dm mpath: Interface for explicit probing of active paths

[PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Kevin Wolf 9 months, 1 week ago

Multipath cannot directly provide failover for ioctls in the kernel
because it doesn't know what each ioctl means and which result could
indicate a path error. Userspace generally knows what the ioctl it
issued means and if it might be a path error, but neither does it know
which path the ioctl took nor does it necessarily have the privileges to
fail a path using the control device.

This series adds an interface that userspace can use to probe paths and
fail the bad ones after seeing a potential path error in an ioctl
result. Once the bad paths are eliminated, the ioctl can be retried.

While the fundamental problem is relatively broad and can affect any
sort of ioctl, the immediate motivation for this is the use of SG_IO in
QEMU for SCSI passthrough. A preliminary QEMU side patch that makes use
of the new interface to support multipath failover with SCSI passthrough
can be found at:

https://repo.or.cz/qemu/kevin.git/commitdiff/78a474da3b39469b11fbb1d4e0ddf4797b637d35

Kevin Wolf (2):
  dm: Allow .prepare_ioctl to handle ioctls directly
  dm mpath: Interface for explicit probing of active paths

 include/linux/device-mapper.h |  9 +++-
 include/uapi/linux/dm-ioctl.h |  9 +++-
 drivers/md/dm-dust.c          |  4 +-
 drivers/md/dm-ebs-target.c    |  3 +-
 drivers/md/dm-flakey.c        |  4 +-
 drivers/md/dm-ioctl.c         |  1 +
 drivers/md/dm-linear.c        |  4 +-
 drivers/md/dm-log-writes.c    |  4 +-
 drivers/md/dm-mpath.c         | 95 ++++++++++++++++++++++++++++++++++-
 drivers/md/dm-switch.c        |  4 +-
 drivers/md/dm-verity-target.c |  4 +-
 drivers/md/dm-zoned-target.c  |  3 +-
 drivers/md/dm.c               | 17 ++++---
 13 files changed, 142 insertions(+), 19 deletions(-)

-- 
2.49.0

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Martin Wilck 9 months ago

Hello Kevin,

[I'm sorry for the previous email. It seems that I clicked "send" in
the wrong window].

On Tue, 2025-04-29 at 18:50 +0200, Kevin Wolf wrote:
> Multipath cannot directly provide failover for ioctls in the kernel
> because it doesn't know what each ioctl means and which result could
> indicate a path error. Userspace generally knows what the ioctl it
> issued means and if it might be a path error, but neither does it
> know
> which path the ioctl took nor does it necessarily have the privileges
> to
> fail a path using the control device.

Thanks for this effort.

I have some remarks about your approach. The most important one is that
the DM_MPATH_PROBE_PATHS_CMD ioctl appears to be a dangerous command.
It sends IO to possibly broken paths and waits for it to complete. In
the common error case of a device not responding, this might cause the
code to hang for a long time in the kernel ioctl code path, in
uninterruptible sleep (note that these probing commands will be queued
after other possibly hanging IO). In the past we have put a lot of
effort into other code paths in multipath-tools and elsewhere to avoid
this kind of hang to the extent possible. It seems to me that your set
re-introduces this risk.

Apart from that, minor observations are that in patch 2/2 you don't
check whether the map is in queueing state, and I don't quite
understand why you only check paths in the map's active path group
without attempting a PG failover in the case where all paths in the
current PG fail.

I am wondering whether the DM_MPATH_PROBE_PATHS_CMD ioctl is necessary
at all. Rather than triggering explicit path probing, you could have
dm-mpath "handle" IO errors by failing the active path for "path
errors". That would be similar to my patch set from 2021 [1], but it
would avoid the extra IO and thus the additional risk of hanging in the
kernel.

Contrary to my set, you wouldn't attempt retries in the kernel, but
leave this part to qemu instead, like in the current set. That would
avoid Christoph's main criticism that "Failing over SG_IO does not make
sense" [2].

Doing the failover in qemu has the general disadvantage that qemu has
no notion about the number of available and/or healthy paths; it can
thus only guess the reasonable number of retries for any given I/O. But
that's unavoidable, given that the idea to do kernel-level failover on
SG_IO is rejected.

Please let me know your thoughts.

Best Regards
Martin

[1] https://lore.kernel.org/all/20210628151558.2289-1-mwilck@suse.com/
[2] https://lore.kernel.org/all/20210701075629.GA25768@lst.de/

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Kevin Wolf 9 months ago

Am 08.05.2025 um 15:51 hat Martin Wilck geschrieben:
> Hello Kevin,
> 
> [I'm sorry for the previous email. It seems that I clicked "send" in
> the wrong window].
> 
> On Tue, 2025-04-29 at 18:50 +0200, Kevin Wolf wrote:
> > Multipath cannot directly provide failover for ioctls in the kernel
> > because it doesn't know what each ioctl means and which result could
> > indicate a path error. Userspace generally knows what the ioctl it
> > issued means and if it might be a path error, but neither does it
> > know
> > which path the ioctl took nor does it necessarily have the privileges
> > to
> > fail a path using the control device.
> 
> Thanks for this effort.
> 
> I have some remarks about your approach. The most important one is that
> the DM_MPATH_PROBE_PATHS_CMD ioctl appears to be a dangerous command.
> It sends IO to possibly broken paths and waits for it to complete. In
> the common error case of a device not responding, this might cause the
> code to hang for a long time in the kernel ioctl code path, in
> uninterruptible sleep (note that these probing commands will be queued
> after other possibly hanging IO). In the past we have put a lot of
> effort into other code paths in multipath-tools and elsewhere to avoid
> this kind of hang to the extent possible. It seems to me that your set
> re-introduces this risk.

That's a fair point. I don't expect this code to be fully final, another
thing that isn't really optimal (as mentioned in the comment message) is
that we're not probing paths in parallel, potentially adding up timeouts
from multiple paths.

I don't think this is a problem of the interface, though, but we can
improve the implementation keeping the same interface.

Maybe I'm missing something, but I think the reason why it has to be
uninterruptible is just that submit_bio_wait() has the completion on the
stack? So I suppose we could just kmalloc() some state, submit all the
bios, let the completion callback free it, and then we can just stop
waiting early (i.e. wait_for_completion_interruptible/killable) and let
the requests complete on their own in the background.

Would this address your concerns or is there another part to it?

> Apart from that, minor observations are that in patch 2/2 you don't
> check whether the map is in queueing state, and I don't quite
> understand why you only check paths in the map's active path group
> without attempting a PG failover in the case where all paths in the
> current PG fail.

Ben already fixed up the missing check.

Not attempting PG failover was also his suggestion, on the basis that it
would be additional work for no real benefit when you can only submit
requests for the current PG anyway. If userspace retries the SG_IO, it
will already pick a different PG, so having the kernel do the same
doesn't really improve anything.

> I am wondering whether the DM_MPATH_PROBE_PATHS_CMD ioctl is necessary
> at all. Rather than triggering explicit path probing, you could have
> dm-mpath "handle" IO errors by failing the active path for "path
> errors". That would be similar to my patch set from 2021 [1], but it
> would avoid the extra IO and thus the additional risk of hanging in the
> kernel.
> 
> Contrary to my set, you wouldn't attempt retries in the kernel, but
> leave this part to qemu instead, like in the current set. That would
> avoid Christoph's main criticism that "Failing over SG_IO does not make
> sense" [2].

Maybe I misunderstood, but my understanding from the feedback you got
back then was that no SCSI-specific code was wanted in device mapper.
This is why we kept interpreting the status codes in userspace.

While discussing the approaches with Mikuláš and Ben, one option that
was briefly discussed was a pair of ioctls: One that wraps ioctls and
returns which path the ioctl took, and another one to fail that path if
inspection of the result in userspace comes to the conclusion that it's
a path error. I didn't think this could be implemented without also
allowing an unprivileged process to DoS the device by just failing all
paths even if they are still good, so we didn't continue there.

Anyway, if it actually were acceptable for the kernel to check SG_IO
results for path errors and fail the path while still returning the
result unchanged, then that would work for us for the specific case, of
course. But I don't expect this really addresses Christoph's concerns.
If you think it does, is there another reason why you didn't try this
before?

(And it wouldn't be generic, so would we potentially have to repeat the
exercise for other ioctls in the future?)

> Doing the failover in qemu has the general disadvantage that qemu has
> no notion about the number of available and/or healthy paths; it can
> thus only guess the reasonable number of retries for any given I/O. But
> that's unavoidable, given that the idea to do kernel-level failover on
> SG_IO is rejected.

Yes, it's a bit unfortunate, but we have to work with what we have. QEMU
doesn't even necessarily know that it's dealing with a multipath device,
so it just has to blindly try the ioctl and see if it works.

Kevin

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Martin Wilck 9 months ago

Hi Kevin,

On Mon, 2025-05-12 at 17:18 +0200, Kevin Wolf wrote:
> Am 08.05.2025 um 15:51 hat Martin Wilck geschrieben:
> > Hello Kevin,
> > 
> > [I'm sorry for the previous email. It seems that I clicked "send"
> > in
> > the wrong window].
> > 
> > On Tue, 2025-04-29 at 18:50 +0200, Kevin Wolf wrote:
> > > Multipath cannot directly provide failover for ioctls in the
> > > kernel
> > > because it doesn't know what each ioctl means and which result
> > > could
> > > indicate a path error. Userspace generally knows what the ioctl
> > > it
> > > issued means and if it might be a path error, but neither does it
> > > know
> > > which path the ioctl took nor does it necessarily have the
> > > privileges
> > > to
> > > fail a path using the control device.
> > 
> > Thanks for this effort.
> > 
> > I have some remarks about your approach. The most important one is
> > that
> > the DM_MPATH_PROBE_PATHS_CMD ioctl appears to be a dangerous
> > command.
> > It sends IO to possibly broken paths and waits for it to complete.
> > In
> > the common error case of a device not responding, this might cause
> > the
> > code to hang for a long time in the kernel ioctl code path, in
> > uninterruptible sleep (note that these probing commands will be
> > queued
> > after other possibly hanging IO). In the past we have put a lot of
> > effort into other code paths in multipath-tools and elsewhere to
> > avoid
> > this kind of hang to the extent possible. It seems to me that your
> > set
> > re-introduces this risk.
> 
> That's a fair point. I don't expect this code to be fully final,
> another
> thing that isn't really optimal (as mentioned in the comment message)
> is
> that we're not probing paths in parallel, potentially adding up
> timeouts
> from multiple paths.
> 
> I don't think this is a problem of the interface, though, but we can
> improve the implementation keeping the same interface.

> Maybe I'm missing something, but I think the reason why it has to be
> uninterruptible is just that submit_bio_wait() has the completion on
> the
> stack? So I suppose we could just kmalloc() some state, submit all
> the
> bios, let the completion callback free it, and then we can just stop
> waiting early (i.e. wait_for_completion_interruptible/killable) and
> let
> the requests complete on their own in the background.
> 
> Would this address your concerns or is there another part to it?

It'd be an improvement. Not sure if it'd solve the problem entirely.

> > Apart from that, minor observations are that in patch 2/2 you don't
> > check whether the map is in queueing state, and I don't quite
> > understand why you only check paths in the map's active path group
> > without attempting a PG failover in the case where all paths in the
> > current PG fail.
> 
> Ben already fixed up the missing check.

I missed that.

> Not attempting PG failover was also his suggestion, on the basis that
> it
> would be additional work for no real benefit when you can only submit
> requests for the current PG anyway. If userspace retries the SG_IO,
> it
> will already pick a different PG, so having the kernel do the same
> doesn't really improve anything.

Ok. But continuing along this line of reasoning, I'd conclude that it
would be even more useful to fail the path in device-mapper directly
after a failed IO request (given that we find an agreeable way to do
this, see below). Qemu could then retry, depending on the error code.
Paths would be failed one after the other, and eventually a PG failover
would occur.

> > I am wondering whether the DM_MPATH_PROBE_PATHS_CMD ioctl is
> > necessary
> > at all. Rather than triggering explicit path probing, you could
> > have
> > dm-mpath "handle" IO errors by failing the active path for "path
> > errors". That would be similar to my patch set from 2021 [1], but
> > it
> > would avoid the extra IO and thus the additional risk of hanging in
> > the
> > kernel.
> > 
> > Contrary to my set, you wouldn't attempt retries in the kernel, but
> > leave this part to qemu instead, like in the current set. That
> > would
> > avoid Christoph's main criticism that "Failing over SG_IO does not
> > make
> > sense" [2].
> 
> Maybe I misunderstood, but my understanding from the feedback you got
> back then was that no SCSI-specific code was wanted in device mapper.
> This is why we kept interpreting the status codes in userspace.

As I wrote in my reply to Mikulas already, my understanding was that
Christoph's main objection was against retrying SG_IO ioctls in the
kernel on different paths, not against inspecting the error code.

> While discussing the approaches with Mikuláš and Ben, one option that
> was briefly discussed was a pair of ioctls: One that wraps ioctls and
> returns which path the ioctl took, and another one to fail that path
> if
> inspection of the result in userspace comes to the conclusion that
> it's
> a path error. I didn't think this could be implemented without also
> allowing an unprivileged process to DoS the device by just failing
> all
> paths even if they are still good, so we didn't continue there.

This doesn't sound too promising IMO.

If we want to find a solution that's focused on user space, I'd suggest
to use multipathd rather than additional ioctls. After all, all
necessary information is available in multipathd.

Already today, qemu can use the multipath socket to query the current
state of paths in a map. I can imagine a new multipathd CLI command
that would instruct multipathd to check all paths of a given map
immediately (@Ben: set pp->tick to 1 for all paths of the map). This
command could be sent to multipathd in case of a suspected path
failure. This would have similar semantics as the
DM_MPATH_PROBE_PATHS_CMD ioctl, with some advantages, as it would take
the multipath configuration of the specific storage array into account.
It'd also avoid blocking qemu. multipathd would allow this command for
non-root connections, but to avoid DoS, accept it only once every N
seconds.

> Anyway, if it actually were acceptable for the kernel to check SG_IO
> results for path errors and fail the path while still returning the
> result unchanged, then that would work for us for the specific case,
> of
> course. But I don't expect this really addresses Christoph's
> concerns.

For regular block IO, the "path error" logic is in blk_path_error() [1]
and scsi_result_to_blk_status() [2]. blk_path_error() can be called
from dm; the problem in the SG_IO code path is that we don't have a
blk_status_t to examine. Therefore, in my old code, I replicated the
logic of scsi_result_to_blk_status() in the dm layer. It works, but
it's admittedly not the cleanest possible approach. OTOH, SG_IO on dm
devices isn't the cleanest thing to do in general ;-)

> If you think it does, is there another reason why you didn't try this
> before?

It didn't occur to me back then that we could fail paths without
retrying in the kernel.

Perhaps we could have the sg driver pass the blk_status_t (which is
available on the sg level) to device mapper somehow in the sg_io_hdr
structure? That way we could entirely avoid the layering violation
between SCSI and dm. Not sure if that would be acceptible to Christoph,
as blk_status_t is supposed to be exclusive to the kernel. Can we find
a way to make sure it's passed to DM, but not to user space?

Alternatively (if maintainers strictly object the error code inspection
by dm), I wonder whether we could just _always_ fail the current path
in case of errors in the dm-mpath SG_IO code path, without inspecting
the error code. Rationale: a) most potential error conditions are
treated as "path error" in block IO code path, b) qemu can still
inspect the error code and avoid retrying for errors that aren't path
errors, and c) if multipathd is running and the path has been failed
mistakenly, it will reinstate that path after just a few seconds. In
the worst case, an unintentional failover would occur. That isn't as
bad as it used to be, as active-passive configurations with explicit
TPGS are less common then in the past.

> (And it wouldn't be generic, so would we potentially have to repeat
> the
> exercise for other ioctls in the future?)

I don't think so. Do you have anything specific in mind?

> > Doing the failover in qemu has the general disadvantage that qemu
> > has
> > no notion about the number of available and/or healthy paths; it
> > can
> > thus only guess the reasonable number of retries for any given I/O.
> > But
> > that's unavoidable, given that the idea to do kernel-level failover
> > on
> > SG_IO is rejected.
> 
> Yes, it's a bit unfortunate, but we have to work with what we have.
> QEMU
> doesn't even necessarily know that it's dealing with a multipath
> device,
> so it just has to blindly try the ioctl and see if it works.

See the paragraph about multipathd above.

Regards
Martin

[1] https://elixir.bootlin.com/linux/v6.14.6/source/include/linux/blk_types.h#L185
[2] https://elixir.bootlin.com/linux/v6.14.6/source/drivers/scsi/scsi_lib.c#L685

PS: Yet another option that Christoph has suggested in the past would
be to move away from qemu's "scsi-block", and use the generic block
reservation mechanism to allow passthrough of reservation commands from
VMs to the host. Nobody seems to have explored this option seriously so
far.

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Martin Wilck 8 months, 4 weeks ago

On Tue, 2025-05-13 at 10:00 +0200, Martin Wilck wrote:

> 

> > If you think it does, is there another reason why you didn't try
> > this
> > before?
> 
> It didn't occur to me back then that we could fail paths without
> retrying in the kernel.
> 
> Perhaps we could have the sg driver pass the blk_status_t (which is
> available on the sg level) to device mapper somehow in the sg_io_hdr
> structure? That way we could entirely avoid the layering violation
> between SCSI and dm. Not sure if that would be acceptible to
> Christoph,
> as blk_status_t is supposed to be exclusive to the kernel. Can we
> find
> a way to make sure it's passed to DM, but not to user space?

I have to correct myself. I was confused by my old patches which
contain special casing for SG_IO. The current upstream code does of
course not support special-casing SG_IO in any way. device-mapper
neither looks at the ioctl `cmd` value nor at any arguments, and has
only the Unix error code to examine when the ioctl returns. The device
mapper layer has access to *less* information than the user space
process that issued the ioctl. Adding hooks to the sg driver wouldn't
buy us anything in this situation.

If we can't change this, we can't fail paths in the SG_IO error code
path, end of story.

With Kevin's patch 1/2 applied, it would in principle be feasible to
special-case SG_IO, handle it in the dm-multipath, retrieve the
blk_status_t somehow, and possibly initiate path failover. This way
we'd at least keep the generic dm layer clean of SCSI specific code.
But still, the end result would look very similar attempt from 2021 and
would therefore lead us nowhere, probably.

I'm still not too fond of DM_MPATH_PROBE_PATHS_CMD, but I can't offer a
better solution at this time. If the side issues are fixed, it will be
an improvement over the current upstream, situation where we can do no
path failover at all.

In the long term, we should evaluate alternatives. If my conjecture in
my previous post is correct we need only PRIN/PROUT commands, there
might be a better solution than scsi-block for our customers. Using
regular block IO should actually also improved performance.

Regards
Martin

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Kevin Wolf 8 months, 4 weeks ago

Am 14.05.2025 um 23:21 hat Martin Wilck geschrieben:
> On Tue, 2025-05-13 at 10:00 +0200, Martin Wilck wrote:
> > > If you think it does, is there another reason why you didn't try
> > > this
> > > before?
> > 
> > It didn't occur to me back then that we could fail paths without
> > retrying in the kernel.
> > 
> > Perhaps we could have the sg driver pass the blk_status_t (which is
> > available on the sg level) to device mapper somehow in the sg_io_hdr
> > structure? That way we could entirely avoid the layering violation
> > between SCSI and dm. Not sure if that would be acceptible to
> > Christoph,
> > as blk_status_t is supposed to be exclusive to the kernel. Can we
> > find
> > a way to make sure it's passed to DM, but not to user space?
> 
> I have to correct myself. I was confused by my old patches which
> contain special casing for SG_IO. The current upstream code does of
> course not support special-casing SG_IO in any way. device-mapper
> neither looks at the ioctl `cmd` value nor at any arguments, and has
> only the Unix error code to examine when the ioctl returns. The device
> mapper layer has access to *less* information than the user space
> process that issued the ioctl. Adding hooks to the sg driver wouldn't
> buy us anything in this situation.
> 
> If we can't change this, we can't fail paths in the SG_IO error code
> path, end of story.

Yes, as long as we can't look at the sg_io_hdr, there is no way to
figure out if we got a path error.

> With Kevin's patch 1/2 applied, it would in principle be feasible to
> special-case SG_IO, handle it in the dm-multipath, retrieve the
> blk_status_t somehow, and possibly initiate path failover. This way
> we'd at least keep the generic dm layer clean of SCSI specific code.
> But still, the end result would look very similar attempt from 2021 and
> would therefore lead us nowhere, probably.

Right, that was my impression, too.

The interfaces could be made look a bit different, and we could return
-EAGAIN to userspace instead of retrying immediately (not that it makes
sense to me, but if that were really the issue, fine with me), but the
core logic with copying the sg_io_hdr, calling sg_io() directly and then
inspecting the status and possibly failing paths would have to be pretty
much the same as you had.

> I'm still not too fond of DM_MPATH_PROBE_PATHS_CMD, but I can't offer a
> better solution at this time. If the side issues are fixed, it will be
> an improvement over the current upstream, situation where we can do no
> path failover at all.

Yes, I agree we should focus on improving what we have, rather than
trying to find another radically different approach that none of us have
thought of before.

> In the long term, we should evaluate alternatives. If my conjecture in
> my previous post is correct we need only PRIN/PROUT commands, there
> might be a better solution than scsi-block for our customers. Using
> regular block IO should actually also improved performance.

If you're talking about SG_IO in dm-mpath, then PRIN/PROUT commands are
actually the one thing that we don't need. libmpathpersist sends the
commands to the individual path devices, so dm-mpath will never see
those. It's mostly about getting the full results on the SCSI level for
normal I/O commands.

There has actually been a patch series on qemu-devel last year (that I
haven't found the time to review properly yet) that would add explicit
persistent reservation operations to QEMU's block layer that could then
be used with the emulated scsi-hd device. On the backend, it only
implemented it for iscsi, but I suppose we could implement it for
file-posix, too (using the same libmpathpersist code as for
passthrough). If that works, maybe at least some users can move away
from SCSI passthrough.

The thing that we need to make sure, though, is that the emulated status
we can expose to the guest is actually good enough. That Paolo said that
the problem with reservation conflicts was mostly because -EBADE wasn't
a thing yet gives me some hope that at least this wouldn't be a problem
any more today.

We would still lose other parts of the SCSI status, so I'm still a bit
cautious here with making a prediction for how many users could
eventually (I expect years) use the emulated device instead and how many
would keep using passthrough even in the long term.

Kevin

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Christoph Hellwig 8 months, 3 weeks ago

On Thu, May 15, 2025 at 12:11:49PM +0200, Kevin Wolf wrote:
> If you're talking about SG_IO in dm-mpath, then PRIN/PROUT commands are
> actually the one thing that we don't need. libmpathpersist sends the
> commands to the individual path devices, so dm-mpath will never see
> those. It's mostly about getting the full results on the SCSI level for
> normal I/O commands.
> 
> There has actually been a patch series on qemu-devel last year (that I
> haven't found the time to review properly yet) that would add explicit
> persistent reservation operations to QEMU's block layer that could then
> be used with the emulated scsi-hd device. On the backend, it only
> implemented it for iscsi, but I suppose we could implement it for
> file-posix, too (using the same libmpathpersist code as for
> passthrough). If that works, maybe at least some users can move away
> from SCSI passthrough.

Please call into the kernel PR code instead of hacking up more of
this, which will just run into problems again.

> The thing that we need to make sure, though, is that the emulated status
> we can expose to the guest is actually good enough. That Paolo said that
> the problem with reservation conflicts was mostly because -EBADE wasn't
> a thing yet gives me some hope that at least this wouldn't be a problem
> any more today.

And if you need more information we can add that to the kernel PR API.

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Martin Wilck 8 months, 3 weeks ago

On Thu, 2025-05-15 at 23:00 -0700, Christoph Hellwig wrote:
> On Thu, May 15, 2025 at 12:11:49PM +0200, Kevin Wolf wrote:
> > If you're talking about SG_IO in dm-mpath, then PRIN/PROUT commands
> > are
> > actually the one thing that we don't need. libmpathpersist sends
> > the
> > commands to the individual path devices, so dm-mpath will never see
> > those. It's mostly about getting the full results on the SCSI level
> > for
> > normal I/O commands.
> > 
> > There has actually been a patch series on qemu-devel last year
> > (that I
> > haven't found the time to review properly yet) that would add
> > explicit
> > persistent reservation operations to QEMU's block layer that could
> > then
> > be used with the emulated scsi-hd device. On the backend, it only
> > implemented it for iscsi, but I suppose we could implement it for
> > file-posix, too (using the same libmpathpersist code as for
> > passthrough). If that works, maybe at least some users can move
> > away
> > from SCSI passthrough.
>   x
> Please call into the kernel PR code instead of hacking up more of
> this, which will just run into problems again.

I still don't get what this buys us. The guest (which might be Windows
or whatever else) sends SCSI reservation commands. qemu will have to
intercept these anyway, unless the backend device is a plain SCSI
device (in which case transformation into generic PR command would be a
strange thing to do).

If the backend device is multipath on SCSI, qemu-pr-helper would take 
the most appropriate action and return the most appropriate result
code. The dm-multipath layer can't do it as well, as it doesn't have
the full information about the target that's available in user space
(see Ben's note about ALL_TG_PT for example). I don't see any benefit
from using a generic reservation on dm-mpath instead of using qemu-pr-
helper for this important use case. 

I also don't see why this way of handling SCSI PR commands would be
dangerous. You are of course right to say that passthrough of other
SCSI commands (except regular IO and PR) is highly dangerous, but in
the concept that Kevin described that wouldn't happen.

Transforming the SCSI reservation commands into generic reservation
commands makes sense for _other_ types of backend devices. NVMe comes
to mind, but (for real-world applications) not much else. (But does it
make sense to present NVMe devices to guests as SCSI devices?).

Regards
Martin

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Christoph Hellwig 8 months, 3 weeks ago

On Mon, May 19, 2025 at 07:33:50PM +0200, Martin Wilck wrote:
> I still don't get what this buys us.

You get one layer to deal with instead of polking into a leak
abstraction.  Qemu sees a block devices, checks if it supports
persistent reservations and everything is taken care underneath
instead of having to try to understand what is below and working around
it.

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Kevin Wolf 8 months, 3 weeks ago

Am 16.05.2025 um 08:00 hat Christoph Hellwig geschrieben:
> On Thu, May 15, 2025 at 12:11:49PM +0200, Kevin Wolf wrote:
> > If you're talking about SG_IO in dm-mpath, then PRIN/PROUT commands are
> > actually the one thing that we don't need. libmpathpersist sends the
> > commands to the individual path devices, so dm-mpath will never see
> > those. It's mostly about getting the full results on the SCSI level for
> > normal I/O commands.
> > 
> > There has actually been a patch series on qemu-devel last year (that I
> > haven't found the time to review properly yet) that would add explicit
> > persistent reservation operations to QEMU's block layer that could then
> > be used with the emulated scsi-hd device. On the backend, it only
> > implemented it for iscsi, but I suppose we could implement it for
> > file-posix, too (using the same libmpathpersist code as for
> > passthrough). If that works, maybe at least some users can move away
> > from SCSI passthrough.
> 
> Please call into the kernel PR code instead of hacking up more of
> this, which will just run into problems again.

I agree that using kernel code is preferable to doing things behind the
kernel's back.

However, libmpathpersist is the official interface for doing these
things with multipath devices, so I think the necessary work to make
this happen should primarily be done in the library (and possibly the
kernel if the existing interfaces aren't good enough).

QEMU could directly call the kernel when qemu-pr-helper isn't in use.
I don't know enough about how libmpathpersist works internally to tell
if running it this way would be a good idea for multipath devices. Can
multipathd still do its job with reservations being made behind its
back? (It would probably be good to allow this eventually, but is it the
case today?)

Kevin

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Benjamin Marzinski 8 months, 3 weeks ago

On Thu, May 15, 2025 at 11:00:17PM -0700, Christoph Hellwig wrote:
> On Thu, May 15, 2025 at 12:11:49PM +0200, Kevin Wolf wrote:
> > If you're talking about SG_IO in dm-mpath, then PRIN/PROUT commands are
> > actually the one thing that we don't need. libmpathpersist sends the
> > commands to the individual path devices, so dm-mpath will never see
> > those. It's mostly about getting the full results on the SCSI level for
> > normal I/O commands.
> > 
> > There has actually been a patch series on qemu-devel last year (that I
> > haven't found the time to review properly yet) that would add explicit
> > persistent reservation operations to QEMU's block layer that could then
> > be used with the emulated scsi-hd device. On the backend, it only
> > implemented it for iscsi, but I suppose we could implement it for
> > file-posix, too (using the same libmpathpersist code as for
> > passthrough). If that works, maybe at least some users can move away
> > from SCSI passthrough.
> 
> Please call into the kernel PR code instead of hacking up more of
> this, which will just run into problems again.
> 
> > The thing that we need to make sure, though, is that the emulated status
> > we can expose to the guest is actually good enough. That Paolo said that
> > the problem with reservation conflicts was mostly because -EBADE wasn't
> > a thing yet gives me some hope that at least this wouldn't be a problem
> > any more today.
> 
> And if you need more information we can add that to the kernel PR API.

I've run into SCSI arrays that always act like they were called with the
ALL_TG_PT flag set (or perhaps they were just configured that way, I
never could get a straight answer about that). libmpathpersist accepts
this flag, and only sends one registration per initiator & target device
when it's set, instead of one per initator & target port. dm-multipath
currently doesn't have the knowledge necessary to figure out which
devices it needs to send reservations to, even if the kernel PR API
supported this.

But I supposed it could just send the registration normally down one
path and if that works, it could just ignore the existing key when it
sends the registration down all the other paths.

-Ben

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Christoph Hellwig 8 months, 3 weeks ago

On Fri, May 16, 2025 at 12:06:21PM -0400, Benjamin Marzinski wrote:
> I've run into SCSI arrays that always act like they were called with the
> ALL_TG_PT flag set (or perhaps they were just configured that way, I
> never could get a straight answer about that).

Hmm, that's pretty strange.

> libmpathpersist accepts
> this flag, and only sends one registration per initiator & target device
> when it's set, instead of one per initator & target port.

With "this flag" you mean ALL_TG_PT?

> dm-multipath
> currently doesn't have the knowledge necessary to figure out which
> devices it needs to send reservations to, even if the kernel PR API
> supported this.
> 
> But I supposed it could just send the registration normally down one
> path and if that works, it could just ignore the existing key when it
> sends the registration down all the other paths.

We could add support for a similar flag to the kernel PR API.  The
problem is to figure out how to discover support for it.  What does
the library do for that currently?

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Martin Wilck 8 months, 2 weeks ago

On Sun, 2025-05-18 at 22:32 -0700, Christoph Hellwig wrote:
> On Fri, May 16, 2025 at 12:06:21PM -0400, Benjamin Marzinski wrote:
> > I've run into SCSI arrays that always act like they were called
> > with the
> > ALL_TG_PT flag set (or perhaps they were just configured that way,
> > I
> > never could get a straight answer about that).
> 
> Hmm, that's pretty strange.
> 
> > libmpathpersist accepts
> > this flag, and only sends one registration per initiator & target
> > device
> > when it's set, instead of one per initator & target port.
> 
> With "this flag" you mean ALL_TG_PT?

multipath-tools allows setting the flag "all_tg_pt" in multipath.conf
for certain storage devices or multipath maps. If this flag is set,
mpathpersist will always act as if the --param-alltgpt flag was given.
This means that the REGISTER commands are sent once per initiator port
(SCSI host) only, instead of once per I_T nexus.

By default, the all_tg_pt flag is not set for any device, not even for
EMC VNX, where Ben observed the described behavior. It seems that
devices that behave like this are a rare exception.

> > dm-multipath
> > currently doesn't have the knowledge necessary to figure out which
> > devices it needs to send reservations to, even if the kernel PR API
> > supported this.
> > 
> > But I supposed it could just send the registration normally down
> > one
> > path and if that works, it could just ignore the existing key when
> > it
> > sends the registration down all the other paths.
> 
> We could add support for a similar flag to the kernel PR API.  The
> problem is to figure out how to discover support for it.  What does
> the library do for that currently?

dm-multipath has no knowledge about target or initiator ports, and I
fail to see how we could provide this knowledge to it without adding
yet another layering violation. Just sending the command once will not
be sufficient if there are multiple local ports.

Regards
Martin

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Benjamin Marzinski 8 months, 3 weeks ago

On Sun, May 18, 2025 at 10:32:17PM -0700, Christoph Hellwig wrote:
> On Fri, May 16, 2025 at 12:06:21PM -0400, Benjamin Marzinski wrote:
> > I've run into SCSI arrays that always act like they were called with the
> > ALL_TG_PT flag set (or perhaps they were just configured that way, I
> > never could get a straight answer about that).
> 
> Hmm, that's pretty strange.
> 
> > libmpathpersist accepts
> > this flag, and only sends one registration per initiator & target device
> > when it's set, instead of one per initator & target port.
> 
> With "this flag" you mean ALL_TG_PT?

Yes.

> 
> > dm-multipath
> > currently doesn't have the knowledge necessary to figure out which
> > devices it needs to send reservations to, even if the kernel PR API
> > supported this.
> > 
> > But I supposed it could just send the registration normally down one
> > path and if that works, it could just ignore the existing key when it
> > sends the registration down all the other paths.
> 
> We could add support for a similar flag to the kernel PR API.  The
> problem is to figure out how to discover support for it.  What does
> the library do for that currently?
>

libmpathpersist knows which scsi path devices map to the same Initiator,
but different target ports. When ALL_TG_PT is set, it only sends one
registration command to each group, instead of sending one to each path
device (like when ALL_TG_TP isn't set). If it sent a registration to
every SCSI device on these arrays that act like ALL_TG_PT is always set,
it would get reservation conflicts when sending the command to devices
using ports after the first, since they would already have a registered
key. That's why I mentioned the posibility of using Register and
Ignore, which doesn't care if there's a key already registered.

I don't think it would take much work for the Kernel API to support
ALL_TG_PT, if the SCSI implementation just passed the flag down to the
device. The problem really only exists for things like dm-multipath,
that are composed of a number of different devices that all point to the
same underlying Target LUN. To handle this without relying on something
like Register and Ignore, It would need to be able to group all the
devices that come from the same initiator but different target ports
together, and only send a registration to one device per group. This
takes more SCSI specific knowledge than it currently has, or really
wants to have.

-Ben

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Martin Wilck 8 months, 3 weeks ago

On Thu, 2025-05-15 at 12:11 +0200, Kevin Wolf wrote:
> Am 14.05.2025 um 23:21 hat Martin Wilck geschrieben:

> 
> > In the long term, we should evaluate alternatives. If my conjecture
> > in
> > my previous post is correct we need only PRIN/PROUT commands, there
> > might be a better solution than scsi-block for our customers. Using
> > regular block IO should actually also improved performance.
> 
> If you're talking about SG_IO in dm-mpath, then PRIN/PROUT commands
> are
> actually the one thing that we don't need. libmpathpersist sends the
> commands to the individual path devices, so dm-mpath will never see
> those. It's mostly about getting the full results on the SCSI level
> for
> normal I/O commands.

I know. I meant "need" from the PoV of the guest, in the sense "which
commands need to be passed from the guessed to the device (in some
reasonable way) except the normal READ and WRITE commands?".

Regards
Martin

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Paolo Bonzini 8 months, 4 weeks ago

On 5/15/25 12:11, Kevin Wolf wrote:
> The thing that we need to make sure, though, is that the emulated status
> we can expose to the guest is actually good enough. That Paolo said that
> the problem with reservation conflicts was mostly because -EBADE wasn't
> a thing yet gives me some hope that at least this wouldn't be a problem
> any more today.

Hmm I'm not sure about that anymore...  I was confused because some of 
the refactoring of block errors was done in 2017, which is around the 
same time I was working on persistent reservations in QEMU.

However, EBADE handling dates back to 2011 (commit 63583cca745f, "[SCSI] 
Add detailed SCSI I/O errors", 2011-02-12) and yet the Windows tests for 
PR were failing before QEMU switched to SG_IO for reads and writes.  I 
guess I have to try reverting that and retest, though.

Paolo

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Martin Wilck 8 months, 3 weeks ago

On Thu, 2025-05-15 at 13:09 +0200, Paolo Bonzini wrote:

> However, EBADE handling dates back to 2011 (commit 63583cca745f,
> "[SCSI] 
> Add detailed SCSI I/O errors", 2011-02-12) and yet the Windows tests
> for 
> PR were failing before QEMU switched to SG_IO for reads and writes. 
> I 
> guess I have to try reverting that and retest, though.

Thanks! This makes me realize that we could summarize the goal for
future efforts (independent of the current patch set) roughly like
this:

"Emulate a SCSI disk on top of a (host) multipath device in a way that
1) failover works properly (like it would work for regular IO from the
host itself), 
2) Windows tests for PR (plus test case X, Y, ...) can be run
successfully".

Does this make sense? It implies that PR commands don't just need to be
forwarded appropriately, we also need to pass meaningful error codes
back to the guest.

Martin

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Martin Wilck 9 months ago

On Tue, 2025-05-13 at 10:00 +0200, Martin Wilck wrote:
> On Mon, 2025-05-12 at 17:18 +0200, Kevin Wolf wrote:
> 
> > 
> > Ben already fixed up the missing check.
> 
> I missed that.

Saw it now.

Martin

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Hannes Reinecke 9 months ago

On 5/12/25 17:18, Kevin Wolf wrote:
> Am 08.05.2025 um 15:51 hat Martin Wilck geschrieben:
>> Hello Kevin,
>>
>> [I'm sorry for the previous email. It seems that I clicked "send" in
>> the wrong window].
>>
>> On Tue, 2025-04-29 at 18:50 +0200, Kevin Wolf wrote:
>>> Multipath cannot directly provide failover for ioctls in the kernel
>>> because it doesn't know what each ioctl means and which result could
>>> indicate a path error. Userspace generally knows what the ioctl it
>>> issued means and if it might be a path error, but neither does it
>>> know
>>> which path the ioctl took nor does it necessarily have the privileges
>>> to
>>> fail a path using the control device.
>>
>> Thanks for this effort.
>>
>> I have some remarks about your approach. The most important one is that
>> the DM_MPATH_PROBE_PATHS_CMD ioctl appears to be a dangerous command.
>> It sends IO to possibly broken paths and waits for it to complete. In
>> the common error case of a device not responding, this might cause the
>> code to hang for a long time in the kernel ioctl code path, in
>> uninterruptible sleep (note that these probing commands will be queued
>> after other possibly hanging IO). In the past we have put a lot of
>> effort into other code paths in multipath-tools and elsewhere to avoid
>> this kind of hang to the extent possible. It seems to me that your set
>> re-introduces this risk.
> 
> That's a fair point. I don't expect this code to be fully final, another
> thing that isn't really optimal (as mentioned in the comment message) is
> that we're not probing paths in parallel, potentially adding up timeouts
> from multiple paths.
> 
> I don't think this is a problem of the interface, though, but we can
> improve the implementation keeping the same interface.
> 
> Maybe I'm missing something, but I think the reason why it has to be
> uninterruptible is just that submit_bio_wait() has the completion on the
> stack? So I suppose we could just kmalloc() some state, submit all the
> bios, let the completion callback free it, and then we can just stop
> waiting early (i.e. wait_for_completion_interruptible/killable) and let
> the requests complete on their own in the background.
> 
> Would this address your concerns or is there another part to it?
> 
Not really. There are two issues which ultimately made us move away
from read I/O as path checkers:

- Spurious timeouts due to blocked or congested submission
- Error handling delays and stalls

When using normal read/write functions for path checking you are
subjected to normal I/O processing rules, ie I/O is inserted
at the end of the submission queue. If the system is busy the
path checker will time out prematurely without ever having reached
the target. One could avoid that by extending the timeout, but that
would make the path checker unreliable.

But the real issue is error handling. Path checker are precisely there
to check if the path is healthy, and as such _will_ be subjected
to error handling (or might even trigger error handling itself).
The thing about error handling is that you can only return affected
commands once error handling is complete, as only then you can be
sure that no DMA is pending on the buffers and one can free/re-use
the associated pages.
On SCSI error handling can be an _extremely_ lengthy affair
(we had reports for over one hour), and the last thing you want is
to delay your path checker for that time.

(And besides, I didn't even mention the third problem with I/O-based
path checkers, namely that the check entirely the wrong thing; we
are not interested whether we can do I/O, we are interested in whether
we can send commands. In the light of eg ALUA these are two very
different things.)

>> Apart from that, minor observations are that in patch 2/2 you don't
>> check whether the map is in queueing state, and I don't quite
>> understand why you only check paths in the map's active path group
>> without attempting a PG failover in the case where all paths in the
>> current PG fail.
> 
> Ben already fixed up the missing check.
> 
> Not attempting PG failover was also his suggestion, on the basis that it
> would be additional work for no real benefit when you can only submit
> requests for the current PG anyway. If userspace retries the SG_IO, it
> will already pick a different PG, so having the kernel do the same
> doesn't really improve anything.
> 
Precisely.

I think the best way forward here is to have a 'post_ioctl' handler
(much like we have a pre_ioctl handler nowadays).
This would check the return code from the ioctl, and invalidate the
current path if there was an I/O error.
That way the user can resubmit the ioctl, until all paths are exhausted.
For that to work we need to agree on two error codes, one for
'path failed, retry' and one for 'path failed, no retry possible'.
Resetting path status will be delegated to multipathd, but then that's
its main task anyway.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Benjamin Marzinski 8 months, 4 weeks ago

On Tue, May 13, 2025 at 08:30:55AM +0200, Hannes Reinecke wrote:
> On 5/12/25 17:18, Kevin Wolf wrote:
> > Am 08.05.2025 um 15:51 hat Martin Wilck geschrieben:
> > > Hello Kevin,
> > > 
> > > [I'm sorry for the previous email. It seems that I clicked "send" in
> > > the wrong window].
> > > 
> > > On Tue, 2025-04-29 at 18:50 +0200, Kevin Wolf wrote:
> > > > Multipath cannot directly provide failover for ioctls in the kernel
> > > > because it doesn't know what each ioctl means and which result could
> > > > indicate a path error. Userspace generally knows what the ioctl it
> > > > issued means and if it might be a path error, but neither does it
> > > > know
> > > > which path the ioctl took nor does it necessarily have the privileges
> > > > to
> > > > fail a path using the control device.
> > > 
> > > Thanks for this effort.
> > > 
> > > I have some remarks about your approach. The most important one is that
> > > the DM_MPATH_PROBE_PATHS_CMD ioctl appears to be a dangerous command.
> > > It sends IO to possibly broken paths and waits for it to complete. In
> > > the common error case of a device not responding, this might cause the
> > > code to hang for a long time in the kernel ioctl code path, in
> > > uninterruptible sleep (note that these probing commands will be queued
> > > after other possibly hanging IO). In the past we have put a lot of
> > > effort into other code paths in multipath-tools and elsewhere to avoid
> > > this kind of hang to the extent possible. It seems to me that your set
> > > re-introduces this risk.
> > 
> > That's a fair point. I don't expect this code to be fully final, another
> > thing that isn't really optimal (as mentioned in the comment message) is
> > that we're not probing paths in parallel, potentially adding up timeouts
> > from multiple paths.
> > 
> > I don't think this is a problem of the interface, though, but we can
> > improve the implementation keeping the same interface.
> > 
> > Maybe I'm missing something, but I think the reason why it has to be
> > uninterruptible is just that submit_bio_wait() has the completion on the
> > stack? So I suppose we could just kmalloc() some state, submit all the
> > bios, let the completion callback free it, and then we can just stop
> > waiting early (i.e. wait_for_completion_interruptible/killable) and let
> > the requests complete on their own in the background.
> > 
> > Would this address your concerns or is there another part to it?
> > 
> Not really. There are two issues which ultimately made us move away
> from read I/O as path checkers:
> 
> - Spurious timeouts due to blocked or congested submission
> - Error handling delays and stalls
> 
> When using normal read/write functions for path checking you are
> subjected to normal I/O processing rules, ie I/O is inserted
> at the end of the submission queue. If the system is busy the
> path checker will time out prematurely without ever having reached
> the target. One could avoid that by extending the timeout, but that
> would make the path checker unreliable.

I get your complaint in general, but this interface is just giving users
the ability to probe the active pathgroup by sending normal IOs to its
paths.  Presumably, they won't use it if they are already sending normal
IOs to the multipath device, since this is just duplicating the effect
of those IOs. 
 
> But the real issue is error handling. Path checker are precisely there
> to check if the path is healthy, and as such _will_ be subjected
> to error handling (or might even trigger error handling itself).
> The thing about error handling is that you can only return affected
> commands once error handling is complete, as only then you can be
> sure that no DMA is pending on the buffers and one can free/re-use
> the associated pages.
> On SCSI error handling can be an _extremely_ lengthy affair
> (we had reports for over one hour), and the last thing you want is
> to delay your path checker for that time.

I can send a patch that will keep probe_active_paths() from holding the
work_mutex. This means that probe_active_paths() won't delay
multipath_message(), so the path checkers will not be effected by this.

> (And besides, I didn't even mention the third problem with I/O-based
> path checkers, namely that the check entirely the wrong thing; we
> are not interested whether we can do I/O, we are interested in whether
> we can send commands. In the light of eg ALUA these are two very
> different things.)

I actually disagree with this. There are issues with IO based checkers,
but in general, what we care about is paths being able to do IO, not
sending commands. All the improvements on the directio checker were made
because we still need it. For instance, some SCSI devices will claim
they are usable when you run a SCSI Test Unit Ready command, but they
aren't able to actually handle IO. These paths will have their state
ping-pong back and forth as multipathd restores them and the kernel uses
them and immeditately fails them again. The solution is often to switch
multipathd to the directio checker, since it verifies that the path can
actually handle IO. It's problematic when dealing with paths that need
to be initialized before accepting IO, but that's why this patch only
checks the paths in the active pathgroup. 
 
> > > Apart from that, minor observations are that in patch 2/2 you don't
> > > check whether the map is in queueing state, and I don't quite
> > > understand why you only check paths in the map's active path group
> > > without attempting a PG failover in the case where all paths in the
> > > current PG fail.
> > 
> > Ben already fixed up the missing check.
> > 
> > Not attempting PG failover was also his suggestion, on the basis that it
> > would be additional work for no real benefit when you can only submit
> > requests for the current PG anyway. If userspace retries the SG_IO, it
> > will already pick a different PG, so having the kernel do the same
> > doesn't really improve anything.
> > 
> Precisely.
> 
> I think the best way forward here is to have a 'post_ioctl' handler
> (much like we have a pre_ioctl handler nowadays).
> This would check the return code from the ioctl, and invalidate the
> current path if there was an I/O error.
> That way the user can resubmit the ioctl, until all paths are exhausted.
> For that to work we need to agree on two error codes, one for
> 'path failed, retry' and one for 'path failed, no retry possible'.
> Resetting path status will be delegated to multipathd, but then that's
> its main task anyway.

Isn't this basically the idea that Martin had, here:
https://lore.kernel.org/all/20210628151558.2289-1-mwilck@suse.com/

If this was acceptable, I would be all for it. But we often can't even
tell if an SG_IO has failed without decoding the scsi status/sense data,
and that got shot down when Martin posted patches to do it. Thus, we
have a general solution that has nothing to do with SG_IO ioctls and
will work for any type of path device, any time a user wants to verify
the state of the active paths.

-Ben

> Cheers,
> 
> Hannes
> -- 
> Dr. Hannes Reinecke                  Kernel Storage Architect
> hare@suse.de                                +49 911 74053 688
> SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
> HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Christoph Hellwig 9 months ago

On Mon, May 12, 2025 at 05:18:43PM +0200, Kevin Wolf wrote:
> Yes, it's a bit unfortunate, but we have to work with what we have. QEMU
> doesn't even necessarily know that it's dealing with a multipath device,
> so it just has to blindly try the ioctl and see if it works.

Why is qemu even using SG_IO to start with?

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Kevin Wolf 9 months ago

Am 13.05.2025 um 07:55 hat Christoph Hellwig geschrieben:
> On Mon, May 12, 2025 at 05:18:43PM +0200, Kevin Wolf wrote:
> > Yes, it's a bit unfortunate, but we have to work with what we have. QEMU
> > doesn't even necessarily know that it's dealing with a multipath device,
> > so it just has to blindly try the ioctl and see if it works.
> 
> Why is qemu even using SG_IO to start with?

How else would you do SCSI passthrough?

Ok, from your replies to Hannes I understand an implicit message, you
wouldn't. But I don't think that's really an answer, at least not for
all users.

Yes, I'll give you that in the long term, we might be able to move some
passthrough users away from it by using more specific interfaces like
for persistent reservations (though see below). But for example, it's
also used for vendor-specific commands and I don't see how Linux could
ever provide more specific interfaces than SG_IO for those.

But it's even about more than just accessing commands that aren't
otherwise accessible. Mapping a SCSI response from the device to a
single errno and back into SCSI status for the guest isn't lossless.
QEMU's scsi-block device actually started off using normal I/O for reads
and writes and using SG_IO only for things that aren't covered by normal
I/O. But then those had to be changed to SG_IO, too, so that the guest
would actually see the full SCSI status. Things the commit message
mentions are unit attention codes and handling RESERVATION CONFLICT
correctly (which made me unsure above if the more specific interface for
reservations could actually be used to fully get rid of SG_IO). For more
context, I'm adding Paolo who actually made that change back then. He
may remember the concrete bug(s) this fixed.

So if you want the guest device to behave mostly the same as your host
device, I don't really see any way around SCSI passthrough and therefore
SG_IO.

Kevin

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Christoph Hellwig 8 months, 4 weeks ago

On Tue, May 13, 2025 at 11:29:09AM +0200, Kevin Wolf wrote:
> Am 13.05.2025 um 07:55 hat Christoph Hellwig geschrieben:
> > On Mon, May 12, 2025 at 05:18:43PM +0200, Kevin Wolf wrote:
> > > Yes, it's a bit unfortunate, but we have to work with what we have. QEMU
> > > doesn't even necessarily know that it's dealing with a multipath device,
> > > so it just has to blindly try the ioctl and see if it works.
> > 
> > Why is qemu even using SG_IO to start with?
> 
> How else would you do SCSI passthrough?
> 
> Ok, from your replies to Hannes I understand an implicit message, you
> wouldn't. But I don't think that's really an answer, at least not for
> all users.

SG_IO is fine and the only way for SCSI passthrough.  But doing
SCSI passthrough through md-multipath just doesn't work.  SCSI isn't
built for layering, and ALUA and it's vendor-specific variants and
alternatives certainly isn't.  If you try that you're playing with
fire and is not chance of ever moving properly.

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Benjamin Marzinski 8 months, 4 weeks ago

On Tue, May 13, 2025 at 09:57:51PM -0700, Christoph Hellwig wrote:
> On Tue, May 13, 2025 at 11:29:09AM +0200, Kevin Wolf wrote:
> > Am 13.05.2025 um 07:55 hat Christoph Hellwig geschrieben:
> > > On Mon, May 12, 2025 at 05:18:43PM +0200, Kevin Wolf wrote:
> > > > Yes, it's a bit unfortunate, but we have to work with what we have. QEMU
> > > > doesn't even necessarily know that it's dealing with a multipath device,
> > > > so it just has to blindly try the ioctl and see if it works.
> > > 
> > > Why is qemu even using SG_IO to start with?
> > 
> > How else would you do SCSI passthrough?
> > 
> > Ok, from your replies to Hannes I understand an implicit message, you
> > wouldn't. But I don't think that's really an answer, at least not for
> > all users.
> 
> SG_IO is fine and the only way for SCSI passthrough.  But doing
> SCSI passthrough through md-multipath just doesn't work.  SCSI isn't
> built for layering, and ALUA and it's vendor-specific variants and
> alternatives certainly isn't.  If you try that you're playing with
> fire and is not chance of ever moving properly.

Could you be a bit more specific. All multipath is doing here is
forwarding the ioctls to an underlying scsi device, and passing back up
the result. Admittedly, it doesn't always make sense to pass the ioctl
on from the multipath device to just one scsi device. Persistent
Reservations are perfect example of this, and that's why QEMU doesn't
use DMs ioctl passthrough code to handle them. Also, when you have ALUA
setups, not all the scsi devices are equal. But multipath isn't naievely
assuming that they are. It's only passing ioctls to the highest priority
activated paths, just like it does for IO, and multipath is in charge of
handling explicit alua devices. This hasn't proved to be problematic in
practice.

The reality of the situation is that customers have been using this for
a while, and the only issue that they run into is that multipath can't
tell when a SG_IO has failed due to a retryable error. Currently,
they're left with waiting for multipathd's preemptive path checking to
fail the path so they can retry down a new one. The purpose of this
patchset and Martin's previous one is to handle this problem. If there
are unavoidable critical problems that you see with this setup, it would
be really helpful to know what they are.

-Ben

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Martin Wilck 8 months, 4 weeks ago

Hello Ben, hello Christoph,

On Wed, 2025-05-14 at 12:23 -0400, Benjamin Marzinski wrote:
> On Tue, May 13, 2025 at 09:57:51PM -0700, Christoph Hellwig wrote:
> > 
> > SG_IO is fine and the only way for SCSI passthrough.  But doing
> > SCSI passthrough through md-multipath just doesn't work.  SCSI
> > isn't
> > built for layering, and ALUA and it's vendor-specific variants and
> > alternatives certainly isn't.  If you try that you're playing with
> > fire and is not chance of ever moving properly.
> 
> Could you be a bit more specific. All multipath is doing here is
> forwarding the ioctls to an underlying scsi device, and passing back
> up
> the result. Admittedly, it doesn't always make sense to pass the
> ioctl
> on from the multipath device to just one scsi device. Persistent
> Reservations are perfect example of this, and that's why QEMU doesn't
> use DMs ioctl passthrough code to handle them. 

I'd go one step further. Christoph is right to say that what we're
currently doing in qemu – passing through every command except the
PRIN/PROUT to a multipath device – is a dangerous thing to do.

Passthrough from a dm-multipath device to a SCSI device makes sense
only for a small subset of the SCSI command set. Basically just for the
regular IO commands like the various READ and WRITE variants and the
occasional UNMAP. However, in practice these commands account for 99.y%
percent of the actual commands sent to devices. The fact that customers
have been running these setups in large deployments over many years
suggests that, if other commands ever get passed through to member
devices, it has rarely had fatal consequences.

Nobody would seriously consider sending ALUA commands to the multipath
devices. TUR and REQUEST SENSE are other examples for commands that
can't be reasonably passed through to random member devices of a
multipath map. There are certainly many more examples. I guess it would
make sense to review the command set and add some filtering in the qemu
passthrough code.

AFAIK the only commands that we really need to pass through (except the
standard ones) are the reservation commands, which get special handling
by qemu anyway. @Ben, @Kevin, are you aware of anything else?

So: admittedly we're using a framework for passing through any command,
where we actually need to pass through only a tiny subset of commands.
Thinking about it this way, it really doesn't look like the perfect
tool for the job, and we may want to look into a different approach for
the future.

> Also, when you have ALUA
> setups, not all the scsi devices are equal. But multipath isn't
> naievely
> assuming that they are. It's only passing ioctls to the highest
> priority
> activated paths, just like it does for IO, and multipath is in charge
> of
> handling explicit alua devices. This hasn't proved to be problematic
> in
> practice.
> 
> The reality of the situation is that customers have been using this
> for
> a while, and the only issue that they run into is that multipath
> can't
> tell when a SG_IO has failed due to a retryable error. Currently,
> they're left with waiting for multipathd's preemptive path checking
> to
> fail the path so they can retry down a new one. The purpose of this
> patchset and Martin's previous one is to handle this problem. If
> there
> are unavoidable critical problems that you see with this setup, it
> would
> be really helpful to know what they are.

I'd also be interested in understanding this better. As noted above,
I'm aware that passing through everything is dangerous and wrong in
principle. But in practice, we haven't observed anything serious except
(as Ben already said) the failure to do path failover in the SG_IO code
path, which both this patch set and my set from the past are intended
to fix.

While I am open for looking for better alternatives, I still hope that
we can find an agreement for a short/mid-term solution that would allow
us to serve our customers who currently use SCSI passthrough setups. 
That would not just benefit us (the enterprise distros), because it
would also help us fund upstream contributions.

Regards
Martin

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Paolo Bonzini 8 months, 4 weeks ago

Il mer 14 mag 2025, 13:37 Martin Wilck <mwilck@suse.com> ha scritto:
>
> I'd go one step further. Christoph is right to say that what we're
> currently doing in qemu – passing through every command except the
> PRIN/PROUT to a multipath device – is a dangerous thing to do.
>
> Passthrough from a dm-multipath device to a SCSI device makes sense
> only for a small subset of the SCSI command set. Basically just for the
> regular IO commands like the various READ and WRITE variants and the
> occasional UNMAP. The fact that customers
> have been running these setups in large deployments over many years
> suggests that, if other commands ever get passed through to member
> devices, it has rarely had fatal consequences.
>
> Nobody would seriously consider sending ALUA commands to the multipath
> devices. TUR and REQUEST SENSE are other examples for commands that
> can't be reasonably passed through to random member devices of a
> multipath map.

Yes, as usual things are a bit more complicated. First, a handful of
commands are special (REQUEST SENSE would be for HBAs that don't use
auto sense, but that is fortunately not something you encounter).
Second, there's already a filter in the kernel in
drivers/scsi/scsi_ioctl.c for commands that are allowed without
CAP_SYS_RAWIO. QEMU is subject to that so the commands you'll see are
mostly I/O, INQUIRY, TUR, MODE SENSE/SELECT and that's it.

Any command that the kernel doesn't filter would be rejected, or
handled specially in the case of PR commands (PR commands use a
separate privileged helper to send them down to the device; the helper
also knows about multipath and uses the userspace libmpathpersist if
it receives a dm-mpath file descriptor via SCM_RIGHTS).

> AFAIK the only commands that we really need to pass through (except the standard ones) are the reservation commands, which get special handling
> by qemu anyway. @Ben, @Kevin, are you aware of anything else?

.Of the ones that aren't simple I/O, mode parameters and TUR are the
important cases. A TUR failure would be handled by the ioctl that
Kevin proposed here by forcing a path switch. Mode parameters might
not be shared(*) and would need to be sent down all the paths in that
case; that can be fixed in userspace if necessary.

> I'd also be interested in understanding this better. As noted above,
> I'm aware that passing through everything is dangerous and wrong in
> principle. But in practice, we haven't observed anything serious except
> (as Ben already said) the failure to do path failover in the SG_IO code
> path, which both this patch set and my set from the past are intended
> to fix.

Yes, the kernel filter is a PITA in the normal single path case but
here it helps not doing something overly wrong.

Paolo

(*) in theory there's a Mode Page Policy VPD page to tell the
initiator whether they are. I'm not sure if anyone supports it in the
real world...

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Martin Wilck 8 months, 4 weeks ago

On Thu, 2025-05-15 at 04:53 +0200, Paolo Bonzini wrote:
> Il mer 14 mag 2025, 13:37 Martin Wilck <mwilck@suse.com> ha scritto:
> > 
> > I'd go one step further. Christoph is right to say that what we're
> > currently doing in qemu – passing through every command except the
> > PRIN/PROUT to a multipath device – is a dangerous thing to do.
> > 
> > Passthrough from a dm-multipath device to a SCSI device makes sense
> > only for a small subset of the SCSI command set. Basically just for
> > the
> > regular IO commands like the various READ and WRITE variants and
> > the
> > occasional UNMAP. The fact that customers
> > have been running these setups in large deployments over many years
> > suggests that, if other commands ever get passed through to member
> > devices, it has rarely had fatal consequences.
> > 
> > Nobody would seriously consider sending ALUA commands to the
> > multipath
> > devices. TUR and REQUEST SENSE are other examples for commands that
> > can't be reasonably passed through to random member devices of a
> > multipath map.
> 
> Yes, as usual things are a bit more complicated. First, a handful of
> commands are special (REQUEST SENSE would be for HBAs that don't use
> auto sense, but that is fortunately not something you encounter).
> Second, there's already a filter in the kernel in
> drivers/scsi/scsi_ioctl.c for commands that are allowed without
> CAP_SYS_RAWIO. QEMU is subject to that so the commands you'll see are
> mostly I/O, INQUIRY, TUR, MODE SENSE/SELECT and that's it.

Thanks for mentioning this.

However, I suppose that depends on the permissions with which the qemu
process is started, no? Wouldn't qemu need CAP_SYS_RAWIO for PCI
passthrough as well? 

I admit that I'm confused by the many indirections in qemu's scsi-block
code flow. AFAICS qemu forwards everything except PRIN/PROUT to the
kernel block device in "scsi-block" mode. Correct me if I'm wrong.

Anwyway, let's not forget that we're talking about the kernel here.
While qemu is the main target application for this feature is created,
any application can use it, and it may or may not run with
CAP_SYS_RAWIO.

> Any command that the kernel doesn't filter would be rejected, or
> handled specially in the case of PR commands (PR commands use a
> separate privileged helper to send them down to the device; the
> helper
> also knows about multipath and uses the userspace libmpathpersist if
> it receives a dm-mpath file descriptor via SCM_RIGHTS).
> 
> > AFAIK the only commands that we really need to pass through (except
> > the standard ones) are the reservation commands, which get special
> > handling
> > by qemu anyway. @Ben, @Kevin, are you aware of anything else?
> 
> .Of the ones that aren't simple I/O, mode parameters and TUR are the
> important cases. A TUR failure would be handled by the ioctl that
> Kevin proposed here by forcing a path switch. Mode parameters might
> not be shared(*) and would need to be sent down all the paths in that
> case; that can be fixed in userspace if necessary.

Passing TUR from a multipath device to a random member doesn't make
much sense to me. qemu would need to implement some logic to determine
whether the map has any usable paths.

> > I'd also be interested in understanding this better. As noted
> > above,
> > I'm aware that passing through everything is dangerous and wrong in
> > principle. But in practice, we haven't observed anything serious
> > except
> > (as Ben already said) the failure to do path failover in the SG_IO
> > code
> > path, which both this patch set and my set from the past are
> > intended
> > to fix.
> 
> Yes, the kernel filter is a PITA in the normal single path case but
> here it helps not doing something overly wrong.

This seems coincidental to me. Filtering by permissions and filtering
for commands that make sense on multipath devices are orthogonal
problems.

Regards,
Martin

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Benjamin Marzinski 8 months, 3 weeks ago

On Thu, May 15, 2025 at 12:34:13PM +0200, Martin Wilck wrote:
> On Thu, 2025-05-15 at 04:53 +0200, Paolo Bonzini wrote:
> > Il mer 14 mag 2025, 13:37 Martin Wilck <mwilck@suse.com> ha scritto:
> > > 
> > > I'd go one step further. Christoph is right to say that what we're
> > > currently doing in qemu – passing through every command except the
> > > PRIN/PROUT to a multipath device – is a dangerous thing to do.
> > > 
> > > Passthrough from a dm-multipath device to a SCSI device makes sense
> > > only for a small subset of the SCSI command set. Basically just for
> > > the
> > > regular IO commands like the various READ and WRITE variants and
> > > the
> > > occasional UNMAP. The fact that customers
> > > have been running these setups in large deployments over many years
> > > suggests that, if other commands ever get passed through to member
> > > devices, it has rarely had fatal consequences.
> > > 
> > > Nobody would seriously consider sending ALUA commands to the
> > > multipath
> > > devices. TUR and REQUEST SENSE are other examples for commands that
> > > can't be reasonably passed through to random member devices of a
> > > multipath map.
> > 
> > Yes, as usual things are a bit more complicated. First, a handful of
> > commands are special (REQUEST SENSE would be for HBAs that don't use
> > auto sense, but that is fortunately not something you encounter).
> > Second, there's already a filter in the kernel in
> > drivers/scsi/scsi_ioctl.c for commands that are allowed without
> > CAP_SYS_RAWIO. QEMU is subject to that so the commands you'll see are
> > mostly I/O, INQUIRY, TUR, MODE SENSE/SELECT and that's it.
> 
> Thanks for mentioning this.
> 
> However, I suppose that depends on the permissions with which the qemu
> process is started, no? Wouldn't qemu need CAP_SYS_RAWIO for PCI
> passthrough as well? 
> 
> I admit that I'm confused by the many indirections in qemu's scsi-block
> code flow. AFAICS qemu forwards everything except PRIN/PROUT to the
> kernel block device in "scsi-block" mode. Correct me if I'm wrong.
> 
> Anwyway, let's not forget that we're talking about the kernel here.
> While qemu is the main target application for this feature is created,
> any application can use it, and it may or may not run with
> CAP_SYS_RAWIO.

Maybe I'm missing your issue, but QEMU is just using DM's existing ioctl
forwarding code, which was already designed for general use, and I don't
see any capability issues with this patchset itself. If the caller has
the capability to issue this ioctl, they already have the capability to
do reads to the device directly, which will cause any unusable paths to
be failed, exactly like the ioctl does. The only difference it that the
user has no way to know how many reads they would need to issue to the
multipath device directly to force it to try all the paths. This gives
any application the ability to verify the paths just like doing enough
IO to the multipath device would, but it guarantees that it will use the
minimum IO.

-Ben
 
> > Any command that the kernel doesn't filter would be rejected, or
> > handled specially in the case of PR commands (PR commands use a
> > separate privileged helper to send them down to the device; the
> > helper
> > also knows about multipath and uses the userspace libmpathpersist if
> > it receives a dm-mpath file descriptor via SCM_RIGHTS).
> > 
> > > AFAIK the only commands that we really need to pass through (except
> > > the standard ones) are the reservation commands, which get special
> > > handling
> > > by qemu anyway. @Ben, @Kevin, are you aware of anything else?
> > 
> > .Of the ones that aren't simple I/O, mode parameters and TUR are the
> > important cases. A TUR failure would be handled by the ioctl that
> > Kevin proposed here by forcing a path switch. Mode parameters might
> > not be shared(*) and would need to be sent down all the paths in that
> > case; that can be fixed in userspace if necessary.
> 
> Passing TUR from a multipath device to a random member doesn't make
> much sense to me. qemu would need to implement some logic to determine
> whether the map has any usable paths.
> 
> > > I'd also be interested in understanding this better. As noted
> > > above,
> > > I'm aware that passing through everything is dangerous and wrong in
> > > principle. But in practice, we haven't observed anything serious
> > > except
> > > (as Ben already said) the failure to do path failover in the SG_IO
> > > code
> > > path, which both this patch set and my set from the past are
> > > intended
> > > to fix.
> > 
> > Yes, the kernel filter is a PITA in the normal single path case but
> > here it helps not doing something overly wrong.
> 
> This seems coincidental to me. Filtering by permissions and filtering
> for commands that make sense on multipath devices are orthogonal
> problems.
> 
> Regards,
> Martin

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Martin Wilck 8 months, 3 weeks ago

On Thu, 2025-05-15 at 10:29 -0400, Benjamin Marzinski wrote:
> On Thu, May 15, 2025 at 12:34:13PM +0200, Martin Wilck wrote:
> > 
> > However, I suppose that depends on the permissions with which the
> > qemu
> > process is started, no? Wouldn't qemu need CAP_SYS_RAWIO for PCI
> > passthrough as well? 
> > 
> > I admit that I'm confused by the many indirections in qemu's scsi-
> > block
> > code flow. AFAICS qemu forwards everything except PRIN/PROUT to the
> > kernel block device in "scsi-block" mode. Correct me if I'm wrong.
> > 
> > Anwyway, let's not forget that we're talking about the kernel here.
> > While qemu is the main target application for this feature is
> > created,
> > any application can use it, and it may or may not run with
> > CAP_SYS_RAWIO.
> 
> Maybe I'm missing your issue, but QEMU is just using DM's existing
> ioctl
> forwarding code, which was already designed for general use, and I
> don't
> see any capability issues with this patchset itself.

I didn't mean it this way. I was rather musing about the question
whether doing SG_IO on multipath devices by forwarding them to the
current path makes sense.

Sorry for the confusing post.

Regards
Martin

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Christoph Hellwig 8 months, 3 weeks ago

On Thu, May 15, 2025 at 05:00:41PM +0200, Martin Wilck wrote:
> I didn't mean it this way. I was rather musing about the question
> whether doing SG_IO on multipath devices by forwarding them to the
> current path makes sense.

It doesn't, and that's the core of the problem.  Someone back in the
day though just forwarding SG_IO would solve whatever thing they had
to do back then and which didn't have a proper kernel abstraction,
and it keeps adding problems.

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Paolo Bonzini 8 months, 4 weeks ago

On Thu, May 15, 2025 at 12:34 PM Martin Wilck <mwilck@suse.com> wrote:
> On Thu, 2025-05-15 at 04:53 +0200, Paolo Bonzini wrote:
> > Il mer 14 mag 2025, 13:37 Martin Wilck <mwilck@suse.com> ha scritto:
> > Yes, as usual things are a bit more complicated. First, a handful of
> > commands are special (REQUEST SENSE would be for HBAs that don't use
> > auto sense, but that is fortunately not something you encounter).
> > Second, there's already a filter in the kernel in
> > drivers/scsi/scsi_ioctl.c for commands that are allowed without
> > CAP_SYS_RAWIO. QEMU is subject to that so the commands you'll see are
> > mostly I/O, INQUIRY, TUR, MODE SENSE/SELECT and that's it.
>
> Thanks for mentioning this. However, I suppose that depends on the
> permissions with which the qemu process is started, no? Wouldn't
> qemu need CAP_SYS_RAWIO for PCI passthrough as well?

Generally you want to assume that the VM is hostile and run QEMU with
as few privileges as possible (not just capabilities, but also in
separate namespaces and with restrictions from device cgroups,
SELinux, etc.). PCI passthrough is not an issue, it only needs access
to the VFIO inodes and you can do it by setting the appropriate file
permissions without extra capabilities. The actual privileged part is
binding the device to VFIO, which is done outside QEMU anyway.

> I admit that I'm confused by the many indirections in qemu's scsi-block
> code flow. AFAICS qemu forwards everything except PRIN/PROUT to the
> kernel block device in "scsi-block" mode. Correct me if I'm wrong.

Yes, that's correct. The code for PRIN/PROUT calls out to a separate
privileged process (in scsi/qemu-pr-helper.c if you're curious) which
is aware of multipath and can be extended if needed.

> Anwyway, let's not forget that we're talking about the kernel here.
> While qemu is the main target application for this feature is created,
> any application can use it, and it may or may not run with
> CAP_SYS_RAWIO.

Yes, but once you have CAP_SYS_RAWIO all bets for sanity are off... it
even lets you do SG_IO on partitions.

> > .Of the ones that aren't simple I/O, mode parameters and TUR are the
> > important cases. A TUR failure would be handled by the ioctl that
> > Kevin proposed here by forcing a path switch. Mode parameters might
> > not be shared(*) and would need to be sent down all the paths in that
> > case; that can be fixed in userspace if necessary.
>
> Passing TUR from a multipath device to a random member doesn't make
> much sense to me. qemu would need to implement some logic to determine
> whether the map has any usable paths.

As long as one path replies to a TUR and the host is able to
(eventually, somehow) steer I/O transparently to that path, that
should be good enough. If the one path that the kernel tries is down,
QEMU can probe which paths are up and retry. That seems consistent
with what you want from TUR but maybe I'm missing something.

> > Yes, the kernel filter is a PITA in the normal single path case but
> > here it helps not doing something overly wrong.
>
> This seems coincidental to me. Filtering by permissions and filtering
> for commands that make sense on multipath devices are orthogonal
> problems.

I agree (it helps, it doesn't solve the problem), however there is a
large overlap between the two.

Paolo

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Martin Wilck 8 months, 3 weeks ago

On Thu, 2025-05-15 at 12:51 +0200, Paolo Bonzini wrote:
> On Thu, May 15, 2025 at 12:34 PM Martin Wilck <mwilck@suse.com> 
> > 
> > Thanks for mentioning this. However, I suppose that depends on the
> > permissions with which the qemu process is started, no? Wouldn't
> > qemu need CAP_SYS_RAWIO for PCI passthrough as well?
> 
> Generally you want to assume that the VM is hostile and run QEMU with
> as few privileges as possible (not just capabilities, but also in
> separate namespaces and with restrictions from device cgroups,
> SELinux, etc.). PCI passthrough is not an issue, it only needs access
> to the VFIO inodes and you can do it by setting the appropriate file
> permissions without extra capabilities. The actual privileged part is
> binding the device to VFIO, which is done outside QEMU anyway.

Thanks for the clarification.

> > I admit that I'm confused by the many indirections in qemu's scsi-
> > block
> > code flow. AFAICS qemu forwards everything except PRIN/PROUT to the
> > kernel block device in "scsi-block" mode. Correct me if I'm wrong.
> 
> Yes, that's correct. The code for PRIN/PROUT calls out to a separate
> privileged process (in scsi/qemu-pr-helper.c if you're curious) which
> is aware of multipath and can be extended if needed.

Sure, I was aware of the helper. I just wasn't 100% clear about how it
gets called. Found the code in the meantime [1].

[1] https://github.com/qemu/qemu/blob/864813878951b44e964eb4c012d832fd21f8cc0c/block/file-posix.c#L4286

> > > .Of the ones that aren't simple I/O, mode parameters and TUR are
> > > the
> > > important cases. A TUR failure would be handled by the ioctl that
> > > Kevin proposed here by forcing a path switch. Mode parameters
> > > might
> > > not be shared(*) and would need to be sent down all the paths in
> > > that
> > > case; that can be fixed in userspace if necessary.
> > 
> > Passing TUR from a multipath device to a random member doesn't make
> > much sense to me. qemu would need to implement some logic to
> > determine
> > whether the map has any usable paths.
> 
> As long as one path replies to a TUR and the host is able to
> (eventually, somehow) steer I/O transparently to that path, that
> should be good enough. If the one path that the kernel tries is down,
> QEMU can probe which paths are up and retry. That seems consistent
> with what you want from TUR but maybe I'm missing something.

It's ok-ish, in particular in combination with Kevin't patch. But using
an equivalent of "multipath -C" would be closer to the real thing for
TUR.

Regards
Martin

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Paolo Bonzini 8 months, 4 weeks ago

Il mar 13 mag 2025, 11:29 Kevin Wolf <kwolf@redhat.com> ha scritto:
> QEMU's scsi-block device actually started off using normal I/O for reads
> and writes and using SG_IO only for things that aren't covered by normal
> I/O. But then those had to be changed to SG_IO, too, so that the guest
> would actually see the full SCSI status. Things the commit message
> mentions are unit attention codes and handling RESERVATION CONFLICT
> correctly (which made me unsure above if the more specific interface for
> reservations could actually be used to fully get rid of SG_IO). For more
> context, I'm adding Paolo who actually made that change back then. He
> may remember the concrete bug(s) this fixed.

The original reason to avoid SG_IO for reads and writes was purely
performance (using the host scheduler), but it turned out to be a bad
idea.

RESERVATION CONFLICT indeed was one of the reasons why I moved QEMU
away from SG_IO. It has since been fixed, because these days Linux
uses EBADE for it and likewise there are errno values for some other
statuses or sense codes, but it helps more in general to have the
precise SCSI status and sense data. Having the real status and sense
for example lets QEMU decide which errors to pass to the guest and
which should be handled in the host (for example by pausing the VM).
Some HBAs also have equivalents of Linux's host_status, and passing
that down also makes for more accurate pass through.

Also, specifically for reservations, I also didn't like the idea that
a guest command could be split in multiple host commands and the
reservation conflict would appear in the middle due to a concurrent PR
OUT command. To be honest I don't know how physical disks and
controllers handle that, but I didn't want to make it worse.

Thanks,

Paolo

> So if you want the guest device to behave mostly the same as your host
> device, I don't really see any way around SCSI passthrough and therefore
> SG_IO.
>
> Kevin
>

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Hannes Reinecke 9 months ago

On 5/13/25 07:55, Christoph Hellwig wrote:
> On Mon, May 12, 2025 at 05:18:43PM +0200, Kevin Wolf wrote:
>> Yes, it's a bit unfortunate, but we have to work with what we have. QEMU
>> doesn't even necessarily know that it's dealing with a multipath device,
>> so it just has to blindly try the ioctl and see if it works.
> 
> Why is qemu even using SG_IO to start with?
> 
To be able to forward SCSI commands.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Christoph Hellwig 9 months ago

On Tue, May 13, 2025 at 08:09:45AM +0200, Hannes Reinecke wrote:
> On 5/13/25 07:55, Christoph Hellwig wrote:
> > On Mon, May 12, 2025 at 05:18:43PM +0200, Kevin Wolf wrote:
> > > Yes, it's a bit unfortunate, but we have to work with what we have. QEMU
> > > doesn't even necessarily know that it's dealing with a multipath device,
> > > so it just has to blindly try the ioctl and see if it works.
> > 
> > Why is qemu even using SG_IO to start with?
> > 
> To be able to forward SCSI commands.

Why?

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Hannes Reinecke 9 months ago

On 5/13/25 08:14, Christoph Hellwig wrote:
> On Tue, May 13, 2025 at 08:09:45AM +0200, Hannes Reinecke wrote:
>> On 5/13/25 07:55, Christoph Hellwig wrote:
>>> On Mon, May 12, 2025 at 05:18:43PM +0200, Kevin Wolf wrote:
>>>> Yes, it's a bit unfortunate, but we have to work with what we have. QEMU
>>>> doesn't even necessarily know that it's dealing with a multipath device,
>>>> so it just has to blindly try the ioctl and see if it works.
>>>
>>> Why is qemu even using SG_IO to start with?
>>>
>> To be able to forward SCSI commands.
> 
> Why?
> 
Reservations and stuff.
There are customer who use GPFS ...

But yes, I know. We are working on it; it should be possible to hook in
the generic block reservation code here.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Christoph Hellwig 9 months ago

On Tue, May 13, 2025 at 08:32:12AM +0200, Hannes Reinecke wrote:
> Reservations and stuff.

They should use the kernel persistent reservation API.

> There are customer who use GPFS ...

Supporting illegal binary only modules that is already enough of a
reason to NAK this.

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Benjamin Marzinski 8 months, 4 weeks ago

On Mon, May 12, 2025 at 11:49:19PM -0700, Christoph Hellwig wrote:
> On Tue, May 13, 2025 at 08:32:12AM +0200, Hannes Reinecke wrote:
> > Reservations and stuff.
> 
> They should use the kernel persistent reservation API.

Currently QEMU isn't sending Persistent Reservation requests to
multipath devices at all. It's sending those directly to the underlying
scsi devices. The issue here is with all the other SCSI commands that
users want to send to their SCSI passthrough device that is actually a
multipath device on top of a number of SCSI paths. They expect to
get back the actual sense and status information, so QEMU needs to
send them via SG_IOs.

Without reading that sense and status information in kernel, the
multipath target can't know if it needs to fail a path and retry the
ioctl down a different path. QEMU can read this information, but it
doesn't know what path the multipath device send the ioctl down. This
patch just gives users a way to check the paths in the active pathgroup
(which all should be able to handle IO) and fail those that can't.
While QEMU is the driver of this, it's completely general functionality.

-Ben

> 
> > There are customer who use GPFS ...
> 
> Supporting illegal binary only modules that is already enough of a
> reason to NAK this.

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Christoph Hellwig 8 months, 3 weeks ago

On Tue, May 13, 2025 at 12:29:51PM -0400, Benjamin Marzinski wrote:
> On Mon, May 12, 2025 at 11:49:19PM -0700, Christoph Hellwig wrote:
> > On Tue, May 13, 2025 at 08:32:12AM +0200, Hannes Reinecke wrote:
> > > Reservations and stuff.
> > 
> > They should use the kernel persistent reservation API.
> 
> Currently QEMU isn't sending Persistent Reservation requests to
> multipath devices at all. It's sending those directly to the underlying
> scsi devices. The issue here is with all the other SCSI commands that
> users want to send to their SCSI passthrough device that is actually a
> multipath device on top of a number of SCSI paths. They expect to
> get back the actual sense and status information, so QEMU needs to
> send them via SG_IOs.

And that's the problem.  There is plenty of path (I_T_L) nexus
specific information in SCSI, and just trying to glob it back
together above means you'll get path specific information in something
that doesn't claim to be multi-pathing and will get random and
changing results depending on the underlying path selection.

> Without reading that sense and status information in kernel, the
> multipath target can't know if it needs to fail a path and retry the
> ioctl down a different path.

The blk_status_t has the information to fail over.  But the whole
idea of implementing SG_IO in dm-mpath or any other stacking layer
is just really, really dangerous.

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Hannes Reinecke 8 months, 4 weeks ago

On 5/13/25 18:29, Benjamin Marzinski wrote:
> On Mon, May 12, 2025 at 11:49:19PM -0700, Christoph Hellwig wrote:
>> On Tue, May 13, 2025 at 08:32:12AM +0200, Hannes Reinecke wrote:
>>> Reservations and stuff.
>>
>> They should use the kernel persistent reservation API.
> 
> Currently QEMU isn't sending Persistent Reservation requests to
> multipath devices at all. It's sending those directly to the underlying
> scsi devices. The issue here is with all the other SCSI commands that
> users want to send to their SCSI passthrough device that is actually a
> multipath device on top of a number of SCSI paths. They expect to
> get back the actual sense and status information, so QEMU needs to
> send them via SG_IOs.
> 
> Without reading that sense and status information in kernel, the
> multipath target can't know if it needs to fail a path and retry the
> ioctl down a different path. QEMU can read this information, but it
> doesn't know what path the multipath device send the ioctl down. This
> patch just gives users a way to check the paths in the active pathgroup
> (which all should be able to handle IO) and fail those that can't.
> While QEMU is the driver of this, it's completely general functionality.
> 
Ouch.
Different reservations on different paths? Really?
How do you manage to keep the reservations up-to-date?

My recommendation is to use the _same_ reservation on all paths,
precisely to avoid having to re-do reservations on path failure.
And for that scenario the kernel persistent reservation API
would work fine.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Benjamin Marzinski 8 months, 4 weeks ago

On Wed, May 14, 2025 at 08:39:25AM +0200, Hannes Reinecke wrote:
> On 5/13/25 18:29, Benjamin Marzinski wrote:
> > On Mon, May 12, 2025 at 11:49:19PM -0700, Christoph Hellwig wrote:
> > > On Tue, May 13, 2025 at 08:32:12AM +0200, Hannes Reinecke wrote:
> > > > Reservations and stuff.
> > > 
> > > They should use the kernel persistent reservation API.
> > 
> > Currently QEMU isn't sending Persistent Reservation requests to
> > multipath devices at all. It's sending those directly to the underlying
> > scsi devices. The issue here is with all the other SCSI commands that
> > users want to send to their SCSI passthrough device that is actually a
> > multipath device on top of a number of SCSI paths. They expect to
> > get back the actual sense and status information, so QEMU needs to
> > send them via SG_IOs.
> > 
> > Without reading that sense and status information in kernel, the
> > multipath target can't know if it needs to fail a path and retry the
> > ioctl down a different path. QEMU can read this information, but it
> > doesn't know what path the multipath device send the ioctl down. This
> > patch just gives users a way to check the paths in the active pathgroup
> > (which all should be able to handle IO) and fail those that can't.
> > While QEMU is the driver of this, it's completely general functionality.
> > 
> Ouch.
> Different reservations on different paths? Really?
> How do you manage to keep the reservations up-to-date?

multipath's libmpathpersist library. You contributed some code to it
IIRC. It predates the kernel interface, and even if we switched to using
the kernel interface to send persistent reservation commands to the
paths instead of doing that in userspace (the dm_pr_ops functions would
needs some reworking to handle multipath, where the rules are different
than for other dm-devices. For instance, you only want to reserve on one
path, and it's not necessarily a failure if you can't register a key
down all paths) we would still probably want to use the library, since
it also coordinates with multipathd so that it can register keys on
paths that are later added to the multipath device or were down when the
registration initially happened. 

> My recommendation is to use the _same_ reservation on all paths,
> precisely to avoid having to re-do reservations on path failure.
> And for that scenario the kernel persistent reservation API
> would work fine.

Libmpathpersist does register the same key for all paths of a multipath
device. Doing anything else would be crazy. Sorry if my last email was
misleading. But the kernel API cannot help in the case where a path is
later added to the multipath device, or was down when the initial
registration happened.

But again, all this is orthogonal to the patches in question, which have
nothing to do with Persistent Reservations.

-Ben

> Cheers,
> 
> Hannes
> -- 
> Dr. Hannes Reinecke                  Kernel Storage Architect
> hare@suse.de                                +49 911 74053 688
> SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
> HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Christoph Hellwig 8 months, 4 weeks ago

On Tue, May 13, 2025 at 12:29:51PM -0400, Benjamin Marzinski wrote:
> Without reading that sense and status information in kernel, the
> multipath target can't know if it needs to fail a path and retry the
> ioctl down a different path. QEMU can read this information, but it
> doesn't know what path the multipath device send the ioctl down. This
> patch just gives users a way to check the paths in the active pathgroup
> (which all should be able to handle IO) and fail those that can't.
> While QEMU is the driver of this, it's completely general functionality.

As just replied to Martin the problem is that this setup fundamentally
can't work.  Either you pass a SCSI devices through, which should
(mostly, there are a few warts) work.  Or you want host side
multipathing, in which case you must pass through the block device
abstraction and not a SCSI one, or at least do a full emulation of
the SCSI interfaces instead of pretending to pass it through.

The root of all evil here is that dm-multipath tries to pass through
SG_IO, which is dangerous for all but the most trivial commands.

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Martin Wilck 9 months ago

On Mon, 2025-05-12 at 23:49 -0700, Christoph Hellwig wrote:
> On Tue, May 13, 2025 at 08:32:12AM +0200, Hannes Reinecke wrote:
> > Reservations and stuff.
> 
> They should use the kernel persistent reservation API.
> 
> > There are customer who use GPFS ...
> 
> Supporting illegal binary only modules that is already enough of a
> reason to NAK this.

GPFS is not the only use case. 

The reason why we have "scsi-block" in qemu is simply SCSI emulation.
If we emulate a SCSI device to a VM, the device should support commands
like persistent reservations properly. "scsi-block" supports this,
while still offering decent IO performance to VMs. In the multipath
case, the idea is to be able to hide from the VM the fact that the
device in question is actually a multipath device, and still allow this
sort of SCSI commands to be passed through to the storage.

And no, passing the SCSI devices to the VM and doing multipath in the
the guest doesn't work. The transport layer isn't properly emulated
(bluntly speaking, we have no FC emulation).

Regards
Martin

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Christoph Hellwig 8 months, 4 weeks ago

On Tue, May 13, 2025 at 10:17:58AM +0200, Martin Wilck wrote:
> If we emulate a SCSI device to a VM, the device should support commands
> like persistent reservations properly.

Then point it to an actual SCSI device, and not a multipath-device.

Trying to split responsibility for handling the initiator side work is
not in any way support in the SCSI spec, and by trying it anyway you
are just creating tons of of problems.

> And no, passing the SCSI devices to the VM and doing multipath in the
> the guest doesn't work. The transport layer isn't properly emulated
> (bluntly speaking, we have no FC emulation).

Then fix that.  Because without it you will be in never ending pain
due to impedance mismatch.

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Paolo Bonzini 8 months, 4 weeks ago

On 5/14/25 06:53, Christoph Hellwig wrote:
>> And no, passing the SCSI devices to the VM and doing multipath in the
>> the guest doesn't work. The transport layer isn't properly emulated
>> (bluntly speaking, we have no FC emulation).
> 
> Then fix that.  Because without it you will be in never ending pain
> due to impedance mismatch.

FC HBAs don't even expose enough of the protocol to actually do it 
sensibly, so that can't really be done.  Hannes tried.

Paolo

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Mikulas Patocka 9 months ago

Hi


On Thu, 8 May 2025, Martin Wilck wrote:

> Hello Kevin,
> 
> [I'm sorry for the previous email. It seems that I clicked "send" in
> the wrong window].
> 
> On Tue, 2025-04-29 at 18:50 +0200, Kevin Wolf wrote:
> > Multipath cannot directly provide failover for ioctls in the kernel
> > because it doesn't know what each ioctl means and which result could
> > indicate a path error. Userspace generally knows what the ioctl it
> > issued means and if it might be a path error, but neither does it
> > know
> > which path the ioctl took nor does it necessarily have the privileges
> > to
> > fail a path using the control device.
> 
> Thanks for this effort.
> 
> I have some remarks about your approach. The most important one is that
> the DM_MPATH_PROBE_PATHS_CMD ioctl appears to be a dangerous command.
> It sends IO to possibly broken paths and waits for it to complete. In
> the common error case of a device not responding, this might cause the
> code to hang for a long time in the kernel ioctl code path, in
> uninterruptible sleep (note that these probing commands will be queued

Normal reads and writes would also hang in an uninterruptible sleep if a 
path stops responding.

> after other possibly hanging IO). In the past we have put a lot of
> effort into other code paths in multipath-tools and elsewhere to avoid
> this kind of hang to the extent possible. It seems to me that your set
> re-introduces this risk.
> 
> Apart from that, minor observations are that in patch 2/2 you don't
> check whether the map is in queueing state, and I don't quite
> understand why you only check paths in the map's active path group
> without attempting a PG failover in the case where all paths in the
> current PG fail.
> 
> I am wondering whether the DM_MPATH_PROBE_PATHS_CMD ioctl is necessary
> at all. Rather than triggering explicit path probing, you could have
> dm-mpath "handle" IO errors by failing the active path for "path
> errors". That would be similar to my patch set from 2021 [1], but it
> would avoid the extra IO and thus the additional risk of hanging in the
> kernel.

What exactly do you mean? You say "you could have dm-mpath 'handle' IO 
errors by failing the active path for "path errors".

Christoph doesn't want dm-mpath can't look at the error code - so dm-mpath 
doesn't know if path error occured or not. qemu could look at the error 
code, but qemu doesn't know which path did the SG_IO go through - so it 
can't instruct qemu to fail that path.

One of the possibility that we discussed was to add a path-id to SG_IO, so 
that dm-mpath would mark which path did the SG_IO go through and qemu 
could instruct dm-mpath to fail that path. But we rejected it as being 
more complicated than the current approach.

> Contrary to my set, you wouldn't attempt retries in the kernel, but
> leave this part to qemu instead, like in the current set. That would
> avoid Christoph's main criticism that "Failing over SG_IO does not make
> sense" [2].
> 
> Doing the failover in qemu has the general disadvantage that qemu has
> no notion about the number of available and/or healthy paths; it can
> thus only guess the reasonable number of retries for any given I/O. But
> that's unavoidable, given that the idea to do kernel-level failover on
> SG_IO is rejected.
> 
> Please let me know your thoughts.
> 
> Best Regards
> Martin
> 
> [1] https://lore.kernel.org/all/20210628151558.2289-1-mwilck@suse.com/
> [2] https://lore.kernel.org/all/20210701075629.GA25768@lst.de/

Mikulas

Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

Posted by Martin Wilck 9 months ago

Hello Mikulas,

On Mon, 2025-05-12 at 15:46 +0200, Mikulas Patocka wrote:
> Hi
> 
> 
> On Thu, 8 May 2025, Martin Wilck wrote:
> 
> > Hello Kevin,
> > 
> > [I'm sorry for the previous email. It seems that I clicked "send"
> > in
> > the wrong window].
> > 
> > On Tue, 2025-04-29 at 18:50 +0200, Kevin Wolf wrote:
> > > Multipath cannot directly provide failover for ioctls in the
> > > kernel
> > > because it doesn't know what each ioctl means and which result
> > > could
> > > indicate a path error. Userspace generally knows what the ioctl
> > > it
> > > issued means and if it might be a path error, but neither does it
> > > know
> > > which path the ioctl took nor does it necessarily have the
> > > privileges
> > > to
> > > fail a path using the control device.
> > 
> > Thanks for this effort.
> > 
> > I have some remarks about your approach. The most important one is
> > that
> > the DM_MPATH_PROBE_PATHS_CMD ioctl appears to be a dangerous
> > command.
> > It sends IO to possibly broken paths and waits for it to complete.
> > In
> > the common error case of a device not responding, this might cause
> > the
> > code to hang for a long time in the kernel ioctl code path, in
> > uninterruptible sleep (note that these probing commands will be
> > queued
> 
> Normal reads and writes would also hang in an uninterruptible sleep
> if a 
> path stops responding.

Right. That's why multipathd uses TEST UNIT READY if supported by the
storage, and either aio or separate I/O threads to avoid the main
thread being blocked, and generally goes to great lengths to make sure
that misbehaving storage hardware can cause no freeze.

The way I read Kevin's code, you'd queue up additional IO in the same
context that had submitted the original failing IO. I realize that qemu
uses asynchronous IO, but AFAIK the details depend on the qemu options
for the individual VM, and it isn't clear to me whether or not
DM_MPATH_PROBE_PATHS_CMD can bring the entire VM to halt.

> >  [...]
> > 
> > I am wondering whether the DM_MPATH_PROBE_PATHS_CMD ioctl is
> > necessary
> > at all. Rather than triggering explicit path probing, you could
> > have
> > dm-mpath "handle" IO errors by failing the active path for "path
> > errors". That would be similar to my patch set from 2021 [1], but
> > it
> > would avoid the extra IO and thus the additional risk of hanging in
> > the
> > kernel.
> 
> What exactly do you mean? You say "you could have dm-mpath 'handle'
> IO 
> errors by failing the active path for "path errors".

What I meant was that dm-mpath could call fail_path() in case of an
error, using a similar mechanism to the block IO code path.

> Christoph doesn't want dm-mpath can't look at the error code - so dm-
> mpath 
> doesn't know if path error occured or not. 

My impression from the previous discussion was that Christoph mainly
objected to SG_IO requests being retried in the kernel [1], not so much
to the inspection of the error code by device mapper.

I may have misunderstood this of course. Adding Christoph into the loop
so that he can clarify what he meant. 

> qemu could look at the error 
> code, but qemu doesn't know which path did the SG_IO go through - so
> it 
> can't instruct qemu to fail that path.
> 
> One of the possibility that we discussed was to add a path-id to
> SG_IO, so 
> that dm-mpath would mark which path did the SG_IO go through and qemu
> could instruct dm-mpath to fail that path. But we rejected it as
> being 
> more complicated than the current approach.

If we discuss extending SG_IO itself, it might be easier to have it
store the blkstatus_t somewhere in the sg_io_hdr. More about that idea
in my reply to Kevin.

Regards,
Martin

[1] https://lore.kernel.org/all/20210701075629.GA25768@lst.de/


> > Contrary to my set, you wouldn't attempt retries in the kernel, but
> > leave this part to qemu instead, like in the current set. That
> > would
> > avoid Christoph's main criticism that "Failing over SG_IO does not
> > make
> > sense" [2].
> > 
> > Doing the failover in qemu has the general disadvantage that qemu
> > has
> > no notion about the number of available and/or healthy paths; it
> > can
> > thus only guess the reasonable number of retries for any given I/O.
> > But
> > that's unavoidable, given that the idea to do kernel-level failover
> > on
> > SG_IO is rejected.
> > 
> > Please let me know your thoughts.
> > 
> > Best Regards
> > Martin
> > 
> > [1]
> > https://lore.kernel.org/all/20210628151558.2289-1-mwilck@suse.com/
> > [2] https://lore.kernel.org/all/20210701075629.GA25768@lst.de/
> 
> Mikulas