include/linux/device-mapper.h | 9 +++- include/uapi/linux/dm-ioctl.h | 9 +++- drivers/md/dm-dust.c | 4 +- drivers/md/dm-ebs-target.c | 3 +- drivers/md/dm-flakey.c | 4 +- drivers/md/dm-ioctl.c | 1 + drivers/md/dm-linear.c | 4 +- drivers/md/dm-log-writes.c | 4 +- drivers/md/dm-mpath.c | 95 ++++++++++++++++++++++++++++++++++- drivers/md/dm-switch.c | 4 +- drivers/md/dm-verity-target.c | 4 +- drivers/md/dm-zoned-target.c | 3 +- drivers/md/dm.c | 17 ++++--- 13 files changed, 142 insertions(+), 19 deletions(-)
Multipath cannot directly provide failover for ioctls in the kernel because it doesn't know what each ioctl means and which result could indicate a path error. Userspace generally knows what the ioctl it issued means and if it might be a path error, but neither does it know which path the ioctl took nor does it necessarily have the privileges to fail a path using the control device. This series adds an interface that userspace can use to probe paths and fail the bad ones after seeing a potential path error in an ioctl result. Once the bad paths are eliminated, the ioctl can be retried. While the fundamental problem is relatively broad and can affect any sort of ioctl, the immediate motivation for this is the use of SG_IO in QEMU for SCSI passthrough. A preliminary QEMU side patch that makes use of the new interface to support multipath failover with SCSI passthrough can be found at: https://repo.or.cz/qemu/kevin.git/commitdiff/78a474da3b39469b11fbb1d4e0ddf4797b637d35 Kevin Wolf (2): dm: Allow .prepare_ioctl to handle ioctls directly dm mpath: Interface for explicit probing of active paths include/linux/device-mapper.h | 9 +++- include/uapi/linux/dm-ioctl.h | 9 +++- drivers/md/dm-dust.c | 4 +- drivers/md/dm-ebs-target.c | 3 +- drivers/md/dm-flakey.c | 4 +- drivers/md/dm-ioctl.c | 1 + drivers/md/dm-linear.c | 4 +- drivers/md/dm-log-writes.c | 4 +- drivers/md/dm-mpath.c | 95 ++++++++++++++++++++++++++++++++++- drivers/md/dm-switch.c | 4 +- drivers/md/dm-verity-target.c | 4 +- drivers/md/dm-zoned-target.c | 3 +- drivers/md/dm.c | 17 ++++--- 13 files changed, 142 insertions(+), 19 deletions(-) -- 2.49.0
Hello Kevin, [I'm sorry for the previous email. It seems that I clicked "send" in the wrong window]. On Tue, 2025-04-29 at 18:50 +0200, Kevin Wolf wrote: > Multipath cannot directly provide failover for ioctls in the kernel > because it doesn't know what each ioctl means and which result could > indicate a path error. Userspace generally knows what the ioctl it > issued means and if it might be a path error, but neither does it > know > which path the ioctl took nor does it necessarily have the privileges > to > fail a path using the control device. Thanks for this effort. I have some remarks about your approach. The most important one is that the DM_MPATH_PROBE_PATHS_CMD ioctl appears to be a dangerous command. It sends IO to possibly broken paths and waits for it to complete. In the common error case of a device not responding, this might cause the code to hang for a long time in the kernel ioctl code path, in uninterruptible sleep (note that these probing commands will be queued after other possibly hanging IO). In the past we have put a lot of effort into other code paths in multipath-tools and elsewhere to avoid this kind of hang to the extent possible. It seems to me that your set re-introduces this risk. Apart from that, minor observations are that in patch 2/2 you don't check whether the map is in queueing state, and I don't quite understand why you only check paths in the map's active path group without attempting a PG failover in the case where all paths in the current PG fail. I am wondering whether the DM_MPATH_PROBE_PATHS_CMD ioctl is necessary at all. Rather than triggering explicit path probing, you could have dm-mpath "handle" IO errors by failing the active path for "path errors". That would be similar to my patch set from 2021 [1], but it would avoid the extra IO and thus the additional risk of hanging in the kernel. Contrary to my set, you wouldn't attempt retries in the kernel, but leave this part to qemu instead, like in the current set. That would avoid Christoph's main criticism that "Failing over SG_IO does not make sense" [2]. Doing the failover in qemu has the general disadvantage that qemu has no notion about the number of available and/or healthy paths; it can thus only guess the reasonable number of retries for any given I/O. But that's unavoidable, given that the idea to do kernel-level failover on SG_IO is rejected. Please let me know your thoughts. Best Regards Martin [1] https://lore.kernel.org/all/20210628151558.2289-1-mwilck@suse.com/ [2] https://lore.kernel.org/all/20210701075629.GA25768@lst.de/
Am 08.05.2025 um 15:51 hat Martin Wilck geschrieben: > Hello Kevin, > > [I'm sorry for the previous email. It seems that I clicked "send" in > the wrong window]. > > On Tue, 2025-04-29 at 18:50 +0200, Kevin Wolf wrote: > > Multipath cannot directly provide failover for ioctls in the kernel > > because it doesn't know what each ioctl means and which result could > > indicate a path error. Userspace generally knows what the ioctl it > > issued means and if it might be a path error, but neither does it > > know > > which path the ioctl took nor does it necessarily have the privileges > > to > > fail a path using the control device. > > Thanks for this effort. > > I have some remarks about your approach. The most important one is that > the DM_MPATH_PROBE_PATHS_CMD ioctl appears to be a dangerous command. > It sends IO to possibly broken paths and waits for it to complete. In > the common error case of a device not responding, this might cause the > code to hang for a long time in the kernel ioctl code path, in > uninterruptible sleep (note that these probing commands will be queued > after other possibly hanging IO). In the past we have put a lot of > effort into other code paths in multipath-tools and elsewhere to avoid > this kind of hang to the extent possible. It seems to me that your set > re-introduces this risk. That's a fair point. I don't expect this code to be fully final, another thing that isn't really optimal (as mentioned in the comment message) is that we're not probing paths in parallel, potentially adding up timeouts from multiple paths. I don't think this is a problem of the interface, though, but we can improve the implementation keeping the same interface. Maybe I'm missing something, but I think the reason why it has to be uninterruptible is just that submit_bio_wait() has the completion on the stack? So I suppose we could just kmalloc() some state, submit all the bios, let the completion callback free it, and then we can just stop waiting early (i.e. wait_for_completion_interruptible/killable) and let the requests complete on their own in the background. Would this address your concerns or is there another part to it? > Apart from that, minor observations are that in patch 2/2 you don't > check whether the map is in queueing state, and I don't quite > understand why you only check paths in the map's active path group > without attempting a PG failover in the case where all paths in the > current PG fail. Ben already fixed up the missing check. Not attempting PG failover was also his suggestion, on the basis that it would be additional work for no real benefit when you can only submit requests for the current PG anyway. If userspace retries the SG_IO, it will already pick a different PG, so having the kernel do the same doesn't really improve anything. > I am wondering whether the DM_MPATH_PROBE_PATHS_CMD ioctl is necessary > at all. Rather than triggering explicit path probing, you could have > dm-mpath "handle" IO errors by failing the active path for "path > errors". That would be similar to my patch set from 2021 [1], but it > would avoid the extra IO and thus the additional risk of hanging in the > kernel. > > Contrary to my set, you wouldn't attempt retries in the kernel, but > leave this part to qemu instead, like in the current set. That would > avoid Christoph's main criticism that "Failing over SG_IO does not make > sense" [2]. Maybe I misunderstood, but my understanding from the feedback you got back then was that no SCSI-specific code was wanted in device mapper. This is why we kept interpreting the status codes in userspace. While discussing the approaches with Mikuláš and Ben, one option that was briefly discussed was a pair of ioctls: One that wraps ioctls and returns which path the ioctl took, and another one to fail that path if inspection of the result in userspace comes to the conclusion that it's a path error. I didn't think this could be implemented without also allowing an unprivileged process to DoS the device by just failing all paths even if they are still good, so we didn't continue there. Anyway, if it actually were acceptable for the kernel to check SG_IO results for path errors and fail the path while still returning the result unchanged, then that would work for us for the specific case, of course. But I don't expect this really addresses Christoph's concerns. If you think it does, is there another reason why you didn't try this before? (And it wouldn't be generic, so would we potentially have to repeat the exercise for other ioctls in the future?) > Doing the failover in qemu has the general disadvantage that qemu has > no notion about the number of available and/or healthy paths; it can > thus only guess the reasonable number of retries for any given I/O. But > that's unavoidable, given that the idea to do kernel-level failover on > SG_IO is rejected. Yes, it's a bit unfortunate, but we have to work with what we have. QEMU doesn't even necessarily know that it's dealing with a multipath device, so it just has to blindly try the ioctl and see if it works. Kevin
Hi Kevin, On Mon, 2025-05-12 at 17:18 +0200, Kevin Wolf wrote: > Am 08.05.2025 um 15:51 hat Martin Wilck geschrieben: > > Hello Kevin, > > > > [I'm sorry for the previous email. It seems that I clicked "send" > > in > > the wrong window]. > > > > On Tue, 2025-04-29 at 18:50 +0200, Kevin Wolf wrote: > > > Multipath cannot directly provide failover for ioctls in the > > > kernel > > > because it doesn't know what each ioctl means and which result > > > could > > > indicate a path error. Userspace generally knows what the ioctl > > > it > > > issued means and if it might be a path error, but neither does it > > > know > > > which path the ioctl took nor does it necessarily have the > > > privileges > > > to > > > fail a path using the control device. > > > > Thanks for this effort. > > > > I have some remarks about your approach. The most important one is > > that > > the DM_MPATH_PROBE_PATHS_CMD ioctl appears to be a dangerous > > command. > > It sends IO to possibly broken paths and waits for it to complete. > > In > > the common error case of a device not responding, this might cause > > the > > code to hang for a long time in the kernel ioctl code path, in > > uninterruptible sleep (note that these probing commands will be > > queued > > after other possibly hanging IO). In the past we have put a lot of > > effort into other code paths in multipath-tools and elsewhere to > > avoid > > this kind of hang to the extent possible. It seems to me that your > > set > > re-introduces this risk. > > That's a fair point. I don't expect this code to be fully final, > another > thing that isn't really optimal (as mentioned in the comment message) > is > that we're not probing paths in parallel, potentially adding up > timeouts > from multiple paths. > > I don't think this is a problem of the interface, though, but we can > improve the implementation keeping the same interface. > Maybe I'm missing something, but I think the reason why it has to be > uninterruptible is just that submit_bio_wait() has the completion on > the > stack? So I suppose we could just kmalloc() some state, submit all > the > bios, let the completion callback free it, and then we can just stop > waiting early (i.e. wait_for_completion_interruptible/killable) and > let > the requests complete on their own in the background. > > Would this address your concerns or is there another part to it? It'd be an improvement. Not sure if it'd solve the problem entirely. > > Apart from that, minor observations are that in patch 2/2 you don't > > check whether the map is in queueing state, and I don't quite > > understand why you only check paths in the map's active path group > > without attempting a PG failover in the case where all paths in the > > current PG fail. > > Ben already fixed up the missing check. I missed that. > Not attempting PG failover was also his suggestion, on the basis that > it > would be additional work for no real benefit when you can only submit > requests for the current PG anyway. If userspace retries the SG_IO, > it > will already pick a different PG, so having the kernel do the same > doesn't really improve anything. Ok. But continuing along this line of reasoning, I'd conclude that it would be even more useful to fail the path in device-mapper directly after a failed IO request (given that we find an agreeable way to do this, see below). Qemu could then retry, depending on the error code. Paths would be failed one after the other, and eventually a PG failover would occur. > > I am wondering whether the DM_MPATH_PROBE_PATHS_CMD ioctl is > > necessary > > at all. Rather than triggering explicit path probing, you could > > have > > dm-mpath "handle" IO errors by failing the active path for "path > > errors". That would be similar to my patch set from 2021 [1], but > > it > > would avoid the extra IO and thus the additional risk of hanging in > > the > > kernel. > > > > Contrary to my set, you wouldn't attempt retries in the kernel, but > > leave this part to qemu instead, like in the current set. That > > would > > avoid Christoph's main criticism that "Failing over SG_IO does not > > make > > sense" [2]. > > Maybe I misunderstood, but my understanding from the feedback you got > back then was that no SCSI-specific code was wanted in device mapper. > This is why we kept interpreting the status codes in userspace. As I wrote in my reply to Mikulas already, my understanding was that Christoph's main objection was against retrying SG_IO ioctls in the kernel on different paths, not against inspecting the error code. > While discussing the approaches with Mikuláš and Ben, one option that > was briefly discussed was a pair of ioctls: One that wraps ioctls and > returns which path the ioctl took, and another one to fail that path > if > inspection of the result in userspace comes to the conclusion that > it's > a path error. I didn't think this could be implemented without also > allowing an unprivileged process to DoS the device by just failing > all > paths even if they are still good, so we didn't continue there. This doesn't sound too promising IMO. If we want to find a solution that's focused on user space, I'd suggest to use multipathd rather than additional ioctls. After all, all necessary information is available in multipathd. Already today, qemu can use the multipath socket to query the current state of paths in a map. I can imagine a new multipathd CLI command that would instruct multipathd to check all paths of a given map immediately (@Ben: set pp->tick to 1 for all paths of the map). This command could be sent to multipathd in case of a suspected path failure. This would have similar semantics as the DM_MPATH_PROBE_PATHS_CMD ioctl, with some advantages, as it would take the multipath configuration of the specific storage array into account. It'd also avoid blocking qemu. multipathd would allow this command for non-root connections, but to avoid DoS, accept it only once every N seconds. > Anyway, if it actually were acceptable for the kernel to check SG_IO > results for path errors and fail the path while still returning the > result unchanged, then that would work for us for the specific case, > of > course. But I don't expect this really addresses Christoph's > concerns. For regular block IO, the "path error" logic is in blk_path_error() [1] and scsi_result_to_blk_status() [2]. blk_path_error() can be called from dm; the problem in the SG_IO code path is that we don't have a blk_status_t to examine. Therefore, in my old code, I replicated the logic of scsi_result_to_blk_status() in the dm layer. It works, but it's admittedly not the cleanest possible approach. OTOH, SG_IO on dm devices isn't the cleanest thing to do in general ;-) > If you think it does, is there another reason why you didn't try this > before? It didn't occur to me back then that we could fail paths without retrying in the kernel. Perhaps we could have the sg driver pass the blk_status_t (which is available on the sg level) to device mapper somehow in the sg_io_hdr structure? That way we could entirely avoid the layering violation between SCSI and dm. Not sure if that would be acceptible to Christoph, as blk_status_t is supposed to be exclusive to the kernel. Can we find a way to make sure it's passed to DM, but not to user space? Alternatively (if maintainers strictly object the error code inspection by dm), I wonder whether we could just _always_ fail the current path in case of errors in the dm-mpath SG_IO code path, without inspecting the error code. Rationale: a) most potential error conditions are treated as "path error" in block IO code path, b) qemu can still inspect the error code and avoid retrying for errors that aren't path errors, and c) if multipathd is running and the path has been failed mistakenly, it will reinstate that path after just a few seconds. In the worst case, an unintentional failover would occur. That isn't as bad as it used to be, as active-passive configurations with explicit TPGS are less common then in the past. > (And it wouldn't be generic, so would we potentially have to repeat > the > exercise for other ioctls in the future?) I don't think so. Do you have anything specific in mind? > > Doing the failover in qemu has the general disadvantage that qemu > > has > > no notion about the number of available and/or healthy paths; it > > can > > thus only guess the reasonable number of retries for any given I/O. > > But > > that's unavoidable, given that the idea to do kernel-level failover > > on > > SG_IO is rejected. > > Yes, it's a bit unfortunate, but we have to work with what we have. > QEMU > doesn't even necessarily know that it's dealing with a multipath > device, > so it just has to blindly try the ioctl and see if it works. See the paragraph about multipathd above. Regards Martin [1] https://elixir.bootlin.com/linux/v6.14.6/source/include/linux/blk_types.h#L185 [2] https://elixir.bootlin.com/linux/v6.14.6/source/drivers/scsi/scsi_lib.c#L685 PS: Yet another option that Christoph has suggested in the past would be to move away from qemu's "scsi-block", and use the generic block reservation mechanism to allow passthrough of reservation commands from VMs to the host. Nobody seems to have explored this option seriously so far.
On Tue, 2025-05-13 at 10:00 +0200, Martin Wilck wrote: > > > If you think it does, is there another reason why you didn't try > > this > > before? > > It didn't occur to me back then that we could fail paths without > retrying in the kernel. > > Perhaps we could have the sg driver pass the blk_status_t (which is > available on the sg level) to device mapper somehow in the sg_io_hdr > structure? That way we could entirely avoid the layering violation > between SCSI and dm. Not sure if that would be acceptible to > Christoph, > as blk_status_t is supposed to be exclusive to the kernel. Can we > find > a way to make sure it's passed to DM, but not to user space? I have to correct myself. I was confused by my old patches which contain special casing for SG_IO. The current upstream code does of course not support special-casing SG_IO in any way. device-mapper neither looks at the ioctl `cmd` value nor at any arguments, and has only the Unix error code to examine when the ioctl returns. The device mapper layer has access to *less* information than the user space process that issued the ioctl. Adding hooks to the sg driver wouldn't buy us anything in this situation. If we can't change this, we can't fail paths in the SG_IO error code path, end of story. With Kevin's patch 1/2 applied, it would in principle be feasible to special-case SG_IO, handle it in the dm-multipath, retrieve the blk_status_t somehow, and possibly initiate path failover. This way we'd at least keep the generic dm layer clean of SCSI specific code. But still, the end result would look very similar attempt from 2021 and would therefore lead us nowhere, probably. I'm still not too fond of DM_MPATH_PROBE_PATHS_CMD, but I can't offer a better solution at this time. If the side issues are fixed, it will be an improvement over the current upstream, situation where we can do no path failover at all. In the long term, we should evaluate alternatives. If my conjecture in my previous post is correct we need only PRIN/PROUT commands, there might be a better solution than scsi-block for our customers. Using regular block IO should actually also improved performance. Regards Martin
Am 14.05.2025 um 23:21 hat Martin Wilck geschrieben: > On Tue, 2025-05-13 at 10:00 +0200, Martin Wilck wrote: > > > If you think it does, is there another reason why you didn't try > > > this > > > before? > > > > It didn't occur to me back then that we could fail paths without > > retrying in the kernel. > > > > Perhaps we could have the sg driver pass the blk_status_t (which is > > available on the sg level) to device mapper somehow in the sg_io_hdr > > structure? That way we could entirely avoid the layering violation > > between SCSI and dm. Not sure if that would be acceptible to > > Christoph, > > as blk_status_t is supposed to be exclusive to the kernel. Can we > > find > > a way to make sure it's passed to DM, but not to user space? > > I have to correct myself. I was confused by my old patches which > contain special casing for SG_IO. The current upstream code does of > course not support special-casing SG_IO in any way. device-mapper > neither looks at the ioctl `cmd` value nor at any arguments, and has > only the Unix error code to examine when the ioctl returns. The device > mapper layer has access to *less* information than the user space > process that issued the ioctl. Adding hooks to the sg driver wouldn't > buy us anything in this situation. > > If we can't change this, we can't fail paths in the SG_IO error code > path, end of story. Yes, as long as we can't look at the sg_io_hdr, there is no way to figure out if we got a path error. > With Kevin's patch 1/2 applied, it would in principle be feasible to > special-case SG_IO, handle it in the dm-multipath, retrieve the > blk_status_t somehow, and possibly initiate path failover. This way > we'd at least keep the generic dm layer clean of SCSI specific code. > But still, the end result would look very similar attempt from 2021 and > would therefore lead us nowhere, probably. Right, that was my impression, too. The interfaces could be made look a bit different, and we could return -EAGAIN to userspace instead of retrying immediately (not that it makes sense to me, but if that were really the issue, fine with me), but the core logic with copying the sg_io_hdr, calling sg_io() directly and then inspecting the status and possibly failing paths would have to be pretty much the same as you had. > I'm still not too fond of DM_MPATH_PROBE_PATHS_CMD, but I can't offer a > better solution at this time. If the side issues are fixed, it will be > an improvement over the current upstream, situation where we can do no > path failover at all. Yes, I agree we should focus on improving what we have, rather than trying to find another radically different approach that none of us have thought of before. > In the long term, we should evaluate alternatives. If my conjecture in > my previous post is correct we need only PRIN/PROUT commands, there > might be a better solution than scsi-block for our customers. Using > regular block IO should actually also improved performance. If you're talking about SG_IO in dm-mpath, then PRIN/PROUT commands are actually the one thing that we don't need. libmpathpersist sends the commands to the individual path devices, so dm-mpath will never see those. It's mostly about getting the full results on the SCSI level for normal I/O commands. There has actually been a patch series on qemu-devel last year (that I haven't found the time to review properly yet) that would add explicit persistent reservation operations to QEMU's block layer that could then be used with the emulated scsi-hd device. On the backend, it only implemented it for iscsi, but I suppose we could implement it for file-posix, too (using the same libmpathpersist code as for passthrough). If that works, maybe at least some users can move away from SCSI passthrough. The thing that we need to make sure, though, is that the emulated status we can expose to the guest is actually good enough. That Paolo said that the problem with reservation conflicts was mostly because -EBADE wasn't a thing yet gives me some hope that at least this wouldn't be a problem any more today. We would still lose other parts of the SCSI status, so I'm still a bit cautious here with making a prediction for how many users could eventually (I expect years) use the emulated device instead and how many would keep using passthrough even in the long term. Kevin
On Thu, May 15, 2025 at 12:11:49PM +0200, Kevin Wolf wrote: > If you're talking about SG_IO in dm-mpath, then PRIN/PROUT commands are > actually the one thing that we don't need. libmpathpersist sends the > commands to the individual path devices, so dm-mpath will never see > those. It's mostly about getting the full results on the SCSI level for > normal I/O commands. > > There has actually been a patch series on qemu-devel last year (that I > haven't found the time to review properly yet) that would add explicit > persistent reservation operations to QEMU's block layer that could then > be used with the emulated scsi-hd device. On the backend, it only > implemented it for iscsi, but I suppose we could implement it for > file-posix, too (using the same libmpathpersist code as for > passthrough). If that works, maybe at least some users can move away > from SCSI passthrough. Please call into the kernel PR code instead of hacking up more of this, which will just run into problems again. > The thing that we need to make sure, though, is that the emulated status > we can expose to the guest is actually good enough. That Paolo said that > the problem with reservation conflicts was mostly because -EBADE wasn't > a thing yet gives me some hope that at least this wouldn't be a problem > any more today. And if you need more information we can add that to the kernel PR API.
On Thu, 2025-05-15 at 23:00 -0700, Christoph Hellwig wrote: > On Thu, May 15, 2025 at 12:11:49PM +0200, Kevin Wolf wrote: > > If you're talking about SG_IO in dm-mpath, then PRIN/PROUT commands > > are > > actually the one thing that we don't need. libmpathpersist sends > > the > > commands to the individual path devices, so dm-mpath will never see > > those. It's mostly about getting the full results on the SCSI level > > for > > normal I/O commands. > > > > There has actually been a patch series on qemu-devel last year > > (that I > > haven't found the time to review properly yet) that would add > > explicit > > persistent reservation operations to QEMU's block layer that could > > then > > be used with the emulated scsi-hd device. On the backend, it only > > implemented it for iscsi, but I suppose we could implement it for > > file-posix, too (using the same libmpathpersist code as for > > passthrough). If that works, maybe at least some users can move > > away > > from SCSI passthrough. > x > Please call into the kernel PR code instead of hacking up more of > this, which will just run into problems again. I still don't get what this buys us. The guest (which might be Windows or whatever else) sends SCSI reservation commands. qemu will have to intercept these anyway, unless the backend device is a plain SCSI device (in which case transformation into generic PR command would be a strange thing to do). If the backend device is multipath on SCSI, qemu-pr-helper would take the most appropriate action and return the most appropriate result code. The dm-multipath layer can't do it as well, as it doesn't have the full information about the target that's available in user space (see Ben's note about ALL_TG_PT for example). I don't see any benefit from using a generic reservation on dm-mpath instead of using qemu-pr- helper for this important use case. I also don't see why this way of handling SCSI PR commands would be dangerous. You are of course right to say that passthrough of other SCSI commands (except regular IO and PR) is highly dangerous, but in the concept that Kevin described that wouldn't happen. Transforming the SCSI reservation commands into generic reservation commands makes sense for _other_ types of backend devices. NVMe comes to mind, but (for real-world applications) not much else. (But does it make sense to present NVMe devices to guests as SCSI devices?). Regards Martin
On Mon, May 19, 2025 at 07:33:50PM +0200, Martin Wilck wrote: > I still don't get what this buys us. You get one layer to deal with instead of polking into a leak abstraction. Qemu sees a block devices, checks if it supports persistent reservations and everything is taken care underneath instead of having to try to understand what is below and working around it.
Am 16.05.2025 um 08:00 hat Christoph Hellwig geschrieben: > On Thu, May 15, 2025 at 12:11:49PM +0200, Kevin Wolf wrote: > > If you're talking about SG_IO in dm-mpath, then PRIN/PROUT commands are > > actually the one thing that we don't need. libmpathpersist sends the > > commands to the individual path devices, so dm-mpath will never see > > those. It's mostly about getting the full results on the SCSI level for > > normal I/O commands. > > > > There has actually been a patch series on qemu-devel last year (that I > > haven't found the time to review properly yet) that would add explicit > > persistent reservation operations to QEMU's block layer that could then > > be used with the emulated scsi-hd device. On the backend, it only > > implemented it for iscsi, but I suppose we could implement it for > > file-posix, too (using the same libmpathpersist code as for > > passthrough). If that works, maybe at least some users can move away > > from SCSI passthrough. > > Please call into the kernel PR code instead of hacking up more of > this, which will just run into problems again. I agree that using kernel code is preferable to doing things behind the kernel's back. However, libmpathpersist is the official interface for doing these things with multipath devices, so I think the necessary work to make this happen should primarily be done in the library (and possibly the kernel if the existing interfaces aren't good enough). QEMU could directly call the kernel when qemu-pr-helper isn't in use. I don't know enough about how libmpathpersist works internally to tell if running it this way would be a good idea for multipath devices. Can multipathd still do its job with reservations being made behind its back? (It would probably be good to allow this eventually, but is it the case today?) Kevin
On Thu, May 15, 2025 at 11:00:17PM -0700, Christoph Hellwig wrote: > On Thu, May 15, 2025 at 12:11:49PM +0200, Kevin Wolf wrote: > > If you're talking about SG_IO in dm-mpath, then PRIN/PROUT commands are > > actually the one thing that we don't need. libmpathpersist sends the > > commands to the individual path devices, so dm-mpath will never see > > those. It's mostly about getting the full results on the SCSI level for > > normal I/O commands. > > > > There has actually been a patch series on qemu-devel last year (that I > > haven't found the time to review properly yet) that would add explicit > > persistent reservation operations to QEMU's block layer that could then > > be used with the emulated scsi-hd device. On the backend, it only > > implemented it for iscsi, but I suppose we could implement it for > > file-posix, too (using the same libmpathpersist code as for > > passthrough). If that works, maybe at least some users can move away > > from SCSI passthrough. > > Please call into the kernel PR code instead of hacking up more of > this, which will just run into problems again. > > > The thing that we need to make sure, though, is that the emulated status > > we can expose to the guest is actually good enough. That Paolo said that > > the problem with reservation conflicts was mostly because -EBADE wasn't > > a thing yet gives me some hope that at least this wouldn't be a problem > > any more today. > > And if you need more information we can add that to the kernel PR API. I've run into SCSI arrays that always act like they were called with the ALL_TG_PT flag set (or perhaps they were just configured that way, I never could get a straight answer about that). libmpathpersist accepts this flag, and only sends one registration per initiator & target device when it's set, instead of one per initator & target port. dm-multipath currently doesn't have the knowledge necessary to figure out which devices it needs to send reservations to, even if the kernel PR API supported this. But I supposed it could just send the registration normally down one path and if that works, it could just ignore the existing key when it sends the registration down all the other paths. -Ben
On Fri, May 16, 2025 at 12:06:21PM -0400, Benjamin Marzinski wrote: > I've run into SCSI arrays that always act like they were called with the > ALL_TG_PT flag set (or perhaps they were just configured that way, I > never could get a straight answer about that). Hmm, that's pretty strange. > libmpathpersist accepts > this flag, and only sends one registration per initiator & target device > when it's set, instead of one per initator & target port. With "this flag" you mean ALL_TG_PT? > dm-multipath > currently doesn't have the knowledge necessary to figure out which > devices it needs to send reservations to, even if the kernel PR API > supported this. > > But I supposed it could just send the registration normally down one > path and if that works, it could just ignore the existing key when it > sends the registration down all the other paths. We could add support for a similar flag to the kernel PR API. The problem is to figure out how to discover support for it. What does the library do for that currently?
On Sun, 2025-05-18 at 22:32 -0700, Christoph Hellwig wrote: > On Fri, May 16, 2025 at 12:06:21PM -0400, Benjamin Marzinski wrote: > > I've run into SCSI arrays that always act like they were called > > with the > > ALL_TG_PT flag set (or perhaps they were just configured that way, > > I > > never could get a straight answer about that). > > Hmm, that's pretty strange. > > > libmpathpersist accepts > > this flag, and only sends one registration per initiator & target > > device > > when it's set, instead of one per initator & target port. > > With "this flag" you mean ALL_TG_PT? multipath-tools allows setting the flag "all_tg_pt" in multipath.conf for certain storage devices or multipath maps. If this flag is set, mpathpersist will always act as if the --param-alltgpt flag was given. This means that the REGISTER commands are sent once per initiator port (SCSI host) only, instead of once per I_T nexus. By default, the all_tg_pt flag is not set for any device, not even for EMC VNX, where Ben observed the described behavior. It seems that devices that behave like this are a rare exception. > > dm-multipath > > currently doesn't have the knowledge necessary to figure out which > > devices it needs to send reservations to, even if the kernel PR API > > supported this. > > > > But I supposed it could just send the registration normally down > > one > > path and if that works, it could just ignore the existing key when > > it > > sends the registration down all the other paths. > > We could add support for a similar flag to the kernel PR API. The > problem is to figure out how to discover support for it. What does > the library do for that currently? dm-multipath has no knowledge about target or initiator ports, and I fail to see how we could provide this knowledge to it without adding yet another layering violation. Just sending the command once will not be sufficient if there are multiple local ports. Regards Martin
On Sun, May 18, 2025 at 10:32:17PM -0700, Christoph Hellwig wrote: > On Fri, May 16, 2025 at 12:06:21PM -0400, Benjamin Marzinski wrote: > > I've run into SCSI arrays that always act like they were called with the > > ALL_TG_PT flag set (or perhaps they were just configured that way, I > > never could get a straight answer about that). > > Hmm, that's pretty strange. > > > libmpathpersist accepts > > this flag, and only sends one registration per initiator & target device > > when it's set, instead of one per initator & target port. > > With "this flag" you mean ALL_TG_PT? Yes. > > > dm-multipath > > currently doesn't have the knowledge necessary to figure out which > > devices it needs to send reservations to, even if the kernel PR API > > supported this. > > > > But I supposed it could just send the registration normally down one > > path and if that works, it could just ignore the existing key when it > > sends the registration down all the other paths. > > We could add support for a similar flag to the kernel PR API. The > problem is to figure out how to discover support for it. What does > the library do for that currently? > libmpathpersist knows which scsi path devices map to the same Initiator, but different target ports. When ALL_TG_PT is set, it only sends one registration command to each group, instead of sending one to each path device (like when ALL_TG_TP isn't set). If it sent a registration to every SCSI device on these arrays that act like ALL_TG_PT is always set, it would get reservation conflicts when sending the command to devices using ports after the first, since they would already have a registered key. That's why I mentioned the posibility of using Register and Ignore, which doesn't care if there's a key already registered. I don't think it would take much work for the Kernel API to support ALL_TG_PT, if the SCSI implementation just passed the flag down to the device. The problem really only exists for things like dm-multipath, that are composed of a number of different devices that all point to the same underlying Target LUN. To handle this without relying on something like Register and Ignore, It would need to be able to group all the devices that come from the same initiator but different target ports together, and only send a registration to one device per group. This takes more SCSI specific knowledge than it currently has, or really wants to have. -Ben
On Thu, 2025-05-15 at 12:11 +0200, Kevin Wolf wrote: > Am 14.05.2025 um 23:21 hat Martin Wilck geschrieben: > > > In the long term, we should evaluate alternatives. If my conjecture > > in > > my previous post is correct we need only PRIN/PROUT commands, there > > might be a better solution than scsi-block for our customers. Using > > regular block IO should actually also improved performance. > > If you're talking about SG_IO in dm-mpath, then PRIN/PROUT commands > are > actually the one thing that we don't need. libmpathpersist sends the > commands to the individual path devices, so dm-mpath will never see > those. It's mostly about getting the full results on the SCSI level > for > normal I/O commands. I know. I meant "need" from the PoV of the guest, in the sense "which commands need to be passed from the guessed to the device (in some reasonable way) except the normal READ and WRITE commands?". Regards Martin
On 5/15/25 12:11, Kevin Wolf wrote: > The thing that we need to make sure, though, is that the emulated status > we can expose to the guest is actually good enough. That Paolo said that > the problem with reservation conflicts was mostly because -EBADE wasn't > a thing yet gives me some hope that at least this wouldn't be a problem > any more today. Hmm I'm not sure about that anymore... I was confused because some of the refactoring of block errors was done in 2017, which is around the same time I was working on persistent reservations in QEMU. However, EBADE handling dates back to 2011 (commit 63583cca745f, "[SCSI] Add detailed SCSI I/O errors", 2011-02-12) and yet the Windows tests for PR were failing before QEMU switched to SG_IO for reads and writes. I guess I have to try reverting that and retest, though. Paolo
On Thu, 2025-05-15 at 13:09 +0200, Paolo Bonzini wrote: > However, EBADE handling dates back to 2011 (commit 63583cca745f, > "[SCSI] > Add detailed SCSI I/O errors", 2011-02-12) and yet the Windows tests > for > PR were failing before QEMU switched to SG_IO for reads and writes. > I > guess I have to try reverting that and retest, though. Thanks! This makes me realize that we could summarize the goal for future efforts (independent of the current patch set) roughly like this: "Emulate a SCSI disk on top of a (host) multipath device in a way that 1) failover works properly (like it would work for regular IO from the host itself), 2) Windows tests for PR (plus test case X, Y, ...) can be run successfully". Does this make sense? It implies that PR commands don't just need to be forwarded appropriately, we also need to pass meaningful error codes back to the guest. Martin
On Tue, 2025-05-13 at 10:00 +0200, Martin Wilck wrote: > On Mon, 2025-05-12 at 17:18 +0200, Kevin Wolf wrote: > > > > > Ben already fixed up the missing check. > > I missed that. Saw it now. Martin
On 5/12/25 17:18, Kevin Wolf wrote: > Am 08.05.2025 um 15:51 hat Martin Wilck geschrieben: >> Hello Kevin, >> >> [I'm sorry for the previous email. It seems that I clicked "send" in >> the wrong window]. >> >> On Tue, 2025-04-29 at 18:50 +0200, Kevin Wolf wrote: >>> Multipath cannot directly provide failover for ioctls in the kernel >>> because it doesn't know what each ioctl means and which result could >>> indicate a path error. Userspace generally knows what the ioctl it >>> issued means and if it might be a path error, but neither does it >>> know >>> which path the ioctl took nor does it necessarily have the privileges >>> to >>> fail a path using the control device. >> >> Thanks for this effort. >> >> I have some remarks about your approach. The most important one is that >> the DM_MPATH_PROBE_PATHS_CMD ioctl appears to be a dangerous command. >> It sends IO to possibly broken paths and waits for it to complete. In >> the common error case of a device not responding, this might cause the >> code to hang for a long time in the kernel ioctl code path, in >> uninterruptible sleep (note that these probing commands will be queued >> after other possibly hanging IO). In the past we have put a lot of >> effort into other code paths in multipath-tools and elsewhere to avoid >> this kind of hang to the extent possible. It seems to me that your set >> re-introduces this risk. > > That's a fair point. I don't expect this code to be fully final, another > thing that isn't really optimal (as mentioned in the comment message) is > that we're not probing paths in parallel, potentially adding up timeouts > from multiple paths. > > I don't think this is a problem of the interface, though, but we can > improve the implementation keeping the same interface. > > Maybe I'm missing something, but I think the reason why it has to be > uninterruptible is just that submit_bio_wait() has the completion on the > stack? So I suppose we could just kmalloc() some state, submit all the > bios, let the completion callback free it, and then we can just stop > waiting early (i.e. wait_for_completion_interruptible/killable) and let > the requests complete on their own in the background. > > Would this address your concerns or is there another part to it? > Not really. There are two issues which ultimately made us move away from read I/O as path checkers: - Spurious timeouts due to blocked or congested submission - Error handling delays and stalls When using normal read/write functions for path checking you are subjected to normal I/O processing rules, ie I/O is inserted at the end of the submission queue. If the system is busy the path checker will time out prematurely without ever having reached the target. One could avoid that by extending the timeout, but that would make the path checker unreliable. But the real issue is error handling. Path checker are precisely there to check if the path is healthy, and as such _will_ be subjected to error handling (or might even trigger error handling itself). The thing about error handling is that you can only return affected commands once error handling is complete, as only then you can be sure that no DMA is pending on the buffers and one can free/re-use the associated pages. On SCSI error handling can be an _extremely_ lengthy affair (we had reports for over one hour), and the last thing you want is to delay your path checker for that time. (And besides, I didn't even mention the third problem with I/O-based path checkers, namely that the check entirely the wrong thing; we are not interested whether we can do I/O, we are interested in whether we can send commands. In the light of eg ALUA these are two very different things.) >> Apart from that, minor observations are that in patch 2/2 you don't >> check whether the map is in queueing state, and I don't quite >> understand why you only check paths in the map's active path group >> without attempting a PG failover in the case where all paths in the >> current PG fail. > > Ben already fixed up the missing check. > > Not attempting PG failover was also his suggestion, on the basis that it > would be additional work for no real benefit when you can only submit > requests for the current PG anyway. If userspace retries the SG_IO, it > will already pick a different PG, so having the kernel do the same > doesn't really improve anything. > Precisely. I think the best way forward here is to have a 'post_ioctl' handler (much like we have a pre_ioctl handler nowadays). This would check the return code from the ioctl, and invalidate the current path if there was an I/O error. That way the user can resubmit the ioctl, until all paths are exhausted. For that to work we need to agree on two error codes, one for 'path failed, retry' and one for 'path failed, no retry possible'. Resetting path status will be delegated to multipathd, but then that's its main task anyway. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
On Tue, May 13, 2025 at 08:30:55AM +0200, Hannes Reinecke wrote: > On 5/12/25 17:18, Kevin Wolf wrote: > > Am 08.05.2025 um 15:51 hat Martin Wilck geschrieben: > > > Hello Kevin, > > > > > > [I'm sorry for the previous email. It seems that I clicked "send" in > > > the wrong window]. > > > > > > On Tue, 2025-04-29 at 18:50 +0200, Kevin Wolf wrote: > > > > Multipath cannot directly provide failover for ioctls in the kernel > > > > because it doesn't know what each ioctl means and which result could > > > > indicate a path error. Userspace generally knows what the ioctl it > > > > issued means and if it might be a path error, but neither does it > > > > know > > > > which path the ioctl took nor does it necessarily have the privileges > > > > to > > > > fail a path using the control device. > > > > > > Thanks for this effort. > > > > > > I have some remarks about your approach. The most important one is that > > > the DM_MPATH_PROBE_PATHS_CMD ioctl appears to be a dangerous command. > > > It sends IO to possibly broken paths and waits for it to complete. In > > > the common error case of a device not responding, this might cause the > > > code to hang for a long time in the kernel ioctl code path, in > > > uninterruptible sleep (note that these probing commands will be queued > > > after other possibly hanging IO). In the past we have put a lot of > > > effort into other code paths in multipath-tools and elsewhere to avoid > > > this kind of hang to the extent possible. It seems to me that your set > > > re-introduces this risk. > > > > That's a fair point. I don't expect this code to be fully final, another > > thing that isn't really optimal (as mentioned in the comment message) is > > that we're not probing paths in parallel, potentially adding up timeouts > > from multiple paths. > > > > I don't think this is a problem of the interface, though, but we can > > improve the implementation keeping the same interface. > > > > Maybe I'm missing something, but I think the reason why it has to be > > uninterruptible is just that submit_bio_wait() has the completion on the > > stack? So I suppose we could just kmalloc() some state, submit all the > > bios, let the completion callback free it, and then we can just stop > > waiting early (i.e. wait_for_completion_interruptible/killable) and let > > the requests complete on their own in the background. > > > > Would this address your concerns or is there another part to it? > > > Not really. There are two issues which ultimately made us move away > from read I/O as path checkers: > > - Spurious timeouts due to blocked or congested submission > - Error handling delays and stalls > > When using normal read/write functions for path checking you are > subjected to normal I/O processing rules, ie I/O is inserted > at the end of the submission queue. If the system is busy the > path checker will time out prematurely without ever having reached > the target. One could avoid that by extending the timeout, but that > would make the path checker unreliable. I get your complaint in general, but this interface is just giving users the ability to probe the active pathgroup by sending normal IOs to its paths. Presumably, they won't use it if they are already sending normal IOs to the multipath device, since this is just duplicating the effect of those IOs. > But the real issue is error handling. Path checker are precisely there > to check if the path is healthy, and as such _will_ be subjected > to error handling (or might even trigger error handling itself). > The thing about error handling is that you can only return affected > commands once error handling is complete, as only then you can be > sure that no DMA is pending on the buffers and one can free/re-use > the associated pages. > On SCSI error handling can be an _extremely_ lengthy affair > (we had reports for over one hour), and the last thing you want is > to delay your path checker for that time. I can send a patch that will keep probe_active_paths() from holding the work_mutex. This means that probe_active_paths() won't delay multipath_message(), so the path checkers will not be effected by this. > (And besides, I didn't even mention the third problem with I/O-based > path checkers, namely that the check entirely the wrong thing; we > are not interested whether we can do I/O, we are interested in whether > we can send commands. In the light of eg ALUA these are two very > different things.) I actually disagree with this. There are issues with IO based checkers, but in general, what we care about is paths being able to do IO, not sending commands. All the improvements on the directio checker were made because we still need it. For instance, some SCSI devices will claim they are usable when you run a SCSI Test Unit Ready command, but they aren't able to actually handle IO. These paths will have their state ping-pong back and forth as multipathd restores them and the kernel uses them and immeditately fails them again. The solution is often to switch multipathd to the directio checker, since it verifies that the path can actually handle IO. It's problematic when dealing with paths that need to be initialized before accepting IO, but that's why this patch only checks the paths in the active pathgroup. > > > Apart from that, minor observations are that in patch 2/2 you don't > > > check whether the map is in queueing state, and I don't quite > > > understand why you only check paths in the map's active path group > > > without attempting a PG failover in the case where all paths in the > > > current PG fail. > > > > Ben already fixed up the missing check. > > > > Not attempting PG failover was also his suggestion, on the basis that it > > would be additional work for no real benefit when you can only submit > > requests for the current PG anyway. If userspace retries the SG_IO, it > > will already pick a different PG, so having the kernel do the same > > doesn't really improve anything. > > > Precisely. > > I think the best way forward here is to have a 'post_ioctl' handler > (much like we have a pre_ioctl handler nowadays). > This would check the return code from the ioctl, and invalidate the > current path if there was an I/O error. > That way the user can resubmit the ioctl, until all paths are exhausted. > For that to work we need to agree on two error codes, one for > 'path failed, retry' and one for 'path failed, no retry possible'. > Resetting path status will be delegated to multipathd, but then that's > its main task anyway. Isn't this basically the idea that Martin had, here: https://lore.kernel.org/all/20210628151558.2289-1-mwilck@suse.com/ If this was acceptable, I would be all for it. But we often can't even tell if an SG_IO has failed without decoding the scsi status/sense data, and that got shot down when Martin posted patches to do it. Thus, we have a general solution that has nothing to do with SG_IO ioctls and will work for any type of path device, any time a user wants to verify the state of the active paths. -Ben > Cheers, > > Hannes > -- > Dr. Hannes Reinecke Kernel Storage Architect > hare@suse.de +49 911 74053 688 > SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg > HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
On Mon, May 12, 2025 at 05:18:43PM +0200, Kevin Wolf wrote: > Yes, it's a bit unfortunate, but we have to work with what we have. QEMU > doesn't even necessarily know that it's dealing with a multipath device, > so it just has to blindly try the ioctl and see if it works. Why is qemu even using SG_IO to start with?
Am 13.05.2025 um 07:55 hat Christoph Hellwig geschrieben: > On Mon, May 12, 2025 at 05:18:43PM +0200, Kevin Wolf wrote: > > Yes, it's a bit unfortunate, but we have to work with what we have. QEMU > > doesn't even necessarily know that it's dealing with a multipath device, > > so it just has to blindly try the ioctl and see if it works. > > Why is qemu even using SG_IO to start with? How else would you do SCSI passthrough? Ok, from your replies to Hannes I understand an implicit message, you wouldn't. But I don't think that's really an answer, at least not for all users. Yes, I'll give you that in the long term, we might be able to move some passthrough users away from it by using more specific interfaces like for persistent reservations (though see below). But for example, it's also used for vendor-specific commands and I don't see how Linux could ever provide more specific interfaces than SG_IO for those. But it's even about more than just accessing commands that aren't otherwise accessible. Mapping a SCSI response from the device to a single errno and back into SCSI status for the guest isn't lossless. QEMU's scsi-block device actually started off using normal I/O for reads and writes and using SG_IO only for things that aren't covered by normal I/O. But then those had to be changed to SG_IO, too, so that the guest would actually see the full SCSI status. Things the commit message mentions are unit attention codes and handling RESERVATION CONFLICT correctly (which made me unsure above if the more specific interface for reservations could actually be used to fully get rid of SG_IO). For more context, I'm adding Paolo who actually made that change back then. He may remember the concrete bug(s) this fixed. So if you want the guest device to behave mostly the same as your host device, I don't really see any way around SCSI passthrough and therefore SG_IO. Kevin
On Tue, May 13, 2025 at 11:29:09AM +0200, Kevin Wolf wrote: > Am 13.05.2025 um 07:55 hat Christoph Hellwig geschrieben: > > On Mon, May 12, 2025 at 05:18:43PM +0200, Kevin Wolf wrote: > > > Yes, it's a bit unfortunate, but we have to work with what we have. QEMU > > > doesn't even necessarily know that it's dealing with a multipath device, > > > so it just has to blindly try the ioctl and see if it works. > > > > Why is qemu even using SG_IO to start with? > > How else would you do SCSI passthrough? > > Ok, from your replies to Hannes I understand an implicit message, you > wouldn't. But I don't think that's really an answer, at least not for > all users. SG_IO is fine and the only way for SCSI passthrough. But doing SCSI passthrough through md-multipath just doesn't work. SCSI isn't built for layering, and ALUA and it's vendor-specific variants and alternatives certainly isn't. If you try that you're playing with fire and is not chance of ever moving properly.
On Tue, May 13, 2025 at 09:57:51PM -0700, Christoph Hellwig wrote: > On Tue, May 13, 2025 at 11:29:09AM +0200, Kevin Wolf wrote: > > Am 13.05.2025 um 07:55 hat Christoph Hellwig geschrieben: > > > On Mon, May 12, 2025 at 05:18:43PM +0200, Kevin Wolf wrote: > > > > Yes, it's a bit unfortunate, but we have to work with what we have. QEMU > > > > doesn't even necessarily know that it's dealing with a multipath device, > > > > so it just has to blindly try the ioctl and see if it works. > > > > > > Why is qemu even using SG_IO to start with? > > > > How else would you do SCSI passthrough? > > > > Ok, from your replies to Hannes I understand an implicit message, you > > wouldn't. But I don't think that's really an answer, at least not for > > all users. > > SG_IO is fine and the only way for SCSI passthrough. But doing > SCSI passthrough through md-multipath just doesn't work. SCSI isn't > built for layering, and ALUA and it's vendor-specific variants and > alternatives certainly isn't. If you try that you're playing with > fire and is not chance of ever moving properly. Could you be a bit more specific. All multipath is doing here is forwarding the ioctls to an underlying scsi device, and passing back up the result. Admittedly, it doesn't always make sense to pass the ioctl on from the multipath device to just one scsi device. Persistent Reservations are perfect example of this, and that's why QEMU doesn't use DMs ioctl passthrough code to handle them. Also, when you have ALUA setups, not all the scsi devices are equal. But multipath isn't naievely assuming that they are. It's only passing ioctls to the highest priority activated paths, just like it does for IO, and multipath is in charge of handling explicit alua devices. This hasn't proved to be problematic in practice. The reality of the situation is that customers have been using this for a while, and the only issue that they run into is that multipath can't tell when a SG_IO has failed due to a retryable error. Currently, they're left with waiting for multipathd's preemptive path checking to fail the path so they can retry down a new one. The purpose of this patchset and Martin's previous one is to handle this problem. If there are unavoidable critical problems that you see with this setup, it would be really helpful to know what they are. -Ben
Hello Ben, hello Christoph, On Wed, 2025-05-14 at 12:23 -0400, Benjamin Marzinski wrote: > On Tue, May 13, 2025 at 09:57:51PM -0700, Christoph Hellwig wrote: > > > > SG_IO is fine and the only way for SCSI passthrough. But doing > > SCSI passthrough through md-multipath just doesn't work. SCSI > > isn't > > built for layering, and ALUA and it's vendor-specific variants and > > alternatives certainly isn't. If you try that you're playing with > > fire and is not chance of ever moving properly. > > Could you be a bit more specific. All multipath is doing here is > forwarding the ioctls to an underlying scsi device, and passing back > up > the result. Admittedly, it doesn't always make sense to pass the > ioctl > on from the multipath device to just one scsi device. Persistent > Reservations are perfect example of this, and that's why QEMU doesn't > use DMs ioctl passthrough code to handle them. I'd go one step further. Christoph is right to say that what we're currently doing in qemu – passing through every command except the PRIN/PROUT to a multipath device – is a dangerous thing to do. Passthrough from a dm-multipath device to a SCSI device makes sense only for a small subset of the SCSI command set. Basically just for the regular IO commands like the various READ and WRITE variants and the occasional UNMAP. However, in practice these commands account for 99.y% percent of the actual commands sent to devices. The fact that customers have been running these setups in large deployments over many years suggests that, if other commands ever get passed through to member devices, it has rarely had fatal consequences. Nobody would seriously consider sending ALUA commands to the multipath devices. TUR and REQUEST SENSE are other examples for commands that can't be reasonably passed through to random member devices of a multipath map. There are certainly many more examples. I guess it would make sense to review the command set and add some filtering in the qemu passthrough code. AFAIK the only commands that we really need to pass through (except the standard ones) are the reservation commands, which get special handling by qemu anyway. @Ben, @Kevin, are you aware of anything else? So: admittedly we're using a framework for passing through any command, where we actually need to pass through only a tiny subset of commands. Thinking about it this way, it really doesn't look like the perfect tool for the job, and we may want to look into a different approach for the future. > Also, when you have ALUA > setups, not all the scsi devices are equal. But multipath isn't > naievely > assuming that they are. It's only passing ioctls to the highest > priority > activated paths, just like it does for IO, and multipath is in charge > of > handling explicit alua devices. This hasn't proved to be problematic > in > practice. > > The reality of the situation is that customers have been using this > for > a while, and the only issue that they run into is that multipath > can't > tell when a SG_IO has failed due to a retryable error. Currently, > they're left with waiting for multipathd's preemptive path checking > to > fail the path so they can retry down a new one. The purpose of this > patchset and Martin's previous one is to handle this problem. If > there > are unavoidable critical problems that you see with this setup, it > would > be really helpful to know what they are. I'd also be interested in understanding this better. As noted above, I'm aware that passing through everything is dangerous and wrong in principle. But in practice, we haven't observed anything serious except (as Ben already said) the failure to do path failover in the SG_IO code path, which both this patch set and my set from the past are intended to fix. While I am open for looking for better alternatives, I still hope that we can find an agreement for a short/mid-term solution that would allow us to serve our customers who currently use SCSI passthrough setups. That would not just benefit us (the enterprise distros), because it would also help us fund upstream contributions. Regards Martin
Il mer 14 mag 2025, 13:37 Martin Wilck <mwilck@suse.com> ha scritto: > > I'd go one step further. Christoph is right to say that what we're > currently doing in qemu – passing through every command except the > PRIN/PROUT to a multipath device – is a dangerous thing to do. > > Passthrough from a dm-multipath device to a SCSI device makes sense > only for a small subset of the SCSI command set. Basically just for the > regular IO commands like the various READ and WRITE variants and the > occasional UNMAP. The fact that customers > have been running these setups in large deployments over many years > suggests that, if other commands ever get passed through to member > devices, it has rarely had fatal consequences. > > Nobody would seriously consider sending ALUA commands to the multipath > devices. TUR and REQUEST SENSE are other examples for commands that > can't be reasonably passed through to random member devices of a > multipath map. Yes, as usual things are a bit more complicated. First, a handful of commands are special (REQUEST SENSE would be for HBAs that don't use auto sense, but that is fortunately not something you encounter). Second, there's already a filter in the kernel in drivers/scsi/scsi_ioctl.c for commands that are allowed without CAP_SYS_RAWIO. QEMU is subject to that so the commands you'll see are mostly I/O, INQUIRY, TUR, MODE SENSE/SELECT and that's it. Any command that the kernel doesn't filter would be rejected, or handled specially in the case of PR commands (PR commands use a separate privileged helper to send them down to the device; the helper also knows about multipath and uses the userspace libmpathpersist if it receives a dm-mpath file descriptor via SCM_RIGHTS). > AFAIK the only commands that we really need to pass through (except the standard ones) are the reservation commands, which get special handling > by qemu anyway. @Ben, @Kevin, are you aware of anything else? .Of the ones that aren't simple I/O, mode parameters and TUR are the important cases. A TUR failure would be handled by the ioctl that Kevin proposed here by forcing a path switch. Mode parameters might not be shared(*) and would need to be sent down all the paths in that case; that can be fixed in userspace if necessary. > I'd also be interested in understanding this better. As noted above, > I'm aware that passing through everything is dangerous and wrong in > principle. But in practice, we haven't observed anything serious except > (as Ben already said) the failure to do path failover in the SG_IO code > path, which both this patch set and my set from the past are intended > to fix. Yes, the kernel filter is a PITA in the normal single path case but here it helps not doing something overly wrong. Paolo (*) in theory there's a Mode Page Policy VPD page to tell the initiator whether they are. I'm not sure if anyone supports it in the real world...
On Thu, 2025-05-15 at 04:53 +0200, Paolo Bonzini wrote: > Il mer 14 mag 2025, 13:37 Martin Wilck <mwilck@suse.com> ha scritto: > > > > I'd go one step further. Christoph is right to say that what we're > > currently doing in qemu – passing through every command except the > > PRIN/PROUT to a multipath device – is a dangerous thing to do. > > > > Passthrough from a dm-multipath device to a SCSI device makes sense > > only for a small subset of the SCSI command set. Basically just for > > the > > regular IO commands like the various READ and WRITE variants and > > the > > occasional UNMAP. The fact that customers > > have been running these setups in large deployments over many years > > suggests that, if other commands ever get passed through to member > > devices, it has rarely had fatal consequences. > > > > Nobody would seriously consider sending ALUA commands to the > > multipath > > devices. TUR and REQUEST SENSE are other examples for commands that > > can't be reasonably passed through to random member devices of a > > multipath map. > > Yes, as usual things are a bit more complicated. First, a handful of > commands are special (REQUEST SENSE would be for HBAs that don't use > auto sense, but that is fortunately not something you encounter). > Second, there's already a filter in the kernel in > drivers/scsi/scsi_ioctl.c for commands that are allowed without > CAP_SYS_RAWIO. QEMU is subject to that so the commands you'll see are > mostly I/O, INQUIRY, TUR, MODE SENSE/SELECT and that's it. Thanks for mentioning this. However, I suppose that depends on the permissions with which the qemu process is started, no? Wouldn't qemu need CAP_SYS_RAWIO for PCI passthrough as well? I admit that I'm confused by the many indirections in qemu's scsi-block code flow. AFAICS qemu forwards everything except PRIN/PROUT to the kernel block device in "scsi-block" mode. Correct me if I'm wrong. Anwyway, let's not forget that we're talking about the kernel here. While qemu is the main target application for this feature is created, any application can use it, and it may or may not run with CAP_SYS_RAWIO. > Any command that the kernel doesn't filter would be rejected, or > handled specially in the case of PR commands (PR commands use a > separate privileged helper to send them down to the device; the > helper > also knows about multipath and uses the userspace libmpathpersist if > it receives a dm-mpath file descriptor via SCM_RIGHTS). > > > AFAIK the only commands that we really need to pass through (except > > the standard ones) are the reservation commands, which get special > > handling > > by qemu anyway. @Ben, @Kevin, are you aware of anything else? > > .Of the ones that aren't simple I/O, mode parameters and TUR are the > important cases. A TUR failure would be handled by the ioctl that > Kevin proposed here by forcing a path switch. Mode parameters might > not be shared(*) and would need to be sent down all the paths in that > case; that can be fixed in userspace if necessary. Passing TUR from a multipath device to a random member doesn't make much sense to me. qemu would need to implement some logic to determine whether the map has any usable paths. > > I'd also be interested in understanding this better. As noted > > above, > > I'm aware that passing through everything is dangerous and wrong in > > principle. But in practice, we haven't observed anything serious > > except > > (as Ben already said) the failure to do path failover in the SG_IO > > code > > path, which both this patch set and my set from the past are > > intended > > to fix. > > Yes, the kernel filter is a PITA in the normal single path case but > here it helps not doing something overly wrong. This seems coincidental to me. Filtering by permissions and filtering for commands that make sense on multipath devices are orthogonal problems. Regards, Martin
On Thu, May 15, 2025 at 12:34:13PM +0200, Martin Wilck wrote: > On Thu, 2025-05-15 at 04:53 +0200, Paolo Bonzini wrote: > > Il mer 14 mag 2025, 13:37 Martin Wilck <mwilck@suse.com> ha scritto: > > > > > > I'd go one step further. Christoph is right to say that what we're > > > currently doing in qemu – passing through every command except the > > > PRIN/PROUT to a multipath device – is a dangerous thing to do. > > > > > > Passthrough from a dm-multipath device to a SCSI device makes sense > > > only for a small subset of the SCSI command set. Basically just for > > > the > > > regular IO commands like the various READ and WRITE variants and > > > the > > > occasional UNMAP. The fact that customers > > > have been running these setups in large deployments over many years > > > suggests that, if other commands ever get passed through to member > > > devices, it has rarely had fatal consequences. > > > > > > Nobody would seriously consider sending ALUA commands to the > > > multipath > > > devices. TUR and REQUEST SENSE are other examples for commands that > > > can't be reasonably passed through to random member devices of a > > > multipath map. > > > > Yes, as usual things are a bit more complicated. First, a handful of > > commands are special (REQUEST SENSE would be for HBAs that don't use > > auto sense, but that is fortunately not something you encounter). > > Second, there's already a filter in the kernel in > > drivers/scsi/scsi_ioctl.c for commands that are allowed without > > CAP_SYS_RAWIO. QEMU is subject to that so the commands you'll see are > > mostly I/O, INQUIRY, TUR, MODE SENSE/SELECT and that's it. > > Thanks for mentioning this. > > However, I suppose that depends on the permissions with which the qemu > process is started, no? Wouldn't qemu need CAP_SYS_RAWIO for PCI > passthrough as well? > > I admit that I'm confused by the many indirections in qemu's scsi-block > code flow. AFAICS qemu forwards everything except PRIN/PROUT to the > kernel block device in "scsi-block" mode. Correct me if I'm wrong. > > Anwyway, let's not forget that we're talking about the kernel here. > While qemu is the main target application for this feature is created, > any application can use it, and it may or may not run with > CAP_SYS_RAWIO. Maybe I'm missing your issue, but QEMU is just using DM's existing ioctl forwarding code, which was already designed for general use, and I don't see any capability issues with this patchset itself. If the caller has the capability to issue this ioctl, they already have the capability to do reads to the device directly, which will cause any unusable paths to be failed, exactly like the ioctl does. The only difference it that the user has no way to know how many reads they would need to issue to the multipath device directly to force it to try all the paths. This gives any application the ability to verify the paths just like doing enough IO to the multipath device would, but it guarantees that it will use the minimum IO. -Ben > > Any command that the kernel doesn't filter would be rejected, or > > handled specially in the case of PR commands (PR commands use a > > separate privileged helper to send them down to the device; the > > helper > > also knows about multipath and uses the userspace libmpathpersist if > > it receives a dm-mpath file descriptor via SCM_RIGHTS). > > > > > AFAIK the only commands that we really need to pass through (except > > > the standard ones) are the reservation commands, which get special > > > handling > > > by qemu anyway. @Ben, @Kevin, are you aware of anything else? > > > > .Of the ones that aren't simple I/O, mode parameters and TUR are the > > important cases. A TUR failure would be handled by the ioctl that > > Kevin proposed here by forcing a path switch. Mode parameters might > > not be shared(*) and would need to be sent down all the paths in that > > case; that can be fixed in userspace if necessary. > > Passing TUR from a multipath device to a random member doesn't make > much sense to me. qemu would need to implement some logic to determine > whether the map has any usable paths. > > > > I'd also be interested in understanding this better. As noted > > > above, > > > I'm aware that passing through everything is dangerous and wrong in > > > principle. But in practice, we haven't observed anything serious > > > except > > > (as Ben already said) the failure to do path failover in the SG_IO > > > code > > > path, which both this patch set and my set from the past are > > > intended > > > to fix. > > > > Yes, the kernel filter is a PITA in the normal single path case but > > here it helps not doing something overly wrong. > > This seems coincidental to me. Filtering by permissions and filtering > for commands that make sense on multipath devices are orthogonal > problems. > > Regards, > Martin
On Thu, 2025-05-15 at 10:29 -0400, Benjamin Marzinski wrote: > On Thu, May 15, 2025 at 12:34:13PM +0200, Martin Wilck wrote: > > > > However, I suppose that depends on the permissions with which the > > qemu > > process is started, no? Wouldn't qemu need CAP_SYS_RAWIO for PCI > > passthrough as well? > > > > I admit that I'm confused by the many indirections in qemu's scsi- > > block > > code flow. AFAICS qemu forwards everything except PRIN/PROUT to the > > kernel block device in "scsi-block" mode. Correct me if I'm wrong. > > > > Anwyway, let's not forget that we're talking about the kernel here. > > While qemu is the main target application for this feature is > > created, > > any application can use it, and it may or may not run with > > CAP_SYS_RAWIO. > > Maybe I'm missing your issue, but QEMU is just using DM's existing > ioctl > forwarding code, which was already designed for general use, and I > don't > see any capability issues with this patchset itself. I didn't mean it this way. I was rather musing about the question whether doing SG_IO on multipath devices by forwarding them to the current path makes sense. Sorry for the confusing post. Regards Martin
On Thu, May 15, 2025 at 05:00:41PM +0200, Martin Wilck wrote: > I didn't mean it this way. I was rather musing about the question > whether doing SG_IO on multipath devices by forwarding them to the > current path makes sense. It doesn't, and that's the core of the problem. Someone back in the day though just forwarding SG_IO would solve whatever thing they had to do back then and which didn't have a proper kernel abstraction, and it keeps adding problems.
On Thu, May 15, 2025 at 12:34 PM Martin Wilck <mwilck@suse.com> wrote: > On Thu, 2025-05-15 at 04:53 +0200, Paolo Bonzini wrote: > > Il mer 14 mag 2025, 13:37 Martin Wilck <mwilck@suse.com> ha scritto: > > Yes, as usual things are a bit more complicated. First, a handful of > > commands are special (REQUEST SENSE would be for HBAs that don't use > > auto sense, but that is fortunately not something you encounter). > > Second, there's already a filter in the kernel in > > drivers/scsi/scsi_ioctl.c for commands that are allowed without > > CAP_SYS_RAWIO. QEMU is subject to that so the commands you'll see are > > mostly I/O, INQUIRY, TUR, MODE SENSE/SELECT and that's it. > > Thanks for mentioning this. However, I suppose that depends on the > permissions with which the qemu process is started, no? Wouldn't > qemu need CAP_SYS_RAWIO for PCI passthrough as well? Generally you want to assume that the VM is hostile and run QEMU with as few privileges as possible (not just capabilities, but also in separate namespaces and with restrictions from device cgroups, SELinux, etc.). PCI passthrough is not an issue, it only needs access to the VFIO inodes and you can do it by setting the appropriate file permissions without extra capabilities. The actual privileged part is binding the device to VFIO, which is done outside QEMU anyway. > I admit that I'm confused by the many indirections in qemu's scsi-block > code flow. AFAICS qemu forwards everything except PRIN/PROUT to the > kernel block device in "scsi-block" mode. Correct me if I'm wrong. Yes, that's correct. The code for PRIN/PROUT calls out to a separate privileged process (in scsi/qemu-pr-helper.c if you're curious) which is aware of multipath and can be extended if needed. > Anwyway, let's not forget that we're talking about the kernel here. > While qemu is the main target application for this feature is created, > any application can use it, and it may or may not run with > CAP_SYS_RAWIO. Yes, but once you have CAP_SYS_RAWIO all bets for sanity are off... it even lets you do SG_IO on partitions. > > .Of the ones that aren't simple I/O, mode parameters and TUR are the > > important cases. A TUR failure would be handled by the ioctl that > > Kevin proposed here by forcing a path switch. Mode parameters might > > not be shared(*) and would need to be sent down all the paths in that > > case; that can be fixed in userspace if necessary. > > Passing TUR from a multipath device to a random member doesn't make > much sense to me. qemu would need to implement some logic to determine > whether the map has any usable paths. As long as one path replies to a TUR and the host is able to (eventually, somehow) steer I/O transparently to that path, that should be good enough. If the one path that the kernel tries is down, QEMU can probe which paths are up and retry. That seems consistent with what you want from TUR but maybe I'm missing something. > > Yes, the kernel filter is a PITA in the normal single path case but > > here it helps not doing something overly wrong. > > This seems coincidental to me. Filtering by permissions and filtering > for commands that make sense on multipath devices are orthogonal > problems. I agree (it helps, it doesn't solve the problem), however there is a large overlap between the two. Paolo
On Thu, 2025-05-15 at 12:51 +0200, Paolo Bonzini wrote: > On Thu, May 15, 2025 at 12:34 PM Martin Wilck <mwilck@suse.com> > > > > Thanks for mentioning this. However, I suppose that depends on the > > permissions with which the qemu process is started, no? Wouldn't > > qemu need CAP_SYS_RAWIO for PCI passthrough as well? > > Generally you want to assume that the VM is hostile and run QEMU with > as few privileges as possible (not just capabilities, but also in > separate namespaces and with restrictions from device cgroups, > SELinux, etc.). PCI passthrough is not an issue, it only needs access > to the VFIO inodes and you can do it by setting the appropriate file > permissions without extra capabilities. The actual privileged part is > binding the device to VFIO, which is done outside QEMU anyway. Thanks for the clarification. > > I admit that I'm confused by the many indirections in qemu's scsi- > > block > > code flow. AFAICS qemu forwards everything except PRIN/PROUT to the > > kernel block device in "scsi-block" mode. Correct me if I'm wrong. > > Yes, that's correct. The code for PRIN/PROUT calls out to a separate > privileged process (in scsi/qemu-pr-helper.c if you're curious) which > is aware of multipath and can be extended if needed. Sure, I was aware of the helper. I just wasn't 100% clear about how it gets called. Found the code in the meantime [1]. [1] https://github.com/qemu/qemu/blob/864813878951b44e964eb4c012d832fd21f8cc0c/block/file-posix.c#L4286 > > > .Of the ones that aren't simple I/O, mode parameters and TUR are > > > the > > > important cases. A TUR failure would be handled by the ioctl that > > > Kevin proposed here by forcing a path switch. Mode parameters > > > might > > > not be shared(*) and would need to be sent down all the paths in > > > that > > > case; that can be fixed in userspace if necessary. > > > > Passing TUR from a multipath device to a random member doesn't make > > much sense to me. qemu would need to implement some logic to > > determine > > whether the map has any usable paths. > > As long as one path replies to a TUR and the host is able to > (eventually, somehow) steer I/O transparently to that path, that > should be good enough. If the one path that the kernel tries is down, > QEMU can probe which paths are up and retry. That seems consistent > with what you want from TUR but maybe I'm missing something. It's ok-ish, in particular in combination with Kevin't patch. But using an equivalent of "multipath -C" would be closer to the real thing for TUR. Regards Martin
Il mar 13 mag 2025, 11:29 Kevin Wolf <kwolf@redhat.com> ha scritto: > QEMU's scsi-block device actually started off using normal I/O for reads > and writes and using SG_IO only for things that aren't covered by normal > I/O. But then those had to be changed to SG_IO, too, so that the guest > would actually see the full SCSI status. Things the commit message > mentions are unit attention codes and handling RESERVATION CONFLICT > correctly (which made me unsure above if the more specific interface for > reservations could actually be used to fully get rid of SG_IO). For more > context, I'm adding Paolo who actually made that change back then. He > may remember the concrete bug(s) this fixed. The original reason to avoid SG_IO for reads and writes was purely performance (using the host scheduler), but it turned out to be a bad idea. RESERVATION CONFLICT indeed was one of the reasons why I moved QEMU away from SG_IO. It has since been fixed, because these days Linux uses EBADE for it and likewise there are errno values for some other statuses or sense codes, but it helps more in general to have the precise SCSI status and sense data. Having the real status and sense for example lets QEMU decide which errors to pass to the guest and which should be handled in the host (for example by pausing the VM). Some HBAs also have equivalents of Linux's host_status, and passing that down also makes for more accurate pass through. Also, specifically for reservations, I also didn't like the idea that a guest command could be split in multiple host commands and the reservation conflict would appear in the middle due to a concurrent PR OUT command. To be honest I don't know how physical disks and controllers handle that, but I didn't want to make it worse. Thanks, Paolo > So if you want the guest device to behave mostly the same as your host > device, I don't really see any way around SCSI passthrough and therefore > SG_IO. > > Kevin >
On 5/13/25 07:55, Christoph Hellwig wrote: > On Mon, May 12, 2025 at 05:18:43PM +0200, Kevin Wolf wrote: >> Yes, it's a bit unfortunate, but we have to work with what we have. QEMU >> doesn't even necessarily know that it's dealing with a multipath device, >> so it just has to blindly try the ioctl and see if it works. > > Why is qemu even using SG_IO to start with? > To be able to forward SCSI commands. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
On Tue, May 13, 2025 at 08:09:45AM +0200, Hannes Reinecke wrote: > On 5/13/25 07:55, Christoph Hellwig wrote: > > On Mon, May 12, 2025 at 05:18:43PM +0200, Kevin Wolf wrote: > > > Yes, it's a bit unfortunate, but we have to work with what we have. QEMU > > > doesn't even necessarily know that it's dealing with a multipath device, > > > so it just has to blindly try the ioctl and see if it works. > > > > Why is qemu even using SG_IO to start with? > > > To be able to forward SCSI commands. Why?
On 5/13/25 08:14, Christoph Hellwig wrote: > On Tue, May 13, 2025 at 08:09:45AM +0200, Hannes Reinecke wrote: >> On 5/13/25 07:55, Christoph Hellwig wrote: >>> On Mon, May 12, 2025 at 05:18:43PM +0200, Kevin Wolf wrote: >>>> Yes, it's a bit unfortunate, but we have to work with what we have. QEMU >>>> doesn't even necessarily know that it's dealing with a multipath device, >>>> so it just has to blindly try the ioctl and see if it works. >>> >>> Why is qemu even using SG_IO to start with? >>> >> To be able to forward SCSI commands. > > Why? > Reservations and stuff. There are customer who use GPFS ... But yes, I know. We are working on it; it should be possible to hook in the generic block reservation code here. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
On Tue, May 13, 2025 at 08:32:12AM +0200, Hannes Reinecke wrote: > Reservations and stuff. They should use the kernel persistent reservation API. > There are customer who use GPFS ... Supporting illegal binary only modules that is already enough of a reason to NAK this.
On Mon, May 12, 2025 at 11:49:19PM -0700, Christoph Hellwig wrote: > On Tue, May 13, 2025 at 08:32:12AM +0200, Hannes Reinecke wrote: > > Reservations and stuff. > > They should use the kernel persistent reservation API. Currently QEMU isn't sending Persistent Reservation requests to multipath devices at all. It's sending those directly to the underlying scsi devices. The issue here is with all the other SCSI commands that users want to send to their SCSI passthrough device that is actually a multipath device on top of a number of SCSI paths. They expect to get back the actual sense and status information, so QEMU needs to send them via SG_IOs. Without reading that sense and status information in kernel, the multipath target can't know if it needs to fail a path and retry the ioctl down a different path. QEMU can read this information, but it doesn't know what path the multipath device send the ioctl down. This patch just gives users a way to check the paths in the active pathgroup (which all should be able to handle IO) and fail those that can't. While QEMU is the driver of this, it's completely general functionality. -Ben > > > There are customer who use GPFS ... > > Supporting illegal binary only modules that is already enough of a > reason to NAK this.
On Tue, May 13, 2025 at 12:29:51PM -0400, Benjamin Marzinski wrote: > On Mon, May 12, 2025 at 11:49:19PM -0700, Christoph Hellwig wrote: > > On Tue, May 13, 2025 at 08:32:12AM +0200, Hannes Reinecke wrote: > > > Reservations and stuff. > > > > They should use the kernel persistent reservation API. > > Currently QEMU isn't sending Persistent Reservation requests to > multipath devices at all. It's sending those directly to the underlying > scsi devices. The issue here is with all the other SCSI commands that > users want to send to their SCSI passthrough device that is actually a > multipath device on top of a number of SCSI paths. They expect to > get back the actual sense and status information, so QEMU needs to > send them via SG_IOs. And that's the problem. There is plenty of path (I_T_L) nexus specific information in SCSI, and just trying to glob it back together above means you'll get path specific information in something that doesn't claim to be multi-pathing and will get random and changing results depending on the underlying path selection. > Without reading that sense and status information in kernel, the > multipath target can't know if it needs to fail a path and retry the > ioctl down a different path. The blk_status_t has the information to fail over. But the whole idea of implementing SG_IO in dm-mpath or any other stacking layer is just really, really dangerous.
On 5/13/25 18:29, Benjamin Marzinski wrote: > On Mon, May 12, 2025 at 11:49:19PM -0700, Christoph Hellwig wrote: >> On Tue, May 13, 2025 at 08:32:12AM +0200, Hannes Reinecke wrote: >>> Reservations and stuff. >> >> They should use the kernel persistent reservation API. > > Currently QEMU isn't sending Persistent Reservation requests to > multipath devices at all. It's sending those directly to the underlying > scsi devices. The issue here is with all the other SCSI commands that > users want to send to their SCSI passthrough device that is actually a > multipath device on top of a number of SCSI paths. They expect to > get back the actual sense and status information, so QEMU needs to > send them via SG_IOs. > > Without reading that sense and status information in kernel, the > multipath target can't know if it needs to fail a path and retry the > ioctl down a different path. QEMU can read this information, but it > doesn't know what path the multipath device send the ioctl down. This > patch just gives users a way to check the paths in the active pathgroup > (which all should be able to handle IO) and fail those that can't. > While QEMU is the driver of this, it's completely general functionality. > Ouch. Different reservations on different paths? Really? How do you manage to keep the reservations up-to-date? My recommendation is to use the _same_ reservation on all paths, precisely to avoid having to re-do reservations on path failure. And for that scenario the kernel persistent reservation API would work fine. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
On Wed, May 14, 2025 at 08:39:25AM +0200, Hannes Reinecke wrote: > On 5/13/25 18:29, Benjamin Marzinski wrote: > > On Mon, May 12, 2025 at 11:49:19PM -0700, Christoph Hellwig wrote: > > > On Tue, May 13, 2025 at 08:32:12AM +0200, Hannes Reinecke wrote: > > > > Reservations and stuff. > > > > > > They should use the kernel persistent reservation API. > > > > Currently QEMU isn't sending Persistent Reservation requests to > > multipath devices at all. It's sending those directly to the underlying > > scsi devices. The issue here is with all the other SCSI commands that > > users want to send to their SCSI passthrough device that is actually a > > multipath device on top of a number of SCSI paths. They expect to > > get back the actual sense and status information, so QEMU needs to > > send them via SG_IOs. > > > > Without reading that sense and status information in kernel, the > > multipath target can't know if it needs to fail a path and retry the > > ioctl down a different path. QEMU can read this information, but it > > doesn't know what path the multipath device send the ioctl down. This > > patch just gives users a way to check the paths in the active pathgroup > > (which all should be able to handle IO) and fail those that can't. > > While QEMU is the driver of this, it's completely general functionality. > > > Ouch. > Different reservations on different paths? Really? > How do you manage to keep the reservations up-to-date? multipath's libmpathpersist library. You contributed some code to it IIRC. It predates the kernel interface, and even if we switched to using the kernel interface to send persistent reservation commands to the paths instead of doing that in userspace (the dm_pr_ops functions would needs some reworking to handle multipath, where the rules are different than for other dm-devices. For instance, you only want to reserve on one path, and it's not necessarily a failure if you can't register a key down all paths) we would still probably want to use the library, since it also coordinates with multipathd so that it can register keys on paths that are later added to the multipath device or were down when the registration initially happened. > My recommendation is to use the _same_ reservation on all paths, > precisely to avoid having to re-do reservations on path failure. > And for that scenario the kernel persistent reservation API > would work fine. Libmpathpersist does register the same key for all paths of a multipath device. Doing anything else would be crazy. Sorry if my last email was misleading. But the kernel API cannot help in the case where a path is later added to the multipath device, or was down when the initial registration happened. But again, all this is orthogonal to the patches in question, which have nothing to do with Persistent Reservations. -Ben > Cheers, > > Hannes > -- > Dr. Hannes Reinecke Kernel Storage Architect > hare@suse.de +49 911 74053 688 > SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg > HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
On Tue, May 13, 2025 at 12:29:51PM -0400, Benjamin Marzinski wrote: > Without reading that sense and status information in kernel, the > multipath target can't know if it needs to fail a path and retry the > ioctl down a different path. QEMU can read this information, but it > doesn't know what path the multipath device send the ioctl down. This > patch just gives users a way to check the paths in the active pathgroup > (which all should be able to handle IO) and fail those that can't. > While QEMU is the driver of this, it's completely general functionality. As just replied to Martin the problem is that this setup fundamentally can't work. Either you pass a SCSI devices through, which should (mostly, there are a few warts) work. Or you want host side multipathing, in which case you must pass through the block device abstraction and not a SCSI one, or at least do a full emulation of the SCSI interfaces instead of pretending to pass it through. The root of all evil here is that dm-multipath tries to pass through SG_IO, which is dangerous for all but the most trivial commands.
On Mon, 2025-05-12 at 23:49 -0700, Christoph Hellwig wrote: > On Tue, May 13, 2025 at 08:32:12AM +0200, Hannes Reinecke wrote: > > Reservations and stuff. > > They should use the kernel persistent reservation API. > > > There are customer who use GPFS ... > > Supporting illegal binary only modules that is already enough of a > reason to NAK this. GPFS is not the only use case. The reason why we have "scsi-block" in qemu is simply SCSI emulation. If we emulate a SCSI device to a VM, the device should support commands like persistent reservations properly. "scsi-block" supports this, while still offering decent IO performance to VMs. In the multipath case, the idea is to be able to hide from the VM the fact that the device in question is actually a multipath device, and still allow this sort of SCSI commands to be passed through to the storage. And no, passing the SCSI devices to the VM and doing multipath in the the guest doesn't work. The transport layer isn't properly emulated (bluntly speaking, we have no FC emulation). Regards Martin
On Tue, May 13, 2025 at 10:17:58AM +0200, Martin Wilck wrote: > If we emulate a SCSI device to a VM, the device should support commands > like persistent reservations properly. Then point it to an actual SCSI device, and not a multipath-device. Trying to split responsibility for handling the initiator side work is not in any way support in the SCSI spec, and by trying it anyway you are just creating tons of of problems. > And no, passing the SCSI devices to the VM and doing multipath in the > the guest doesn't work. The transport layer isn't properly emulated > (bluntly speaking, we have no FC emulation). Then fix that. Because without it you will be in never ending pain due to impedance mismatch.
On 5/14/25 06:53, Christoph Hellwig wrote: >> And no, passing the SCSI devices to the VM and doing multipath in the >> the guest doesn't work. The transport layer isn't properly emulated >> (bluntly speaking, we have no FC emulation). > > Then fix that. Because without it you will be in never ending pain > due to impedance mismatch. FC HBAs don't even expose enough of the protocol to actually do it sensibly, so that can't really be done. Hannes tried. Paolo
Hi On Thu, 8 May 2025, Martin Wilck wrote: > Hello Kevin, > > [I'm sorry for the previous email. It seems that I clicked "send" in > the wrong window]. > > On Tue, 2025-04-29 at 18:50 +0200, Kevin Wolf wrote: > > Multipath cannot directly provide failover for ioctls in the kernel > > because it doesn't know what each ioctl means and which result could > > indicate a path error. Userspace generally knows what the ioctl it > > issued means and if it might be a path error, but neither does it > > know > > which path the ioctl took nor does it necessarily have the privileges > > to > > fail a path using the control device. > > Thanks for this effort. > > I have some remarks about your approach. The most important one is that > the DM_MPATH_PROBE_PATHS_CMD ioctl appears to be a dangerous command. > It sends IO to possibly broken paths and waits for it to complete. In > the common error case of a device not responding, this might cause the > code to hang for a long time in the kernel ioctl code path, in > uninterruptible sleep (note that these probing commands will be queued Normal reads and writes would also hang in an uninterruptible sleep if a path stops responding. > after other possibly hanging IO). In the past we have put a lot of > effort into other code paths in multipath-tools and elsewhere to avoid > this kind of hang to the extent possible. It seems to me that your set > re-introduces this risk. > > Apart from that, minor observations are that in patch 2/2 you don't > check whether the map is in queueing state, and I don't quite > understand why you only check paths in the map's active path group > without attempting a PG failover in the case where all paths in the > current PG fail. > > I am wondering whether the DM_MPATH_PROBE_PATHS_CMD ioctl is necessary > at all. Rather than triggering explicit path probing, you could have > dm-mpath "handle" IO errors by failing the active path for "path > errors". That would be similar to my patch set from 2021 [1], but it > would avoid the extra IO and thus the additional risk of hanging in the > kernel. What exactly do you mean? You say "you could have dm-mpath 'handle' IO errors by failing the active path for "path errors". Christoph doesn't want dm-mpath can't look at the error code - so dm-mpath doesn't know if path error occured or not. qemu could look at the error code, but qemu doesn't know which path did the SG_IO go through - so it can't instruct qemu to fail that path. One of the possibility that we discussed was to add a path-id to SG_IO, so that dm-mpath would mark which path did the SG_IO go through and qemu could instruct dm-mpath to fail that path. But we rejected it as being more complicated than the current approach. > Contrary to my set, you wouldn't attempt retries in the kernel, but > leave this part to qemu instead, like in the current set. That would > avoid Christoph's main criticism that "Failing over SG_IO does not make > sense" [2]. > > Doing the failover in qemu has the general disadvantage that qemu has > no notion about the number of available and/or healthy paths; it can > thus only guess the reasonable number of retries for any given I/O. But > that's unavoidable, given that the idea to do kernel-level failover on > SG_IO is rejected. > > Please let me know your thoughts. > > Best Regards > Martin > > [1] https://lore.kernel.org/all/20210628151558.2289-1-mwilck@suse.com/ > [2] https://lore.kernel.org/all/20210701075629.GA25768@lst.de/ Mikulas
Hello Mikulas, On Mon, 2025-05-12 at 15:46 +0200, Mikulas Patocka wrote: > Hi > > > On Thu, 8 May 2025, Martin Wilck wrote: > > > Hello Kevin, > > > > [I'm sorry for the previous email. It seems that I clicked "send" > > in > > the wrong window]. > > > > On Tue, 2025-04-29 at 18:50 +0200, Kevin Wolf wrote: > > > Multipath cannot directly provide failover for ioctls in the > > > kernel > > > because it doesn't know what each ioctl means and which result > > > could > > > indicate a path error. Userspace generally knows what the ioctl > > > it > > > issued means and if it might be a path error, but neither does it > > > know > > > which path the ioctl took nor does it necessarily have the > > > privileges > > > to > > > fail a path using the control device. > > > > Thanks for this effort. > > > > I have some remarks about your approach. The most important one is > > that > > the DM_MPATH_PROBE_PATHS_CMD ioctl appears to be a dangerous > > command. > > It sends IO to possibly broken paths and waits for it to complete. > > In > > the common error case of a device not responding, this might cause > > the > > code to hang for a long time in the kernel ioctl code path, in > > uninterruptible sleep (note that these probing commands will be > > queued > > Normal reads and writes would also hang in an uninterruptible sleep > if a > path stops responding. Right. That's why multipathd uses TEST UNIT READY if supported by the storage, and either aio or separate I/O threads to avoid the main thread being blocked, and generally goes to great lengths to make sure that misbehaving storage hardware can cause no freeze. The way I read Kevin's code, you'd queue up additional IO in the same context that had submitted the original failing IO. I realize that qemu uses asynchronous IO, but AFAIK the details depend on the qemu options for the individual VM, and it isn't clear to me whether or not DM_MPATH_PROBE_PATHS_CMD can bring the entire VM to halt. > > [...] > > > > I am wondering whether the DM_MPATH_PROBE_PATHS_CMD ioctl is > > necessary > > at all. Rather than triggering explicit path probing, you could > > have > > dm-mpath "handle" IO errors by failing the active path for "path > > errors". That would be similar to my patch set from 2021 [1], but > > it > > would avoid the extra IO and thus the additional risk of hanging in > > the > > kernel. > > What exactly do you mean? You say "you could have dm-mpath 'handle' > IO > errors by failing the active path for "path errors". What I meant was that dm-mpath could call fail_path() in case of an error, using a similar mechanism to the block IO code path. > Christoph doesn't want dm-mpath can't look at the error code - so dm- > mpath > doesn't know if path error occured or not. My impression from the previous discussion was that Christoph mainly objected to SG_IO requests being retried in the kernel [1], not so much to the inspection of the error code by device mapper. I may have misunderstood this of course. Adding Christoph into the loop so that he can clarify what he meant. > qemu could look at the error > code, but qemu doesn't know which path did the SG_IO go through - so > it > can't instruct qemu to fail that path. > > One of the possibility that we discussed was to add a path-id to > SG_IO, so > that dm-mpath would mark which path did the SG_IO go through and qemu > could instruct dm-mpath to fail that path. But we rejected it as > being > more complicated than the current approach. If we discuss extending SG_IO itself, it might be easier to have it store the blkstatus_t somewhere in the sg_io_hdr. More about that idea in my reply to Kevin. Regards, Martin [1] https://lore.kernel.org/all/20210701075629.GA25768@lst.de/ > > Contrary to my set, you wouldn't attempt retries in the kernel, but > > leave this part to qemu instead, like in the current set. That > > would > > avoid Christoph's main criticism that "Failing over SG_IO does not > > make > > sense" [2]. > > > > Doing the failover in qemu has the general disadvantage that qemu > > has > > no notion about the number of available and/or healthy paths; it > > can > > thus only guess the reasonable number of retries for any given I/O. > > But > > that's unavoidable, given that the idea to do kernel-level failover > > on > > SG_IO is rejected. > > > > Please let me know your thoughts. > > > > Best Regards > > Martin > > > > [1] > > https://lore.kernel.org/all/20210628151558.2289-1-mwilck@suse.com/ > > [2] https://lore.kernel.org/all/20210701075629.GA25768@lst.de/ > > Mikulas
© 2016 - 2026 Red Hat, Inc.