file-posix: Handle suspended dm-multipath better for SG_IO

[PATCH for-10.2] file-posix: Handle suspended dm-multipath better for SG_IO

Posted by Kevin Wolf 2 months, 1 week ago

When introducing DM_MPATH_PROBE_PATHS, we already anticipated that
dm-multipath devices might be suspended for a short time when the DM
tables are reloaded and that they return -EAGAIN in this case. The
behaviour promised in the comment wasn't actually implemented, though:
We don't get SG_IO_MAX_RETRIES in practice, because after the first
1ms sleep, DM_MPATH_PROBE_PATHS is called and if that still fails with
-EAGAIN, we error out immediately without any retry.

However, meanwhile it has also turned out that libmpathpersist (which is
used by qemu-pr-helper) may need to perform more complex recovery
operations to get reservations back to expected state if a path failure
happened in the middle of a PR operation. In this case, the device is
suspended for a longer time compared to the case we originally expected.

This patch changes hdev_co_ioctl() to treat -EAGAIN separately so that
it doesn't result in an immediate failure if the device is suspended for
more than 1ms, and moves to incremental backoff to cover both quick and
slow cases without excessive delays.

Buglink: https://issues.redhat.com/browse/RHEL-121543
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/file-posix.c | 56 ++++++++++++++++++++++++++++------------------
 1 file changed, 34 insertions(+), 22 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index c9e367a222..6265d2e248 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -4288,25 +4288,8 @@ hdev_open_Mac_error:
 static bool coroutine_fn sgio_path_error(int ret, sg_io_hdr_t *io_hdr)
 {
     if (ret < 0) {
-        switch (ret) {
-        case -ENODEV:
-            return true;
-        case -EAGAIN:
-            /*
-             * The device is probably suspended. This happens while the dm table
-             * is reloaded, e.g. because a path is added or removed. This is an
-             * operation that should complete within 1ms, so just wait a bit and
-             * retry.
-             *
-             * If the device was suspended for another reason, we'll wait and
-             * retry SG_IO_MAX_RETRIES times. This is a tolerable delay before
-             * we return an error and potentially stop the VM.
-             */
-            qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 1000000);
-            return true;
-        default:
-            return false;
-        }
+        /* Path errors sometimes result in -ENODEV */
+        return ret == -ENODEV;
     }
 
     if (io_hdr->host_status != SCSI_HOST_OK) {
@@ -4375,6 +4358,7 @@ hdev_co_ioctl(BlockDriverState *bs, unsigned long int req, void *buf)
 {
     BDRVRawState *s = bs->opaque;
     RawPosixAIOData acb;
+    uint64_t eagain_sleep_ns = 1 * SCALE_MS;
     int retries = SG_IO_MAX_RETRIES;
     int ret;
 
@@ -4403,9 +4387,37 @@ hdev_co_ioctl(BlockDriverState *bs, unsigned long int req, void *buf)
         },
     };
 
-    do {
-        ret = raw_thread_pool_submit(handle_aiocb_ioctl, &acb);
-    } while (req == SG_IO && retries-- && hdev_co_ioctl_sgio_retry(&acb, ret));
+retry:
+    ret = raw_thread_pool_submit(handle_aiocb_ioctl, &acb);
+    if (req == SG_IO && s->use_mpath) {
+        if (ret == -EAGAIN && eagain_sleep_ns < NANOSECONDS_PER_SECOND) {
+            /*
+             * If this is a multipath device, it is probably suspended.
+             *
+             * This can happen while the dm table is reloaded, e.g. because a
+             * path is added or removed. This is an operation that should
+             * complete within 1ms, so just wait a bit and retry.
+             *
+             * There are also some cases in which libmpathpersist must recover
+             * from path failure during its operation, which can leave the
+             * device suspended for a bit longer while the library brings back
+             * reservations into the expected state.
+             *
+             * Use increasing delays to cover both cases without waiting
+             * excessively, and stop after a bit more than a second (1023 ms).
+             * This is a tolerable delay before we return an error and
+             * potentially stop the VM.
+             */
+            qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, eagain_sleep_ns);
+            eagain_sleep_ns *= 2;
+            goto retry;
+        }
+
+        /* Even for ret == 0, the SG_IO header can contain an error */
+        if (retries-- && hdev_co_ioctl_sgio_retry(&acb, ret)) {
+            goto retry;
+        }
+    }
 
     return ret;
 }
-- 
2.51.1

Re: [PATCH for-10.2] file-posix: Handle suspended dm-multipath better for SG_IO

Posted by Hanna Czenczek 2 months ago

On 28.11.25 23:14, Kevin Wolf wrote:
> When introducing DM_MPATH_PROBE_PATHS, we already anticipated that
> dm-multipath devices might be suspended for a short time when the DM
> tables are reloaded and that they return -EAGAIN in this case. The
> behaviour promised in the comment wasn't actually implemented, though:
> We don't get SG_IO_MAX_RETRIES in practice, because after the first
> 1ms sleep, DM_MPATH_PROBE_PATHS is called and if that still fails with
> -EAGAIN, we error out immediately without any retry.

How so?  `hdev_co_ioctl_sgio_retry()` is what issues 
`DM_MPATH_PROBE_PATHS`, and if it gets `-EAGAIN` it will return `true`, 
requesting a retry.

> However, meanwhile it has also turned out that libmpathpersist (which is
> used by qemu-pr-helper) may need to perform more complex recovery
> operations to get reservations back to expected state if a path failure
> happened in the middle of a PR operation. In this case, the device is
> suspended for a longer time compared to the case we originally expected.

In any case, this does warrant the change.

> This patch changes hdev_co_ioctl() to treat -EAGAIN separately so that
> it doesn't result in an immediate failure if the device is suspended for
> more than 1ms, and moves to incremental backoff to cover both quick and
> slow cases without excessive delays.
>
> Buglink: https://issues.redhat.com/browse/RHEL-121543
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block/file-posix.c | 56 ++++++++++++++++++++++++++++------------------
>   1 file changed, 34 insertions(+), 22 deletions(-)

Reviewed-by: Hanna Czenczek <hreitz@redhat.com>

> diff --git a/block/file-posix.c b/block/file-posix.c
> index c9e367a222..6265d2e248 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -4288,25 +4288,8 @@ hdev_open_Mac_error:
>   static bool coroutine_fn sgio_path_error(int ret, sg_io_hdr_t *io_hdr)
>   {
>       if (ret < 0) {
> -        switch (ret) {
> -        case -ENODEV:
> -            return true;
> -        case -EAGAIN:
> -            /*
> -             * The device is probably suspended. This happens while the dm table
> -             * is reloaded, e.g. because a path is added or removed. This is an
> -             * operation that should complete within 1ms, so just wait a bit and
> -             * retry.
> -             *
> -             * If the device was suspended for another reason, we'll wait and
> -             * retry SG_IO_MAX_RETRIES times. This is a tolerable delay before
> -             * we return an error and potentially stop the VM.
> -             */
> -            qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 1000000);
> -            return true;
> -        default:
> -            return false;
> -        }
> +        /* Path errors sometimes result in -ENODEV */
> +        return ret == -ENODEV;
>       }
>   
>       if (io_hdr->host_status != SCSI_HOST_OK) {
> @@ -4375,6 +4358,7 @@ hdev_co_ioctl(BlockDriverState *bs, unsigned long int req, void *buf)
>   {
>       BDRVRawState *s = bs->opaque;
>       RawPosixAIOData acb;
> +    uint64_t eagain_sleep_ns = 1 * SCALE_MS;
>       int retries = SG_IO_MAX_RETRIES;
>       int ret;
>   
> @@ -4403,9 +4387,37 @@ hdev_co_ioctl(BlockDriverState *bs, unsigned long int req, void *buf)
>           },
>       };
>   
> -    do {
> -        ret = raw_thread_pool_submit(handle_aiocb_ioctl, &acb);
> -    } while (req == SG_IO && retries-- && hdev_co_ioctl_sgio_retry(&acb, ret));
> +retry:
> +    ret = raw_thread_pool_submit(handle_aiocb_ioctl, &acb);
> +    if (req == SG_IO && s->use_mpath) {
> +        if (ret == -EAGAIN && eagain_sleep_ns < NANOSECONDS_PER_SECOND) {
> +            /*
> +             * If this is a multipath device, it is probably suspended.
> +             *
> +             * This can happen while the dm table is reloaded, e.g. because a
> +             * path is added or removed. This is an operation that should
> +             * complete within 1ms, so just wait a bit and retry.
> +             *
> +             * There are also some cases in which libmpathpersist must recover
> +             * from path failure during its operation, which can leave the
> +             * device suspended for a bit longer while the library brings back
> +             * reservations into the expected state.
> +             *
> +             * Use increasing delays to cover both cases without waiting
> +             * excessively, and stop after a bit more than a second (1023 ms).
> +             * This is a tolerable delay before we return an error and
> +             * potentially stop the VM.
> +             */
> +            qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, eagain_sleep_ns);
> +            eagain_sleep_ns *= 2;
> +            goto retry;
> +        }
> +
> +        /* Even for ret == 0, the SG_IO header can contain an error */
> +        if (retries-- && hdev_co_ioctl_sgio_retry(&acb, ret)) {
> +            goto retry;
> +        }
> +    }
>   
>       return ret;
>   }

Re: [PATCH for-10.2] file-posix: Handle suspended dm-multipath better for SG_IO

Posted by Kevin Wolf 2 months ago

Am 03.12.2025 um 18:27 hat Hanna Czenczek geschrieben:
> On 28.11.25 23:14, Kevin Wolf wrote:
> > When introducing DM_MPATH_PROBE_PATHS, we already anticipated that
> > dm-multipath devices might be suspended for a short time when the DM
> > tables are reloaded and that they return -EAGAIN in this case. The
> > behaviour promised in the comment wasn't actually implemented, though:
> > We don't get SG_IO_MAX_RETRIES in practice, because after the first
> > 1ms sleep, DM_MPATH_PROBE_PATHS is called and if that still fails with
> > -EAGAIN, we error out immediately without any retry.
> 
> How so?  `hdev_co_ioctl_sgio_retry()` is what issues `DM_MPATH_PROBE_PATHS`,
> and if it gets `-EAGAIN` it will return `true`, requesting a retry.

Oops... You're right. I somehow ended up testing the old state with a
commit that isn't exactly the same as what went into master and it
didn't have the -EAGAIN special case yet. So this is just me running and
looking at the wrong version of the code.

Unfortunately, it's too late to fix the commit message now. But then,
it's only a random comment about the previous state and doesn't really
make a difference for this patch.

Kevin