From nobody Sat Nov 15 19:07:36 2025 Delivered-To: importer@patchew.org Authentication-Results: mx.zohomail.com; dkim=pass; spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org; dmarc=pass(p=quarantine dis=none) header.from=redhat.com ARC-Seal: i=1; a=rsa-sha256; t=1747919322; cv=none; d=zohomail.com; s=zohoarc; b=cED3Cq+nu6xEDzVpC4JYXWaKuHDce79IxK26yhyG5cFBKQRQuyM0nC8azba+NnnHIpYRcfQaLQL/J0B3lau9UfeHmh3gWnvnFzfFmoH/CKBMq5C6iWtbTcZIcB3D5356ng+tIiOIH8kgDB/NrZPf46cN6/slg6fEBAljkRiBE1Y= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1747919322; h=Content-Transfer-Encoding:Cc:Cc:Date:Date:From:From:List-Subscribe:List-Post:List-Id:List-Archive:List-Help:List-Unsubscribe:MIME-Version:Message-ID:Sender:Subject:Subject:To:To:Message-Id:Reply-To; bh=B6VW/3oe11OxDbvvgFMEn9ha9blcsNfYgl5BIMQuc9k=; b=KJoAq455O3vL2JnNl/9KdlGN5tQIkHmv0P6/bXVwLPRdL1oqgHV3VbaqQxjBkTSLBwiWry3EXqUPGZCBRNcVJ7X5gN1bvrlGvARnZIXcD8dKwyKHMz48Vt5OVt8oOiMiUFK9KimqU++X+m40WeFNXKg8oMAKhp/25GSYYj8xGKo= ARC-Authentication-Results: i=1; mx.zohomail.com; dkim=pass; spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org; dmarc=pass header.from= (p=quarantine dis=none) Return-Path: Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) by mx.zohomail.com with SMTPS id 1747919322438900.7828261594119; Thu, 22 May 2025 06:08:42 -0700 (PDT) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1uI5es-0000Q4-Ci; Thu, 22 May 2025 09:08:26 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1uI5eq-0000PA-Jg for qemu-devel@nongnu.org; Thu, 22 May 2025 09:08:24 -0400 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1uI5en-0005ZQ-0E for qemu-devel@nongnu.org; Thu, 22 May 2025 09:08:24 -0400 Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-582-feZuLH9bPE-UdENrDDZEvA-1; Thu, 22 May 2025 09:08:16 -0400 Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 371B21800368; Thu, 22 May 2025 13:08:15 +0000 (UTC) Received: from merkur.redhat.com (unknown [10.45.226.76]) by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 4F7011800879; Thu, 22 May 2025 13:08:11 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1747919299; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=B6VW/3oe11OxDbvvgFMEn9ha9blcsNfYgl5BIMQuc9k=; b=FDEcaw3goOWf93hqFddbA0T5MCtg9ENh4A9JbBgma38pmlqFMcKaBMRxBL2Mi1eeLdupKE zSnH8sQU+IkyUrudwbOe0HYJ0KLfpJqZL856+3evn9dQT7eh6rOxc+ZwXo+WYI1Chm7Zqh Vrfu/NypFYy5yTe/APshKI+3Q4hw9fM= X-MC-Unique: feZuLH9bPE-UdENrDDZEvA-1 X-Mimecast-MFC-AGG-ID: feZuLH9bPE-UdENrDDZEvA_1747919295 From: Kevin Wolf To: qemu-block@nongnu.org Cc: kwolf@redhat.com, stefanha@redhat.com, hreitz@redhat.com, bmarzins@redhat.com, pbonzini@redhat.com, qemu-devel@nongnu.org Subject: [PATCH v2] file-posix: Probe paths and retry SG_IO on potential path errors Date: Thu, 22 May 2025 15:08:03 +0200 Message-ID: <20250522130803.34738-1-kwolf@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93 Received-SPF: pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) client-ip=209.51.188.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Received-SPF: pass client-ip=170.10.133.124; envelope-from=kwolf@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -23 X-Spam_score: -2.4 X-Spam_bar: -- X-Spam_report: (-2.4 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.275, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H5=0.001, RCVD_IN_MSPIKE_WL=0.001, RCVD_IN_VALIDITY_CERTIFIED_BLOCKED=0.001, RCVD_IN_VALIDITY_RPBL_BLOCKED=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: qemu-devel-bounces+importer=patchew.org@nongnu.org X-ZohoMail-DKIM: pass (identity @redhat.com) X-ZM-MESSAGEID: 1747919323507116600 Content-Type: text/plain; charset="utf-8" When scsi-block is used on a host multipath device, it runs into the problem that the kernel dm-mpath doesn't know anything about SCSI or SG_IO and therefore can't decide if a SG_IO request returned an error and needs to be retried on a different path. Instead of getting working failover, an error is returned to scsi-block and handled according to the configured error policy. Obviously, this is not what users want, they want working failover. QEMU can parse the SG_IO result and determine whether this could have been a path error, but just retrying the same request could just send it to the same failing path again and result in the same error. With a kernel that supports the DM_MPATH_PROBE_PATHS ioctl on dm-mpath block devices (queued in the device mapper tree for Linux 6.16), we can tell the kernel to probe all paths and tell us if any usable paths remained. If so, we can now retry the SG_IO ioctl and expect it to be sent to a working path. Signed-off-by: Kevin Wolf Reviewed-by: Hanna Czenczek Reviewed-by: Stefan Hajnoczi --- v2: - Add a comment to explain retry scenarios [Stefan] - Handle -EAGAIN returned for suspended devices [Ben] block/file-posix.c | 115 ++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 114 insertions(+), 1 deletion(-) diff --git a/block/file-posix.c b/block/file-posix.c index ec95b74869..569f4ca472 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -41,6 +41,7 @@ =20 #include "scsi/pr-manager.h" #include "scsi/constants.h" +#include "scsi/utils.h" =20 #if defined(__APPLE__) && (__MACH__) #include @@ -72,6 +73,7 @@ #include #endif #include +#include #include #include #include @@ -138,6 +140,22 @@ #define RAW_LOCK_PERM_BASE 100 #define RAW_LOCK_SHARED_BASE 200 =20 +/* + * Multiple retries are mostly meant for two separate scenarios: + * + * - DM_MPATH_PROBE_PATHS returns success, but before SG_IO completes, ano= ther + * path goes down. + * + * - DM_MPATH_PROBE_PATHS failed all paths in the current path group, so w= e have + * to send another SG_IO to switch to another path group to probe the pa= ths in + * it. + * + * Even if each path is in a separate path group (path_grouping_policy set= to + * failover), it's rare to have more than eight path groups - and even then + * pretty unlikely that only bad path groups would be chosen in eight retr= ies. + */ +#define SG_IO_MAX_RETRIES 8 + typedef struct BDRVRawState { int fd; bool use_lock; @@ -165,6 +183,7 @@ typedef struct BDRVRawState { bool use_linux_aio:1; bool has_laio_fdsync:1; bool use_linux_io_uring:1; + bool use_mpath:1; int page_cache_inconsistent; /* errno from fdatasync failure */ bool has_fallocate; bool needs_alignment; @@ -4264,15 +4283,105 @@ hdev_open_Mac_error: /* Since this does ioctl the device must be already opened */ bs->sg =3D hdev_is_sg(bs); =20 + /* sg devices aren't even block devices and can't use dm-mpath */ + s->use_mpath =3D !bs->sg; + return ret; } =20 #if defined(__linux__) +#if defined(DM_MPATH_PROBE_PATHS) +static bool sgio_path_error(int ret, sg_io_hdr_t *io_hdr) +{ + if (ret < 0) { + switch (ret) { + case -ENODEV: + return true; + case -EAGAIN: + /* + * The device is probably suspended. This happens while the dm= table + * is reloaded, e.g. because a path is added or removed. This = is an + * operation that should complete within 1ms, so just wait a b= it and + * retry. + * + * If the device was suspended for another reason, we'll wait = and + * retry SG_IO_MAX_RETRIES times. This is a tolerable delay be= fore + * we return an error and potentially stop the VM. + */ + qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 1000000); + return true; + default: + return false; + } + } + + if (io_hdr->host_status !=3D SCSI_HOST_OK) { + return true; + } + + switch (io_hdr->status) { + case GOOD: + case CONDITION_GOOD: + case INTERMEDIATE_GOOD: + case INTERMEDIATE_C_GOOD: + case RESERVATION_CONFLICT: + case COMMAND_TERMINATED: + return false; + case CHECK_CONDITION: + return !scsi_sense_buf_is_guest_recoverable(io_hdr->sbp, + io_hdr->mx_sb_len); + default: + return true; + } +} + +static bool coroutine_fn hdev_co_ioctl_sgio_retry(RawPosixAIOData *acb, in= t ret) +{ + BDRVRawState *s =3D acb->bs->opaque; + RawPosixAIOData probe_acb; + + if (!s->use_mpath) { + return false; + } + + if (!sgio_path_error(ret, acb->ioctl.buf)) { + return false; + } + + probe_acb =3D (RawPosixAIOData) { + .bs =3D acb->bs, + .aio_type =3D QEMU_AIO_IOCTL, + .aio_fildes =3D s->fd, + .aio_offset =3D 0, + .ioctl =3D { + .buf =3D NULL, + .cmd =3D DM_MPATH_PROBE_PATHS, + }, + }; + + ret =3D raw_thread_pool_submit(handle_aiocb_ioctl, &probe_acb); + if (ret =3D=3D -ENOTTY) { + s->use_mpath =3D false; + } else if (ret =3D=3D -EAGAIN) { + /* The device might be suspended for a table reload, worth retryin= g */ + return true; + } + + return ret =3D=3D 0; +} +#else +static bool coroutine_fn hdev_co_ioctl_sgio_retry(RawPosixAIOData *acb, in= t ret) +{ + return false; +} +#endif /* DM_MPATH_PROBE_PATHS */ + static int coroutine_fn hdev_co_ioctl(BlockDriverState *bs, unsigned long int req, void *buf) { BDRVRawState *s =3D bs->opaque; RawPosixAIOData acb; + int retries =3D SG_IO_MAX_RETRIES; int ret; =20 ret =3D fd_open(bs); @@ -4300,7 +4409,11 @@ hdev_co_ioctl(BlockDriverState *bs, unsigned long in= t req, void *buf) }, }; =20 - return raw_thread_pool_submit(handle_aiocb_ioctl, &acb); + do { + ret =3D raw_thread_pool_submit(handle_aiocb_ioctl, &acb); + } while (req =3D=3D SG_IO && retries-- && hdev_co_ioctl_sgio_retry(&ac= b, ret)); + + return ret; } #endif /* linux */ =20 --=20 2.49.0