From nobody Fri Dec 19 04:23:14 2025 Received: from www5210.sakura.ne.jp (www5210.sakura.ne.jp [133.167.8.150]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4CDB630C35E for ; Mon, 27 Oct 2025 15:44:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=133.167.8.150 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761579872; cv=none; b=F+lj7lOahhIKwyi/JzNfER+MmUnnDj2sXE+9yb3ero/cr6IXTXXv6mXoJ6EaAMf0IKbI2zugozxL1RV/qmzyfdaFEkd+pBJSWvombAqhgYDBEZVDekc0b5sziQ6XtJKkTnW1rFj0YDGJZc4FO9qkVOul/1TikoCZ648ZTYsDPns= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761579872; c=relaxed/simple; bh=dSny0q8Vpr/7wchAfKqnsLC1CPOwBAyvKvG9yt4alcM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=NLGsLxY1lDl9t/9K0KZkuEVWafPnlpgOwypn61W5+YPci+l8P8+JEzzrlP1FaM5e8fxAb455ZRIZXv6NS6mOKLuy/xftVy+ee0P/VVX1434q3GNSVYkum+Ni6UAm6G/BJcvvCj4qSD10rcgCly/nIRjyGB4Br8euMsyokhG7Wgs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mgml.me; spf=pass smtp.mailfrom=mgml.me; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b=EVRfl0WY; arc=none smtp.client-ip=133.167.8.150 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mgml.me Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=mgml.me Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b="EVRfl0WY" Received: from fedora (p3796170-ipxg00h01tokaisakaetozai.aichi.ocn.ne.jp [180.53.173.170]) (authenticated bits=0) by www5210.sakura.ne.jp (8.16.1/8.16.1) with ESMTPSA id 59RF4hAl090988 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Tue, 28 Oct 2025 00:04:46 +0900 (JST) (envelope-from k@mgml.me) DKIM-Signature: a=rsa-sha256; bh=FgA+SMwL9yI/ePad9bDVhfEqymOfjz+gNzLHUmDWcLo=; c=relaxed/relaxed; d=mgml.me; h=From:To:Subject:Date:Message-ID; s=rs20250315; t=1761577486; v=1; b=EVRfl0WYA1819y3N75Clc+TFE9hCot/EwuHA/qRGKz+NEMRVRipIh3IARExWJfjb 25ouHdK4orElRc+eYFgDkt24xTOAuVmu9F92olaN3I6wRLej/AaL1hEfAO7JCuYX pSff+zdvkx8T/V9x7YPsdQv7CO5r83NBySZtOKVyl6zMpesxq+XhQk7apvXyczyG uILDneeDmMBrSzZ+WAbvxWxNN0wCZ5JOF02tSt07sCP62OxhD4A3g01nAxihJsv3 AGgJLpYYolrMb48CFCV+kstYZU9W4FFpjORpnVZkYNQaW2hyHIJCAWI4xgZWRDPy BX9lp5CE+iGLW0j34y9z8Q== From: Kenta Akagi To: Song Liu , Yu Kuai , Shaohua Li , Mariusz Tkaczyk , Guoqing Jiang Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, Kenta Akagi Subject: [PATCH v5 11/16] md/raid1: Prevent set MD_BROKEN on failfast bio failure Date: Tue, 28 Oct 2025 00:04:28 +0900 Message-ID: <20251027150433.18193-12-k@mgml.me> X-Mailer: git-send-email 2.50.1 In-Reply-To: <20251027150433.18193-1-k@mgml.me> References: <20251027150433.18193-1-k@mgml.me> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Failfast is a feature implemented only for RAID1 and RAID10. It instructs the block device providing the rdev to immediately return a bio error without retrying if any issue occurs. This allows quickly detaching a problematic rdev and minimizes IO latency. Due to its nature, failfast bios can fail easily, and md must not mark an essential rdev as Faulty or set MD_BROKEN on the array just because a failfast bio failed. When failfast was introduced, RAID1 and RAID10 were designed to continue operating normally even if md_error was called for the last rdev. However, with the introduction of MD_BROKEN in RAID1/RAID10 in commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"), calling md_error for the last rdev now prevents further writes to the array. Despite this, the current failfast error handler still assumes that calling md_error will not break the array. Normally, this is not an issue because MD_FAILFAST is not set when a bio is issued to the last rdev. However, if the array is not degraded and a bio with MD_FAILFAST has been issued, simultaneous failures could potentially break the array. This is unusual but can happen; for example, this can occur when using NVMe over TCP if all rdevs depend on a single Ethernet link. In other words, this becomes a problem under the following conditions: Preconditions: * Failfast is enabled on all rdevs. * All rdevs are In_sync - This is a requirement for bio to be submit with MD_FAILFAST. * At least one bio has been submitted but has not yet completed. Trigger condition: * All underlying devices of the rdevs return an error for their failfast bios. Whether the bio is read or write, eventually both rdevs will be lost. In the write case, md_error is invoked on each rdev through its bi_end_io handler. In the read case, if bio has been issued to multiple rdevs via read_balance, it will be the same as write. Even in the read case where only a single rdev has bio issued, both rdevs will be lost in the following sequence: 1. losing the first rdev triggers a metadata update 2. md_super_write issues the bio with MD_FAILFAST, causing the bio to fail immediately. md_super_write always issues MD_FAILFAST bio if rdev has FailFast, regardless of whether there are other rdevs or not. 3. md_super_write issued bio failed, so super_written calls md_error on the remaining rdev. This commit fixes the write case and first of read cases. Ensure that a failfast bio failure will not cause the last rdev to become faulty or the array to be marked MD_BROKEN. The second of read, i.e., failure of metadata update, has already been fixed in the previous commit. Signed-off-by: Kenta Akagi --- drivers/md/raid1.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index a70ca6bc28f3..bf96ae78a8b1 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -470,7 +470,7 @@ static void raid1_end_write_request(struct bio *bio) (bio->bi_opf & MD_FAILFAST) && /* We never try FailFast to WriteMostly devices */ !test_bit(WriteMostly, &rdev->flags)) { - md_error(r1_bio->mddev, rdev); + md_cond_error(r1_bio->mddev, rdev, bio); } =20 /* @@ -2177,8 +2177,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio) if (test_bit(FailFast, &rdev->flags)) { /* Don't try recovering from here - just fail it * ... unless it is the last working device of course */ - md_error(mddev, rdev); - if (test_bit(Faulty, &rdev->flags)) + if (md_cond_error(mddev, rdev, bio)) /* Don't try to read from here, but make sure * put_buf does it's thing */ @@ -2671,20 +2670,20 @@ static void handle_read_error(struct r1conf *conf, = struct r1bio *r1_bio) */ =20 bio =3D r1_bio->bios[r1_bio->read_disk]; - bio_put(bio); r1_bio->bios[r1_bio->read_disk] =3D NULL; =20 rdev =3D conf->mirrors[r1_bio->read_disk].rdev; if (mddev->ro) { r1_bio->bios[r1_bio->read_disk] =3D IO_BLOCKED; } else if (test_bit(FailFast, &rdev->flags)) { - md_error(mddev, rdev); + md_cond_error(mddev, rdev, bio); } else { freeze_array(conf, 1); fix_read_error(conf, r1_bio); unfreeze_array(conf); } =20 + bio_put(bio); rdev_dec_pending(rdev, conf->mddev); sector =3D r1_bio->sector; =20 --=20 2.50.1