From nobody Thu Oct 2 16:30:53 2025 Received: from www5210.sakura.ne.jp (www5210.sakura.ne.jp [133.167.8.150]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0F5582773CB for ; Mon, 15 Sep 2025 04:28:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=133.167.8.150 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757910504; cv=none; b=XqxfZEB5AhGnaFaDVRnDcpFKhbDGr07wOa39DSnOJ6uZ15V3CkAoTt/HyQDFBILu+6KFQUd0fVoclpBurx3MKCgi7USnQBYbwHY9HzHAJAT9GeInALjV2OfMuDgkEaJH91sHTH/lkC+0oJua0jPVcuG+AzxMB/Wqpql1d7j7CaE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757910504; c=relaxed/simple; bh=xVrsJVZuhFnm9LlNG5pdF6cSZYwuUUtbPrOEa0kZaf4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=VPEbAFxmZfhdk5QbXzcHrKDhe752Zg5pLi0S/S7Gq2i3sXlteYWQS2dUKTybjErEU5ZYunv7SbBGXsGpW/4BuoAKjbv1LaKKxSXwR8U+tufBK4+N9NJeBPyrxilwDB9f4NQK7qX4YH08ATnPL+2K/aIGt3Oit8lFcgnC4blaidk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=mgml.me; spf=pass smtp.mailfrom=mgml.me; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b=aYbF0rGe; arc=none smtp.client-ip=133.167.8.150 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=mgml.me Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=mgml.me Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b="aYbF0rGe" Received: from fedora (p4504123-ipxg00s01tokaisakaetozai.aichi.ocn.ne.jp [114.172.113.123]) (authenticated bits=0) by www5210.sakura.ne.jp (8.16.1/8.16.1) with ESMTPSA id 58F3gpZi004256 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Mon, 15 Sep 2025 12:43:16 +0900 (JST) (envelope-from k@mgml.me) DKIM-Signature: a=rsa-sha256; bh=eYqXkk84Nk/9USgO3/iO04eK9HLGW/S0O+fefYCCwoI=; c=relaxed/relaxed; d=mgml.me; h=From:To:Subject:Date:Message-ID; s=rs20250315; t=1757907796; v=1; b=aYbF0rGeO+0w+wasbY3F+D6nb5VWwRxBvGFfJsXou00YGqhRPkDJhOoQLbYuPvtH a98lwCXlLRrNDuFdq2uhB3MhmfF1zWG8e8sDdjrkTMwwbWs9bv94j99JpcEbN/DB LY77XktDdb5GeCQsNdsvBOnavjY3dDDVCeivWyBGODjrm0zLCFFPOLosQ2BdD62S 2WxJmGPHor5t99GC0ahE58fJLpHDDp+3241iiG0WkvTrZ77LdU/YaSRpbakLlYzT EQheqUOcc1qyGvuNlt6rHzLoTNe5+4/x33Vlzf0qNFqCr4u9TZYV4Gilcof5IwdE 4QaJ4CYvjG3d9JQT6xfOkA== From: Kenta Akagi To: Song Liu , Yu Kuai , Mariusz Tkaczyk , Shaohua Li , Guoqing Jiang Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, Kenta Akagi Subject: [PATCH v4 1/9] md/raid1,raid10: Set the LastDev flag when the configuration changes Date: Mon, 15 Sep 2025 12:42:02 +0900 Message-ID: <20250915034210.8533-2-k@mgml.me> X-Mailer: git-send-email 2.50.1 In-Reply-To: <20250915034210.8533-1-k@mgml.me> References: <20250915034210.8533-1-k@mgml.me> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Currently, the LastDev flag is set on an rdev that failed a failfast metadata write and called md_error, but did not become Faulty. It is cleared when the metadata write retry succeeds. This has problems for the following reasons: * Despite its name, the flag is only set during a metadata write window. * Unlike when LastDev and Failfast was introduced, md_error on the last rdev of a RAID1/10 array now sets MD_BROKEN. Thus when LastDev is set, the array is already unwritable. A following commit will prevent failfast bios from breaking the array, which requires knowing from outside the personality whether an rdev is the last one. For that purpose, LastDev should be set on rdevs that must not be lost. This commit ensures that LastDev is set on the indispensable rdev in a degraded RAID1/10 array. Signed-off-by: Kenta Akagi --- drivers/md/md.c | 4 +--- drivers/md/md.h | 6 +++--- drivers/md/raid1.c | 34 +++++++++++++++++++++++++++++++++- drivers/md/raid10.c | 34 +++++++++++++++++++++++++++++++++- 4 files changed, 70 insertions(+), 8 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 4e033c26fdd4..268410b66b83 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -1007,10 +1007,8 @@ static void super_written(struct bio *bio) if (!test_bit(Faulty, &rdev->flags) && (bio->bi_opf & MD_FAILFAST)) { set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags); - set_bit(LastDev, &rdev->flags); } - } else - clear_bit(LastDev, &rdev->flags); + } =20 bio_put(bio); =20 diff --git a/drivers/md/md.h b/drivers/md/md.h index 51af29a03079..ec598f9a8381 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -281,9 +281,9 @@ enum flag_bits { * It is expects that no bad block log * is present. */ - LastDev, /* Seems to be the last working dev as - * it didn't fail, so don't use FailFast - * any more for metadata + LastDev, /* This is the last working rdev. + * so don't use FailFast any more for + * metadata. */ CollisionCheck, /* * check if there is collision between raid1 diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index bf44878ec640..32ad6b102ff7 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1733,6 +1733,33 @@ static void raid1_status(struct seq_file *seq, struc= t mddev *mddev) seq_printf(seq, "]"); } =20 +/** + * update_lastdev - Set or clear LastDev flag for all rdevs in array + * @conf: pointer to r1conf + * + * Sets LastDev if the device is In_sync and cannot be lost for the array. + * Otherwise, clear it. + * + * Caller must hold ->device_lock. + */ +static void update_lastdev(struct r1conf *conf) +{ + int i; + int alive_disks =3D conf->raid_disks - conf->mddev->degraded; + + for (i =3D 0; i < conf->raid_disks; i++) { + struct md_rdev *rdev =3D conf->mirrors[i].rdev; + + if (rdev) { + if (test_bit(In_sync, &rdev->flags) && + alive_disks =3D=3D 1) + set_bit(LastDev, &rdev->flags); + else + clear_bit(LastDev, &rdev->flags); + } + } +} + /** * raid1_error() - RAID1 error handler. * @mddev: affected md device. @@ -1767,8 +1794,10 @@ static void raid1_error(struct mddev *mddev, struct = md_rdev *rdev) } } set_bit(Blocked, &rdev->flags); - if (test_and_clear_bit(In_sync, &rdev->flags)) + if (test_and_clear_bit(In_sync, &rdev->flags)) { mddev->degraded++; + update_lastdev(conf); + } set_bit(Faulty, &rdev->flags); spin_unlock_irqrestore(&conf->device_lock, flags); /* @@ -1864,6 +1893,7 @@ static int raid1_spare_active(struct mddev *mddev) } } mddev->degraded -=3D count; + update_lastdev(conf); spin_unlock_irqrestore(&conf->device_lock, flags); =20 print_conf(conf); @@ -3290,6 +3320,7 @@ static int raid1_run(struct mddev *mddev) rcu_assign_pointer(conf->thread, NULL); mddev->private =3D conf; set_bit(MD_FAILFAST_SUPPORTED, &mddev->flags); + update_lastdev(conf); =20 md_set_array_sectors(mddev, raid1_size(mddev, 0, 0)); =20 @@ -3427,6 +3458,7 @@ static int raid1_reshape(struct mddev *mddev) =20 spin_lock_irqsave(&conf->device_lock, flags); mddev->degraded +=3D (raid_disks - conf->raid_disks); + update_lastdev(conf); spin_unlock_irqrestore(&conf->device_lock, flags); conf->raid_disks =3D mddev->raid_disks =3D raid_disks; mddev->delta_disks =3D 0; diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index b60c30bfb6c7..dc4edd4689f8 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -1983,6 +1983,33 @@ static int enough(struct r10conf *conf, int ignore) _enough(conf, 1, ignore); } =20 +/** + * update_lastdev - Set or clear LastDev flag for all rdevs in array + * @conf: pointer to r10conf + * + * Sets LastDev if the device is In_sync and cannot be lost for the array. + * Otherwise, clear it. + * + * Caller must hold ->reconfig_mutex or ->device_lock. + */ +static void update_lastdev(struct r10conf *conf) +{ + int i; + int raid_disks =3D max(conf->geo.raid_disks, conf->prev.raid_disks); + + for (i =3D 0; i < raid_disks; i++) { + struct md_rdev *rdev =3D conf->mirrors[i].rdev; + + if (rdev) { + if (test_bit(In_sync, &rdev->flags) && + !enough(conf, i)) + set_bit(LastDev, &rdev->flags); + else + clear_bit(LastDev, &rdev->flags); + } + } +} + /** * raid10_error() - RAID10 error handler. * @mddev: affected md device. @@ -2013,8 +2040,10 @@ static void raid10_error(struct mddev *mddev, struct= md_rdev *rdev) return; } } - if (test_and_clear_bit(In_sync, &rdev->flags)) + if (test_and_clear_bit(In_sync, &rdev->flags)) { mddev->degraded++; + update_lastdev(conf); + } =20 set_bit(MD_RECOVERY_INTR, &mddev->recovery); set_bit(Blocked, &rdev->flags); @@ -2102,6 +2131,7 @@ static int raid10_spare_active(struct mddev *mddev) } spin_lock_irqsave(&conf->device_lock, flags); mddev->degraded -=3D count; + update_lastdev(conf); spin_unlock_irqrestore(&conf->device_lock, flags); =20 print_conf(conf); @@ -4159,6 +4189,7 @@ static int raid10_run(struct mddev *mddev) md_set_array_sectors(mddev, size); mddev->resync_max_sectors =3D size; set_bit(MD_FAILFAST_SUPPORTED, &mddev->flags); + update_lastdev(conf); =20 if (md_integrity_register(mddev)) goto out_free_conf; @@ -4567,6 +4598,7 @@ static int raid10_start_reshape(struct mddev *mddev) */ spin_lock_irq(&conf->device_lock); mddev->degraded =3D calc_degraded(conf); + update_lastdev(conf); spin_unlock_irq(&conf->device_lock); mddev->raid_disks =3D conf->geo.raid_disks; mddev->reshape_position =3D conf->reshape_progress; --=20 2.50.1 From nobody Thu Oct 2 16:30:53 2025 Received: from www5210.sakura.ne.jp (www5210.sakura.ne.jp [133.167.8.150]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B7BB222422E for ; Mon, 15 Sep 2025 04:28:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=133.167.8.150 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757910493; cv=none; b=NDZaqu4PsOyRLVDPSaOJDMLDnkHKDpenyY2/Xf0tGmC74fvxdb1kDTn7+315EufsGhl0isVjSPRu/I65VBClIxIPtZU69t8XaWoNwjzyS/1LAqB93IkIAaKnop8I8iEO/bOWneQvbdA7WYe6sFLC4aa6UoyityTDehYnXbfzUHo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757910493; c=relaxed/simple; bh=Ow9oem3bq54CCBwZqRu/jbUwDtw5qZJuTzGmkup7YS4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=OIJtuVxZnTH574p6MmbxfXJ/qScibkNgxMEsyfDsKRvEzZqJWjN1G28GYxcLNF3++6oDnxJdkCA/sNSzmrAbRlur1vq+Xq5OVdeMmet4u4UmtU0QHXSzCWh4Lo+x+K0mksw4hMU4D+VVxQEV5L9OJx/a3Y9+y0SSA/HsbwtgUa0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=mgml.me; spf=pass smtp.mailfrom=mgml.me; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b=eXQ20g5H; arc=none smtp.client-ip=133.167.8.150 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=mgml.me Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=mgml.me Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b="eXQ20g5H" Received: from fedora (p4504123-ipxg00s01tokaisakaetozai.aichi.ocn.ne.jp [114.172.113.123]) (authenticated bits=0) by www5210.sakura.ne.jp (8.16.1/8.16.1) with ESMTPSA id 58F3gpZj004256 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Mon, 15 Sep 2025 12:43:16 +0900 (JST) (envelope-from k@mgml.me) DKIM-Signature: a=rsa-sha256; bh=7DroO2tvGIKHGeZ7GHD8saOcIV9bJRQbhvjJqTkaPBg=; c=relaxed/relaxed; d=mgml.me; h=From:To:Subject:Date:Message-ID; s=rs20250315; t=1757907796; v=1; b=eXQ20g5HQRNIP3Yc/9UqawOPG+PzHH7/pnEV+j7HRuKO39u17Q/mlVIO176gSBun 8HXCjrOUxvIwK9SGXYTvqR3oTMQaPT5uvLBwnylxJ3FmraOSEjs2O1urcuR73S+y TXPgzFjzFyiBL/mRKaao7/kXndaNaT9QJgyLvdx81V8Ob8JPN+IDiAbGYzzXvxtk H+2nOBax+7qK024Yr2r9qGtOtHkPYfOkw1B0l12oLT2G+9kJNf9i5YeXF/tECCEA rRTt8B/hBYNRNg+Bk4F5dSe2pTrLRCvjmZ9xPzuMOwRqcjhaT2DgYo1RsQ39dZpX ZWXXpnaOGwktw/e4TKBcnQ== From: Kenta Akagi To: Song Liu , Yu Kuai , Mariusz Tkaczyk , Shaohua Li , Guoqing Jiang Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, Kenta Akagi Subject: [PATCH v4 2/9] md: serialize md_error() Date: Mon, 15 Sep 2025 12:42:03 +0900 Message-ID: <20250915034210.8533-3-k@mgml.me> X-Mailer: git-send-email 2.50.1 In-Reply-To: <20250915034210.8533-1-k@mgml.me> References: <20250915034210.8533-1-k@mgml.me> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable md_error is mainly called when a bio fails, so it can run in parallel. Each personality=E2=80=99s error_handler locks with device_lock, so concurr= ent calls are safe. However, RAID1 and RAID10 require changes for Failfast bio error handling, which needs a special helper for md_error. For that helper to work, the regular md_error must also be serialized. The helper function md_bio_failure_error for failfast will be introduced in a subsequent commit. This commit serializes md_error for all RAID personalities. While unnecessary for RAID levels other than 1 and 10, it has no performance impact as it is a cold path. Signed-off-by: Kenta Akagi --- drivers/md/md.c | 10 +++++++++- drivers/md/md.h | 4 ++++ 2 files changed, 13 insertions(+), 1 deletion(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 268410b66b83..5607578a6db9 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -705,6 +705,7 @@ int mddev_init(struct mddev *mddev) atomic_set(&mddev->openers, 0); atomic_set(&mddev->sync_seq, 0); spin_lock_init(&mddev->lock); + spin_lock_init(&mddev->error_handle_lock); init_waitqueue_head(&mddev->sb_wait); init_waitqueue_head(&mddev->recovery_wait); mddev->reshape_position =3D MaxSector; @@ -8262,7 +8263,7 @@ void md_unregister_thread(struct mddev *mddev, struct= md_thread __rcu **threadp) } EXPORT_SYMBOL(md_unregister_thread); =20 -void md_error(struct mddev *mddev, struct md_rdev *rdev) +void _md_error(struct mddev *mddev, struct md_rdev *rdev) { if (!rdev || test_bit(Faulty, &rdev->flags)) return; @@ -8287,6 +8288,13 @@ void md_error(struct mddev *mddev, struct md_rdev *r= dev) queue_work(md_misc_wq, &mddev->event_work); md_new_event(); } + +void md_error(struct mddev *mddev, struct md_rdev *rdev) +{ + spin_lock(&mddev->error_handle_lock); + _md_error(mddev, rdev); + spin_unlock(&mddev->error_handle_lock); +} EXPORT_SYMBOL(md_error); =20 /* seq_file implementation /proc/mdstat */ diff --git a/drivers/md/md.h b/drivers/md/md.h index ec598f9a8381..5177cb609e4b 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -619,6 +619,9 @@ struct mddev { /* The sequence number for sync thread */ atomic_t sync_seq; =20 + /* Lock for serializing md_error */ + spinlock_t error_handle_lock; + bool has_superblocks:1; bool fail_last_dev:1; bool serialize_policy:1; @@ -901,6 +904,7 @@ extern void md_write_start(struct mddev *mddev, struct = bio *bi); extern void md_write_inc(struct mddev *mddev, struct bio *bi); extern void md_write_end(struct mddev *mddev); extern void md_done_sync(struct mddev *mddev, int blocks, int ok); +void _md_error(struct mddev *mddev, struct md_rdev *rdev); extern void md_error(struct mddev *mddev, struct md_rdev *rdev); extern void md_finish_reshape(struct mddev *mddev); void md_submit_discard_bio(struct mddev *mddev, struct md_rdev *rdev, --=20 2.50.1 From nobody Thu Oct 2 16:30:53 2025 Received: from www5210.sakura.ne.jp (www5210.sakura.ne.jp [133.167.8.150]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2BBBE274FCB for ; Mon, 15 Sep 2025 04:28:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=133.167.8.150 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757910497; cv=none; b=r+LAD9jwiJgXjNuaBOONBguS1RpMR54OZTTyHrtqi0OSC2HIHnFjNDEd10pPLyFjORgrVRcFaaoES+yplz2FtK/Tmr0D5CYMhgx+TSvFEYQDZmrBRQn595JjxdNzFq6cG9ChR+g7WTydodwSx6IYpoBL39dNsfiOp7ym3XVQLMU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757910497; c=relaxed/simple; bh=E/8dlYZ6yejXTpFkv2sPwGGkGkBR++m6Bvl2O3N28ro=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=U/YnW4WymaoDLPSRx5ew895QTdhjEMw++ZtRbcycWopQ7DPki+5jV+4Lmya0pDor/Pfovo2ZZ6lscPvxJy3FSMsCn4uLVfGY7A24HZy0xCTfP6RnrpSP+xfYdVP4wmZJY6Srea/egzcNoUIlsuhUFUYBwR9qAMzcGDQAYct3KVw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=mgml.me; spf=pass smtp.mailfrom=mgml.me; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b=N5t58CaZ; arc=none smtp.client-ip=133.167.8.150 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=mgml.me Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=mgml.me Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b="N5t58CaZ" Received: from fedora (p4504123-ipxg00s01tokaisakaetozai.aichi.ocn.ne.jp [114.172.113.123]) (authenticated bits=0) by www5210.sakura.ne.jp (8.16.1/8.16.1) with ESMTPSA id 58F3gpZk004256 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Mon, 15 Sep 2025 12:43:16 +0900 (JST) (envelope-from k@mgml.me) DKIM-Signature: a=rsa-sha256; bh=1/ZEQgORDC2M93e8yJIurN5FVHbw5bJImDctzOv4h0Q=; c=relaxed/relaxed; d=mgml.me; h=From:To:Subject:Date:Message-ID; s=rs20250315; t=1757907796; v=1; b=N5t58CaZjqH3SnV0ypz/TiNA4FAv6ZuRRnrcRpAhbrtcANyc5gv+BX6hmCo/qFyM AhUdflyO9whoCzcc5H9rZ5H2Rc3JWqSM10dWKlnDSK7usKoIHRaNgAo8EVmMA8sh QdFNne0fnPmYr0oMuFguYoZ9y2ny814rzAKz/diBCV5i/ZYAQCh6D2KidbtjJDz2 l6TXvpZnIlPnuimruhJbbAWoRrq5xNZpvJ0xwAnIOaDhDBxVRyBG8RnYtv7EggtH XkY9fTWtQDTQ2wkZMKfgQnghc0dZJvlLxfs4/7wteshtr5NjNn6OzxzxFz8hKdQP BxZbx0X/1gBGXJ1zfxDbsw== From: Kenta Akagi To: Song Liu , Yu Kuai , Mariusz Tkaczyk , Shaohua Li , Guoqing Jiang Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, Kenta Akagi Subject: [PATCH v4 3/9] md: introduce md_bio_failure_error() Date: Mon, 15 Sep 2025 12:42:04 +0900 Message-ID: <20250915034210.8533-4-k@mgml.me> X-Mailer: git-send-email 2.50.1 In-Reply-To: <20250915034210.8533-1-k@mgml.me> References: <20250915034210.8533-1-k@mgml.me> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add a new helper function md_bio_failure_error(). It is serialized with md_error() under the same lock and works almost the same, but with two differences: * Takes the failed bio as an argument * If MD_FAILFAST is set in bi_opf and the target rdev is LastDev, it does not mark the rdev faulty Failfast bios must not break the array, but in the current implementation this can happen. This is because MD_BROKEN was introduced in RAID1/RAID10 and is set when md_error() is called on an rdev required for mddev operation. At the time failfast was introduced, this was not the case. Before this commit, md_error() has already been serialized, and RAID1/RAID10 mark rdevs that must not be set Faulty by Failfast with the LastDev flag. The actual change in bio error handling will follow in a later commit. Signed-off-by: Kenta Akagi --- drivers/md/md.c | 42 ++++++++++++++++++++++++++++++++++++++++++ drivers/md/md.h | 4 +++- 2 files changed, 45 insertions(+), 1 deletion(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 5607578a6db9..65fdd9bae8f4 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -8297,6 +8297,48 @@ void md_error(struct mddev *mddev, struct md_rdev *r= dev) } EXPORT_SYMBOL(md_error); =20 +/** md_bio_failure_error() - md error handler for MD_FAILFAST bios + * @mddev: affected md device. + * @rdev: member device to fail. + * @bio: bio whose triggered device failure. + * + * This is almost the same as md_error(). That is, it is serialized at + * the same level as md_error, marks the rdev as Faulty, and changes + * the mddev status. + * However, if all of the following conditions are met, it does nothing. + * This is because MD_FAILFAST bios must not stopping the array. + * * RAID1 or RAID10 + * * LastDev - if rdev becomes Faulty, mddev will stop + * * The failed bio has MD_FAILFAST set + * + * Returns: true if _md_error() was called, false if not. + */ +bool md_bio_failure_error(struct mddev *mddev, struct md_rdev *rdev, struc= t bio *bio) +{ + bool do_md_error =3D true; + + spin_lock(&mddev->error_handle_lock); + if (mddev->pers) { + if (mddev->pers->head.id =3D=3D ID_RAID1 || + mddev->pers->head.id =3D=3D ID_RAID10) { + if (test_bit(LastDev, &rdev->flags) && + test_bit(FailFast, &rdev->flags) && + bio !=3D NULL && (bio->bi_opf & MD_FAILFAST)) + do_md_error =3D false; + } + } + + if (do_md_error) + _md_error(mddev, rdev); + else + pr_warn_ratelimited("md: %s: %s didn't do anything for %pg\n", + mdname(mddev), __func__, rdev->bdev); + + spin_unlock(&mddev->error_handle_lock); + return do_md_error; +} +EXPORT_SYMBOL(md_bio_failure_error); + /* seq_file implementation /proc/mdstat */ =20 static void status_unused(struct seq_file *seq) diff --git a/drivers/md/md.h b/drivers/md/md.h index 5177cb609e4b..11389ea58431 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -283,7 +283,8 @@ enum flag_bits { */ LastDev, /* This is the last working rdev. * so don't use FailFast any more for - * metadata. + * metadata and don't Fail rdev + * when FailFast bio failure. */ CollisionCheck, /* * check if there is collision between raid1 @@ -906,6 +907,7 @@ extern void md_write_end(struct mddev *mddev); extern void md_done_sync(struct mddev *mddev, int blocks, int ok); void _md_error(struct mddev *mddev, struct md_rdev *rdev); extern void md_error(struct mddev *mddev, struct md_rdev *rdev); +extern bool md_bio_failure_error(struct mddev *mddev, struct md_rdev *rdev= , struct bio *bio); extern void md_finish_reshape(struct mddev *mddev); void md_submit_discard_bio(struct mddev *mddev, struct md_rdev *rdev, struct bio *bio, sector_t start, sector_t size); --=20 2.50.1 From nobody Thu Oct 2 16:30:53 2025 Received: from www5210.sakura.ne.jp (www5210.sakura.ne.jp [133.167.8.150]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 54EE3277CAF for ; Mon, 15 Sep 2025 04:28:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=133.167.8.150 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757910507; cv=none; b=Y2AycR9CaTlmOEA+96BO0eND4K0T+1c0rjtCubB2vas0KZEUDHEqxYXnYg1McT3dVTtuPLn704nkrhtuXFYj4Oc40jtkPWnUX+8vP7zyJAU0McZdCh+DalInta9oG19R2srFQJ9THkUNaDMbXK6VDFDn70h4knTUG+QH368G44M= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757910507; c=relaxed/simple; bh=wD5CMM6TvPfA/2xDATvKzQg1tYzefqIW7PeON1dutxo=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=tTRTCXXvVmHYBUTr07Ekx4VhUD2t8LwKN8QTWnRX3QruT4W9KAfm9E+xkEQfr9bI5U2pAaSkfI4RiNDrhkIy5QWH/wUyldsBDWAqT2x2sG/azIaYmnkynA/qy5MmMoSJPEuDAynu5PRrGJ8rxYzmnwUQwyY1fOGgXfyOqwUwpfQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=mgml.me; spf=pass smtp.mailfrom=mgml.me; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b=W1rUQz75; arc=none smtp.client-ip=133.167.8.150 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=mgml.me Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=mgml.me Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b="W1rUQz75" Received: from fedora (p4504123-ipxg00s01tokaisakaetozai.aichi.ocn.ne.jp [114.172.113.123]) (authenticated bits=0) by www5210.sakura.ne.jp (8.16.1/8.16.1) with ESMTPSA id 58F3gpZl004256 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Mon, 15 Sep 2025 12:43:16 +0900 (JST) (envelope-from k@mgml.me) DKIM-Signature: a=rsa-sha256; bh=03vpmN1U17dRcmG0NGZtJHktTy8T8Gasill9Rvm0iL0=; c=relaxed/relaxed; d=mgml.me; h=From:To:Subject:Date:Message-ID; s=rs20250315; t=1757907796; v=1; b=W1rUQz75FBm4uvA1JAJGwHh+FF+NyYqirGbFwqb4v0W5JP4skhjFeIbqjOu8aQZ9 W5tM1Gnf1Gbsk1GcogYcxYNKHtcf3bTOz5SS5pAQ9P9y4IU7vjlcusZxdz/3CS/B da9Z3UmhMkvoNUdP6IElYoR72jScEipTI1OllTpocYvEMxCsUtecHmKBHhpVfats V9hZ68OmwUxAaoGtLq0ReEmbKxqU5b+yDV4nykFnl5MRCoKn7bKSYtcP4zrwkhwl fdqvWa8Gzhj85xQsL147zvXdRcIZ3CjQl/R3CSLNRgf74uZwloRJGG4bcPEbXRR+ nymlg4TO/6WrQlDESKHIlg== From: Kenta Akagi To: Song Liu , Yu Kuai , Mariusz Tkaczyk , Shaohua Li , Guoqing Jiang Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, Kenta Akagi Subject: [PATCH v4 4/9] md/raid1,raid10: Don't set MD_BROKEN on failfast bio failure Date: Mon, 15 Sep 2025 12:42:05 +0900 Message-ID: <20250915034210.8533-5-k@mgml.me> X-Mailer: git-send-email 2.50.1 In-Reply-To: <20250915034210.8533-1-k@mgml.me> References: <20250915034210.8533-1-k@mgml.me> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Failfast is a feature implemented only for RAID1 and RAID10. It instructs the block device providing the rdev to immediately return a bio error without retrying if any issue occurs. This allows quickly detaching a problematic rdev and minimizes IO latency. Due to its nature, failfast bios can fail easily, and md must not mark an essential rdev as Faulty or set MD_BROKEN on the array just because a failfast bio failed. When failfast was introduced, RAID1 and RAID10 were designed to continue operating normally even if md_error was called for the last rdev. However, with the introduction of MD_BROKEN in RAID1/RAID10 in commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"), calling md_error for the last rdev now prevents further writes to the array. Despite this, the current failfast error handler still assumes that calling md_error will not break the array. Normally, this is not an issue because MD_FAILFAST is not set when a bio is issued to the last rdev. However, if the array is not degraded and a bio with MD_FAILFAST has been issued, simultaneous failures could potentially break the array. This is unusual but can happen; for example, this can occur when using NVMe over TCP if all rdevs depend on a single Ethernet link. In other words, this becomes a problem under the following conditions: Preconditions: * Failfast is enabled on all rdevs. * All rdevs are In_sync - This is a requirement for bio to be submit with MD_FAILFAST. * At least one bio has been submitted but has not yet completed. Trigger condition: * All underlying devices of the rdevs return an error for their failfast bios. Whether the bio is a read or a write makes little difference to the outcome. In the write case, md_error is invoked on each rdev through its bi_end_io handler. In the read case, losing the first rdev triggers a metadata update. Next, md_super_write, unlike raid1_write_request, issues the bio with MD_FAILFAST if the rdev supports it, causing the bio to fail immediately - Before this patchset, LastDev was set only by the failure path in super_written. Consequently, super_written calls md_error on the remaining rdev. Prior to this commit, the following changes were introduced: * The helper function md_bio_failure_error() that skips the error handler if a failfast bio targets the last rdev. * Serialization md_error() and md_bio_failure_error(). * Setting the LastDev flag for rdevs that must not be lost. This commit uses md_bio_failure_error() instead of md_error() for failfast bio failures, ensuring that failfast bios do not stop array operations. Fixes: 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10") Signed-off-by: Kenta Akagi --- drivers/md/md.c | 5 +---- drivers/md/raid1.c | 37 ++++++++++++++++++------------------- drivers/md/raid10.c | 9 +++++---- 3 files changed, 24 insertions(+), 27 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 65fdd9bae8f4..65814bbe9bad 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -1004,11 +1004,8 @@ static void super_written(struct bio *bio) if (bio->bi_status) { pr_err("md: %s gets error=3D%d\n", __func__, blk_status_to_errno(bio->bi_status)); - md_error(mddev, rdev); - if (!test_bit(Faulty, &rdev->flags) - && (bio->bi_opf & MD_FAILFAST)) { + if (!md_bio_failure_error(mddev, rdev, bio)) set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags); - } } =20 bio_put(bio); diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 32ad6b102ff7..8fff9dacc6e0 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -470,7 +470,7 @@ static void raid1_end_write_request(struct bio *bio) (bio->bi_opf & MD_FAILFAST) && /* We never try FailFast to WriteMostly devices */ !test_bit(WriteMostly, &rdev->flags)) { - md_error(r1_bio->mddev, rdev); + md_bio_failure_error(r1_bio->mddev, rdev, bio); } =20 /* @@ -2178,8 +2178,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio) if (test_bit(FailFast, &rdev->flags)) { /* Don't try recovering from here - just fail it * ... unless it is the last working device of course */ - md_error(mddev, rdev); - if (test_bit(Faulty, &rdev->flags)) + if (md_bio_failure_error(mddev, rdev, bio)) /* Don't try to read from here, but make sure * put_buf does it's thing */ @@ -2657,9 +2656,8 @@ static void handle_write_finished(struct r1conf *conf= , struct r1bio *r1_bio) static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio) { struct mddev *mddev =3D conf->mddev; - struct bio *bio; + struct bio *bio, *updated_bio; struct md_rdev *rdev; - sector_t sector; =20 clear_bit(R1BIO_ReadError, &r1_bio->state); /* we got a read error. Maybe the drive is bad. Maybe just @@ -2672,29 +2670,30 @@ static void handle_read_error(struct r1conf *conf, = struct r1bio *r1_bio) */ =20 bio =3D r1_bio->bios[r1_bio->read_disk]; - bio_put(bio); - r1_bio->bios[r1_bio->read_disk] =3D NULL; + updated_bio =3D NULL; =20 rdev =3D conf->mirrors[r1_bio->read_disk].rdev; - if (mddev->ro =3D=3D 0 - && !test_bit(FailFast, &rdev->flags)) { - freeze_array(conf, 1); - fix_read_error(conf, r1_bio); - unfreeze_array(conf); - } else if (mddev->ro =3D=3D 0 && test_bit(FailFast, &rdev->flags)) { - md_error(mddev, rdev); + if (mddev->ro =3D=3D 0) { + if (!test_bit(FailFast, &rdev->flags)) { + freeze_array(conf, 1); + fix_read_error(conf, r1_bio); + unfreeze_array(conf); + } else { + md_bio_failure_error(mddev, rdev, bio); + } } else { - r1_bio->bios[r1_bio->read_disk] =3D IO_BLOCKED; + updated_bio =3D IO_BLOCKED; } =20 + bio_put(bio); + r1_bio->bios[r1_bio->read_disk] =3D updated_bio; + rdev_dec_pending(rdev, conf->mddev); - sector =3D r1_bio->sector; - bio =3D r1_bio->master_bio; =20 /* Reuse the old r1_bio so that the IO_BLOCKED settings are preserved */ r1_bio->state =3D 0; - raid1_read_request(mddev, bio, r1_bio->sectors, r1_bio); - allow_barrier(conf, sector); + raid1_read_request(mddev, r1_bio->master_bio, r1_bio->sectors, r1_bio); + allow_barrier(conf, r1_bio->sector); } =20 static void raid1d(struct md_thread *thread) diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index dc4edd4689f8..b73af94a88b0 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -488,7 +488,7 @@ static void raid10_end_write_request(struct bio *bio) dec_rdev =3D 0; if (test_bit(FailFast, &rdev->flags) && (bio->bi_opf & MD_FAILFAST)) { - md_error(rdev->mddev, rdev); + md_bio_failure_error(rdev->mddev, rdev, bio); } =20 /* @@ -2443,7 +2443,7 @@ static void sync_request_write(struct mddev *mddev, s= truct r10bio *r10_bio) continue; } else if (test_bit(FailFast, &rdev->flags)) { /* Just give up on this device */ - md_error(rdev->mddev, rdev); + md_bio_failure_error(rdev->mddev, rdev, tbio); continue; } /* Ok, we need to write this bio, either to correct an @@ -2895,8 +2895,9 @@ static void handle_read_error(struct mddev *mddev, st= ruct r10bio *r10_bio) freeze_array(conf, 1); fix_read_error(conf, mddev, r10_bio); unfreeze_array(conf); - } else - md_error(mddev, rdev); + } else { + md_bio_failure_error(mddev, rdev, bio); + } =20 rdev_dec_pending(rdev, mddev); r10_bio->state =3D 0; --=20 2.50.1 From nobody Thu Oct 2 16:30:53 2025 Received: from www5210.sakura.ne.jp (www5210.sakura.ne.jp [133.167.8.150]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 923D42765D4 for ; Mon, 15 Sep 2025 04:28:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=133.167.8.150 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757910502; cv=none; b=KR1gkgnOkd4T6oXbdgOr+UMuO+SB6goxXA2Yp8x/1X6VjNjnd++COUaqbkY0nDoDmNvlvM2qoGwDLpT/xbSE5Oq91fLel0BXxiwqPCrpoh1T149emapNMJXKRbOiZodLmDPRYHHYqoJSk4MMdhKN23AfLJPZjCbHbvYslRPNYZQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757910502; c=relaxed/simple; bh=dA7iA4QdYV5OQXZnQTJvoO7SWvqCvNO2XgAU9+kXw9M=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=hjDY7QpFJk8FgIslcXWSTuZn9Ea6GhHimYxVVhCXC9bJEgh1O0qjvsUf0ZvOZks+ZAjq+r8Hm9aNvf7M4ojl/G0xvswV5n6rEjp2JF2r1T1036gH/N2oUg2fwZn/thoc6kWpaAmH7mQUqz5s7ZGFSIsOJIzsRSRmV5adNNEFjNY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=mgml.me; spf=pass smtp.mailfrom=mgml.me; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b=cid15uYn; arc=none smtp.client-ip=133.167.8.150 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=mgml.me Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=mgml.me Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b="cid15uYn" Received: from fedora (p4504123-ipxg00s01tokaisakaetozai.aichi.ocn.ne.jp [114.172.113.123]) (authenticated bits=0) by www5210.sakura.ne.jp (8.16.1/8.16.1) with ESMTPSA id 58F3gpZm004256 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Mon, 15 Sep 2025 12:43:16 +0900 (JST) (envelope-from k@mgml.me) DKIM-Signature: a=rsa-sha256; bh=66cadRtkt9WSztizM58z6M5q8HJ0282WWMYK0IH6Oy4=; c=relaxed/relaxed; d=mgml.me; h=From:To:Subject:Date:Message-ID; s=rs20250315; t=1757907796; v=1; b=cid15uYniIEHCPUNQNY9iMTPrtXl5hdOSrQQkQbZ065Ed44Z0B3ZdnAWvFBt0MWs IvUfzWX6dBgwHPX9FKMZVCd8t9I3YL7wlPFZilmwou+xZg+wOAwDhOZCb4RGrWIn 601AUlip8+uecNAf7XkkYKd624ra2LlrAiUYYkOtoL8rML1CVsHWbUfhZNUTv3Qd yDvaMvM3unnJ7lacqFVpTXomB79He3EP1AHb/yGKjE4N7KvLiL4E/8yMyGpkmi5i 0RPIFlUyGpOVmE6mSooGjqUNSyRaXWOEsfqHZKbzm75hV1Kk6mzU3hvPVujfgtsx odhbWhXb32U9hzuZ+J1tew== From: Kenta Akagi To: Song Liu , Yu Kuai , Mariusz Tkaczyk , Shaohua Li , Guoqing Jiang Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, Kenta Akagi Subject: [PATCH v4 5/9] md/raid1,raid10: Set R{1,10}BIO_Uptodate when successful retry of a failed bio Date: Mon, 15 Sep 2025 12:42:06 +0900 Message-ID: <20250915034210.8533-6-k@mgml.me> X-Mailer: git-send-email 2.50.1 In-Reply-To: <20250915034210.8533-1-k@mgml.me> References: <20250915034210.8533-1-k@mgml.me> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" In the current implementation, when a write bio fails, the retry flow is as follows: * In bi_end_io, e.g. raid1_end_write_request, R1BIO_WriteError is set on the r1bio. * The md thread calls handle_write_finished corresponding to this r1bio. * Inside handle_write_finished, narrow_write_error is invoked. * narrow_write_error rewrites the r1bio on a per-sector basis, marking any failed sectors as badblocks. It returns true if all sectors succeed, or if failed sectors are successfully recorded via rdev_set_badblock. It returns false if rdev_set_badblock fails or if badblocks are disabled. * handle_write_finished faults the rdev if it receives false from narrow_write_error. Otherwise, it does nothing. This can cause a problem where an r1bio that succeeded on retry is incorrectly reported as failed to the higher layer, for example in the following case: * Only one In_sync rdev exists, and * The write bio initially failed but all retries in narrow_write_error succeed. This commit ensures that if a write initially fails but all retries in narrow_write_error succeed, R1BIO_Uptodate or R10BIO_Uptodate is set and the higher layer receives a successful write status. Signed-off-by: Kenta Akagi --- drivers/md/raid1.c | 32 ++++++++++++++++++++++++++------ drivers/md/raid10.c | 21 +++++++++++++++++++++ 2 files changed, 47 insertions(+), 6 deletions(-) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 8fff9dacc6e0..806f5cb33a8e 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -2517,6 +2517,21 @@ static void fix_read_error(struct r1conf *conf, stru= ct r1bio *r1_bio) } } =20 +/** + * narrow_write_error() - Retry write and set badblock + * @r1_bio: the r1bio containing the write error + * @i: which device to retry + * + * Rewrites the bio, splitting it at the least common multiple of the logi= cal + * block size and the badblock size. Blocks that fail to be written are ma= rked + * as bad. If badblocks are disabled, no write is attempted and false is + * returned immediately. + * + * Return: + * * %true - all blocks were written or marked bad successfully + * * %false - bbl disabled or + * one or more blocks write failed and could not be marked bad + */ static bool narrow_write_error(struct r1bio *r1_bio, int i) { struct mddev *mddev =3D r1_bio->mddev; @@ -2614,9 +2629,9 @@ static void handle_write_finished(struct r1conf *conf= , struct r1bio *r1_bio) int m, idx; bool fail =3D false; =20 - for (m =3D 0; m < conf->raid_disks * 2 ; m++) + for (m =3D 0; m < conf->raid_disks * 2 ; m++) { + struct md_rdev *rdev =3D conf->mirrors[m].rdev; if (r1_bio->bios[m] =3D=3D IO_MADE_GOOD) { - struct md_rdev *rdev =3D conf->mirrors[m].rdev; rdev_clear_badblocks(rdev, r1_bio->sector, r1_bio->sectors, 0); @@ -2628,12 +2643,17 @@ static void handle_write_finished(struct r1conf *co= nf, struct r1bio *r1_bio) */ fail =3D true; if (!narrow_write_error(r1_bio, m)) - md_error(conf->mddev, - conf->mirrors[m].rdev); + md_error(conf->mddev, rdev); /* an I/O failed, we can't clear the bitmap */ - rdev_dec_pending(conf->mirrors[m].rdev, - conf->mddev); + else if (test_bit(In_sync, &rdev->flags) && + !test_bit(Faulty, &rdev->flags) && + rdev_has_badblock(rdev, + r1_bio->sector, + r1_bio->sectors) =3D=3D 0) + set_bit(R1BIO_Uptodate, &r1_bio->state); + rdev_dec_pending(rdev, conf->mddev); } + } if (fail) { spin_lock_irq(&conf->device_lock); list_add(&r1_bio->retry_list, &conf->bio_end_io_list); diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index b73af94a88b0..21c2821453e1 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -2809,6 +2809,21 @@ static void fix_read_error(struct r10conf *conf, str= uct mddev *mddev, struct r10 } } =20 +/** + * narrow_write_error() - Retry write and set badblock + * @r10_bio: the r10bio containing the write error + * @i: which device to retry + * + * Rewrites the bio, splitting it at the least common multiple of the logi= cal + * block size and the badblock size. Blocks that fail to be written are ma= rked + * as bad. If badblocks are disabled, no write is attempted and false is + * returned immediately. + * + * Return: + * * %true - all blocks were written or marked bad successfully + * * %false - bbl disabled or + * one or more blocks write failed and could not be marked bad + */ static bool narrow_write_error(struct r10bio *r10_bio, int i) { struct bio *bio =3D r10_bio->master_bio; @@ -2975,6 +2990,12 @@ static void handle_write_completed(struct r10conf *c= onf, struct r10bio *r10_bio) fail =3D true; if (!narrow_write_error(r10_bio, m)) md_error(conf->mddev, rdev); + else if (test_bit(In_sync, &rdev->flags) && + !test_bit(Faulty, &rdev->flags) && + rdev_has_badblock(rdev, + r10_bio->devs[m].addr, + r10_bio->sectors) =3D=3D 0) + set_bit(R10BIO_Uptodate, &r10_bio->state); rdev_dec_pending(rdev, conf->mddev); } bio =3D r10_bio->devs[m].repl_bio; --=20 2.50.1 From nobody Thu Oct 2 16:30:53 2025 Received: from www5210.sakura.ne.jp (www5210.sakura.ne.jp [133.167.8.150]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BB6C12765D4 for ; Mon, 15 Sep 2025 04:28:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=133.167.8.150 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757910509; cv=none; b=m5r1ax6vuKKDnf0u4FD17eM0GCAG7OUWacD8KmIOVFXBahQsU7gpAoxJC/8rkexPKpSrsxA8U84I9/rU3A00X2mXRhnL2VFAqfRkebY92n0qxHyXqzhlS7XsToR8if/ppi81c69JzvxjNDrb5UIkSMbb4axYdEwyJyfBDABQ2XA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757910509; c=relaxed/simple; bh=4CLcBBkjK36gEeYIZwEui/NZjEwFgXh9aYprXRLtRug=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=gXZn6SNtG8j5FjsBdaU9Phg3V38UBWY4JDrY9f0kyZS+RTjE39fneT5DtSXNegN6NM28L+j5mcxZywa+crCalIT/zB7+/A3PhMWLEq7XBCCdkP5QG1E7h1SSb/D6fsdZn/ivY23cze4xPMcTUHoi88drvN1KIEzY+uzsNuR5N24= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=mgml.me; spf=pass smtp.mailfrom=mgml.me; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b=slVGW7LA; arc=none smtp.client-ip=133.167.8.150 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=mgml.me Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=mgml.me Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b="slVGW7LA" Received: from fedora (p4504123-ipxg00s01tokaisakaetozai.aichi.ocn.ne.jp [114.172.113.123]) (authenticated bits=0) by www5210.sakura.ne.jp (8.16.1/8.16.1) with ESMTPSA id 58F3gpZn004256 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Mon, 15 Sep 2025 12:43:16 +0900 (JST) (envelope-from k@mgml.me) DKIM-Signature: a=rsa-sha256; bh=RJxV4KK8ZU7rxaRBtg/Q1j5+pT7wrtjDTMfXC7sU52Y=; c=relaxed/relaxed; d=mgml.me; h=From:To:Subject:Date:Message-ID; s=rs20250315; t=1757907796; v=1; b=slVGW7LATuFQMqYfeezIbLpXf/DneP0VulPE6a+NXMHbI1uOJLXpDr2pXpXvzuno sg3gyxQfLhJ1Y5aXGeuOpJMc1la/9WUAtGPHq0JXMcHcOevaJ/ELxV75y3LisNJW q/966omQdWkIBMgYtRgKYvva/PNOlDg5T/R0UhgG2Xf8c7UVbdgPZFnopWFwmnHF EghZJm5OmNTEtvbo7E1J3jfIMfpNO480OgJVlcC8OLe3B/XCSIz5WQgQUfvkIgyQ CYcvdYcGmxvoVrbjIJit8zzh9qSwyEOF0QtcACwWMW8sOZWt573jowHhvJSgBnyB PDG8v0FasebMymovhzfwAg== From: Kenta Akagi To: Song Liu , Yu Kuai , Mariusz Tkaczyk , Shaohua Li , Guoqing Jiang Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, Kenta Akagi Subject: [PATCH v4 6/9] md/raid1,raid10: Fix missing retries Failfast write bios on no-bbl rdevs Date: Mon, 15 Sep 2025 12:42:07 +0900 Message-ID: <20250915034210.8533-7-k@mgml.me> X-Mailer: git-send-email 2.50.1 In-Reply-To: <20250915034210.8533-1-k@mgml.me> References: <20250915034210.8533-1-k@mgml.me> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" In the current implementation, write failures are not retried on rdevs with badblocks disabled. This is because narrow_write_error, which issues retry bios, immediately returns when badblocks are disabled. As a result, a single write failure on such an rdev will immediately mark it as Faulty. The retry mechanism appears to have been implemented under the assumption that a bad block is involved in the failure. However, the retry after MD_FAILFAST write failure depend on this code, and a Failfast write request may fail for reasons unrelated to bad blocks. Consequently, if failfast is enabled and badblocks are disabled on all rdevs, and all rdevs encounter a failfast write bio failure at the same time, no retries will occur and the entire array can be lost. This commit adds a path in narrow_write_error to retry writes even on rdevs where bad blocks are disabled, and failed bios marked with MD_FAILFAST will use this path. For non-failfast cases, the behavior remains unchanged: no retry writes are attempted to rdevs with bad blocks disabled. Fixes: 1919cbb23bf1 ("md/raid10: add failfast handling for writes.") Fixes: 212e7eb7a340 ("md/raid1: add failfast handling for writes.") Signed-off-by: Kenta Akagi --- drivers/md/raid1.c | 44 +++++++++++++++++++++++++++++--------------- drivers/md/raid10.c | 37 ++++++++++++++++++++++++------------- 2 files changed, 53 insertions(+), 28 deletions(-) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 806f5cb33a8e..55213bcd82f4 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -2521,18 +2521,19 @@ static void fix_read_error(struct r1conf *conf, str= uct r1bio *r1_bio) * narrow_write_error() - Retry write and set badblock * @r1_bio: the r1bio containing the write error * @i: which device to retry + * @force: Retry writing even if badblock is disabled * * Rewrites the bio, splitting it at the least common multiple of the logi= cal * block size and the badblock size. Blocks that fail to be written are ma= rked - * as bad. If badblocks are disabled, no write is attempted and false is - * returned immediately. + * as bad. If bbl disabled and @force is not set, no retry is attempted. + * If bbl disabled and @force is set, the write is retried in the same way. * * Return: * * %true - all blocks were written or marked bad successfully * * %false - bbl disabled or * one or more blocks write failed and could not be marked bad */ -static bool narrow_write_error(struct r1bio *r1_bio, int i) +static bool narrow_write_error(struct r1bio *r1_bio, int i, bool force) { struct mddev *mddev =3D r1_bio->mddev; struct r1conf *conf =3D mddev->private; @@ -2553,13 +2554,17 @@ static bool narrow_write_error(struct r1bio *r1_bio= , int i) sector_t sector; int sectors; int sect_to_write =3D r1_bio->sectors; - bool ok =3D true; + bool write_ok =3D true; + bool setbad_ok =3D true; + bool bbl_enabled =3D !(rdev->badblocks.shift < 0); =20 - if (rdev->badblocks.shift < 0) + if (!force && !bbl_enabled) return false; =20 - block_sectors =3D roundup(1 << rdev->badblocks.shift, - bdev_logical_block_size(rdev->bdev) >> 9); + block_sectors =3D bdev_logical_block_size(rdev->bdev) >> 9; + if (bbl_enabled) + block_sectors =3D roundup(1 << rdev->badblocks.shift, + block_sectors); sector =3D r1_bio->sector; sectors =3D ((sector + block_sectors) & ~(sector_t)(block_sectors - 1)) @@ -2587,18 +2592,22 @@ static bool narrow_write_error(struct r1bio *r1_bio= , int i) bio_trim(wbio, sector - r1_bio->sector, sectors); wbio->bi_iter.bi_sector +=3D rdev->data_offset; =20 - if (submit_bio_wait(wbio) < 0) + if (submit_bio_wait(wbio) < 0) { /* failure! */ - ok =3D rdev_set_badblocks(rdev, sector, - sectors, 0) - && ok; + write_ok =3D false; + if (bbl_enabled) + setbad_ok =3D rdev_set_badblocks(rdev, sector, + sectors, 0) + && setbad_ok; + } =20 bio_put(wbio); sect_to_write -=3D sectors; sector +=3D sectors; sectors =3D block_sectors; } - return ok; + return (write_ok || + (bbl_enabled && setbad_ok)); } =20 static void handle_sync_write_finished(struct r1conf *conf, struct r1bio *= r1_bio) @@ -2631,18 +2640,23 @@ static void handle_write_finished(struct r1conf *co= nf, struct r1bio *r1_bio) =20 for (m =3D 0; m < conf->raid_disks * 2 ; m++) { struct md_rdev *rdev =3D conf->mirrors[m].rdev; - if (r1_bio->bios[m] =3D=3D IO_MADE_GOOD) { + struct bio *bio =3D r1_bio->bios[m]; + + if (bio =3D=3D IO_MADE_GOOD) { rdev_clear_badblocks(rdev, r1_bio->sector, r1_bio->sectors, 0); rdev_dec_pending(rdev, conf->mddev); - } else if (r1_bio->bios[m] !=3D NULL) { + } else if (bio !=3D NULL) { /* This drive got a write error. We need to * narrow down and record precise write * errors. */ fail =3D true; - if (!narrow_write_error(r1_bio, m)) + if (!narrow_write_error( + r1_bio, m, + test_bit(FailFast, &rdev->flags) && + (bio->bi_opf & MD_FAILFAST))) md_error(conf->mddev, rdev); /* an I/O failed, we can't clear the bitmap */ else if (test_bit(In_sync, &rdev->flags) && diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 21c2821453e1..92cf3047dce6 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -2813,18 +2813,18 @@ static void fix_read_error(struct r10conf *conf, st= ruct mddev *mddev, struct r10 * narrow_write_error() - Retry write and set badblock * @r10_bio: the r10bio containing the write error * @i: which device to retry + * @force: Retry writing even if badblock is disabled * * Rewrites the bio, splitting it at the least common multiple of the logi= cal * block size and the badblock size. Blocks that fail to be written are ma= rked - * as bad. If badblocks are disabled, no write is attempted and false is - * returned immediately. + * as bad. If bbl disabled and @force is not set, no retry is attempted. * * Return: * * %true - all blocks were written or marked bad successfully * * %false - bbl disabled or * one or more blocks write failed and could not be marked bad */ -static bool narrow_write_error(struct r10bio *r10_bio, int i) +static bool narrow_write_error(struct r10bio *r10_bio, int i, bool force) { struct bio *bio =3D r10_bio->master_bio; struct mddev *mddev =3D r10_bio->mddev; @@ -2845,13 +2845,17 @@ static bool narrow_write_error(struct r10bio *r10_b= io, int i) sector_t sector; int sectors; int sect_to_write =3D r10_bio->sectors; - bool ok =3D true; + bool write_ok =3D true; + bool setbad_ok =3D true; + bool bbl_enabled =3D !(rdev->badblocks.shift < 0); =20 - if (rdev->badblocks.shift < 0) + if (!force && !bbl_enabled) return false; =20 - block_sectors =3D roundup(1 << rdev->badblocks.shift, - bdev_logical_block_size(rdev->bdev) >> 9); + block_sectors =3D bdev_logical_block_size(rdev->bdev) >> 9; + if (bbl_enabled) + block_sectors =3D roundup(1 << rdev->badblocks.shift, + block_sectors); sector =3D r10_bio->sector; sectors =3D ((r10_bio->sector + block_sectors) & ~(sector_t)(block_sectors - 1)) @@ -2871,18 +2875,22 @@ static bool narrow_write_error(struct r10bio *r10_b= io, int i) choose_data_offset(r10_bio, rdev); wbio->bi_opf =3D REQ_OP_WRITE; =20 - if (submit_bio_wait(wbio) < 0) + if (submit_bio_wait(wbio) < 0) { /* Failure! */ - ok =3D rdev_set_badblocks(rdev, wsector, - sectors, 0) - && ok; + write_ok =3D false; + if (bbl_enabled) + setbad_ok =3D rdev_set_badblocks(rdev, wsector, + sectors, 0) + && setbad_ok; + } =20 bio_put(wbio); sect_to_write -=3D sectors; sector +=3D sectors; sectors =3D block_sectors; } - return ok; + return (write_ok || + (bbl_enabled && setbad_ok)); } =20 static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio) @@ -2988,7 +2996,10 @@ static void handle_write_completed(struct r10conf *c= onf, struct r10bio *r10_bio) rdev_dec_pending(rdev, conf->mddev); } else if (bio !=3D NULL && bio->bi_status) { fail =3D true; - if (!narrow_write_error(r10_bio, m)) + if (!narrow_write_error( + r10_bio, m, + test_bit(FailFast, &rdev->flags) && + (bio->bi_opf & MD_FAILFAST))) md_error(conf->mddev, rdev); else if (test_bit(In_sync, &rdev->flags) && !test_bit(Faulty, &rdev->flags) && --=20 2.50.1 From nobody Thu Oct 2 16:30:53 2025 Received: from www5210.sakura.ne.jp (www5210.sakura.ne.jp [133.167.8.150]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0FCD01FDE19 for ; Mon, 15 Sep 2025 05:02:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=133.167.8.150 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757912562; cv=none; b=PmHzrg/KYUBfu/AqbvyCOsTKXzDt/veO+WmW3yF4KLj4OSp+0K7KX//or1cd3F7IX37U10jJ2XsBY79IwukncRalDuGtk/Vo7db4lWLDxlAKY8QI1FYEE9hn7SXSXBVHuCBE0UNAzIx3n9U7DXArC9Cq09AtBpFR+214+5wKTtI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757912562; c=relaxed/simple; bh=Spp3/3eTGCFSeofHT3BiTDH6F7HLA3NPNTzyD9V/zOE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=CRVgVFWy2hCAUixMM0DRDuR68fa4NZmuAXtol4jiMYdAAsFRIuwTkP0JGg95/3XVhlLfjgLqKAnwYhJ/PcaB3j9oIYfhCzxCTAN3oAjnCESJwRgve0PSNhEo5LMXdSLP7aRiQlR/A35aNbaj042G2dAD604dXkIjgyhLX4A75tQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=mgml.me; spf=pass smtp.mailfrom=mgml.me; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b=exL+AbdH; arc=none smtp.client-ip=133.167.8.150 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=mgml.me Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=mgml.me Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b="exL+AbdH" Received: from fedora (p4504123-ipxg00s01tokaisakaetozai.aichi.ocn.ne.jp [114.172.113.123]) (authenticated bits=0) by www5210.sakura.ne.jp (8.16.1/8.16.1) with ESMTPSA id 58F3gpZo004256 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Mon, 15 Sep 2025 12:43:16 +0900 (JST) (envelope-from k@mgml.me) DKIM-Signature: a=rsa-sha256; bh=Egc/gZktqMRQw3A1lmXwPSLk4aYuJupWRWye4Qdc3pY=; c=relaxed/relaxed; d=mgml.me; h=From:To:Subject:Date:Message-ID; s=rs20250315; t=1757907797; v=1; b=exL+AbdHN16NjauwWSbDqRiPWxPkOdKM7+taj16H7+/SMJsfIoaQJilzpseOULPS 6Q8J61S2XI4bw8n5qWf1aDCvJQKVbScplupbAQs/ZNReZO6OkinUoX+6/QyCo6ki CDffJv+HJzCkV84WOeILZkw45/NZY1zseX5F8cZMaKqgFNWfWEDvdiBgFqt0AQOt nplY790Qo/mMTW59rfR2rfbSFPfZUjYDe5lLi7SqMHdd7o4HVW8U/mMEdtylfAP8 J4fi25/+rKreX5PQF0zrGw+qKTcFxJms4KaY1wfzeHk0XaoUft7vPmP9zfxQS2zu +8JGMLZHn27CLaxoRtZ+PA== From: Kenta Akagi To: Song Liu , Yu Kuai , Mariusz Tkaczyk , Shaohua Li , Guoqing Jiang Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, Kenta Akagi Subject: [PATCH v4 7/9] md/raid10: fix failfast read error not rescheduled Date: Mon, 15 Sep 2025 12:42:08 +0900 Message-ID: <20250915034210.8533-8-k@mgml.me> X-Mailer: git-send-email 2.50.1 In-Reply-To: <20250915034210.8533-1-k@mgml.me> References: <20250915034210.8533-1-k@mgml.me> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" raid10_end_read_request lacks a path to retry when a FailFast IO fails. As a result, when Failfast Read IOs fail on all rdevs, the upper layer receives EIO, without read rescheduled. Looking at the two commits below, it seems only raid10_end_read_request lacks the failfast read retry handling, while raid1_end_read_request has it. In RAID1, the retry works as expected. * commit 8d3ca83dcf9c ("md/raid10: add failfast handling for reads.") * commit 2e52d449bcec ("md/raid1: add failfast handling for reads.") I don't know why raid10_end_read_request lacks this, but it is probably just a simple oversight. This commit will make the failfast read bio for the last rdev in raid10 retry if it fails. Fixes: 8d3ca83dcf9c ("md/raid10: add failfast handling for reads.") Signed-off-by: Kenta Akagi Reviewed-by: Li Nan --- drivers/md/raid10.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 92cf3047dce6..86c0eacd37cb 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -399,6 +399,11 @@ static void raid10_end_read_request(struct bio *bio) * wait for the 'master' bio. */ set_bit(R10BIO_Uptodate, &r10_bio->state); + } else if (test_bit(FailFast, &rdev->flags) && + test_bit(R10BIO_FailFast, &r10_bio->state)) { + /* This was a fail-fast read so we definitely + * want to retry */ + ; } else if (!raid1_should_handle_error(bio)) { uptodate =3D 1; } else { --=20 2.50.1 From nobody Thu Oct 2 16:30:53 2025 Received: from www5210.sakura.ne.jp (www5210.sakura.ne.jp [133.167.8.150]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 50B3E17BEBF for ; Mon, 15 Sep 2025 04:28:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=133.167.8.150 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757910491; cv=none; b=STDAP9sGBmQmtDaWoNjwHBLFvhl+SsQp0Dolz9kmt22k3O5WRQHdOW1CzCCAAvoz5JxGxnzeQmdLttfAq+ZlhyP9Ie0BouMYfwp8cYUvcmGFOA7CGuY9o7dXaz5h1kAVq6uS7Ge+TuWM4r2m2Cl/Xmci8HCyN8v3KIpM6Hycf78= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757910491; c=relaxed/simple; bh=thSEO4XZbXdXEtOD8K6LIxHa5CcneJ4OsmAcM9CEBvM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=UQbX68YPAuIxJnzm6Sd0yAslPklQrU9RX/m64FBxay9XUfLPDNNf0jMvr5jUhUTPEffKIKC+aRtulGYElMs3HAjd8HJ1knuMXm7FBJ8KuU7Ntw32jTdGS7bFywuxe/ytrOeMOnLO+yoJM+Sa9kqTbPa07AFabGRG8lqpppXH044= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=mgml.me; spf=pass smtp.mailfrom=mgml.me; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b=gYNs4EVW; arc=none smtp.client-ip=133.167.8.150 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=mgml.me Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=mgml.me Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b="gYNs4EVW" Received: from fedora (p4504123-ipxg00s01tokaisakaetozai.aichi.ocn.ne.jp [114.172.113.123]) (authenticated bits=0) by www5210.sakura.ne.jp (8.16.1/8.16.1) with ESMTPSA id 58F3gpZp004256 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Mon, 15 Sep 2025 12:43:17 +0900 (JST) (envelope-from k@mgml.me) DKIM-Signature: a=rsa-sha256; bh=R0HOnqHMxuP96xu16dWewxfrRPsTQLoG49PuVlo3OSI=; c=relaxed/relaxed; d=mgml.me; h=From:To:Subject:Date:Message-ID; s=rs20250315; t=1757907797; v=1; b=gYNs4EVWhtbf5rniTjP3beAPOvDoy/GmNv2TdZlwiDlGHvaAOKZ4r9MF2loPlWV7 AKqkI2CY68d6uuuR7Ilm7gJ7CqUmRY93eswH5cayJiLcvAKOkXEJRqAOJMLbRsl5 WMeIA7U+/cs82yQvROapfEhj44jII6P6gRe4Y/kRp+sjLmLwE2BEEyU6syahklqc 0FwuZaYfZykv5xpmIm0hq1deddReOTMeF+sSLm/DGVMrfzYLaZ92HuoSfdoslIun dqEHuYwQA6fxYn4MWXjxDA9NtFUM+scL0q64sXn5MSRduqKfw3amoILpXARD0CBt W7FYJ0dFMOJOO6szOcRywg== From: Kenta Akagi To: Song Liu , Yu Kuai , Mariusz Tkaczyk , Shaohua Li , Guoqing Jiang Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, Kenta Akagi Subject: [PATCH v4 8/9] md/raid1,raid10: Add error message when setting MD_BROKEN Date: Mon, 15 Sep 2025 12:42:09 +0900 Message-ID: <20250915034210.8533-9-k@mgml.me> X-Mailer: git-send-email 2.50.1 In-Reply-To: <20250915034210.8533-1-k@mgml.me> References: <20250915034210.8533-1-k@mgml.me> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Once MD_BROKEN is set on an array, no further writes can be performed to it. The user must be informed that the array cannot continue operation. Signed-off-by: Kenta Akagi --- drivers/md/raid1.c | 4 ++++ drivers/md/raid10.c | 4 ++++ 2 files changed, 8 insertions(+) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 55213bcd82f4..febe2849a71a 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1786,6 +1786,10 @@ static void raid1_error(struct mddev *mddev, struct = md_rdev *rdev) if (test_bit(In_sync, &rdev->flags) && (conf->raid_disks - mddev->degraded) =3D=3D 1) { set_bit(MD_BROKEN, &mddev->flags); + pr_crit("md/raid1:%s: Disk failure on %pg, this is the last device.\n" + "md/raid1:%s: Cannot continue operation (%d/%d failed).\n", + mdname(mddev), rdev->bdev, + mdname(mddev), mddev->degraded + 1, conf->raid_disks); =20 if (!mddev->fail_last_dev) { conf->recovery_disabled =3D mddev->recovery_disabled; diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 86c0eacd37cb..be5fd77da3e1 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -2039,6 +2039,10 @@ static void raid10_error(struct mddev *mddev, struct= md_rdev *rdev) =20 if (test_bit(In_sync, &rdev->flags) && !enough(conf, rdev->raid_disk)) { set_bit(MD_BROKEN, &mddev->flags); + pr_crit("md/raid10:%s: Disk failure on %pg, this is the last device.\n" + "md/raid10:%s: Cannot continue operation (%d/%d failed).\n", + mdname(mddev), rdev->bdev, + mdname(mddev), mddev->degraded + 1, conf->geo.raid_disks); =20 if (!mddev->fail_last_dev) { spin_unlock_irqrestore(&conf->device_lock, flags); --=20 2.50.1 From nobody Thu Oct 2 16:30:53 2025 Received: from www5210.sakura.ne.jp (www5210.sakura.ne.jp [133.167.8.150]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0AC4E19E98C for ; Mon, 15 Sep 2025 05:02:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=133.167.8.150 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757912565; cv=none; b=WByiB6oINv1yZlu1SM65CG2mf44WJrQBLcMg5kxf1REYrvCsMU0VNpFreJpahmMYNPjKvrohnOd6ldQ6FH+KyrA71QL4z/jOFkAq9UHMiUDcWRlGVMUAoWYb64BjdnLNq83T2k1BEfibgzADxCymi714OudVbX5IJ/PKykAOunE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757912565; c=relaxed/simple; bh=0BuXaIhXy4G65WMnbZc0qJEgqyZFyu+URXPBWPnUEgc=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=UE7EAUlW2/Zo6WV7obdB7ygdkn2c3sDUS2y4mLfC1tzWOq2wJoOsInPslQWU1RqLr3sFDVjCAXDBAQMjcTacWHo46WF60uSlGxEWaeB9IJ0LIeT0K4Sq7mo5p9wkhj/yS7py24oMr/r0EG0P+B8QXON2MUQFundC1c78jeyivww= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=mgml.me; spf=pass smtp.mailfrom=mgml.me; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b=hCN3+R70; arc=none smtp.client-ip=133.167.8.150 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=mgml.me Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=mgml.me Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b="hCN3+R70" Received: from fedora (p4504123-ipxg00s01tokaisakaetozai.aichi.ocn.ne.jp [114.172.113.123]) (authenticated bits=0) by www5210.sakura.ne.jp (8.16.1/8.16.1) with ESMTPSA id 58F3gpZq004256 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Mon, 15 Sep 2025 12:43:17 +0900 (JST) (envelope-from k@mgml.me) DKIM-Signature: a=rsa-sha256; bh=LpQfTwt0Pqfiz1reZkxCSBjCcgZybFuCZ+NkVasMgPc=; c=relaxed/relaxed; d=mgml.me; h=From:To:Subject:Date:Message-ID; s=rs20250315; t=1757907797; v=1; b=hCN3+R70BIWFFdxoEL3msRz4tEwvZSXjC3+akKIzMekeaExya/c0CC2jnyGAjwf+ K6z7nFn8vAOSKLKhBTFJM8ej1C05EpsfvrnDxmC1XORT6z5QgA3SiykZ+yJehqx8 tKylEWYVPb914sfcbS0LVtX0PoWxvDIa0pYowhC+OUt+M7B9KWgSXr8fUC1yE5mA nnIYEzJiD84EQcJN3J61+rZqdSvGgqChknYCgzNw1Y+rPdO3uHouc3N5Pn0qNhQU FepFusttzkNKFzdUax5cU55Fs8YphfNbEJxZPLo8RlcL+QKP2H8h0cg40eBdMhdl B6aMosimTHg7SWbbOkEfBw== From: Kenta Akagi To: Song Liu , Yu Kuai , Mariusz Tkaczyk , Shaohua Li , Guoqing Jiang Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, Kenta Akagi Subject: [PATCH v4 9/9] md/raid1,raid10: Fix: Operation continuing on 0 devices. Date: Mon, 15 Sep 2025 12:42:10 +0900 Message-ID: <20250915034210.8533-10-k@mgml.me> X-Mailer: git-send-email 2.50.1 In-Reply-To: <20250915034210.8533-1-k@mgml.me> References: <20250915034210.8533-1-k@mgml.me> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Since commit 9a567843f7ce ("md: allow last device to be forcibly removed from RAID1/RAID10."), RAID1/10 arrays can now lose all rdevs. Before that commit, losing the array last rdev or reaching the end of the function without early return in raid{1,10}_error never occurred. However, both situations can occur in the current implementation. As a result, when mddev->fail_last_dev is set, a spurious pr_crit message can be printed. This patch prevents "Operation continuing" printed if the array is not operational. root@fedora:~# mdadm --create --verbose /dev/md0 --level=3D1 \ --raid-devices=3D2 /dev/loop0 /dev/loop1 mdadm: Note: this array has metadata at the start and may not be suitable as a boot device. If you plan to store '/boot' on this device please ensure that your boot-loader understands md/v1.x metadata, or use --metadata=3D0.90 mdadm: size set to 1046528K Continue creating array? y mdadm: Defaulting to version 1.2 metadata mdadm: array /dev/md0 started. root@fedora:~# echo 1 > /sys/block/md0/md/fail_last_dev root@fedora:~# mdadm --fail /dev/md0 loop0 mdadm: set loop0 faulty in /dev/md0 root@fedora:~# mdadm --fail /dev/md0 loop1 mdadm: set device faulty failed for loop1: Device or resource busy root@fedora:~# dmesg | tail -n 4 [ 1314.359674] md/raid1:md0: Disk failure on loop0, disabling device. md/raid1:md0: Operation continuing on 1 devices. [ 1315.506633] md/raid1:md0: Disk failure on loop1, disabling device. md/raid1:md0: Operation continuing on 0 devices. root@fedora:~# Fixes: 9a567843f7ce ("md: allow last device to be forcibly removed from RAI= D1/RAID10.") Signed-off-by: Kenta Akagi --- drivers/md/raid1.c | 9 +++++---- drivers/md/raid10.c | 9 +++++---- 2 files changed, 10 insertions(+), 8 deletions(-) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index febe2849a71a..b3c845855841 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1803,6 +1803,11 @@ static void raid1_error(struct mddev *mddev, struct = md_rdev *rdev) update_lastdev(conf); } set_bit(Faulty, &rdev->flags); + if ((conf->raid_disks - mddev->degraded) > 0) + pr_crit("md/raid1:%s: Disk failure on %pg, disabling device.\n" + "md/raid1:%s: Operation continuing on %d devices.\n", + mdname(mddev), rdev->bdev, + mdname(mddev), conf->raid_disks - mddev->degraded); spin_unlock_irqrestore(&conf->device_lock, flags); /* * if recovery is running, make sure it aborts. @@ -1810,10 +1815,6 @@ static void raid1_error(struct mddev *mddev, struct = md_rdev *rdev) set_bit(MD_RECOVERY_INTR, &mddev->recovery); set_mask_bits(&mddev->sb_flags, 0, BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING)); - pr_crit("md/raid1:%s: Disk failure on %pg, disabling device.\n" - "md/raid1:%s: Operation continuing on %d devices.\n", - mdname(mddev), rdev->bdev, - mdname(mddev), conf->raid_disks - mddev->degraded); } =20 static void print_conf(struct r1conf *conf) diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index be5fd77da3e1..4f3ef43ebd2a 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -2059,11 +2059,12 @@ static void raid10_error(struct mddev *mddev, struc= t md_rdev *rdev) set_bit(Faulty, &rdev->flags); set_mask_bits(&mddev->sb_flags, 0, BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING)); + if (enough(conf, -1)) + pr_crit("md/raid10:%s: Disk failure on %pg, disabling device.\n" + "md/raid10:%s: Operation continuing on %d devices.\n", + mdname(mddev), rdev->bdev, + mdname(mddev), conf->geo.raid_disks - mddev->degraded); spin_unlock_irqrestore(&conf->device_lock, flags); - pr_crit("md/raid10:%s: Disk failure on %pg, disabling device.\n" - "md/raid10:%s: Operation continuing on %d devices.\n", - mdname(mddev), rdev->bdev, - mdname(mddev), conf->geo.raid_disks - mddev->degraded); } =20 static void print_conf(struct r10conf *conf) --=20 2.50.1