From nobody Thu Dec 18 20:00:13 2025 Received: from www5210.sakura.ne.jp (www5210.sakura.ne.jp [133.167.8.150]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E58953128A6 for ; Mon, 27 Oct 2025 15:44:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=133.167.8.150 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761579883; cv=none; b=CLBkRFvOT0UtRWjpYHW8eiWDFX8NNk5U54S9qU4fHFElwKj80HAvHMvE8ZeeZ7uE3PYzPcwCIFPknPnOnwrhHTinfDunS7uhDaFT8MSn1UBdq4ZFUsu6xi4k790IocOFrnT8zs2OsDSqo+C+P32/I/6R/6d36zW+Mu2VEKob0PM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761579883; c=relaxed/simple; bh=PM6Hw2/LEDcPJK1j1iuESreNSJB1IP7lLe9E6WPaptA=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=XHN3OZQVPy4RY+dDgIqCJl2vFftogo7cuwA17G8xNBeX3vdxjPIJFvgWn6XdtaJ0sd7GWL8SgCAWXOb9kXV234Y3AytS4t0MR2TiYLLdREJnTfI+8fhUj8xccZvUEwxtySOXZ10iIyvkQNsijDgMJCe9uSEscw3JMykecG8jGM4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mgml.me; spf=pass smtp.mailfrom=mgml.me; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b=PxQAV1tl; arc=none smtp.client-ip=133.167.8.150 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mgml.me Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=mgml.me Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b="PxQAV1tl" Received: from fedora (p3796170-ipxg00h01tokaisakaetozai.aichi.ocn.ne.jp [180.53.173.170]) (authenticated bits=0) by www5210.sakura.ne.jp (8.16.1/8.16.1) with ESMTPSA id 59RF4hAb090988 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Tue, 28 Oct 2025 00:04:45 +0900 (JST) (envelope-from k@mgml.me) DKIM-Signature: a=rsa-sha256; bh=CBmmouvA1pbrUQTqO+Cc93MjVtgweINh+DXoNfYQxpA=; c=relaxed/relaxed; d=mgml.me; h=From:To:Subject:Date:Message-ID; s=rs20250315; t=1761577485; v=1; b=PxQAV1tlkBQIereR8urIM0X2lEL4q+lH7aYEgTAoG6gf+KNQraGeYNDnzn1KQPDZ +vJICqNP6dy1tKhWHgNCeqHhrLKF/C25h7laIpkYw5F6nA27fhsWAepwbpEin/0m cNyQbDOm9UC5rbeixLV6vNWJb6Lvf8xJfWgy+1/DtGr95inexOXzJ6EhxDXVK4Dp pawlZm6MnKQMSLvem5ie2d4Tp+fG6DBt2Me7+YT5vmGWIZJ7xiV1cJz1Yg00OzFt F6h5UIbP9EucpQ5S2F9csOWH4rd+eIeNvhWeux8ZVBgWUHlcU+K93XvMCMdrqQiY qpfO8kafvLuUsz3zL+hLuw== From: Kenta Akagi To: Song Liu , Yu Kuai , Shaohua Li , Mariusz Tkaczyk , Guoqing Jiang Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, Kenta Akagi Subject: [PATCH v5 01/16] md: move device_lock from conf to mddev Date: Tue, 28 Oct 2025 00:04:18 +0900 Message-ID: <20251027150433.18193-2-k@mgml.me> X-Mailer: git-send-email 2.50.1 In-Reply-To: <20251027150433.18193-1-k@mgml.me> References: <20251027150433.18193-1-k@mgml.me> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Move device_lock from mddev->private (r1conf, r10conf, r5conf) to the mddev structure in preparation for serializing md_error() and introducing new function that conditional md_error() calls to improvement failfast bio error handling. This commit only moves the device_lock location with no functional changes. Subsequent commits will serialize md_error() and introduce a new function that calls md_error() conditionally. Signed-off-by: Kenta Akagi Reviewed-by: Yu Kuai --- drivers/md/md.c | 1 + drivers/md/md.h | 2 + drivers/md/raid1.c | 51 +++++++------- drivers/md/raid1.h | 2 - drivers/md/raid10.c | 61 +++++++++-------- drivers/md/raid10.h | 1 - drivers/md/raid5-cache.c | 16 ++--- drivers/md/raid5.c | 139 +++++++++++++++++++-------------------- drivers/md/raid5.h | 1 - 9 files changed, 135 insertions(+), 139 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 6062e0deb616..d667580e3125 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -760,6 +760,7 @@ int mddev_init(struct mddev *mddev) atomic_set(&mddev->openers, 0); atomic_set(&mddev->sync_seq, 0); spin_lock_init(&mddev->lock); + spin_lock_init(&mddev->device_lock); init_waitqueue_head(&mddev->sb_wait); init_waitqueue_head(&mddev->recovery_wait); mddev->reshape_position =3D MaxSector; diff --git a/drivers/md/md.h b/drivers/md/md.h index 5d5f780b8447..64ac22edf372 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -543,6 +543,8 @@ struct mddev { /* used for register new sync thread */ struct work_struct sync_work; =20 + spinlock_t device_lock; + /* "lock" protects: * flush_bio transition from NULL to !NULL * rdev superblocks, events diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 592a40233004..7924d5ee189d 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -282,10 +282,10 @@ static void reschedule_retry(struct r1bio *r1_bio) int idx; =20 idx =3D sector_to_idx(r1_bio->sector); - spin_lock_irqsave(&conf->device_lock, flags); + spin_lock_irqsave(&conf->mddev->device_lock, flags); list_add(&r1_bio->retry_list, &conf->retry_list); atomic_inc(&conf->nr_queued[idx]); - spin_unlock_irqrestore(&conf->device_lock, flags); + spin_unlock_irqrestore(&conf->mddev->device_lock, flags); =20 wake_up(&conf->wait_barrier); md_wakeup_thread(mddev->thread); @@ -387,12 +387,12 @@ static void raid1_end_read_request(struct bio *bio) * Here we redefine "uptodate" to mean "Don't want to retry" */ unsigned long flags; - spin_lock_irqsave(&conf->device_lock, flags); + spin_lock_irqsave(&conf->mddev->device_lock, flags); if (r1_bio->mddev->degraded =3D=3D conf->raid_disks || (r1_bio->mddev->degraded =3D=3D conf->raid_disks-1 && test_bit(In_sync, &rdev->flags))) uptodate =3D 1; - spin_unlock_irqrestore(&conf->device_lock, flags); + spin_unlock_irqrestore(&conf->mddev->device_lock, flags); } =20 if (uptodate) { @@ -917,14 +917,14 @@ static void flush_pending_writes(struct r1conf *conf) /* Any writes that have been queued but are awaiting * bitmap updates get flushed here. */ - spin_lock_irq(&conf->device_lock); + spin_lock_irq(&conf->mddev->device_lock); =20 if (conf->pending_bio_list.head) { struct blk_plug plug; struct bio *bio; =20 bio =3D bio_list_get(&conf->pending_bio_list); - spin_unlock_irq(&conf->device_lock); + spin_unlock_irq(&conf->mddev->device_lock); =20 /* * As this is called in a wait_event() loop (see freeze_array), @@ -940,7 +940,7 @@ static void flush_pending_writes(struct r1conf *conf) flush_bio_list(conf, bio); blk_finish_plug(&plug); } else - spin_unlock_irq(&conf->device_lock); + spin_unlock_irq(&conf->mddev->device_lock); } =20 /* Barriers.... @@ -1274,9 +1274,9 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool= from_schedule) struct bio *bio; =20 if (from_schedule) { - spin_lock_irq(&conf->device_lock); + spin_lock_irq(&conf->mddev->device_lock); bio_list_merge(&conf->pending_bio_list, &plug->pending); - spin_unlock_irq(&conf->device_lock); + spin_unlock_irq(&conf->mddev->device_lock); wake_up_barrier(conf); md_wakeup_thread(mddev->thread); kfree(plug); @@ -1664,9 +1664,9 @@ static void raid1_write_request(struct mddev *mddev, = struct bio *bio, /* flush_pending_writes() needs access to the rdev so...*/ mbio->bi_bdev =3D (void *)rdev; if (!raid1_add_bio_to_plug(mddev, mbio, raid1_unplug, disks)) { - spin_lock_irqsave(&conf->device_lock, flags); + spin_lock_irqsave(&conf->mddev->device_lock, flags); bio_list_add(&conf->pending_bio_list, mbio); - spin_unlock_irqrestore(&conf->device_lock, flags); + spin_unlock_irqrestore(&conf->mddev->device_lock, flags); md_wakeup_thread(mddev->thread); } } @@ -1753,7 +1753,7 @@ static void raid1_error(struct mddev *mddev, struct m= d_rdev *rdev) struct r1conf *conf =3D mddev->private; unsigned long flags; =20 - spin_lock_irqsave(&conf->device_lock, flags); + spin_lock_irqsave(&conf->mddev->device_lock, flags); =20 if (test_bit(In_sync, &rdev->flags) && (conf->raid_disks - mddev->degraded) =3D=3D 1) { @@ -1761,7 +1761,7 @@ static void raid1_error(struct mddev *mddev, struct m= d_rdev *rdev) =20 if (!mddev->fail_last_dev) { conf->recovery_disabled =3D mddev->recovery_disabled; - spin_unlock_irqrestore(&conf->device_lock, flags); + spin_unlock_irqrestore(&conf->mddev->device_lock, flags); return; } } @@ -1769,7 +1769,7 @@ static void raid1_error(struct mddev *mddev, struct m= d_rdev *rdev) if (test_and_clear_bit(In_sync, &rdev->flags)) mddev->degraded++; set_bit(Faulty, &rdev->flags); - spin_unlock_irqrestore(&conf->device_lock, flags); + spin_unlock_irqrestore(&conf->mddev->device_lock, flags); /* * if recovery is running, make sure it aborts. */ @@ -1831,7 +1831,7 @@ static int raid1_spare_active(struct mddev *mddev) * device_lock used to avoid races with raid1_end_read_request * which expects 'In_sync' flags and ->degraded to be consistent. */ - spin_lock_irqsave(&conf->device_lock, flags); + spin_lock_irqsave(&conf->mddev->device_lock, flags); for (i =3D 0; i < conf->raid_disks; i++) { struct md_rdev *rdev =3D conf->mirrors[i].rdev; struct md_rdev *repl =3D conf->mirrors[conf->raid_disks + i].rdev; @@ -1863,7 +1863,7 @@ static int raid1_spare_active(struct mddev *mddev) } } mddev->degraded -=3D count; - spin_unlock_irqrestore(&conf->device_lock, flags); + spin_unlock_irqrestore(&conf->mddev->device_lock, flags); =20 print_conf(conf); return count; @@ -2605,11 +2605,11 @@ static void handle_write_finished(struct r1conf *co= nf, struct r1bio *r1_bio) conf->mddev); } if (fail) { - spin_lock_irq(&conf->device_lock); + spin_lock_irq(&conf->mddev->device_lock); list_add(&r1_bio->retry_list, &conf->bio_end_io_list); idx =3D sector_to_idx(r1_bio->sector); atomic_inc(&conf->nr_queued[idx]); - spin_unlock_irq(&conf->device_lock); + spin_unlock_irq(&conf->mddev->device_lock); /* * In case freeze_array() is waiting for condition * get_unqueued_pending() =3D=3D extra to be true. @@ -2681,10 +2681,10 @@ static void raid1d(struct md_thread *thread) if (!list_empty_careful(&conf->bio_end_io_list) && !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) { LIST_HEAD(tmp); - spin_lock_irqsave(&conf->device_lock, flags); + spin_lock_irqsave(&conf->mddev->device_lock, flags); if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) list_splice_init(&conf->bio_end_io_list, &tmp); - spin_unlock_irqrestore(&conf->device_lock, flags); + spin_unlock_irqrestore(&conf->mddev->device_lock, flags); while (!list_empty(&tmp)) { r1_bio =3D list_first_entry(&tmp, struct r1bio, retry_list); @@ -2702,16 +2702,16 @@ static void raid1d(struct md_thread *thread) =20 flush_pending_writes(conf); =20 - spin_lock_irqsave(&conf->device_lock, flags); + spin_lock_irqsave(&conf->mddev->device_lock, flags); if (list_empty(head)) { - spin_unlock_irqrestore(&conf->device_lock, flags); + spin_unlock_irqrestore(&conf->mddev->device_lock, flags); break; } r1_bio =3D list_entry(head->prev, struct r1bio, retry_list); list_del(head->prev); idx =3D sector_to_idx(r1_bio->sector); atomic_dec(&conf->nr_queued[idx]); - spin_unlock_irqrestore(&conf->device_lock, flags); + spin_unlock_irqrestore(&conf->mddev->device_lock, flags); =20 mddev =3D r1_bio->mddev; conf =3D mddev->private; @@ -3131,7 +3131,6 @@ static struct r1conf *setup_conf(struct mddev *mddev) goto abort; =20 err =3D -EINVAL; - spin_lock_init(&conf->device_lock); conf->raid_disks =3D mddev->raid_disks; rdev_for_each(rdev, mddev) { int disk_idx =3D rdev->raid_disk; @@ -3429,9 +3428,9 @@ static int raid1_reshape(struct mddev *mddev) kfree(conf->mirrors); conf->mirrors =3D newmirrors; =20 - spin_lock_irqsave(&conf->device_lock, flags); + spin_lock_irqsave(&conf->mddev->device_lock, flags); mddev->degraded +=3D (raid_disks - conf->raid_disks); - spin_unlock_irqrestore(&conf->device_lock, flags); + spin_unlock_irqrestore(&conf->mddev->device_lock, flags); conf->raid_disks =3D mddev->raid_disks =3D raid_disks; mddev->delta_disks =3D 0; =20 diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h index 2ebe35aaa534..7af8e294e7ae 100644 --- a/drivers/md/raid1.h +++ b/drivers/md/raid1.h @@ -57,8 +57,6 @@ struct r1conf { int raid_disks; int nonrot_disks; =20 - spinlock_t device_lock; - /* list of 'struct r1bio' that need to be processed by raid1d, * whether to retry a read, writeout a resync or recovery * block, or anything else. diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 14dcd5142eb4..57c887070df3 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -301,10 +301,10 @@ static void reschedule_retry(struct r10bio *r10_bio) struct mddev *mddev =3D r10_bio->mddev; struct r10conf *conf =3D mddev->private; =20 - spin_lock_irqsave(&conf->device_lock, flags); + spin_lock_irqsave(&conf->mddev->device_lock, flags); list_add(&r10_bio->retry_list, &conf->retry_list); conf->nr_queued ++; - spin_unlock_irqrestore(&conf->device_lock, flags); + spin_unlock_irqrestore(&conf->mddev->device_lock, flags); =20 /* wake up frozen array... */ wake_up(&conf->wait_barrier); @@ -863,14 +863,14 @@ static void flush_pending_writes(struct r10conf *conf) /* Any writes that have been queued but are awaiting * bitmap updates get flushed here. */ - spin_lock_irq(&conf->device_lock); + spin_lock_irq(&conf->mddev->device_lock); =20 if (conf->pending_bio_list.head) { struct blk_plug plug; struct bio *bio; =20 bio =3D bio_list_get(&conf->pending_bio_list); - spin_unlock_irq(&conf->device_lock); + spin_unlock_irq(&conf->mddev->device_lock); =20 /* * As this is called in a wait_event() loop (see freeze_array), @@ -896,7 +896,7 @@ static void flush_pending_writes(struct r10conf *conf) } blk_finish_plug(&plug); } else - spin_unlock_irq(&conf->device_lock); + spin_unlock_irq(&conf->mddev->device_lock); } =20 /* Barriers.... @@ -1089,9 +1089,9 @@ static void raid10_unplug(struct blk_plug_cb *cb, boo= l from_schedule) struct bio *bio; =20 if (from_schedule) { - spin_lock_irq(&conf->device_lock); + spin_lock_irq(&conf->mddev->device_lock); bio_list_merge(&conf->pending_bio_list, &plug->pending); - spin_unlock_irq(&conf->device_lock); + spin_unlock_irq(&conf->mddev->device_lock); wake_up_barrier(conf); md_wakeup_thread(mddev->thread); kfree(plug); @@ -1278,9 +1278,9 @@ static void raid10_write_one_disk(struct mddev *mddev= , struct r10bio *r10_bio, atomic_inc(&r10_bio->remaining); =20 if (!raid1_add_bio_to_plug(mddev, mbio, raid10_unplug, conf->copies)) { - spin_lock_irqsave(&conf->device_lock, flags); + spin_lock_irqsave(&conf->mddev->device_lock, flags); bio_list_add(&conf->pending_bio_list, mbio); - spin_unlock_irqrestore(&conf->device_lock, flags); + spin_unlock_irqrestore(&conf->mddev->device_lock, flags); md_wakeup_thread(mddev->thread); } } @@ -1997,13 +1997,13 @@ static void raid10_error(struct mddev *mddev, struc= t md_rdev *rdev) struct r10conf *conf =3D mddev->private; unsigned long flags; =20 - spin_lock_irqsave(&conf->device_lock, flags); + spin_lock_irqsave(&conf->mddev->device_lock, flags); =20 if (test_bit(In_sync, &rdev->flags) && !enough(conf, rdev->raid_disk)) { set_bit(MD_BROKEN, &mddev->flags); =20 if (!mddev->fail_last_dev) { - spin_unlock_irqrestore(&conf->device_lock, flags); + spin_unlock_irqrestore(&conf->mddev->device_lock, flags); return; } } @@ -2015,7 +2015,7 @@ static void raid10_error(struct mddev *mddev, struct = md_rdev *rdev) set_bit(Faulty, &rdev->flags); set_mask_bits(&mddev->sb_flags, 0, BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING)); - spin_unlock_irqrestore(&conf->device_lock, flags); + spin_unlock_irqrestore(&conf->mddev->device_lock, flags); pr_crit("md/raid10:%s: Disk failure on %pg, disabling device.\n" "md/raid10:%s: Operation continuing on %d devices.\n", mdname(mddev), rdev->bdev, @@ -2094,9 +2094,9 @@ static int raid10_spare_active(struct mddev *mddev) sysfs_notify_dirent_safe(tmp->rdev->sysfs_state); } } - spin_lock_irqsave(&conf->device_lock, flags); + spin_lock_irqsave(&conf->mddev->device_lock, flags); mddev->degraded -=3D count; - spin_unlock_irqrestore(&conf->device_lock, flags); + spin_unlock_irqrestore(&conf->mddev->device_lock, flags); =20 print_conf(conf); return count; @@ -2951,10 +2951,10 @@ static void handle_write_completed(struct r10conf *= conf, struct r10bio *r10_bio) } } if (fail) { - spin_lock_irq(&conf->device_lock); + spin_lock_irq(&conf->mddev->device_lock); list_add(&r10_bio->retry_list, &conf->bio_end_io_list); conf->nr_queued++; - spin_unlock_irq(&conf->device_lock); + spin_unlock_irq(&conf->mddev->device_lock); /* * In case freeze_array() is waiting for condition * nr_pending =3D=3D nr_queued + extra to be true. @@ -2984,14 +2984,14 @@ static void raid10d(struct md_thread *thread) if (!list_empty_careful(&conf->bio_end_io_list) && !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) { LIST_HEAD(tmp); - spin_lock_irqsave(&conf->device_lock, flags); + spin_lock_irqsave(&conf->mddev->device_lock, flags); if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) { while (!list_empty(&conf->bio_end_io_list)) { list_move(conf->bio_end_io_list.prev, &tmp); conf->nr_queued--; } } - spin_unlock_irqrestore(&conf->device_lock, flags); + spin_unlock_irqrestore(&conf->mddev->device_lock, flags); while (!list_empty(&tmp)) { r10_bio =3D list_first_entry(&tmp, struct r10bio, retry_list); @@ -3009,15 +3009,15 @@ static void raid10d(struct md_thread *thread) =20 flush_pending_writes(conf); =20 - spin_lock_irqsave(&conf->device_lock, flags); + spin_lock_irqsave(&conf->mddev->device_lock, flags); if (list_empty(head)) { - spin_unlock_irqrestore(&conf->device_lock, flags); + spin_unlock_irqrestore(&conf->mddev->device_lock, flags); break; } r10_bio =3D list_entry(head->prev, struct r10bio, retry_list); list_del(head->prev); conf->nr_queued--; - spin_unlock_irqrestore(&conf->device_lock, flags); + spin_unlock_irqrestore(&conf->mddev->device_lock, flags); =20 mddev =3D r10_bio->mddev; conf =3D mddev->private; @@ -3960,7 +3960,6 @@ static struct r10conf *setup_conf(struct mddev *mddev) conf->prev.stride =3D conf->dev_sectors; } conf->reshape_safe =3D conf->reshape_progress; - spin_lock_init(&conf->device_lock); INIT_LIST_HEAD(&conf->retry_list); INIT_LIST_HEAD(&conf->bio_end_io_list); =20 @@ -4467,7 +4466,7 @@ static int raid10_start_reshape(struct mddev *mddev) return -EINVAL; =20 conf->offset_diff =3D min_offset_diff; - spin_lock_irq(&conf->device_lock); + spin_lock_irq(&conf->mddev->device_lock); if (conf->mirrors_new) { memcpy(conf->mirrors_new, conf->mirrors, sizeof(struct raid10_info)*conf->prev.raid_disks); @@ -4482,7 +4481,7 @@ static int raid10_start_reshape(struct mddev *mddev) if (mddev->reshape_backwards) { sector_t size =3D raid10_size(mddev, 0, 0); if (size < mddev->array_sectors) { - spin_unlock_irq(&conf->device_lock); + spin_unlock_irq(&conf->mddev->device_lock); pr_warn("md/raid10:%s: array size must be reduce before number of disks= \n", mdname(mddev)); return -EINVAL; @@ -4492,7 +4491,7 @@ static int raid10_start_reshape(struct mddev *mddev) } else conf->reshape_progress =3D 0; conf->reshape_safe =3D conf->reshape_progress; - spin_unlock_irq(&conf->device_lock); + spin_unlock_irq(&conf->mddev->device_lock); =20 if (mddev->delta_disks && mddev->bitmap) { struct mdp_superblock_1 *sb =3D NULL; @@ -4561,9 +4560,9 @@ static int raid10_start_reshape(struct mddev *mddev) * ->degraded is measured against the larger of the * pre and post numbers. */ - spin_lock_irq(&conf->device_lock); + spin_lock_irq(&conf->mddev->device_lock); mddev->degraded =3D calc_degraded(conf); - spin_unlock_irq(&conf->device_lock); + spin_unlock_irq(&conf->mddev->device_lock); mddev->raid_disks =3D conf->geo.raid_disks; mddev->reshape_position =3D conf->reshape_progress; set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags); @@ -4579,7 +4578,7 @@ static int raid10_start_reshape(struct mddev *mddev) =20 abort: mddev->recovery =3D 0; - spin_lock_irq(&conf->device_lock); + spin_lock_irq(&conf->mddev->device_lock); conf->geo =3D conf->prev; mddev->raid_disks =3D conf->geo.raid_disks; rdev_for_each(rdev, mddev) @@ -4588,7 +4587,7 @@ static int raid10_start_reshape(struct mddev *mddev) conf->reshape_progress =3D MaxSector; conf->reshape_safe =3D MaxSector; mddev->reshape_position =3D MaxSector; - spin_unlock_irq(&conf->device_lock); + spin_unlock_irq(&conf->mddev->device_lock); return ret; } =20 @@ -4947,13 +4946,13 @@ static void end_reshape(struct r10conf *conf) if (test_bit(MD_RECOVERY_INTR, &conf->mddev->recovery)) return; =20 - spin_lock_irq(&conf->device_lock); + spin_lock_irq(&conf->mddev->device_lock); conf->prev =3D conf->geo; md_finish_reshape(conf->mddev); smp_wmb(); conf->reshape_progress =3D MaxSector; conf->reshape_safe =3D MaxSector; - spin_unlock_irq(&conf->device_lock); + spin_unlock_irq(&conf->mddev->device_lock); =20 mddev_update_io_opt(conf->mddev, raid10_nr_stripes(conf)); conf->fullsync =3D 0; diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h index da00a55f7a55..5f6c8b21ecd0 100644 --- a/drivers/md/raid10.h +++ b/drivers/md/raid10.h @@ -29,7 +29,6 @@ struct r10conf { struct mddev *mddev; struct raid10_info *mirrors; struct raid10_info *mirrors_new, *mirrors_old; - spinlock_t device_lock; =20 /* geometry */ struct geom { diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c index ba768ca7f422..177759b18c72 100644 --- a/drivers/md/raid5-cache.c +++ b/drivers/md/raid5-cache.c @@ -1356,7 +1356,7 @@ static void r5l_write_super_and_discard_space(struct = r5l_log *log, * r5c_flush_stripe moves stripe from cached list to handle_list. When cal= led, * the stripe must be on r5c_cached_full_stripes or r5c_cached_partial_str= ipes. * - * must hold conf->device_lock + * must hold conf->mddev->device_lock */ static void r5c_flush_stripe(struct r5conf *conf, struct stripe_head *sh) { @@ -1366,10 +1366,10 @@ static void r5c_flush_stripe(struct r5conf *conf, s= truct stripe_head *sh) =20 /* * The stripe is not ON_RELEASE_LIST, so it is safe to call - * raid5_release_stripe() while holding conf->device_lock + * raid5_release_stripe() while holding conf->mddev->device_lock */ BUG_ON(test_bit(STRIPE_ON_RELEASE_LIST, &sh->state)); - lockdep_assert_held(&conf->device_lock); + lockdep_assert_held(&conf->mddev->device_lock); =20 list_del_init(&sh->lru); atomic_inc(&sh->count); @@ -1396,7 +1396,7 @@ void r5c_flush_cache(struct r5conf *conf, int num) int count; struct stripe_head *sh, *next; =20 - lockdep_assert_held(&conf->device_lock); + lockdep_assert_held(&conf->mddev->device_lock); if (!READ_ONCE(conf->log)) return; =20 @@ -1455,15 +1455,15 @@ static void r5c_do_reclaim(struct r5conf *conf) stripes_to_flush =3D -1; =20 if (stripes_to_flush >=3D 0) { - spin_lock_irqsave(&conf->device_lock, flags); + spin_lock_irqsave(&conf->mddev->device_lock, flags); r5c_flush_cache(conf, stripes_to_flush); - spin_unlock_irqrestore(&conf->device_lock, flags); + spin_unlock_irqrestore(&conf->mddev->device_lock, flags); } =20 /* if log space is tight, flush stripes on stripe_in_journal_list */ if (test_bit(R5C_LOG_TIGHT, &conf->cache_state)) { spin_lock_irqsave(&log->stripe_in_journal_lock, flags); - spin_lock(&conf->device_lock); + spin_lock(&conf->mddev->device_lock); list_for_each_entry(sh, &log->stripe_in_journal_list, r5c) { /* * stripes on stripe_in_journal_list could be in any @@ -1481,7 +1481,7 @@ static void r5c_do_reclaim(struct r5conf *conf) break; } } - spin_unlock(&conf->device_lock); + spin_unlock(&conf->mddev->device_lock); spin_unlock_irqrestore(&log->stripe_in_journal_lock, flags); } =20 diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 24b32a0c95b4..3350dcf9cab6 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -83,34 +83,34 @@ static inline int stripe_hash_locks_hash(struct r5conf = *conf, sector_t sect) } =20 static inline void lock_device_hash_lock(struct r5conf *conf, int hash) - __acquires(&conf->device_lock) + __acquires(&conf->mddev->device_lock) { spin_lock_irq(conf->hash_locks + hash); - spin_lock(&conf->device_lock); + spin_lock(&conf->mddev->device_lock); } =20 static inline void unlock_device_hash_lock(struct r5conf *conf, int hash) - __releases(&conf->device_lock) + __releases(&conf->mddev->device_lock) { - spin_unlock(&conf->device_lock); + spin_unlock(&conf->mddev->device_lock); spin_unlock_irq(conf->hash_locks + hash); } =20 static inline void lock_all_device_hash_locks_irq(struct r5conf *conf) - __acquires(&conf->device_lock) + __acquires(&conf->mddev->device_lock) { int i; spin_lock_irq(conf->hash_locks); for (i =3D 1; i < NR_STRIPE_HASH_LOCKS; i++) spin_lock_nest_lock(conf->hash_locks + i, conf->hash_locks); - spin_lock(&conf->device_lock); + spin_lock(&conf->mddev->device_lock); } =20 static inline void unlock_all_device_hash_locks_irq(struct r5conf *conf) - __releases(&conf->device_lock) + __releases(&conf->mddev->device_lock) { int i; - spin_unlock(&conf->device_lock); + spin_unlock(&conf->mddev->device_lock); for (i =3D NR_STRIPE_HASH_LOCKS - 1; i; i--) spin_unlock(conf->hash_locks + i); spin_unlock_irq(conf->hash_locks); @@ -172,7 +172,7 @@ static bool stripe_is_lowprio(struct stripe_head *sh) } =20 static void raid5_wakeup_stripe_thread(struct stripe_head *sh) - __must_hold(&sh->raid_conf->device_lock) + __must_hold(&sh->raid_conf->mddev->device_lock) { struct r5conf *conf =3D sh->raid_conf; struct r5worker_group *group; @@ -220,7 +220,7 @@ static void raid5_wakeup_stripe_thread(struct stripe_he= ad *sh) =20 static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh, struct list_head *temp_inactive_list) - __must_hold(&conf->device_lock) + __must_hold(&conf->mddev->device_lock) { int i; int injournal =3D 0; /* number of date pages with R5_InJournal */ @@ -306,7 +306,7 @@ static void do_release_stripe(struct r5conf *conf, stru= ct stripe_head *sh, =20 static void __release_stripe(struct r5conf *conf, struct stripe_head *sh, struct list_head *temp_inactive_list) - __must_hold(&conf->device_lock) + __must_hold(&conf->mddev->device_lock) { if (atomic_dec_and_test(&sh->count)) do_release_stripe(conf, sh, temp_inactive_list); @@ -363,7 +363,7 @@ static void release_inactive_stripe_list(struct r5conf = *conf, =20 static int release_stripe_list(struct r5conf *conf, struct list_head *temp_inactive_list) - __must_hold(&conf->device_lock) + __must_hold(&conf->mddev->device_lock) { struct stripe_head *sh, *t; int count =3D 0; @@ -412,11 +412,11 @@ void raid5_release_stripe(struct stripe_head *sh) return; slow_path: /* we are ok here if STRIPE_ON_RELEASE_LIST is set or not */ - if (atomic_dec_and_lock_irqsave(&sh->count, &conf->device_lock, flags)) { + if (atomic_dec_and_lock_irqsave(&sh->count, &conf->mddev->device_lock, fl= ags)) { INIT_LIST_HEAD(&list); hash =3D sh->hash_lock_index; do_release_stripe(conf, sh, &list); - spin_unlock_irqrestore(&conf->device_lock, flags); + spin_unlock_irqrestore(&conf->mddev->device_lock, flags); release_inactive_stripe_list(conf, &list, hash); } } @@ -647,7 +647,7 @@ static struct stripe_head *find_get_stripe(struct r5con= f *conf, * references it with the device_lock held. */ =20 - spin_lock(&conf->device_lock); + spin_lock(&conf->mddev->device_lock); if (!atomic_read(&sh->count)) { if (!test_bit(STRIPE_HANDLE, &sh->state)) atomic_inc(&conf->active_stripes); @@ -666,7 +666,7 @@ static struct stripe_head *find_get_stripe(struct r5con= f *conf, } } atomic_inc(&sh->count); - spin_unlock(&conf->device_lock); + spin_unlock(&conf->mddev->device_lock); =20 return sh; } @@ -684,7 +684,7 @@ static struct stripe_head *find_get_stripe(struct r5con= f *conf, * of the two sections, and some non-in_sync devices may * be insync in the section most affected by failed devices. * - * Most calls to this function hold &conf->device_lock. Calls + * Most calls to this function hold &conf->mddev->device_lock. Calls * in raid5_run() do not require the lock as no other threads * have been started yet. */ @@ -2913,7 +2913,7 @@ static void raid5_error(struct mddev *mddev, struct m= d_rdev *rdev) pr_crit("md/raid:%s: Disk failure on %pg, disabling device.\n", mdname(mddev), rdev->bdev); =20 - spin_lock_irqsave(&conf->device_lock, flags); + spin_lock_irqsave(&conf->mddev->device_lock, flags); set_bit(Faulty, &rdev->flags); clear_bit(In_sync, &rdev->flags); mddev->degraded =3D raid5_calc_degraded(conf); @@ -2929,7 +2929,7 @@ static void raid5_error(struct mddev *mddev, struct m= d_rdev *rdev) mdname(mddev), conf->raid_disks - mddev->degraded); } =20 - spin_unlock_irqrestore(&conf->device_lock, flags); + spin_unlock_irqrestore(&conf->mddev->device_lock, flags); set_bit(MD_RECOVERY_INTR, &mddev->recovery); =20 set_bit(Blocked, &rdev->flags); @@ -5294,7 +5294,7 @@ static void handle_stripe(struct stripe_head *sh) } =20 static void raid5_activate_delayed(struct r5conf *conf) - __must_hold(&conf->device_lock) + __must_hold(&conf->mddev->device_lock) { if (atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD) { while (!list_empty(&conf->delayed_list)) { @@ -5313,7 +5313,7 @@ static void raid5_activate_delayed(struct r5conf *con= f) =20 static void activate_bit_delay(struct r5conf *conf, struct list_head *temp_inactive_list) - __must_hold(&conf->device_lock) + __must_hold(&conf->mddev->device_lock) { struct list_head head; list_add(&head, &conf->bitmap_list); @@ -5348,12 +5348,12 @@ static void add_bio_to_retry(struct bio *bi,struct = r5conf *conf) { unsigned long flags; =20 - spin_lock_irqsave(&conf->device_lock, flags); + spin_lock_irqsave(&conf->mddev->device_lock, flags); =20 bi->bi_next =3D conf->retry_read_aligned_list; conf->retry_read_aligned_list =3D bi; =20 - spin_unlock_irqrestore(&conf->device_lock, flags); + spin_unlock_irqrestore(&conf->mddev->device_lock, flags); md_wakeup_thread(conf->mddev->thread); } =20 @@ -5472,11 +5472,11 @@ static int raid5_read_one_chunk(struct mddev *mddev= , struct bio *raid_bio) */ if (did_inc && atomic_dec_and_test(&conf->active_aligned_reads)) wake_up(&conf->wait_for_quiescent); - spin_lock_irq(&conf->device_lock); + spin_lock_irq(&conf->mddev->device_lock); wait_event_lock_irq(conf->wait_for_quiescent, conf->quiesce =3D=3D 0, - conf->device_lock); + conf->mddev->device_lock); atomic_inc(&conf->active_aligned_reads); - spin_unlock_irq(&conf->device_lock); + spin_unlock_irq(&conf->mddev->device_lock); } =20 mddev_trace_remap(mddev, align_bio, raid_bio->bi_iter.bi_sector); @@ -5516,7 +5516,7 @@ static struct bio *chunk_aligned_read(struct mddev *m= ddev, struct bio *raid_bio) * handle_list. */ static struct stripe_head *__get_priority_stripe(struct r5conf *conf, int = group) - __must_hold(&conf->device_lock) + __must_hold(&conf->mddev->device_lock) { struct stripe_head *sh, *tmp; struct list_head *handle_list =3D NULL; @@ -5625,7 +5625,7 @@ static void raid5_unplug(struct blk_plug_cb *blk_cb, = bool from_schedule) int hash; =20 if (cb->list.next && !list_empty(&cb->list)) { - spin_lock_irq(&conf->device_lock); + spin_lock_irq(&conf->mddev->device_lock); while (!list_empty(&cb->list)) { sh =3D list_first_entry(&cb->list, struct stripe_head, lru); list_del_init(&sh->lru); @@ -5644,7 +5644,7 @@ static void raid5_unplug(struct blk_plug_cb *blk_cb, = bool from_schedule) __release_stripe(conf, sh, &cb->temp_inactive_list[hash]); cnt++; } - spin_unlock_irq(&conf->device_lock); + spin_unlock_irq(&conf->mddev->device_lock); } release_inactive_stripe_list(conf, cb->temp_inactive_list, NR_STRIPE_HASH_LOCKS); @@ -5793,14 +5793,14 @@ static bool stripe_ahead_of_reshape(struct mddev *m= ddev, struct r5conf *conf, max_sector =3D max(max_sector, sh->dev[dd_idx].sector); } =20 - spin_lock_irq(&conf->device_lock); + spin_lock_irq(&conf->mddev->device_lock); =20 if (!range_ahead_of_reshape(mddev, min_sector, max_sector, conf->reshape_progress)) /* mismatch, need to try again */ ret =3D true; =20 - spin_unlock_irq(&conf->device_lock); + spin_unlock_irq(&conf->mddev->device_lock); =20 return ret; } @@ -5880,10 +5880,10 @@ static enum reshape_loc get_reshape_loc(struct mdde= v *mddev, * to the stripe that we think it is, we will have * to check again. */ - spin_lock_irq(&conf->device_lock); + spin_lock_irq(&conf->mddev->device_lock); reshape_progress =3D conf->reshape_progress; reshape_safe =3D conf->reshape_safe; - spin_unlock_irq(&conf->device_lock); + spin_unlock_irq(&conf->mddev->device_lock); if (reshape_progress =3D=3D MaxSector) return LOC_NO_RESHAPE; if (ahead_of_reshape(mddev, logical_sector, reshape_progress)) @@ -6373,9 +6373,9 @@ static sector_t reshape_request(struct mddev *mddev, = sector_t sector_nr, int *sk test_bit(MD_RECOVERY_INTR, &mddev->recovery)); if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) return 0; - spin_lock_irq(&conf->device_lock); + spin_lock_irq(&conf->mddev->device_lock); conf->reshape_safe =3D mddev->reshape_position; - spin_unlock_irq(&conf->device_lock); + spin_unlock_irq(&conf->mddev->device_lock); wake_up(&conf->wait_for_reshape); sysfs_notify_dirent_safe(mddev->sysfs_completed); } @@ -6413,12 +6413,12 @@ static sector_t reshape_request(struct mddev *mddev= , sector_t sector_nr, int *sk } list_add(&sh->lru, &stripes); } - spin_lock_irq(&conf->device_lock); + spin_lock_irq(&conf->mddev->device_lock); if (mddev->reshape_backwards) conf->reshape_progress -=3D reshape_sectors * new_data_disks; else conf->reshape_progress +=3D reshape_sectors * new_data_disks; - spin_unlock_irq(&conf->device_lock); + spin_unlock_irq(&conf->mddev->device_lock); /* Ok, those stripe are ready. We can start scheduling * reads on the source stripes. * The source stripes are determined by mapping the first and last @@ -6482,9 +6482,9 @@ static sector_t reshape_request(struct mddev *mddev, = sector_t sector_nr, int *sk || test_bit(MD_RECOVERY_INTR, &mddev->recovery)); if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) goto ret; - spin_lock_irq(&conf->device_lock); + spin_lock_irq(&conf->mddev->device_lock); conf->reshape_safe =3D mddev->reshape_position; - spin_unlock_irq(&conf->device_lock); + spin_unlock_irq(&conf->mddev->device_lock); wake_up(&conf->wait_for_reshape); sysfs_notify_dirent_safe(mddev->sysfs_completed); } @@ -6651,7 +6651,7 @@ static int retry_aligned_read(struct r5conf *conf, s= truct bio *raid_bio, static int handle_active_stripes(struct r5conf *conf, int group, struct r5worker *worker, struct list_head *temp_inactive_list) - __must_hold(&conf->device_lock) + __must_hold(&conf->mddev->device_lock) { struct stripe_head *batch[MAX_STRIPE_BATCH], *sh; int i, batch_size =3D 0, hash; @@ -6666,21 +6666,21 @@ static int handle_active_stripes(struct r5conf *con= f, int group, if (!list_empty(temp_inactive_list + i)) break; if (i =3D=3D NR_STRIPE_HASH_LOCKS) { - spin_unlock_irq(&conf->device_lock); + spin_unlock_irq(&conf->mddev->device_lock); log_flush_stripe_to_raid(conf); - spin_lock_irq(&conf->device_lock); + spin_lock_irq(&conf->mddev->device_lock); return batch_size; } release_inactive =3D true; } - spin_unlock_irq(&conf->device_lock); + spin_unlock_irq(&conf->mddev->device_lock); =20 release_inactive_stripe_list(conf, temp_inactive_list, NR_STRIPE_HASH_LOCKS); =20 r5l_flush_stripe_to_raid(conf->log); if (release_inactive) { - spin_lock_irq(&conf->device_lock); + spin_lock_irq(&conf->mddev->device_lock); return 0; } =20 @@ -6690,7 +6690,7 @@ static int handle_active_stripes(struct r5conf *conf,= int group, =20 cond_resched(); =20 - spin_lock_irq(&conf->device_lock); + spin_lock_irq(&conf->mddev->device_lock); for (i =3D 0; i < batch_size; i++) { hash =3D batch[i]->hash_lock_index; __release_stripe(conf, batch[i], &temp_inactive_list[hash]); @@ -6712,7 +6712,7 @@ static void raid5_do_work(struct work_struct *work) =20 blk_start_plug(&plug); handled =3D 0; - spin_lock_irq(&conf->device_lock); + spin_lock_irq(&conf->mddev->device_lock); while (1) { int batch_size, released; =20 @@ -6726,11 +6726,11 @@ static void raid5_do_work(struct work_struct *work) handled +=3D batch_size; wait_event_lock_irq(mddev->sb_wait, !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags), - conf->device_lock); + conf->mddev->device_lock); } pr_debug("%d stripes handled\n", handled); =20 - spin_unlock_irq(&conf->device_lock); + spin_unlock_irq(&conf->mddev->device_lock); =20 flush_deferred_bios(conf); =20 @@ -6762,7 +6762,7 @@ static void raid5d(struct md_thread *thread) =20 blk_start_plug(&plug); handled =3D 0; - spin_lock_irq(&conf->device_lock); + spin_lock_irq(&conf->mddev->device_lock); while (1) { struct bio *bio; int batch_size, released; @@ -6779,10 +6779,10 @@ static void raid5d(struct md_thread *thread) !list_empty(&conf->bitmap_list)) { /* Now is a good time to flush some bitmap updates */ conf->seq_flush++; - spin_unlock_irq(&conf->device_lock); + spin_unlock_irq(&conf->mddev->device_lock); if (md_bitmap_enabled(mddev, true)) mddev->bitmap_ops->unplug(mddev, true); - spin_lock_irq(&conf->device_lock); + spin_lock_irq(&conf->mddev->device_lock); conf->seq_write =3D conf->seq_flush; activate_bit_delay(conf, conf->temp_inactive_list); } @@ -6790,9 +6790,9 @@ static void raid5d(struct md_thread *thread) =20 while ((bio =3D remove_bio_from_retry(conf, &offset))) { int ok; - spin_unlock_irq(&conf->device_lock); + spin_unlock_irq(&conf->mddev->device_lock); ok =3D retry_aligned_read(conf, bio, offset); - spin_lock_irq(&conf->device_lock); + spin_lock_irq(&conf->mddev->device_lock); if (!ok) break; handled++; @@ -6805,14 +6805,14 @@ static void raid5d(struct md_thread *thread) handled +=3D batch_size; =20 if (mddev->sb_flags & ~(1 << MD_SB_CHANGE_PENDING)) { - spin_unlock_irq(&conf->device_lock); + spin_unlock_irq(&conf->mddev->device_lock); md_check_recovery(mddev); - spin_lock_irq(&conf->device_lock); + spin_lock_irq(&conf->mddev->device_lock); } } pr_debug("%d stripes handled\n", handled); =20 - spin_unlock_irq(&conf->device_lock); + spin_unlock_irq(&conf->mddev->device_lock); if (test_and_clear_bit(R5_ALLOC_MORE, &conf->cache_state) && mutex_trylock(&conf->cache_size_mutex)) { grow_one_stripe(conf, __GFP_NOWARN); @@ -7197,11 +7197,11 @@ raid5_store_group_thread_cnt(struct mddev *mddev, c= onst char *page, size_t len) =20 err =3D alloc_thread_groups(conf, new, &group_cnt, &new_groups); if (!err) { - spin_lock_irq(&conf->device_lock); + spin_lock_irq(&conf->mddev->device_lock); conf->group_cnt =3D group_cnt; conf->worker_cnt_per_group =3D new; conf->worker_groups =3D new_groups; - spin_unlock_irq(&conf->device_lock); + spin_unlock_irq(&conf->mddev->device_lock); =20 if (old_groups) kfree(old_groups[0].workers); @@ -7504,8 +7504,7 @@ static struct r5conf *setup_conf(struct mddev *mddev) conf->worker_groups =3D new_group; } else goto abort; - spin_lock_init(&conf->device_lock); - seqcount_spinlock_init(&conf->gen_lock, &conf->device_lock); + seqcount_spinlock_init(&conf->gen_lock, &conf->mddev->device_lock); mutex_init(&conf->cache_size_mutex); =20 init_waitqueue_head(&conf->wait_for_quiescent); @@ -8151,9 +8150,9 @@ static int raid5_spare_active(struct mddev *mddev) sysfs_notify_dirent_safe(rdev->sysfs_state); } } - spin_lock_irqsave(&conf->device_lock, flags); + spin_lock_irqsave(&conf->mddev->device_lock, flags); mddev->degraded =3D raid5_calc_degraded(conf); - spin_unlock_irqrestore(&conf->device_lock, flags); + spin_unlock_irqrestore(&conf->mddev->device_lock, flags); print_raid5_conf(conf); return count; } @@ -8474,7 +8473,7 @@ static int raid5_start_reshape(struct mddev *mddev) } =20 atomic_set(&conf->reshape_stripes, 0); - spin_lock_irq(&conf->device_lock); + spin_lock_irq(&conf->mddev->device_lock); write_seqcount_begin(&conf->gen_lock); conf->previous_raid_disks =3D conf->raid_disks; conf->raid_disks +=3D mddev->delta_disks; @@ -8493,7 +8492,7 @@ static int raid5_start_reshape(struct mddev *mddev) conf->reshape_progress =3D 0; conf->reshape_safe =3D conf->reshape_progress; write_seqcount_end(&conf->gen_lock); - spin_unlock_irq(&conf->device_lock); + spin_unlock_irq(&conf->mddev->device_lock); =20 /* Now make sure any requests that proceeded on the assumption * the reshape wasn't running - like Discard or Read - have @@ -8533,9 +8532,9 @@ static int raid5_start_reshape(struct mddev *mddev) * ->degraded is measured against the larger of the * pre and post number of devices. */ - spin_lock_irqsave(&conf->device_lock, flags); + spin_lock_irqsave(&conf->mddev->device_lock, flags); mddev->degraded =3D raid5_calc_degraded(conf); - spin_unlock_irqrestore(&conf->device_lock, flags); + spin_unlock_irqrestore(&conf->mddev->device_lock, flags); } mddev->raid_disks =3D conf->raid_disks; mddev->reshape_position =3D conf->reshape_progress; @@ -8560,7 +8559,7 @@ static void end_reshape(struct r5conf *conf) if (!test_bit(MD_RECOVERY_INTR, &conf->mddev->recovery)) { struct md_rdev *rdev; =20 - spin_lock_irq(&conf->device_lock); + spin_lock_irq(&conf->mddev->device_lock); conf->previous_raid_disks =3D conf->raid_disks; md_finish_reshape(conf->mddev); smp_wmb(); @@ -8571,7 +8570,7 @@ static void end_reshape(struct r5conf *conf) !test_bit(Journal, &rdev->flags) && !test_bit(In_sync, &rdev->flags)) rdev->recovery_offset =3D MaxSector; - spin_unlock_irq(&conf->device_lock); + spin_unlock_irq(&conf->mddev->device_lock); wake_up(&conf->wait_for_reshape); =20 mddev_update_io_opt(conf->mddev, @@ -8591,9 +8590,9 @@ static void raid5_finish_reshape(struct mddev *mddev) =20 if (mddev->delta_disks <=3D 0) { int d; - spin_lock_irq(&conf->device_lock); + spin_lock_irq(&conf->mddev->device_lock); mddev->degraded =3D raid5_calc_degraded(conf); - spin_unlock_irq(&conf->device_lock); + spin_unlock_irq(&conf->mddev->device_lock); for (d =3D conf->raid_disks ; d < conf->raid_disks - mddev->delta_disks; d++) { diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h index eafc6e9ed6ee..8ec60e06dc05 100644 --- a/drivers/md/raid5.h +++ b/drivers/md/raid5.h @@ -668,7 +668,6 @@ struct r5conf { unsigned long cache_state; struct shrinker *shrinker; int pool_size; /* number of disks in stripeheads in pool */ - spinlock_t device_lock; struct disk_info *disks; struct bio_set bio_split; =20 --=20 2.50.1 From nobody Thu Dec 18 20:00:13 2025 Received: from www5210.sakura.ne.jp (www5210.sakura.ne.jp [133.167.8.150]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4888E19EEC2 for ; Mon, 27 Oct 2025 15:44:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=133.167.8.150 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761579877; cv=none; b=bKpzbSNloJMAUQ5mdO6rlKz45oyweRuotgMRP2J3kuFoIAG+8d6QuoZ6wVnQbCr1e8Fe4Yt46jmd0PWun6l4thIA/7qS8KjvOTxi+eMBb/Eics8KurV2nMaDgRnbVQ8KPzDlVWIzHnzMgQ2J45ePiXlK4gmmQb9zIgAn8WV2bF0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761579877; c=relaxed/simple; bh=o3mPadGkLNtqCe0Xkf+f5lzQl2+6xDrs2obkYDqJCow=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Y6PCnW+GtTBK1qTdZUf/oDdrcZaH1+1gu5WhV2+JXxQIt6ja+V9n5/ynenKSHetGA0OrpPWt6q3fbruJgyR1KGKeUJF+egfwpnwyMXvU+6pFkRCXAWC0WHyzZ+fM7sU70G7CSoXcepM6gXynVbJ2bdo+R0X/LCI2SNkfyploDF0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mgml.me; spf=pass smtp.mailfrom=mgml.me; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b=oo1fa/4S; arc=none smtp.client-ip=133.167.8.150 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mgml.me Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=mgml.me Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b="oo1fa/4S" Received: from fedora (p3796170-ipxg00h01tokaisakaetozai.aichi.ocn.ne.jp [180.53.173.170]) (authenticated bits=0) by www5210.sakura.ne.jp (8.16.1/8.16.1) with ESMTPSA id 59RF4hAc090988 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Tue, 28 Oct 2025 00:04:45 +0900 (JST) (envelope-from k@mgml.me) DKIM-Signature: a=rsa-sha256; bh=VKP0RBCaclB0YivyJvOzI9dmKCKygZBpmZQyobzcYlQ=; c=relaxed/relaxed; d=mgml.me; h=From:To:Subject:Date:Message-ID; s=rs20250315; t=1761577485; v=1; b=oo1fa/4STlJIUJUIP7C0nq0rv7xcWIq3+MMCR554hD0/hcQKnDtT8Y4TwEy5Atoy r0EcZkUfCBsmjenmlQT1eFDYX2be1vyY+kY3cOKS846d/CJgu8nBYTYJm9lbz2TC 2CvFddoYJ/seSRuSLP583upGSyti4ouF37IR6Nf71uCXP0O17QYnNKzeaxz0N1V0 Ou6IrFLv/7J1UugAk94/y4EPKoe0ZDVaLEZ16aqFTwt2UhEBRadG9pnZQ/wpckJm cCtD15d57ELGYz5Z98rHyyyTX8A032+7rPclfMGN9cqRI2G0eEwFPP5sf/hFWgm4 HmUpvqZAAswnJ4gYuFlUFw== From: Kenta Akagi To: Song Liu , Yu Kuai , Shaohua Li , Mariusz Tkaczyk , Guoqing Jiang Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, Kenta Akagi Subject: [PATCH v5 02/16] md: serialize md_error() Date: Tue, 28 Oct 2025 00:04:19 +0900 Message-ID: <20251027150433.18193-3-k@mgml.me> X-Mailer: git-send-email 2.50.1 In-Reply-To: <20251027150433.18193-1-k@mgml.me> References: <20251027150433.18193-1-k@mgml.me> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Serialize the md_error() function in preparation for the introduction of a conditional md_error() in a subsequent commit. The conditional md_error() is intended to prevent unintentional setting of MD_BROKEN during RAID1/10 failfast handling. To enhance the failfast bio error handler, it must verify that the affected rdev is not the last working device before marking it as faulty. Without serialization, a race condition can occur if multiple failfast bios attempt to call the error handler concurrently: failfast bio1 failfast bio2 --- --- md_cond_error(md,rdev1,bio) md_cond_error(md,rdev2,bio) if(!is_degraded(md)) if(!is_degraded(md)) raid1_error(md,rdev1) raid1_error(md,rdev2) spin_lock(md) set_faulty(rdev1) spin_unlock(md) spin_lock(md) set_faulty(rdev2) set_broken(md) spin_unlock(md) This can unintentionally cause the array to stop in situations where the 'Last' rdev should not be marked as Faulty. This commit serializes the md_error() function for all RAID personalities to avoid this race condition. Future commits will introduce a conditional md_error() specifically for failfast bio handling. Serialization is applied to both the standard and conditional md_error() for the following reasons: - Both use the same error-handling mechanism, so it's clearer to serialize them consistently. - The md_error() path is cold, meaning serialization has no performance impact. Signed-off-by: Kenta Akagi --- drivers/md/md-linear.c | 1 + drivers/md/md.c | 10 +++++++++- drivers/md/md.h | 1 + drivers/md/raid0.c | 1 + drivers/md/raid1.c | 6 +----- drivers/md/raid10.c | 9 ++------- drivers/md/raid5.c | 4 +--- 7 files changed, 16 insertions(+), 16 deletions(-) diff --git a/drivers/md/md-linear.c b/drivers/md/md-linear.c index 7033d982d377..0f6893e4b9f5 100644 --- a/drivers/md/md-linear.c +++ b/drivers/md/md-linear.c @@ -298,6 +298,7 @@ static void linear_status(struct seq_file *seq, struct = mddev *mddev) } =20 static void linear_error(struct mddev *mddev, struct md_rdev *rdev) + __must_hold(&mddev->device_lock) { if (!test_and_set_bit(MD_BROKEN, &mddev->flags)) { char *md_name =3D mdname(mddev); diff --git a/drivers/md/md.c b/drivers/md/md.c index d667580e3125..4ad9cb0ac98c 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -8444,7 +8444,8 @@ void md_unregister_thread(struct mddev *mddev, struct= md_thread __rcu **threadp) } EXPORT_SYMBOL(md_unregister_thread); =20 -void md_error(struct mddev *mddev, struct md_rdev *rdev) +void _md_error(struct mddev *mddev, struct md_rdev *rdev) + __must_hold(&mddev->device_lock) { if (!rdev || test_bit(Faulty, &rdev->flags)) return; @@ -8469,6 +8470,13 @@ void md_error(struct mddev *mddev, struct md_rdev *r= dev) queue_work(md_misc_wq, &mddev->event_work); md_new_event(); } + +void md_error(struct mddev *mddev, struct md_rdev *rdev) +{ + spin_lock(&mddev->device_lock); + _md_error(mddev, rdev); + spin_unlock(&mddev->device_lock); +} EXPORT_SYMBOL(md_error); =20 /* seq_file implementation /proc/mdstat */ diff --git a/drivers/md/md.h b/drivers/md/md.h index 64ac22edf372..c982598cbf97 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -913,6 +913,7 @@ extern void md_write_start(struct mddev *mddev, struct = bio *bi); extern void md_write_inc(struct mddev *mddev, struct bio *bi); extern void md_write_end(struct mddev *mddev); extern void md_done_sync(struct mddev *mddev, int blocks, int ok); +void _md_error(struct mddev *mddev, struct md_rdev *rdev); extern void md_error(struct mddev *mddev, struct md_rdev *rdev); extern void md_finish_reshape(struct mddev *mddev); void md_submit_discard_bio(struct mddev *mddev, struct md_rdev *rdev, diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c index e443e478645a..8cf3caf9defd 100644 --- a/drivers/md/raid0.c +++ b/drivers/md/raid0.c @@ -625,6 +625,7 @@ static void raid0_status(struct seq_file *seq, struct m= ddev *mddev) } =20 static void raid0_error(struct mddev *mddev, struct md_rdev *rdev) + __must_hold(&mddev->device_lock) { if (!test_and_set_bit(MD_BROKEN, &mddev->flags)) { char *md_name =3D mdname(mddev); diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 7924d5ee189d..202e510f73a4 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1749,11 +1749,9 @@ static void raid1_status(struct seq_file *seq, struc= t mddev *mddev) * &mddev->fail_last_dev is off. */ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) + __must_hold(&mddev->device_lock) { struct r1conf *conf =3D mddev->private; - unsigned long flags; - - spin_lock_irqsave(&conf->mddev->device_lock, flags); =20 if (test_bit(In_sync, &rdev->flags) && (conf->raid_disks - mddev->degraded) =3D=3D 1) { @@ -1761,7 +1759,6 @@ static void raid1_error(struct mddev *mddev, struct m= d_rdev *rdev) =20 if (!mddev->fail_last_dev) { conf->recovery_disabled =3D mddev->recovery_disabled; - spin_unlock_irqrestore(&conf->mddev->device_lock, flags); return; } } @@ -1769,7 +1766,6 @@ static void raid1_error(struct mddev *mddev, struct m= d_rdev *rdev) if (test_and_clear_bit(In_sync, &rdev->flags)) mddev->degraded++; set_bit(Faulty, &rdev->flags); - spin_unlock_irqrestore(&conf->mddev->device_lock, flags); /* * if recovery is running, make sure it aborts. */ diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 57c887070df3..25c0ab09807b 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -1993,19 +1993,15 @@ static int enough(struct r10conf *conf, int ignore) * &mddev->fail_last_dev is off. */ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev) + __must_hold(&mddev->device_lock) { struct r10conf *conf =3D mddev->private; - unsigned long flags; - - spin_lock_irqsave(&conf->mddev->device_lock, flags); =20 if (test_bit(In_sync, &rdev->flags) && !enough(conf, rdev->raid_disk)) { set_bit(MD_BROKEN, &mddev->flags); =20 - if (!mddev->fail_last_dev) { - spin_unlock_irqrestore(&conf->mddev->device_lock, flags); + if (!mddev->fail_last_dev) return; - } } if (test_and_clear_bit(In_sync, &rdev->flags)) mddev->degraded++; @@ -2015,7 +2011,6 @@ static void raid10_error(struct mddev *mddev, struct = md_rdev *rdev) set_bit(Faulty, &rdev->flags); set_mask_bits(&mddev->sb_flags, 0, BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING)); - spin_unlock_irqrestore(&conf->mddev->device_lock, flags); pr_crit("md/raid10:%s: Disk failure on %pg, disabling device.\n" "md/raid10:%s: Operation continuing on %d devices.\n", mdname(mddev), rdev->bdev, diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 3350dcf9cab6..d1372b1bc405 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -2905,15 +2905,14 @@ static void raid5_end_write_request(struct bio *bi) } =20 static void raid5_error(struct mddev *mddev, struct md_rdev *rdev) + __must_hold(&mddev->device_lock) { struct r5conf *conf =3D mddev->private; - unsigned long flags; pr_debug("raid456: error called\n"); =20 pr_crit("md/raid:%s: Disk failure on %pg, disabling device.\n", mdname(mddev), rdev->bdev); =20 - spin_lock_irqsave(&conf->mddev->device_lock, flags); set_bit(Faulty, &rdev->flags); clear_bit(In_sync, &rdev->flags); mddev->degraded =3D raid5_calc_degraded(conf); @@ -2929,7 +2928,6 @@ static void raid5_error(struct mddev *mddev, struct m= d_rdev *rdev) mdname(mddev), conf->raid_disks - mddev->degraded); } =20 - spin_unlock_irqrestore(&conf->mddev->device_lock, flags); set_bit(MD_RECOVERY_INTR, &mddev->recovery); =20 set_bit(Blocked, &rdev->flags); --=20 2.50.1 From nobody Thu Dec 18 20:00:13 2025 Received: from www5210.sakura.ne.jp (www5210.sakura.ne.jp [133.167.8.150]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4490A30DEA2 for ; Mon, 27 Oct 2025 15:44:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=133.167.8.150 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761579850; cv=none; b=E4kPKfmZPjAZ0ytcdwVlgIZwzCTSH/WjLfIKXGCQor9gsyP/Id8BCV8Qg2sDCZ/FmcQ+owXnpDtVKoL5dPfMYP5fx6bsbiO3bcHqwWFPR9AYRNSyrXvdJwUm/9+tD7xCKmQW2ILH5YT20DboJzaKCgursLTUAb6DnxCI0d1K0Q4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761579850; c=relaxed/simple; bh=GiJM0j7XIJSrgiGU1nODsCDYyTYNrdzvNSTEM+cNYuY=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=kbwrCxxTINEGwz7Tuav+UkpWKsDwb2kzWYnS86MvM7kyoo6usu3/2MJLgn5rzc00Cq2uINZyzIffzsKwq3oggk8oSiVBesc8cF4yD9f6ru7XROpsrc5e3x6rsLneGBlLQO4CcAE/iA9dmwhkbLbEXQ2M+yOIxU2LhDxgOgwaIqY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mgml.me; spf=pass smtp.mailfrom=mgml.me; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b=cj8u3Riv; arc=none smtp.client-ip=133.167.8.150 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mgml.me Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=mgml.me Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b="cj8u3Riv" Received: from fedora (p3796170-ipxg00h01tokaisakaetozai.aichi.ocn.ne.jp [180.53.173.170]) (authenticated bits=0) by www5210.sakura.ne.jp (8.16.1/8.16.1) with ESMTPSA id 59RF4hAd090988 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Tue, 28 Oct 2025 00:04:45 +0900 (JST) (envelope-from k@mgml.me) DKIM-Signature: a=rsa-sha256; bh=uy9KH+7/iFBaIvHnHGGLWrehk7S8042ac77AEgzmD8M=; c=relaxed/relaxed; d=mgml.me; h=From:To:Subject:Date:Message-ID; s=rs20250315; t=1761577485; v=1; b=cj8u3RivypG8IY8lsq8zpbl2E0PkY+x+0CBtfHuILaIVWQ9v7oGhJHpxmfqmXk2E w0k0ICQWBw5piRtGWhYj0RJJU+zMNktNurrmbXAQMeVHlXH9E+8eUa/3LgsdJgZP 0XrOatl7b2nFGJgTvnkeRfsh/XZU8YvoZE742AySo2ouQYoC9ady2hkfSoxLiqrJ qMN8Snkhvg3wDm0Hl30F0Dk3yPgrG7zD4HjuMdHfTr52k9KhmbbygKyGrMq7ZQlZ atfVa4fo5mmP92M1ch/ioBofZsAJgIjxz3PlxqLEKIkCqykxYb6v/zj60lSBcWG3 i7HRCMi+xjqd23WZvy8njQ== From: Kenta Akagi To: Song Liu , Yu Kuai , Shaohua Li , Mariusz Tkaczyk , Guoqing Jiang Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, Kenta Akagi Subject: [PATCH v5 03/16] md: add pers->should_error() callback Date: Tue, 28 Oct 2025 00:04:20 +0900 Message-ID: <20251027150433.18193-4-k@mgml.me> X-Mailer: git-send-email 2.50.1 In-Reply-To: <20251027150433.18193-1-k@mgml.me> References: <20251027150433.18193-1-k@mgml.me> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The failfast feature in RAID1 and RAID10 assumes that when md_error() is called, the array remains functional because the last rdev neither fails nor sets MD_BROKEN. However, the current implementation can cause the array to lose its last in-sync device or be marked as MD_BROKEN, which breaks the assumption and can lead to array failure. To address this issue, a new handler md_cond_error() will be introduced to ensure that failfast I/O does not mark the array as broken. As preparation, this commit adds a helper pers->should_error() to determine from outside the personality whether an rdev can fail safely, which is needed by md_cond_error(). Signed-off-by: Kenta Akagi --- drivers/md/md.h | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/md/md.h b/drivers/md/md.h index c982598cbf97..01c8182431d1 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -763,6 +763,7 @@ struct md_personality * if appropriate, and should abort recovery if needed */ void (*error_handler)(struct mddev *mddev, struct md_rdev *rdev); + bool (*should_error)(struct mddev *mddev, struct md_rdev *rdev, struct bi= o *bio); int (*hot_add_disk) (struct mddev *mddev, struct md_rdev *rdev); int (*hot_remove_disk) (struct mddev *mddev, struct md_rdev *rdev); int (*spare_active) (struct mddev *mddev); --=20 2.50.1 From nobody Thu Dec 18 20:00:13 2025 Received: from www5210.sakura.ne.jp (www5210.sakura.ne.jp [133.167.8.150]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7896D3101D3 for ; Mon, 27 Oct 2025 15:44:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=133.167.8.150 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761579864; cv=none; b=T9r7IvNEh9Yz7YHg/HLrg8SE/zToCzcGrkxpL/qL9b0sZrQdLE2H9aEptqPgzfwtzmJIEmIWc6sW8aCgk0djGuu5iyyRAYOk2NOEOCIM0HXchxFlzLhcZEpCEn/YOGgBAZFX3TbYWBWztr+fq+PFBPaUih28kUYcSZ577oor4mw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761579864; c=relaxed/simple; bh=JftAzuJvItUdTzJGOjax1orfbl6WY88s1ki2ITRQhm4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=qJa9mZ1V8ydnaz5HzrsfXboibWcOcmmPJiKyaNGhAx2XdcZ5ENzRp5hBnPXdFREy0dcDzsmQk7L3yMc0rSDSGsvJjjC2/vWk128/tPtdQgtldOGqb8Ib18g3YSVkj8F0gRX6Uz663YB1R4BlwrtMOwymeBjEwct/SPZgWdxjD48= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mgml.me; spf=pass smtp.mailfrom=mgml.me; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b=KvCLLbTt; arc=none smtp.client-ip=133.167.8.150 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mgml.me Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=mgml.me Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b="KvCLLbTt" Received: from fedora (p3796170-ipxg00h01tokaisakaetozai.aichi.ocn.ne.jp [180.53.173.170]) (authenticated bits=0) by www5210.sakura.ne.jp (8.16.1/8.16.1) with ESMTPSA id 59RF4hAe090988 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Tue, 28 Oct 2025 00:04:45 +0900 (JST) (envelope-from k@mgml.me) DKIM-Signature: a=rsa-sha256; bh=J14c/B9Favd/mX6dP+sPNqMYNuSw3DFvUvW1zTATqCA=; c=relaxed/relaxed; d=mgml.me; h=From:To:Subject:Date:Message-ID; s=rs20250315; t=1761577485; v=1; b=KvCLLbTtt1cVehdqV5PiQPPTHsAXF6EA44jHDd1nQ+3dtCVk45xZ/rMfAjjrfwFd MvKLh147T8yyWamY74/Ii4onnT/CagE6UpUIHJXqpKiZwppvZgP0CLtkoO7G0xvo CsuNnKF2egiuNWMwoVbnJTeAEffjRlZYpWT5BVuHm57qY2FqzMl9ziY6IXRL+mhT zN15RqIRJhZ/bCPcnMYUWQbOMaDGGN5KjNaycjmlKLEK9EEIw+9FXA/BAtUproK4 FY+nP1egM7N4zeCLn39cnvVrghoAxKZo39/bsYAKeJ5yJffn39P6bRxI6Tx55VeI BNX5c2d55yETg9s+EYoUFA== From: Kenta Akagi To: Song Liu , Yu Kuai , Shaohua Li , Mariusz Tkaczyk , Guoqing Jiang Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, Kenta Akagi Subject: [PATCH v5 04/16] md: introduce md_cond_error() Date: Tue, 28 Oct 2025 00:04:21 +0900 Message-ID: <20251027150433.18193-5-k@mgml.me> X-Mailer: git-send-email 2.50.1 In-Reply-To: <20251027150433.18193-1-k@mgml.me> References: <20251027150433.18193-1-k@mgml.me> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The failfast feature in RAID1 and RAID10 assumes that when md_error() is called, the array remains functional because the last rdev neither fails nor sets MD_BROKEN. However, the current implementation can cause the array to lose its last in-sync device or be marked as MD_BROKEN, which breaks the assumption and can lead to array failure. To address this issue, introduce md_cond_error() that handles failfast bio errors without stopped the array. This function checks whether the array will become inoperable if a rdev fails, and if so, it skips error handling to ensure the array remains in an operational state. Callers of md_error() will be updated to use this new function in subsequent commits to properly handle failfast scenarios. Signed-off-by: Kenta Akagi --- drivers/md/md.c | 33 +++++++++++++++++++++++++++++++++ drivers/md/md.h | 1 + 2 files changed, 34 insertions(+) diff --git a/drivers/md/md.c b/drivers/md/md.c index 4ad9cb0ac98c..e33ab564f26b 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -8479,6 +8479,39 @@ void md_error(struct mddev *mddev, struct md_rdev *r= dev) } EXPORT_SYMBOL(md_error); =20 +/** md_cond_error() - conditionally md_error() + * @mddev: affected md device + * @rdev: member device to fail + * @bio: bio whose triggered device failure + * + * Check if the personality wants to fail this rdev for this bio, + * and if so, call _md_error(). + * This function has no different behavior from md_error except + * for the raid1/10 with failfast enabled rdevs. + * + * Returns: %true if rdev already or become Faulty, %false if not. + */ +bool md_cond_error(struct mddev *mddev, struct md_rdev *rdev, struct bio *= bio) +{ + if (WARN_ON_ONCE(!mddev->pers)) + /* return true because we don't want caller to retry */ + return true; + + spin_lock(&mddev->device_lock); + + if (mddev->pers->should_error && + !mddev->pers->should_error(mddev, rdev, bio)) { + spin_unlock(&mddev->device_lock); + return test_bit(Faulty, &rdev->flags); + } + + _md_error(mddev, rdev); + spin_unlock(&mddev->device_lock); + + return !WARN_ON_ONCE(!test_bit(Faulty, &rdev->flags)); +} +EXPORT_SYMBOL(md_cond_error); + /* seq_file implementation /proc/mdstat */ =20 static void status_unused(struct seq_file *seq) diff --git a/drivers/md/md.h b/drivers/md/md.h index 01c8182431d1..38f9874538a6 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -916,6 +916,7 @@ extern void md_write_end(struct mddev *mddev); extern void md_done_sync(struct mddev *mddev, int blocks, int ok); void _md_error(struct mddev *mddev, struct md_rdev *rdev); extern void md_error(struct mddev *mddev, struct md_rdev *rdev); +extern bool md_cond_error(struct mddev *mddev, struct md_rdev *rdev, struc= t bio *bio); extern void md_finish_reshape(struct mddev *mddev); void md_submit_discard_bio(struct mddev *mddev, struct md_rdev *rdev, struct bio *bio, sector_t start, sector_t size); --=20 2.50.1 From nobody Thu Dec 18 20:00:13 2025 Received: from www5210.sakura.ne.jp (www5210.sakura.ne.jp [133.167.8.150]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 314E930FF20 for ; Mon, 27 Oct 2025 15:44:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=133.167.8.150 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761579862; cv=none; b=nC3V1b+Gd1EnCA3uWyLuyHkiutdkrC624DUFJ2ebF5FtShBrx3v1Ei1feKBOrQknkUMPvlQErW+R7YtQZZmqAIdt2NC3Dy5D1rQ3LLtyA3WQ/n34FjUIE0+m4HtlCH0JSwbmwIKa3/kMrnT+I452M4Mimps/V7pJdX0PK4YPfCk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761579862; c=relaxed/simple; bh=fDDLXYRr8DG1J/Sy8d2/00lhjPfpEUt3qTyk2cUc91I=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=h5CxCAXENpVTA/a5rO9fxvkrn7yph/es7GJ09lvIV6AkrjySd1ThL8SfalxqUgn267hXMecM60mKLrqhqdx7P3Gkeqy6CQXy1mP1Iw34yiRL/aXFpxVvl4C+LK4y7opN5WWl+dZwfr77rgx3j9GmuL+XXZH1cCrMxMpaU+W+g/s= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mgml.me; spf=pass smtp.mailfrom=mgml.me; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b=gmDLhGOJ; arc=none smtp.client-ip=133.167.8.150 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mgml.me Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=mgml.me Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b="gmDLhGOJ" Received: from fedora (p3796170-ipxg00h01tokaisakaetozai.aichi.ocn.ne.jp [180.53.173.170]) (authenticated bits=0) by www5210.sakura.ne.jp (8.16.1/8.16.1) with ESMTPSA id 59RF4hAf090988 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Tue, 28 Oct 2025 00:04:45 +0900 (JST) (envelope-from k@mgml.me) DKIM-Signature: a=rsa-sha256; bh=f+Io4aR19zfdAavz436gI+/sCfOgtfjxIdEmYkp7QcM=; c=relaxed/relaxed; d=mgml.me; h=From:To:Subject:Date:Message-ID; s=rs20250315; t=1761577486; v=1; b=gmDLhGOJsxfZZXfRDZeX7upNwLJXLq3D8b9gzVbA8YWcN0EOmgQAwAZlt0sdQcQS kTzn6DPXx/MkkKKYl1NvsZmwdT9B8aZROt3IQ7W2u9Z/UJHgAqgKXGDtqDdBFOhT zErFPiEIm8l6LV5lKH1JMNJiYHkPoWp80PzHt8BJeUxb84rUIJ8Egz/gxDvDPcAg JcizZCQ3FijjhEiOqsNOO/wAR0FGSHtxZdxem9w7i2LNXBWuGWZQ1WoNV5jDMIch 5f8cCxU2SwoK+HZCPQ4HzS9LqtKtUHPyQEdgND5g5ZMArs5GHMGmqm0m+3kyc9kU 2QtaHKhoXThJnAR1kFEuIg== From: Kenta Akagi To: Song Liu , Yu Kuai , Shaohua Li , Mariusz Tkaczyk , Guoqing Jiang Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, Kenta Akagi Subject: [PATCH v5 05/16] md/raid1: implement pers->should_error() Date: Tue, 28 Oct 2025 00:04:22 +0900 Message-ID: <20251027150433.18193-6-k@mgml.me> X-Mailer: git-send-email 2.50.1 In-Reply-To: <20251027150433.18193-1-k@mgml.me> References: <20251027150433.18193-1-k@mgml.me> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The failfast feature in RAID1 and RAID10 assumes that when md_error() is called, the array remains functional because the last rdev neither fails nor sets MD_BROKEN. However, the current implementation can cause the array to lose its last in-sync device or be marked as MD_BROKEN, which breaks the assumption and can lead to array failure. To address this issue, introduce a new handler, md_cond_error(), to ensure that failfast I/O does not mark the array as broken. md_cond_error() checks whether a device should be faulted based on pers->should_error(). This commit implements should_error() callback for raid1 personality, which returns true if faulting the specified rdev would cause the mddev to become non-functional. Signed-off-by: Kenta Akagi --- drivers/md/raid1.c | 35 +++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 202e510f73a4..69b7730f3875 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1732,6 +1732,40 @@ static void raid1_status(struct seq_file *seq, struc= t mddev *mddev) seq_printf(seq, "]"); } =20 +/** + * raid1_should_error() - Determine if this rdev should be failed + * @mddev: affected md device + * @rdev: member device to check + * @bio: the bio that caused the failure + * + * When failfast bios failure, rdev can fail, but the mddev must not fail. + * This function tells md_cond_error() not to fail rdev if bio is failfast + * and last rdev. + * + * Returns: %false if bio is failfast and rdev is the last in-sync device. + * Otherwise %true - should fail this rdev. + */ +static bool raid1_should_error(struct mddev *mddev, struct md_rdev *rdev, = struct bio *bio) +{ + int i; + struct r1conf *conf =3D mddev->private; + + if (!(bio->bi_opf & MD_FAILFAST) || + !test_bit(FailFast, &rdev->flags) || + test_bit(Faulty, &rdev->flags)) + return true; + + for (i =3D 0; i < conf->raid_disks; i++) { + struct md_rdev *rdev2 =3D conf->mirrors[i].rdev; + + if (rdev2 && rdev2 !=3D rdev && + test_bit(In_sync, &rdev2->flags) && + !test_bit(Faulty, &rdev2->flags)) + return true; + } + return false; +} + /** * raid1_error() - RAID1 error handler. * @mddev: affected md device. @@ -3486,6 +3520,7 @@ static struct md_personality raid1_personality =3D .free =3D raid1_free, .status =3D raid1_status, .error_handler =3D raid1_error, + .should_error =3D raid1_should_error, .hot_add_disk =3D raid1_add_disk, .hot_remove_disk=3D raid1_remove_disk, .spare_active =3D raid1_spare_active, --=20 2.50.1 From nobody Thu Dec 18 20:00:13 2025 Received: from www5210.sakura.ne.jp (www5210.sakura.ne.jp [133.167.8.150]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2DC4430F7FE for ; Mon, 27 Oct 2025 15:44:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=133.167.8.150 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761579857; cv=none; b=gkn0f5HqhsjipCbh1C33XpOeBpXgaCz9hOQeSmUwtN0FO1T459UPogCA0l2MUODqUm06XI4/Jy70uwbXsVyxYHX0QohzADvTntRP1Y2ICCIQ8N7lYletnoqJsCTtPqcWPNUpjVXtkCuHMBey/AGwO8sbcLAXcT0Q2Jda9jH8PQ8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761579857; c=relaxed/simple; bh=NCSVD5pdhokkZMeJ2d4TLtTyzKAOCn9PZfnvyBmp+ZA=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=pzBeBAUquBLXqX4gnAmWUWtNxPyX7qjAF6MmpmoZzVaqHjy2Y8mkBGPCtfs1NGvNCU7fVdngpXVYPxU2cblazmtjJwj60/Tjo6YP35J/8XMwOspS17CBG8/42z8esrvXnfiR7BzMseZX8Bsg6IXs/YyLTLjdT2mpn1HPj15nxVk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mgml.me; spf=pass smtp.mailfrom=mgml.me; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b=qwVN65Pq; arc=none smtp.client-ip=133.167.8.150 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mgml.me Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=mgml.me Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b="qwVN65Pq" Received: from fedora (p3796170-ipxg00h01tokaisakaetozai.aichi.ocn.ne.jp [180.53.173.170]) (authenticated bits=0) by www5210.sakura.ne.jp (8.16.1/8.16.1) with ESMTPSA id 59RF4hAg090988 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Tue, 28 Oct 2025 00:04:46 +0900 (JST) (envelope-from k@mgml.me) DKIM-Signature: a=rsa-sha256; bh=rED3C2j/XTprLxpQLKH9JWsM1LFO3rk8cgg9si00nUA=; c=relaxed/relaxed; d=mgml.me; h=From:To:Subject:Date:Message-ID; s=rs20250315; t=1761577486; v=1; b=qwVN65PqpPXLDS0qjHoEUdm5pwL18bjIcsoUcDYRSgJHoS917MvYumn5X7yoeFA4 HurXMQTfH80ebiW23EhCVUGI7ohCuETYWIL+0r43CzQEazxAVEuchK6EqBGeETAA VuSG82PjR1AsoVAHK/+/WaNyXHtu3hmVQeiYHgqtHzRJM22snQbWcpvYxqAsoqKG PfHadLJVm77JyWesaY4V3fzfpJqkY61vqRJnbY4sGTAL50zwbaE0+tdDz2tr5sgb 5lqNpBTLGi3PtEiGwRn31OfZxGwZol09UAyQM6bjFn80Zdt6v4bspDT2Rc8VJdF4 S2dQotExAoHlzWQMsTU42A== From: Kenta Akagi To: Song Liu , Yu Kuai , Shaohua Li , Mariusz Tkaczyk , Guoqing Jiang Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, Kenta Akagi Subject: [PATCH v5 06/16] md/raid10: implement pers->should_error() Date: Tue, 28 Oct 2025 00:04:23 +0900 Message-ID: <20251027150433.18193-7-k@mgml.me> X-Mailer: git-send-email 2.50.1 In-Reply-To: <20251027150433.18193-1-k@mgml.me> References: <20251027150433.18193-1-k@mgml.me> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The failfast feature in RAID1 and RAID10 assumes that when md_error() is called, the array remains functional because the last rdev neither fails nor sets MD_BROKEN. However, the current implementation can cause the array to lose its last in-sync device or be marked as MD_BROKEN, which breaks the assumption and can lead to array failure. To address this issue, introduce a new handler, md_cond_error(), to ensure that failfast I/O does not mark the array as broken. md_cond_error() checks whether a device should be faulted based on pers->should_error(). This commit implements should_error() callback for raid10 personality, which returns true if faulting the specified rdev would cause the mddev to become non-functional. Signed-off-by: Kenta Akagi --- drivers/md/raid10.c | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 25c0ab09807b..68dbab7b360b 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -1977,6 +1977,31 @@ static int enough(struct r10conf *conf, int ignore) _enough(conf, 1, ignore); } =20 +/** + * raid10_should_error() - Determine if this rdev should be failed + * @mddev: affected md device + * @rdev: member device to check + * @bio: the bio that caused the failure + * + * When failfast bios failure, rdev can fail, but the mddev must not fail. + * This function tells md_cond_error() not to fail rdev if bio is failfast + * and last rdev. + * + * Returns: %false if bio is failfast and rdev is the last in-sync device. + * Otherwise %true - should fail this rdev. + */ +static bool raid10_should_error(struct mddev *mddev, struct md_rdev *rdev,= struct bio *bio) +{ + struct r10conf *conf =3D mddev->private; + + if (!(bio->bi_opf & MD_FAILFAST) || + !test_bit(FailFast, &rdev->flags) || + test_bit(Faulty, &rdev->flags)) + return true; + + return enough(conf, rdev->raid_disk); +} + /** * raid10_error() - RAID10 error handler. * @mddev: affected md device. @@ -5116,6 +5141,7 @@ static struct md_personality raid10_personality =3D .free =3D raid10_free, .status =3D raid10_status, .error_handler =3D raid10_error, + .should_error =3D raid10_should_error, .hot_add_disk =3D raid10_add_disk, .hot_remove_disk=3D raid10_remove_disk, .spare_active =3D raid10_spare_active, --=20 2.50.1 From nobody Thu Dec 18 20:00:13 2025 Received: from www5210.sakura.ne.jp (www5210.sakura.ne.jp [133.167.8.150]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BAFE830DEC9 for ; Mon, 27 Oct 2025 15:44:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=133.167.8.150 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761579852; cv=none; b=gmHHiqOoBxJk/2rRmVf/oapvvds2Mqe3VOFJUKakHhuGCz2b2AxIfq7gbYeS622WhJn4dQ+BuNOHj/qz80m/BQAxJZMcx/+DLP9wi6cRq/KYg8aL46TDG0YLny0eaTy7S1QAvBveEhg+LmuQFGgq8JNQupVDRR3gQJuiguzJq54= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761579852; c=relaxed/simple; bh=DWXAcebOnNsq25q914zeSgdReXvvjkAY4JV6otpWaE8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=kRRazVSLuZ3TIXMJGsDizQWocxBtZsdFDxHsR27WdYObTt0ToDrFv3enTnGPA2akhnNCH619MdR2xpVIF5u/g+CyG/kuMR6KACovjEdy2i0D+2f9v9Bh4xloh7j0rqTwRXKWojF5PiJj/aTBH5OZd65cJH9mtGf0pgLuU5Zr7fE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mgml.me; spf=pass smtp.mailfrom=mgml.me; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b=hcK4ocpK; arc=none smtp.client-ip=133.167.8.150 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mgml.me Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=mgml.me Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b="hcK4ocpK" Received: from fedora (p3796170-ipxg00h01tokaisakaetozai.aichi.ocn.ne.jp [180.53.173.170]) (authenticated bits=0) by www5210.sakura.ne.jp (8.16.1/8.16.1) with ESMTPSA id 59RF4hAh090988 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Tue, 28 Oct 2025 00:04:46 +0900 (JST) (envelope-from k@mgml.me) DKIM-Signature: a=rsa-sha256; bh=8Q0DS9pACMqRNvuXPGJkIpcAMLX9xqEbDvafPHKZnJ0=; c=relaxed/relaxed; d=mgml.me; h=From:To:Subject:Date:Message-ID; s=rs20250315; t=1761577486; v=1; b=hcK4ocpKIVCPUr+euDKYvuXpDRYlDLtKeGRQFPrhsUDNOje9B5J13d1GkjVfH+cN Zzn9Igx4E7JtYQGMnpIInQdsPfN+zghXLGBoeb37urZ18jukczdTNWp1teArjP47 TiouQFQYQFOHrQFEh55vsLqIKuSZuPIQ0SGzGp6vKSo+YxrkyFe2OQ/JqiWmZPJJ KWQDTGekbXrkTZE705w3C7uB097wm+lKcC6Kh6c8UJl11O4SIZIlX2TL19i2EJ8C tzgEgEc1Vl7Ns+aJCN4Q1ZEsbvRWPnWEIEn3AXVwLc2WE9+bG4EtcpvzXzCC0HcR icNxYJkh4ZcOPLPcLko57Q== From: Kenta Akagi To: Song Liu , Yu Kuai , Shaohua Li , Mariusz Tkaczyk , Guoqing Jiang Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, Kenta Akagi Subject: [PATCH v5 07/16] md/raid1: refactor handle_read_error() Date: Tue, 28 Oct 2025 00:04:24 +0900 Message-ID: <20251027150433.18193-8-k@mgml.me> X-Mailer: git-send-email 2.50.1 In-Reply-To: <20251027150433.18193-1-k@mgml.me> References: <20251027150433.18193-1-k@mgml.me> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" For the failfast bio feature, the behavior of handle_read_error() will be changed in a subsequent commit, but refactor it first. This commit only refactors the code without functional changes. A subsequent commit will replace md_error() with md_cond_error() to implement proper failfast error handling. Signed-off-by: Kenta Akagi Reviewed-by: Yu Kuai --- drivers/md/raid1.c | 14 ++++++-------- 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 69b7730f3875..a70ca6bc28f3 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -2675,24 +2675,22 @@ static void handle_read_error(struct r1conf *conf, = struct r1bio *r1_bio) r1_bio->bios[r1_bio->read_disk] =3D NULL; =20 rdev =3D conf->mirrors[r1_bio->read_disk].rdev; - if (mddev->ro =3D=3D 0 - && !test_bit(FailFast, &rdev->flags)) { + if (mddev->ro) { + r1_bio->bios[r1_bio->read_disk] =3D IO_BLOCKED; + } else if (test_bit(FailFast, &rdev->flags)) { + md_error(mddev, rdev); + } else { freeze_array(conf, 1); fix_read_error(conf, r1_bio); unfreeze_array(conf); - } else if (mddev->ro =3D=3D 0 && test_bit(FailFast, &rdev->flags)) { - md_error(mddev, rdev); - } else { - r1_bio->bios[r1_bio->read_disk] =3D IO_BLOCKED; } =20 rdev_dec_pending(rdev, conf->mddev); sector =3D r1_bio->sector; - bio =3D r1_bio->master_bio; =20 /* Reuse the old r1_bio so that the IO_BLOCKED settings are preserved */ r1_bio->state =3D 0; - raid1_read_request(mddev, bio, r1_bio->sectors, r1_bio); + raid1_read_request(mddev, r1_bio->master_bio, r1_bio->sectors, r1_bio); allow_barrier(conf, sector); } =20 --=20 2.50.1 From nobody Thu Dec 18 20:00:13 2025 Received: from www5210.sakura.ne.jp (www5210.sakura.ne.jp [133.167.8.150]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AE49C433C4 for ; Mon, 27 Oct 2025 15:44:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=133.167.8.150 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761579847; cv=none; b=dTbOvBoZcJtRsHrxjltCqPeVc7YRjPt/3GNZi4ZuEQEqvL1nU8MPFWSlX9xxcjb+8sI6fkJxbTMXGLE96ZYXr9GADXdsT67g08nGLcC+PRrOLe2DiX1mP0rbvQbfiTPTfFWb3qwt6MRGtBfDrdYhVRDSVcWsAvVTRLeMlnkbEAM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761579847; c=relaxed/simple; bh=3+XA20Rc9EnFrgx6YdQVGbAivbvGC2wj1PaFAMIRAV0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=sk1WEYXhez6zek2H/KohppIg6yiEZEuhaMQUgIKR9wZANoUhs+KNjZAQmKrhNt5C3+5prONkHIxqYy2ep/tmVXiafNz8Y9eaDuDUu/llhLeW8hihNdP+CCDMFHivXiri+wNyaeYA6pxdvijnWqWq8nob+5ZTPSuRGr35UMHJ/MM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mgml.me; spf=pass smtp.mailfrom=mgml.me; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b=iVBobUcQ; arc=none smtp.client-ip=133.167.8.150 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mgml.me Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=mgml.me Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b="iVBobUcQ" Received: from fedora (p3796170-ipxg00h01tokaisakaetozai.aichi.ocn.ne.jp [180.53.173.170]) (authenticated bits=0) by www5210.sakura.ne.jp (8.16.1/8.16.1) with ESMTPSA id 59RF4hAi090988 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Tue, 28 Oct 2025 00:04:46 +0900 (JST) (envelope-from k@mgml.me) DKIM-Signature: a=rsa-sha256; bh=sx1Lo0j+on85gzn1GqHUcemMoQBFiCbVd0Zd9bIDEUs=; c=relaxed/relaxed; d=mgml.me; h=From:To:Subject:Date:Message-ID; s=rs20250315; t=1761577486; v=1; b=iVBobUcQ+2uPz46F4RrLCxtc/+KUanSfiEft5Hn3jP3Fn2evfGKQBq+/4GIdJjQX 8R3E1b1pzd9ygS2gWl6ugnE5MbIz6TKS5xZgMur2S1W8TOJ1zsYXv/VNiuLaWrhg LAfHBs8UgQUN2Cum1Z7KJ/GCyoGlubVUSErMEEIF2u+kxWaDbv18nuSWqbdgHSmb 2/rjVop/R+A49Kg+sxuVro2VUWtJ8C6E3Y8yFhIQepx0dGgB2odq/3/8CVG3g+/v ELYFmdhJVsRzo4DoMAikPHbVZkeRUrX6cTOAWFgej7TDFbh8Cn5bs0CdBgvqPF2v 6uzGWfU50BHcThTF5u/oEQ== From: Kenta Akagi To: Song Liu , Yu Kuai , Shaohua Li , Mariusz Tkaczyk , Guoqing Jiang Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, Kenta Akagi Subject: [PATCH v5 08/16] md/raid10: refactor handle_read_error() Date: Tue, 28 Oct 2025 00:04:25 +0900 Message-ID: <20251027150433.18193-9-k@mgml.me> X-Mailer: git-send-email 2.50.1 In-Reply-To: <20251027150433.18193-1-k@mgml.me> References: <20251027150433.18193-1-k@mgml.me> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" For the failfast bio feature, the behavior of handle_read_error() will be changed in a subsequent commit, but refactor it first. This commit only refactors the code without functional changes. A subsequent commit will replace md_error() with md_cond_error() to implement proper failfast error handling. Signed-off-by: Kenta Akagi Reviewed-by: Yu Kuai --- drivers/md/raid10.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 68dbab7b360b..87468113e31a 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -2873,14 +2873,15 @@ static void handle_read_error(struct mddev *mddev, = struct r10bio *r10_bio) bio_put(bio); r10_bio->devs[slot].bio =3D NULL; =20 - if (mddev->ro) + if (mddev->ro) { r10_bio->devs[slot].bio =3D IO_BLOCKED; - else if (!test_bit(FailFast, &rdev->flags)) { + } else if (test_bit(FailFast, &rdev->flags)) { + md_error(mddev, rdev); + } else { freeze_array(conf, 1); fix_read_error(conf, mddev, r10_bio); unfreeze_array(conf); - } else - md_error(mddev, rdev); + } =20 rdev_dec_pending(rdev, mddev); r10_bio->state =3D 0; --=20 2.50.1 From nobody Thu Dec 18 20:00:13 2025 Received: from www5210.sakura.ne.jp (www5210.sakura.ne.jp [133.167.8.150]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BF6CF3126BC for ; Mon, 27 Oct 2025 15:44:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=133.167.8.150 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761579879; cv=none; b=GnrcHFzoNqQyuL0IkaycOWrw4+sknZXuUiTrSwXARi2HEbe5nFzgDHy1wjEMpTxnxgWSOjQbCPlrZ79TQn1WGo3k+F8aAwR9QpIEVKAlArAYxGJHKLkuL/vVxiuVC05wjyre3AvJKWWMb+EE3SNuXMCyfYGtB82lNBdagHgjNt4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761579879; c=relaxed/simple; bh=Wd1cUGfGScYh02ihprIo6cw3komLFAh+Tw0AB+8Ppa0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=ItdVG4OgXTRBgh7rTQ9ieh7eA+iTYRhspmubrLDiwBziD9atTswgJ9MBUvwGLzmXDUyxBJb4Oacs7dmZjDxvzYCmp9M1tKUAT1wEI0qY33Y3abc+iGG0RX6+oVxdDjjOYRKxPJ16fORl5ZYcFDJCtKpveUj58Q+1g/0i+xcLsKw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mgml.me; spf=pass smtp.mailfrom=mgml.me; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b=wtbktcgZ; arc=none smtp.client-ip=133.167.8.150 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mgml.me Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=mgml.me Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b="wtbktcgZ" Received: from fedora (p3796170-ipxg00h01tokaisakaetozai.aichi.ocn.ne.jp [180.53.173.170]) (authenticated bits=0) by www5210.sakura.ne.jp (8.16.1/8.16.1) with ESMTPSA id 59RF4hAj090988 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Tue, 28 Oct 2025 00:04:46 +0900 (JST) (envelope-from k@mgml.me) DKIM-Signature: a=rsa-sha256; bh=okaoLlIHcjunkUheKnD4aYuOo8R+kLSHEmrYzp02T+M=; c=relaxed/relaxed; d=mgml.me; h=From:To:Subject:Date:Message-ID; s=rs20250315; t=1761577486; v=1; b=wtbktcgZuz9+nzv5a52F/JHVQgDWaq5rj52clpbEjilkx4wZtZoaFf+1d3mrZh86 dyzaUHjPu3g3FlxWOuaNJwef6vF7CjHdinYe/0oyFJbKZJNXvljoo8c4wruWBUNX ufziY2bU+jGtfByu9QyAy1styBgpUKrGjqBgXXMkc1+c6748yfmhtD4C8zrGK9zj B0jHEOdj3QvrhFMvOup//Xh7MWEWUYhiTIf/2PjR09/cqA0rbc37oxv/pKKXpoyB Kt+XMh/ae5A/VG04xmHL4Ys5nHycFaMoKty8Wt9lpc+radQaxU2jTUGX04mrPIus 4diIMJW1WBwnedz6r1uL4w== From: Kenta Akagi To: Song Liu , Yu Kuai , Shaohua Li , Mariusz Tkaczyk , Guoqing Jiang Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, Kenta Akagi , Li Nan Subject: [PATCH v5 09/16] md/raid10: fix failfast read error not rescheduled Date: Tue, 28 Oct 2025 00:04:26 +0900 Message-ID: <20251027150433.18193-10-k@mgml.me> X-Mailer: git-send-email 2.50.1 In-Reply-To: <20251027150433.18193-1-k@mgml.me> References: <20251027150433.18193-1-k@mgml.me> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" raid10_end_read_request lacks a path to retry when a FailFast IO fails. As a result, when Failfast Read IOs fail on all rdevs, the upper layer receives EIO, without read rescheduled. Looking at the two commits below, it seems only raid10_end_read_request lacks the failfast read retry handling, while raid1_end_read_request has it. In RAID1, the retry works as expected. * commit 8d3ca83dcf9c ("md/raid10: add failfast handling for reads.") * commit 2e52d449bcec ("md/raid1: add failfast handling for reads.") This commit will make the failfast read bio for the last rdev in raid10 retry if it fails. Fixes: 8d3ca83dcf9c ("md/raid10: add failfast handling for reads.") Signed-off-by: Kenta Akagi Reviewed-by: Li Nan Reviewed-by: Yu Kuai --- drivers/md/raid10.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 87468113e31a..1dd27b9ef48e 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -401,6 +401,13 @@ static void raid10_end_read_request(struct bio *bio) * wait for the 'master' bio. */ set_bit(R10BIO_Uptodate, &r10_bio->state); + } else if (test_bit(FailFast, &rdev->flags) && + test_bit(R10BIO_FailFast, &r10_bio->state)) { + /* + * This was a fail-fast read so we definitely + * want to retry + */ + ; } else if (!raid1_should_handle_error(bio)) { uptodate =3D 1; } else { --=20 2.50.1 From nobody Thu Dec 18 20:00:13 2025 Received: from www5210.sakura.ne.jp (www5210.sakura.ne.jp [133.167.8.150]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 30454310777 for ; Mon, 27 Oct 2025 15:44:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=133.167.8.150 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761579867; cv=none; b=m5A/yajHlM9X2Y/e+rGg0dDcQpiA/c1LxEp9VhXUxwUk04q6VoBC8dRCGXNtgJcjJuTAixWQ9kwtB60OoFAj9Q8/bE9gNGzdAsT9zbAFfiZu3BklOiycniSLCFOwajqckRg2B+ya/IRib5rylyN1WwMDJC9jmrY8pbQkARWtJzE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761579867; c=relaxed/simple; bh=5MA3ZBCVJJ5e4pS05xjKTX4sHoI6tdxVpG1jhKcLKCU=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=F5a/piiMeKPOkE7DyfdEr8wMdxZ04CBRNnAvmznexRSXwhH68srY/FfOWN85DyKRoDRhJBClK5iFqcbMqR5vLYOpdjlVo/GKL5Sxz0xHCQvBr4X/iWGr6/cxChagWneqNEOGe7E+7baUlYOCJ352HDPp1OfiapDo1RYgbRX1h9I= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mgml.me; spf=pass smtp.mailfrom=mgml.me; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b=qrU32495; arc=none smtp.client-ip=133.167.8.150 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mgml.me Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=mgml.me Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b="qrU32495" Received: from fedora (p3796170-ipxg00h01tokaisakaetozai.aichi.ocn.ne.jp [180.53.173.170]) (authenticated bits=0) by www5210.sakura.ne.jp (8.16.1/8.16.1) with ESMTPSA id 59RF4hAk090988 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Tue, 28 Oct 2025 00:04:46 +0900 (JST) (envelope-from k@mgml.me) DKIM-Signature: a=rsa-sha256; bh=wIhDgBY9Enp3YXKyeTsXx8BAoA1MnZsmwspGz/LyDDM=; c=relaxed/relaxed; d=mgml.me; h=From:To:Subject:Date:Message-ID; s=rs20250315; t=1761577486; v=1; b=qrU32495ZL9y5M/JvDEK5Rhyl/PlrnA1MRjYBi/NSUZcR+xsKGrC/RfU4vE5bblF 4JU4wruCzCkxSTm2Tte6vSyUhV5CHR5oy68OSMvtRUlAgrTTE2e8bB8BLXqmyq5i H6R3FFxENzF/E41iYMWsIkOBtS6Mbp4q2DbpRS+CW+/4t6T30scevZc3ffDySK9U 5Pv4mi0t3NqcyZzTrJUOpzveUCx8srz+O6jImeaUv2/epKQ6vFLLyCdSTLc5lQMu nma+CUk5sfbzytvk9zPLpnm0qMDMX367ch/LrpK2Luj1v0zoSh+qS0CnnJprmzZG ymXNKNfAGBlLL3s4g6+nsA== From: Kenta Akagi To: Song Liu , Yu Kuai , Shaohua Li , Mariusz Tkaczyk , Guoqing Jiang Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, Kenta Akagi Subject: [PATCH v5 10/16] md: prevent set MD_BROKEN on super_write failure with failfast Date: Tue, 28 Oct 2025 00:04:27 +0900 Message-ID: <20251027150433.18193-11-k@mgml.me> X-Mailer: git-send-email 2.50.1 In-Reply-To: <20251027150433.18193-1-k@mgml.me> References: <20251027150433.18193-1-k@mgml.me> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Failfast is a feature implemented only for RAID1 and RAID10. It instructs the block device providing the rdev to immediately return a bio error without retrying if any issue occurs. This allows quickly detaching a problematic rdev and minimizes IO latency. Due to its nature, failfast bios can fail easily, and md must not mark an essential rdev as Faulty or set MD_BROKEN on the array just because a failfast bio failed. When failfast was introduced, RAID1 and RAID10 were designed to continue operating normally even if md_error was called for the last rdev. However, with the introduction of MD_BROKEN in RAID1/RAID10 in commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"), calling md_error for the last rdev now prevents further writes to the array. Despite this, the current failfast error handler still assumes that calling md_error will not break the array. Normally, this is not an issue because MD_FAILFAST is not set when a bio is issued to the last rdev. However, if the array is not degraded and a bio with MD_FAILFAST has been issued, simultaneous failures could potentially break the array. This is unusual but can happen; for example, this can occur when using NVMe over TCP if all rdevs depend on a single Ethernet link. In other words, this becomes a problem under the following conditions: Preconditions: * Failfast is enabled on all rdevs. * All rdevs are In_sync - This is a requirement for bio to be submit with MD_FAILFAST. * At least one bio has been submitted but has not yet completed. Trigger condition: * All underlying devices of the rdevs return an error for their failfast bios. Whether the bio is read or write, eventually both rdevs will be lost. In the write case, md_error is invoked on each rdev through its bi_end_io handler. In the read case, if bio has been issued to multiple rdevs via read_balance, it will be the same as write. Even in the read case where only a single rdev has bio issued, both rdevs will be lost in the following sequence: 1. losing the first rdev triggers a metadata update 2. md_super_write issues the bio with MD_FAILFAST, causing the bio to fail immediately. md_super_write always issues MD_FAILFAST bio if rdev has FailFast, regardless of whether there are other rdevs or not. 3. md_super_write issued bio failed, so super_written calls md_error on the remaining rdev. This commit fixes a second of read cases. Ensure that a failfast metadata write does not mark the last rdev as faulty or set MD_BROKEN on the array. Fixes: 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10") Fixes: 9a567843f7ce ("md: allow last device to be forcibly removed from RAI= D1/RAID10.") Signed-off-by: Kenta Akagi --- drivers/md/md.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index e33ab564f26b..3c3f5703531b 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -1060,8 +1060,7 @@ static void super_written(struct bio *bio) if (bio->bi_status) { pr_err("md: %s gets error=3D%d\n", __func__, blk_status_to_errno(bio->bi_status)); - md_error(mddev, rdev); - if (!test_bit(Faulty, &rdev->flags) + if (!md_cond_error(mddev, rdev, bio) && (bio->bi_opf & MD_FAILFAST)) { set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags); set_bit(LastDev, &rdev->flags); --=20 2.50.1 From nobody Thu Dec 18 20:00:13 2025 Received: from www5210.sakura.ne.jp (www5210.sakura.ne.jp [133.167.8.150]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4CDB630C35E for ; Mon, 27 Oct 2025 15:44:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=133.167.8.150 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761579872; cv=none; b=F+lj7lOahhIKwyi/JzNfER+MmUnnDj2sXE+9yb3ero/cr6IXTXXv6mXoJ6EaAMf0IKbI2zugozxL1RV/qmzyfdaFEkd+pBJSWvombAqhgYDBEZVDekc0b5sziQ6XtJKkTnW1rFj0YDGJZc4FO9qkVOul/1TikoCZ648ZTYsDPns= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761579872; c=relaxed/simple; bh=dSny0q8Vpr/7wchAfKqnsLC1CPOwBAyvKvG9yt4alcM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=NLGsLxY1lDl9t/9K0KZkuEVWafPnlpgOwypn61W5+YPci+l8P8+JEzzrlP1FaM5e8fxAb455ZRIZXv6NS6mOKLuy/xftVy+ee0P/VVX1434q3GNSVYkum+Ni6UAm6G/BJcvvCj4qSD10rcgCly/nIRjyGB4Br8euMsyokhG7Wgs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mgml.me; spf=pass smtp.mailfrom=mgml.me; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b=EVRfl0WY; arc=none smtp.client-ip=133.167.8.150 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mgml.me Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=mgml.me Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b="EVRfl0WY" Received: from fedora (p3796170-ipxg00h01tokaisakaetozai.aichi.ocn.ne.jp [180.53.173.170]) (authenticated bits=0) by www5210.sakura.ne.jp (8.16.1/8.16.1) with ESMTPSA id 59RF4hAl090988 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Tue, 28 Oct 2025 00:04:46 +0900 (JST) (envelope-from k@mgml.me) DKIM-Signature: a=rsa-sha256; bh=FgA+SMwL9yI/ePad9bDVhfEqymOfjz+gNzLHUmDWcLo=; c=relaxed/relaxed; d=mgml.me; h=From:To:Subject:Date:Message-ID; s=rs20250315; t=1761577486; v=1; b=EVRfl0WYA1819y3N75Clc+TFE9hCot/EwuHA/qRGKz+NEMRVRipIh3IARExWJfjb 25ouHdK4orElRc+eYFgDkt24xTOAuVmu9F92olaN3I6wRLej/AaL1hEfAO7JCuYX pSff+zdvkx8T/V9x7YPsdQv7CO5r83NBySZtOKVyl6zMpesxq+XhQk7apvXyczyG uILDneeDmMBrSzZ+WAbvxWxNN0wCZ5JOF02tSt07sCP62OxhD4A3g01nAxihJsv3 AGgJLpYYolrMb48CFCV+kstYZU9W4FFpjORpnVZkYNQaW2hyHIJCAWI4xgZWRDPy BX9lp5CE+iGLW0j34y9z8Q== From: Kenta Akagi To: Song Liu , Yu Kuai , Shaohua Li , Mariusz Tkaczyk , Guoqing Jiang Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, Kenta Akagi Subject: [PATCH v5 11/16] md/raid1: Prevent set MD_BROKEN on failfast bio failure Date: Tue, 28 Oct 2025 00:04:28 +0900 Message-ID: <20251027150433.18193-12-k@mgml.me> X-Mailer: git-send-email 2.50.1 In-Reply-To: <20251027150433.18193-1-k@mgml.me> References: <20251027150433.18193-1-k@mgml.me> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Failfast is a feature implemented only for RAID1 and RAID10. It instructs the block device providing the rdev to immediately return a bio error without retrying if any issue occurs. This allows quickly detaching a problematic rdev and minimizes IO latency. Due to its nature, failfast bios can fail easily, and md must not mark an essential rdev as Faulty or set MD_BROKEN on the array just because a failfast bio failed. When failfast was introduced, RAID1 and RAID10 were designed to continue operating normally even if md_error was called for the last rdev. However, with the introduction of MD_BROKEN in RAID1/RAID10 in commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"), calling md_error for the last rdev now prevents further writes to the array. Despite this, the current failfast error handler still assumes that calling md_error will not break the array. Normally, this is not an issue because MD_FAILFAST is not set when a bio is issued to the last rdev. However, if the array is not degraded and a bio with MD_FAILFAST has been issued, simultaneous failures could potentially break the array. This is unusual but can happen; for example, this can occur when using NVMe over TCP if all rdevs depend on a single Ethernet link. In other words, this becomes a problem under the following conditions: Preconditions: * Failfast is enabled on all rdevs. * All rdevs are In_sync - This is a requirement for bio to be submit with MD_FAILFAST. * At least one bio has been submitted but has not yet completed. Trigger condition: * All underlying devices of the rdevs return an error for their failfast bios. Whether the bio is read or write, eventually both rdevs will be lost. In the write case, md_error is invoked on each rdev through its bi_end_io handler. In the read case, if bio has been issued to multiple rdevs via read_balance, it will be the same as write. Even in the read case where only a single rdev has bio issued, both rdevs will be lost in the following sequence: 1. losing the first rdev triggers a metadata update 2. md_super_write issues the bio with MD_FAILFAST, causing the bio to fail immediately. md_super_write always issues MD_FAILFAST bio if rdev has FailFast, regardless of whether there are other rdevs or not. 3. md_super_write issued bio failed, so super_written calls md_error on the remaining rdev. This commit fixes the write case and first of read cases. Ensure that a failfast bio failure will not cause the last rdev to become faulty or the array to be marked MD_BROKEN. The second of read, i.e., failure of metadata update, has already been fixed in the previous commit. Signed-off-by: Kenta Akagi --- drivers/md/raid1.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index a70ca6bc28f3..bf96ae78a8b1 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -470,7 +470,7 @@ static void raid1_end_write_request(struct bio *bio) (bio->bi_opf & MD_FAILFAST) && /* We never try FailFast to WriteMostly devices */ !test_bit(WriteMostly, &rdev->flags)) { - md_error(r1_bio->mddev, rdev); + md_cond_error(r1_bio->mddev, rdev, bio); } =20 /* @@ -2177,8 +2177,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio) if (test_bit(FailFast, &rdev->flags)) { /* Don't try recovering from here - just fail it * ... unless it is the last working device of course */ - md_error(mddev, rdev); - if (test_bit(Faulty, &rdev->flags)) + if (md_cond_error(mddev, rdev, bio)) /* Don't try to read from here, but make sure * put_buf does it's thing */ @@ -2671,20 +2670,20 @@ static void handle_read_error(struct r1conf *conf, = struct r1bio *r1_bio) */ =20 bio =3D r1_bio->bios[r1_bio->read_disk]; - bio_put(bio); r1_bio->bios[r1_bio->read_disk] =3D NULL; =20 rdev =3D conf->mirrors[r1_bio->read_disk].rdev; if (mddev->ro) { r1_bio->bios[r1_bio->read_disk] =3D IO_BLOCKED; } else if (test_bit(FailFast, &rdev->flags)) { - md_error(mddev, rdev); + md_cond_error(mddev, rdev, bio); } else { freeze_array(conf, 1); fix_read_error(conf, r1_bio); unfreeze_array(conf); } =20 + bio_put(bio); rdev_dec_pending(rdev, conf->mddev); sector =3D r1_bio->sector; =20 --=20 2.50.1 From nobody Thu Dec 18 20:00:13 2025 Received: from www5210.sakura.ne.jp (www5210.sakura.ne.jp [133.167.8.150]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D8D0F3112DC for ; Mon, 27 Oct 2025 15:44:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=133.167.8.150 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761579869; cv=none; b=PSF9jOvLJJLTZMiL1dMqnv4i0KSOaIjeU9jPBnl/+geMDKR4p1xa6k3/ke2g/8ZkrIVEEsGkP5mo7qyFbyOykMbbkJ0k5BNep9KIzoIdhueKNGpGxVUBYCSstQQwBiBYs+J4yVsHKvedUqJYMcc3r0d6VR4Ey+2LPT+/XtSp8iA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761579869; c=relaxed/simple; bh=OH15Aq+Clj5xSexBAO1h+CHcIuQr2ufEWKDyJyyneYI=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=u2seCE7tHy7rE5zS6rymcSkjlNPkhpu+Zfl/X/lzIX+lLUWoCm8Ick9AIO+IMTYs8qTZPFqCxYtCujigQSdswAMDz79nQQ9MkDnnMOzq+ohSoM55U1+o4sCd+A29fSY8JyefkSqoBMwhxOH3bMqOjf9jglQ1elCC6mb2iP4mP4k= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mgml.me; spf=pass smtp.mailfrom=mgml.me; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b=RFaogUd+; arc=none smtp.client-ip=133.167.8.150 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mgml.me Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=mgml.me Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b="RFaogUd+" Received: from fedora (p3796170-ipxg00h01tokaisakaetozai.aichi.ocn.ne.jp [180.53.173.170]) (authenticated bits=0) by www5210.sakura.ne.jp (8.16.1/8.16.1) with ESMTPSA id 59RF4hAm090988 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Tue, 28 Oct 2025 00:04:46 +0900 (JST) (envelope-from k@mgml.me) DKIM-Signature: a=rsa-sha256; bh=sjnrPdwTQvOQOWTy7vKYqQILbsOONjoFMJ39ZWT0xhI=; c=relaxed/relaxed; d=mgml.me; h=From:To:Subject:Date:Message-ID; s=rs20250315; t=1761577486; v=1; b=RFaogUd+fZM4Sn/bLWPJd/YtyKeu6pDdVB9qfvbQWGVgYeS2Gy7f6yNcbRcZoujb /dfA3nuhuj/kv1cBy6O0IAX14AuFRr27IYLB18f5z3tbtjx0Juvv6fvisZoWV62P 2GJJuUfb6yPdviUAujSLQlzKHlrlc8GV1sAIoW1RwDiJvPKk0C9EXRV2hVdx1bW4 CnD4wFwz/eMUkTHYFOvomllPnQ/8lRAn9p+bpI4tyXalYue68Ttf5nFXDBOXpPsO PwMqKgfSA7kJZjH7yleTXMNPAKjCRhJaASuj7BjBkHpjWhK4H93ApPYaXKqHcdJ4 kXzyjl37xbQ9R61rpteUvA== From: Kenta Akagi To: Song Liu , Yu Kuai , Shaohua Li , Mariusz Tkaczyk , Guoqing Jiang Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, Kenta Akagi Subject: [PATCH v5 12/16] md/raid10: Prevent set MD_BROKEN on failfast bio failure Date: Tue, 28 Oct 2025 00:04:29 +0900 Message-ID: <20251027150433.18193-13-k@mgml.me> X-Mailer: git-send-email 2.50.1 In-Reply-To: <20251027150433.18193-1-k@mgml.me> References: <20251027150433.18193-1-k@mgml.me> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Failfast is a feature implemented only for RAID1 and RAID10. It instructs the block device providing the rdev to immediately return a bio error without retrying if any issue occurs. This allows quickly detaching a problematic rdev and minimizes IO latency. Due to its nature, failfast bios can fail easily, and md must not mark an essential rdev as Faulty or set MD_BROKEN on the array just because a failfast bio failed. When failfast was introduced, RAID1 and RAID10 were designed to continue operating normally even if md_error was called for the last rdev. However, with the introduction of MD_BROKEN in RAID1/RAID10 in commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"), calling md_error for the last rdev now prevents further writes to the array. Despite this, the current failfast error handler still assumes that calling md_error will not break the array. Normally, this is not an issue because MD_FAILFAST is not set when a bio is issued to the last rdev. However, if the array is not degraded and a bio with MD_FAILFAST has been issued, simultaneous failures could potentially break the array. This is unusual but can happen; for example, this can occur when using NVMe over TCP if all rdevs depend on a single Ethernet link. In other words, this becomes a problem under the following conditions: Preconditions: * Failfast is enabled on all rdevs. * All rdevs are In_sync - This is a requirement for bio to be submit with MD_FAILFAST. * At least one bio has been submitted but has not yet completed. Trigger condition: * All underlying devices of the rdevs return an error for their failfast bios. Whether the bio is read or write, eventually both rdevs will be lost. In the write case, md_error is invoked on each rdev through its bi_end_io handler. In the read case, if bio has been issued to multiple rdevs via read_balance, it will be the same as write. Even in the read case where only a single rdev has bio issued, both rdevs will be lost in the following sequence: 1. losing the first rdev triggers a metadata update 2. md_super_write issues the bio with MD_FAILFAST, causing the bio to fail immediately. md_super_write always issues MD_FAILFAST bio if rdev has FailFast, regardless of whether there are other rdevs or not. 3. md_super_write issued bio failed, so super_written calls md_error on the remaining rdev. This commit fixes the write case and first of read cases. Ensure that a failfast bio failure will not cause the last rdev to become faulty or the array to be marked MD_BROKEN. The second of read, i.e., failure of metadata update, has already been fixed in the previous commit. Signed-off-by: Kenta Akagi --- drivers/md/raid10.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 1dd27b9ef48e..aa9d328fe875 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -497,7 +497,7 @@ static void raid10_end_write_request(struct bio *bio) dec_rdev =3D 0; if (test_bit(FailFast, &rdev->flags) && (bio->bi_opf & MD_FAILFAST)) { - md_error(rdev->mddev, rdev); + md_cond_error(rdev->mddev, rdev, bio); } =20 /* @@ -2434,7 +2434,7 @@ static void sync_request_write(struct mddev *mddev, s= truct r10bio *r10_bio) continue; } else if (test_bit(FailFast, &rdev->flags)) { /* Just give up on this device */ - md_error(rdev->mddev, rdev); + md_cond_error(rdev->mddev, rdev, r10_bio->devs[i].bio); continue; } /* Ok, we need to write this bio, either to correct an @@ -2877,19 +2877,19 @@ static void handle_read_error(struct mddev *mddev, = struct r10bio *r10_bio) * frozen. */ bio =3D r10_bio->devs[slot].bio; - bio_put(bio); r10_bio->devs[slot].bio =3D NULL; =20 if (mddev->ro) { r10_bio->devs[slot].bio =3D IO_BLOCKED; } else if (test_bit(FailFast, &rdev->flags)) { - md_error(mddev, rdev); + md_cond_error(mddev, rdev, bio); } else { freeze_array(conf, 1); fix_read_error(conf, mddev, r10_bio); unfreeze_array(conf); } =20 + bio_put(bio); rdev_dec_pending(rdev, mddev); r10_bio->state =3D 0; raid10_read_request(mddev, r10_bio->master_bio, r10_bio, false); --=20 2.50.1 From nobody Thu Dec 18 20:00:13 2025 Received: from www5210.sakura.ne.jp (www5210.sakura.ne.jp [133.167.8.150]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4EEDD30B518 for ; Mon, 27 Oct 2025 15:44:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=133.167.8.150 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761579845; cv=none; b=jMv3TcYsvRhf23+wwAoFJ9Veyq5hmhLhY63tK5Lv8Q7zwasMjoY74SB/Fk8wMNCTfcdCVth1FPSYjIOWVS+BeQq+O5D//JJJw4Ie6KmKI1QJfIdCMxvQb/Mpdy3AX8m0DOjtaCUQqr8En9+d9YGTV8AOcWciYNSDSDIHOwf41DA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761579845; c=relaxed/simple; bh=htt+FZRcCwvAakH7MD0O81HhWPQ1I0Tqez3xD0GeuWk=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=rJkuYftqTVb7/Sg7ACjXX4yVDnNeRj+8VKXgt7VTQ5Zhz9Xzy7y5xlocqrBmauFx8uCwlbTQOQqSPdIflySjbR1MMd5S3J+zf3zblC3oNiTKm51H+8GYIeKihhkgU0u9ysn9CZPvo5tXr+jhYqLG62bYi/Y70Pg3YdPpSTSnxzY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mgml.me; spf=pass smtp.mailfrom=mgml.me; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b=Dc8XCTNU; arc=none smtp.client-ip=133.167.8.150 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mgml.me Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=mgml.me Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b="Dc8XCTNU" Received: from fedora (p3796170-ipxg00h01tokaisakaetozai.aichi.ocn.ne.jp [180.53.173.170]) (authenticated bits=0) by www5210.sakura.ne.jp (8.16.1/8.16.1) with ESMTPSA id 59RF4hAn090988 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Tue, 28 Oct 2025 00:04:46 +0900 (JST) (envelope-from k@mgml.me) DKIM-Signature: a=rsa-sha256; bh=TrguO4MMM9H23L+llpMoD/+oyT7T2CY7lD8mkAEY6zs=; c=relaxed/relaxed; d=mgml.me; h=From:To:Subject:Date:Message-ID; s=rs20250315; t=1761577486; v=1; b=Dc8XCTNUIgygFpeQgZ1OCU8ccdPQd1+9Cm5U5aeKhtf8680RmvYK/kVf0oskxtwO GRfW2EffTT3IJ5G6MOOJcb4MZJ4Qcnd6vWMUpLcA4epLn1KhnPtXUh0yo1W4yAoi Vn3liuHwFnfkfnBK8qKuUnS6cHYXHqMMFAWfMW7VG7ftJvUU2cLzi0rDO/vhR2FX 8RbfKk80sIG5dqZw8JkiQKZH+SsTXb4EsQ6B9gIdZbPtK7/kJ3jLNEodl1YoqmvJ /L5p74u3uJoEQBTN9C3iyBiZwVHpGGw513W2vDPuZORForw1lPKaEZIPOvH4/i2P 5V4JHqILzmNWEcq10CPVdg== From: Kenta Akagi To: Song Liu , Yu Kuai , Shaohua Li , Mariusz Tkaczyk , Guoqing Jiang Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, Kenta Akagi Subject: [PATCH v5 13/16] md/raid1: add error message when setting MD_BROKEN Date: Tue, 28 Oct 2025 00:04:30 +0900 Message-ID: <20251027150433.18193-14-k@mgml.me> X-Mailer: git-send-email 2.50.1 In-Reply-To: <20251027150433.18193-1-k@mgml.me> References: <20251027150433.18193-1-k@mgml.me> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Once MD_BROKEN is set on an array, no further writes can be performed. The user must be informed that the array cannot continue operation. Add error logging when MD_BROKEN flag is set on raid1 arrays to improve debugging and system administration visibility. Signed-off-by: Kenta Akagi --- drivers/md/raid1.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index bf96ae78a8b1..d58a60fb5b2f 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1790,6 +1790,10 @@ static void raid1_error(struct mddev *mddev, struct = md_rdev *rdev) if (test_bit(In_sync, &rdev->flags) && (conf->raid_disks - mddev->degraded) =3D=3D 1) { set_bit(MD_BROKEN, &mddev->flags); + pr_crit("md/raid1:%s: Disk failure on %pg, this is the last device.\n" + "md/raid1:%s: Cannot continue operation (%d/%d failed).\n", + mdname(mddev), rdev->bdev, + mdname(mddev), mddev->degraded + 1, conf->raid_disks); =20 if (!mddev->fail_last_dev) { conf->recovery_disabled =3D mddev->recovery_disabled; --=20 2.50.1 From nobody Thu Dec 18 20:00:13 2025 Received: from www5210.sakura.ne.jp (www5210.sakura.ne.jp [133.167.8.150]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CCA893090DE for ; Mon, 27 Oct 2025 15:44:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=133.167.8.150 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761579842; cv=none; b=WrQ+VLYVp8l8X5aqKMiZSlQrOcjkOdmfGcNE4Bs8PptlUO11VssJkEwCZL13Uq6Z3yHNssJGEJFYQ+M+2Va6cDM/+foRA5z0ymjP3eSlDdNlrIdOzuvv069guuzubgUuOx8bBsI1YlFAeoVElIoHb2ffPG1Dd/JT40GxM+rGX+0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761579842; c=relaxed/simple; bh=REdJID8evdRNIV2SALKM9j+CDHk2lNYcuzoh5ecGRG4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=j9c5MVji0hV7KjISrElRabQ9Gr4XHbxPEVFYb+C/EF+GZVOeWXibnBc8nln1WaEHdC3JdrO8hnBfeiuxAjbwNpQtcqHIlq7aqtjq63fUM7nt+iWe7mjenJHJRmobpoDfp3tvv8b4fKu/ij2+aTsC3JkWQhXrQdR3+ZhKZO0KOHA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mgml.me; spf=pass smtp.mailfrom=mgml.me; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b=jCSr0Rb1; arc=none smtp.client-ip=133.167.8.150 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mgml.me Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=mgml.me Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b="jCSr0Rb1" Received: from fedora (p3796170-ipxg00h01tokaisakaetozai.aichi.ocn.ne.jp [180.53.173.170]) (authenticated bits=0) by www5210.sakura.ne.jp (8.16.1/8.16.1) with ESMTPSA id 59RF4hAo090988 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Tue, 28 Oct 2025 00:04:46 +0900 (JST) (envelope-from k@mgml.me) DKIM-Signature: a=rsa-sha256; bh=GEzNtrrxCNuYas2ZIUicNm+BKzW5PcfO8c9yQJ8eNLY=; c=relaxed/relaxed; d=mgml.me; h=From:To:Subject:Date:Message-ID; s=rs20250315; t=1761577486; v=1; b=jCSr0Rb1CXVZvPif/aR5OkLs7kSDgpYTWcF3zmuVdajPwnFnjum3aprU4hVXp/ZA dwbDaXj0wsF8HJ2GZJBMS7DEMb+hgHKId3JpdVuvNm8KkS19KZlGoYRnApozKpbo gNlv9OPm/Ffja1uo84OampodAsMcYGkaPmvYca1N3RGprO/GIt31kcj3OJ3gSSDL aWSbZGpi3smAcV9r7PNYBwTQaUZTit5JajunCYOf00ypK4lEqYairunu9EAcW52e g2okWUEd1E2iHXSku7lehi9K9qLwAp/sXEaKSiVe6cIAziKdkg8Nr6YFR53CP8r8 DnnwwkPdJy5/e/9hERAGWg== From: Kenta Akagi To: Song Liu , Yu Kuai , Shaohua Li , Mariusz Tkaczyk , Guoqing Jiang Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, Kenta Akagi Subject: [PATCH v5 14/16] md/raid10: Add error message when setting MD_BROKEN Date: Tue, 28 Oct 2025 00:04:31 +0900 Message-ID: <20251027150433.18193-15-k@mgml.me> X-Mailer: git-send-email 2.50.1 In-Reply-To: <20251027150433.18193-1-k@mgml.me> References: <20251027150433.18193-1-k@mgml.me> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Once MD_BROKEN is set on an array, no further writes can be performed. The user must be informed that the array cannot continue operation. Add error logging when MD_BROKEN flag is set on raid10 arrays to improve debugging and system administration visibility. Signed-off-by: Kenta Akagi --- drivers/md/raid10.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index aa9d328fe875..369def3413c0 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -2031,6 +2031,10 @@ static void raid10_error(struct mddev *mddev, struct= md_rdev *rdev) =20 if (test_bit(In_sync, &rdev->flags) && !enough(conf, rdev->raid_disk)) { set_bit(MD_BROKEN, &mddev->flags); + pr_crit("md/raid10:%s: Disk failure on %pg, this is the last device.\n" + "md/raid10:%s: Cannot continue operation (%d/%d failed).\n", + mdname(mddev), rdev->bdev, + mdname(mddev), mddev->degraded + 1, conf->geo.raid_disks); =20 if (!mddev->fail_last_dev) return; --=20 2.50.1 From nobody Thu Dec 18 20:00:13 2025 Received: from www5210.sakura.ne.jp (www5210.sakura.ne.jp [133.167.8.150]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D91DE30F927 for ; Mon, 27 Oct 2025 15:44:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=133.167.8.150 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761579859; cv=none; b=Cf4Wl+ljQ2OBrgHpaW/RVx7IeFXUFJqeinTU+qj1wuPFNaexnITUcJzOQia9UH3MdJ1fIK/gKeHqolyH0uKeVUBQ9ShRewHEv41TbCgTOoreaiJYL+XsSOdnRL1obdFUCEtpjjtOeYsq0CA+Tz7YnaclT/VEc746Gx3OOZ6eKKA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761579859; c=relaxed/simple; bh=qtnaKSZX0v25wsW820bs4QBKufRkNlYzDoC89qHYejE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=RbeQQA0K/izgF08p5BoBBsKcPgAGzY1VrY3Eb1YuhjS0ga3BWzj7dyAnNDhDLkeGpjAcahwZDEkqIg/o4oKNp07lWeUIQQXbAL5B8/dO7McHBWpl+XhDEURkyPAOD++VrKUZh+C+wYzcqyZDzUKPN9oXwCDX5qtPAwc/iXAV/pA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mgml.me; spf=pass smtp.mailfrom=mgml.me; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b=hW9CUXux; arc=none smtp.client-ip=133.167.8.150 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mgml.me Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=mgml.me Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b="hW9CUXux" Received: from fedora (p3796170-ipxg00h01tokaisakaetozai.aichi.ocn.ne.jp [180.53.173.170]) (authenticated bits=0) by www5210.sakura.ne.jp (8.16.1/8.16.1) with ESMTPSA id 59RF4hAp090988 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Tue, 28 Oct 2025 00:04:46 +0900 (JST) (envelope-from k@mgml.me) DKIM-Signature: a=rsa-sha256; bh=HSAThXQjSym+JhHRTUbpIfpjmtQ7EdedUUveP1bgezI=; c=relaxed/relaxed; d=mgml.me; h=From:To:Subject:Date:Message-ID; s=rs20250315; t=1761577487; v=1; b=hW9CUXux59T07ebIL4gJ3lBcuOhdVBzyluDLqNy+fQ6Qx7TijJHH3byWnTcLuwKd E6xSJIyQKE9j4CHTTrMe6kBGMT+Pkwwsmx6wMjUErxZejbWCXLyCmQGFtmIVFuYh BBcVu05TEuFsa0emXFh1zhhWGjA+K4v2mnI/2ooTN41Zii4JXXFUfjCGKc75HaDr QrAShJVl9a6JqUKWDOvuyMZqSlorFedjAVT5qQJ+xrOT3smkzat/MS3LZOmZxtcg yHa0D2UigDYZc7EurY62ArFe/SB3slAZVYlCLftxcOlvG2bUvt8E07HHusqqPZUe VGY7h49RvNmfTkJsezj12A== From: Kenta Akagi To: Song Liu , Yu Kuai , Shaohua Li , Mariusz Tkaczyk , Guoqing Jiang Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, Kenta Akagi Subject: [PATCH v5 15/16] md: rename 'LastDev' rdev flag to 'RetryingSBWrite' Date: Tue, 28 Oct 2025 00:04:32 +0900 Message-ID: <20251027150433.18193-16-k@mgml.me> X-Mailer: git-send-email 2.50.1 In-Reply-To: <20251027150433.18193-1-k@mgml.me> References: <20251027150433.18193-1-k@mgml.me> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The rdev LastDev flag is assigned to devices that have had metadata written with MD_FAILFAST in a Failfast rdev, but whose bio failed but did not become faulty after an md_error. After that, the metadata write is retried by sb_flags MD_SB_NEED_REWRITE, but if rdev LastDev is set at this time, FailFast is not used and LastDev is cleared if the write is successful. Although it's called LastDev, this rdev flag actually means "metadata write with FailFast failed and a retry is required. Do not use FailFast when retrying for this rdev." This is different from what we might expect from the name LastDev, and can be confusing when reading the code. This commit renames the LastDev flag to better reflect its actual behavior for improving readability. The implementation remains unchanged. Signed-off-by: Kenta Akagi --- drivers/md/md.c | 6 +++--- drivers/md/md.h | 7 ++++--- 2 files changed, 7 insertions(+), 6 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 3c3f5703531b..1cbb4fd8bbc0 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -1063,10 +1063,10 @@ static void super_written(struct bio *bio) if (!md_cond_error(mddev, rdev, bio) && (bio->bi_opf & MD_FAILFAST)) { set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags); - set_bit(LastDev, &rdev->flags); + set_bit(RetryingSBWrite, &rdev->flags); } } else - clear_bit(LastDev, &rdev->flags); + clear_bit(RetryingSBWrite, &rdev->flags); =20 bio_put(bio); =20 @@ -1119,7 +1119,7 @@ void md_write_metadata(struct mddev *mddev, struct md= _rdev *rdev, =20 if (test_bit(MD_FAILFAST_SUPPORTED, &mddev->flags) && test_bit(FailFast, &rdev->flags) && - !test_bit(LastDev, &rdev->flags)) + !test_bit(RetryingSBWrite, &rdev->flags)) bio->bi_opf |=3D MD_FAILFAST; =20 atomic_inc(&mddev->pending_writes); diff --git a/drivers/md/md.h b/drivers/md/md.h index 38f9874538a6..0943cc5a86aa 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -282,9 +282,10 @@ enum flag_bits { * It is expects that no bad block log * is present. */ - LastDev, /* Seems to be the last working dev as - * it didn't fail, so don't use FailFast - * any more for metadata + RetryingSBWrite, /* + * metadata write with MD_FAILFAST failed, + * so it is being retried. Failfast + * will not be used during the retry. */ CollisionCheck, /* * check if there is collision between raid1 --=20 2.50.1 From nobody Thu Dec 18 20:00:13 2025 Received: from www5210.sakura.ne.jp (www5210.sakura.ne.jp [133.167.8.150]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EB2A230EF84 for ; Mon, 27 Oct 2025 15:44:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=133.167.8.150 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761579854; cv=none; b=j3MluWPZl7WBYvnCPRJEt5k19mmp010260hiirI3Oag9y20Q5Aeo/LuBXzxKoeNU3SOqXubWNTVRkrG26LlSB/sx00OFw6BaEibnhOHR6/kzu/JCzt7p1dbgB0uQ7ffbHOm3jg2Vgsb5GBcumEhpVmjlPKIOArsf0NsGkTS3xvA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761579854; c=relaxed/simple; bh=oMLUyHl2uZUJ8iKbejsBHgq428Z/NETxc5vWqBZWsbo=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=frpw+TqsskoStxk23P/4qPyT7frMhOcl7ZN8DdLqqjWK5P9OYcA/l1m92Igrwu5HWieGg5v7KDku7A9SSe4TPSaVYnaoW/A+eehuzjseNct23S/9GemUdo92ktaINg7Da8nbqZePppoRqGLKFadqWm8IIQZ7itxrALbrf9tI+7A= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mgml.me; spf=pass smtp.mailfrom=mgml.me; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b=CqJzhg7V; arc=none smtp.client-ip=133.167.8.150 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mgml.me Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=mgml.me Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=mgml.me header.i=@mgml.me header.b="CqJzhg7V" Received: from fedora (p3796170-ipxg00h01tokaisakaetozai.aichi.ocn.ne.jp [180.53.173.170]) (authenticated bits=0) by www5210.sakura.ne.jp (8.16.1/8.16.1) with ESMTPSA id 59RF4hAq090988 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Tue, 28 Oct 2025 00:04:47 +0900 (JST) (envelope-from k@mgml.me) DKIM-Signature: a=rsa-sha256; bh=mKOpuXkFl04B6nJz7GlXTCVis5sGgEKZn5b7ybDr4b8=; c=relaxed/relaxed; d=mgml.me; h=From:To:Subject:Date:Message-ID; s=rs20250315; t=1761577487; v=1; b=CqJzhg7VLz45NOm+u5Bv3EQob29i3Qj/hprXVy1c6Tj9igyl3NxMK3XaQXOXSAGV EN01X0o3g+ZTUR9SKsM/W8kxf2yVYS27IoercU8HXvCuFcYc/Qay88V43EZTM06d ZsiDSJuugOdIM3Oqijy0nz4P7nOgPVgcgv3sWNjJ/PsulvABms/ENfygIkY1KpWH HFCCm8CF+sDIUJy6U9Nskok1cW+CLOv2DkSPi/3nz0UWe/8OQDLQwHAxBoBj9sxA rlX1iM/rzCw+mbmKrujahifATvaca8b+6vMH+iZNJKV9V6lzYT8exraU2cPvHzYn d5D3/sXKLHAkiekTWL1rzw== From: Kenta Akagi To: Song Liu , Yu Kuai , Shaohua Li , Mariusz Tkaczyk , Guoqing Jiang Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, Kenta Akagi Subject: [PATCH v5 16/16] md: Improve super_written() error logging Date: Tue, 28 Oct 2025 00:04:33 +0900 Message-ID: <20251027150433.18193-17-k@mgml.me> X-Mailer: git-send-email 2.50.1 In-Reply-To: <20251027150433.18193-1-k@mgml.me> References: <20251027150433.18193-1-k@mgml.me> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" In the current implementation, when super_write fails, the output log will be like this: md: super_written gets error=3D-5 It is unless combined with other logs - e.g. I/O error message from blk, it's impossible to determine which md and rdev are causing the problem, and if the problem occurs on multiple devices, it becomes completely impossible to determine from the logs where the super_write failed. Also, currently super_written does not output logs when retrying metadata write. If the metadata write fails, the array may be corrupted, but if it is retried and successful, then it is not. The user should be informed if a retry was attempted. This commit adds output to see which array and which device had the problem, and adds a message to indicate when a metadata write retry is scheduled. Signed-off-by: Kenta Akagi --- drivers/md/md.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 1cbb4fd8bbc0..4cbb31552486 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -1058,10 +1058,13 @@ static void super_written(struct bio *bio) struct mddev *mddev =3D rdev->mddev; =20 if (bio->bi_status) { - pr_err("md: %s gets error=3D%d\n", __func__, + pr_err("md: %s: %pg: %s gets error=3D%d\n", + mdname(mddev), rdev->bdev, __func__, blk_status_to_errno(bio->bi_status)); if (!md_cond_error(mddev, rdev, bio) && (bio->bi_opf & MD_FAILFAST)) { + pr_warn("md: %s: %pg: retrying metadata write\n", + mdname(mddev), rdev->bdev); set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags); set_bit(RetryingSBWrite, &rdev->flags); } --=20 2.50.1