fs/buffer.c | 2 ++ fs/ext4/balloc.c | 2 +- fs/ext4/ext4.h | 13 ++++++---- fs/ext4/extents.c | 2 +- fs/ext4/ialloc.c | 3 ++- fs/ext4/indirect.c | 2 +- fs/ext4/inode.c | 10 ++++---- fs/ext4/mmp.c | 2 +- fs/ext4/move_extent.c | 2 +- fs/ext4/resize.c | 2 +- fs/ext4/super.c | 51 +++++++++++++++++++++++++++---------- fs/ext4/sysfs.c | 2 ++ include/linux/buffer_head.h | 16 ++++++++++++ 13 files changed, 79 insertions(+), 30 deletions(-)
From: Diangang Li <lidiangang@bytedance.com>
A production system reported hung tasks blocked for 300s+ in ext4 buffer_head
paths. Hung task reports were accompanied by disk IO errors, but profiling
showed that most individual reads completed (or failed) within 10s, with
the worst case around 60s.
At the same time, we observed a high repeat rate to the same disk LBAs.
The repeated reads frequently showed seconds-level latency and ended with
IO errors, e.g.:
[Tue Mar 24 14:16:24 2026] blk_update_request: I/O error, dev sdi,
sector 10704150288 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[Tue Mar 24 14:16:25 2026] blk_update_request: I/O error, dev sdi,
sector 10704488160 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[Tue Mar 24 14:16:26 2026] blk_update_request: I/O error, dev sdi,
sector 10704382912 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
We also sampled repeated-LBA latency histograms on /dev/sdi and saw that
the same error-prone LBAs were re-submitted many times with ~1-4s latency:
LBA 10704488160 (count=22): 1-2s: 20, 2-4s: 2
LBA 10704382912 (count=21): 1-2s: 20, 2-4s: 1
LBA 10704150288 (count=21): 1-2s: 19, 2-4s: 2
Root cause
==========
ext4 buffer_head reads serialize IO via BH_Lock. When one read fails, the
buffer remains !Uptodate. With multiple threads concurrently accessing
the same buffer_head, each waiter wakes up after the previous owner drops
BH_Lock, then submits the same read again and waits again. This makes the
latency grow linearly with the number of contending threads, leading to
300s+ hung tasks.
The failing IOs are repeatedly issued to the same LBA. The observed 1s+
per-IO latency is likely from device-side retry/error recovery. On SCSI the
driver typically retries reads several times (e.g. 5 retries in our
environment), so a single filesystem submission can easily accumulate 5s+
delay before failing. When multiple threads then re-submit the same failing
read and serialize on BH_Lock, the delay is amplified into 300s+ hung tasks.
Similar behavior exists for other devices (e.g. NVMe with multiple internal
retries).
Example hung stacks:
INFO: task toutiao.infra.t:3760933 blocked for more than 327 seconds.
Call Trace:
__schedule
io_schedule
__wait_on_bit_lock
bh_uptodate_or_lock
__read_extent_tree_block
ext4_find_extent
ext4_ext_map_blocks
ext4_map_blocks
ext4_getblk
ext4_bread
__ext4_read_dirblock
dx_probe
ext4_htree_fill_tree
ext4_readdir
iterate_dir
ksys_getdents64
INFO: task toutiao.infra.t:2724456 blocked for more than 327 seconds.
Call Trace:
__schedule
io_schedule
__wait_on_bit_lock
ext4_read_bh_lock
ext4_bread
__ext4_read_dirblock
htree_dirblock_to_tree
ext4_htree_fill_tree
ext4_readdir
iterate_dir
ksys_getdents64
Approach
========
Record read failures on buffer_head (BH_Read_EIO + b_err_timestamp). When a
retry window is configured (sysfs: err_retry_sec), ext4 will skip submitting
another read for the buffer_head that already failed within the window and
return/unlock immediately. Clear the state on successful completion so the
buffer can recover if the error is transient.
err_retry_sec defaults to 0, which keeps the current behavior: after a read
error, callers may keep retrying the same read. Set it to a non-zero value
to throttle repeated reads within the window.
Patch summary
=============
1) Add BH_Read_EIO, b_err_timestamp and a small helper for tracking read
failures on buffer_head.
2) Update end_buffer_read_sync() and end_buffer_write_sync() (success path)
to maintain that state.
3) Add ext4 sysfs knob err_retry_sec and throttle ext4 buffer_head reads
within the configured window.
4) Pass sb into ext4_read_bh_nowait(), ext4_read_bh() and ext4_read_bh_lock()
so __ext4_read_bh() can apply the per-sb retry window check.
Diangang Li (1):
ext4: fail fast on repeated buffer_head reads after IO failure
fs/buffer.c | 2 ++
fs/ext4/balloc.c | 2 +-
fs/ext4/ext4.h | 13 ++++++----
fs/ext4/extents.c | 2 +-
fs/ext4/ialloc.c | 3 ++-
fs/ext4/indirect.c | 2 +-
fs/ext4/inode.c | 10 ++++----
fs/ext4/mmp.c | 2 +-
fs/ext4/move_extent.c | 2 +-
fs/ext4/resize.c | 2 +-
fs/ext4/super.c | 51 +++++++++++++++++++++++++++----------
fs/ext4/sysfs.c | 2 ++
include/linux/buffer_head.h | 16 ++++++++++++
13 files changed, 79 insertions(+), 30 deletions(-)
--
2.39.5
On Mon, Apr 13, 2026 at 02:24:59PM +0800, Diangang Li wrote: > From: Diangang Li <lidiangang@bytedance.com> > > A production system reported hung tasks blocked for 300s+ in ext4 > buffer_head paths.... > > [Tue Mar 24 14:16:24 2026] blk_update_request: I/O error, dev sdi, > sector 10704150288 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 > [Tue Mar 24 14:16:25 2026] blk_update_request: I/O error, dev sdi, > sector 10704488160 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 > [Tue Mar 24 14:16:26 2026] blk_update_request: I/O error, dev sdi, > sector 10704382912 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 I wonder whether the ext4 layer is the right place to be handle this sort of issue. For example, it could be handled by having a subsystem scanning dmesg (or by wiring up notifications so block device errors get sent to a userspace daemon), and when certain criteria is met, the machine is automatically sent to hardware operations to run diagnostics and (most likey) replace the failing disk. It could also be handled in the driver or SCSI layer so the "fail fast" semantics are handled there, so that it supports all file systems, not just ext4. The SCSI layer also has more information about the type of error; you might want to handle things like media errors differently from Fibre Channel or iSCSI timeouts (which might be something where "fast fast" is not appropriate). By the time the error gets propagated up to the buffer head, we lose a lot of detail about why the error took place. Also, in the long term we will hopefully be moving away from using buffer cache. - Ted
On 4/13/26 8:47 PM, Theodore Tso wrote: > On Mon, Apr 13, 2026 at 02:24:59PM +0800, Diangang Li wrote: >> From: Diangang Li <lidiangang@bytedance.com> >> >> A production system reported hung tasks blocked for 300s+ in ext4 >> buffer_head paths.... >> >> [Tue Mar 24 14:16:24 2026] blk_update_request: I/O error, dev sdi, >> sector 10704150288 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 >> [Tue Mar 24 14:16:25 2026] blk_update_request: I/O error, dev sdi, >> sector 10704488160 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 >> [Tue Mar 24 14:16:26 2026] blk_update_request: I/O error, dev sdi, >> sector 10704382912 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 > > I wonder whether the ext4 layer is the right place to be handle this > sort of issue. For example, it could be handled by having a subsystem > scanning dmesg (or by wiring up notifications so block device errors > get sent to a userspace daemon), and when certain criteria is met, the > machine is automatically sent to hardware operations to run > diagnostics and (most likey) replace the failing disk. > > It could also be handled in the driver or SCSI layer so the "fail > fast" semantics are handled there, so that it supports all file > systems, not just ext4. The SCSI layer also has more information > about the type of error; you might want to handle things like media > errors differently from Fibre Channel or iSCSI timeouts (which might > be something where "fast fast" is not appropriate). > > By the time the error gets propagated up to the buffer head, we lose a > lot of detail about why the error took place. Also, in the long term > we will hopefully be moving away from using buffer cache. > > - Ted Hi Ted, What about moving the fail-fast check into the buffer-head path (submit_bh_wbc) so it is not ext4-specific. We can update a BH_Read_EIO bit in end_bio_bh_io_sync, and add a per-bdev/per-partition sysfs knob for the retry window. That turns it into a generic guard for buffer-head users, and it naturally goes away as buffer-head usage shrinks. We did think about doing this in the block layer (submit_bio) or in SCSI/NVMe, but a generic solution there seems to need a per-device table to cache the error LBAs. With buffer-head, we can keep the error state on the bh itself. I also checked f2fs (no buffer-head). It tracks repeated EIOs on metadata/node pages to avoid infinite retry loops. How do you see that compared with a buffer-head retry window? Are either of these directions worth exploring further? Thanks, Diangang
© 2016 - 2026 Red Hat, Inc.