public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC v1 0/1] buffer_head: fail fast on repeated reads after I/O errors
@ 2026-05-06 13:50 Diangang Li
  2026-05-06 13:50 ` [RFC v1 1/1] " Diangang Li
  0 siblings, 1 reply; 2+ messages in thread
From: Diangang Li @ 2026-05-06 13:50 UTC (permalink / raw)
  To: axboe, viro, brauner
  Cc: linux-block, linux-ext4, linux-fsdevel, changfengnan, Diangang Li

From: Diangang Li <lidiangang@bytedance.com>

A production system reported hung tasks blocked for 300s+ in ext4
buffer_head paths. Hung task reports were accompanied by disk I/O errors,
but profiling showed that most individual reads completed (or failed)
within 10s, with the worst case around 60s.

At the same time, we observed a high repeat rate to the same disk LBAs.
The repeated reads frequently showed seconds-level latency and ended with
I/O errors, e.g.:

  [Tue Mar 24 14:16:24 2026] blk_update_request: I/O error, dev sdi,
      sector 10704150288 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
  [Tue Mar 24 14:16:25 2026] blk_update_request: I/O error, dev sdi,
      sector 10704488160 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
  [Tue Mar 24 14:16:26 2026] blk_update_request: I/O error, dev sdi,
      sector 10704382912 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0

We also sampled repeated-LBA latency histograms on /dev/sdi and saw that
the same error-prone LBAs were re-submitted many times with ~1-4s latency:

  LBA 10704488160 (count=22): 1-2s: 20, 2-4s: 2
  LBA 10704382912 (count=21): 1-2s: 20, 2-4s: 1
  LBA 10704150288 (count=21): 1-2s: 19, 2-4s: 2

Root cause
==========

buffer_head reads serialize I/O via BH_Lock. When one read fails, the
buffer remains !Uptodate. With multiple threads concurrently accessing
the same buffer_head, each waiter wakes up after the previous owner drops
BH_Lock, then submits the same read again and waits again. This makes the
latency grow linearly with the number of contending threads, leading to
300s+ hung tasks.

The failing I/Os are repeatedly issued to the same LBA. The observed 1s+
per-I/O latency is likely from device-side retry/error recovery. On SCSI
the driver typically retries reads several times (e.g. 5 retries in our
environment), so a single filesystem submission can easily accumulate 5s+
delay before failing. When multiple threads then re-submit the same
failing read and serialize on BH_Lock, the delay is amplified into 300s+
hung tasks.

Similar behavior exists for other devices (e.g. NVMe with multiple
internal retries).

Example hung stacks:

  INFO: task toutiao.infra.t:3760933 blocked for more than 327 seconds.
  Call Trace:
   __schedule
   io_schedule
   __wait_on_bit_lock
   bh_uptodate_or_lock
   __read_extent_tree_block
   ext4_find_extent
   ext4_ext_map_blocks
   ext4_map_blocks
   ext4_getblk
   ext4_bread
   __ext4_read_dirblock
   dx_probe
   ext4_htree_fill_tree
   ext4_readdir
   iterate_dir
   ksys_getdents64

  INFO: task toutiao.infra.t:2724456 blocked for more than 327 seconds.
  Call Trace:
   __schedule
   io_schedule
   __wait_on_bit_lock
   ext4_read_bh_lock
   ext4_bread
   __ext4_read_dirblock
   htree_dirblock_to_tree
   ext4_htree_fill_tree
   ext4_readdir
   iterate_dir
   ksys_getdents64

This series follows an earlier ext4-only RFC and moves the policy to the
generic buffer_head path so other buffer_head users can opt in with the
same per-block-device knob.

Approach
========

Record non-readahead read failures on buffer_head (BH_Read_EIO +
b_err_timestamp). When a per-bdev retry window is configured, submit_bh()
will skip submitting another non-readahead read for a buffer_head that
already failed within the window and complete it immediately with failure.
Clear the state on successful read or rewrite so the buffer can recover
if the error is transient.

The timestamp is recorded on the first failure only, so repeated failures
do not extend the retry window. After the window expires, the next
non-readahead read is submitted normally and can discover that the device
or media has recovered.

The retry window is configured per block device:

  /sys/block/<disk>/read_err_retry_sec
  /sys/block/<disk>/<part>/read_err_retry_sec

The default value is 0, which keeps the current behavior: after a read
error, callers may keep retrying the same read. Set it to a non-zero
value to fail repeated non-readahead reads fast within the window.

Patch summary
=============

  1) Add BH_Read_EIO and b_err_timestamp to buffer_head.
  2) Track non-readahead read failures in the submit_bh() bio completion
     path.
  3) Add per-bdev read_err_retry_sec sysfs knobs for disks and partitions.
  4) Fail repeated non-readahead submit_bh() reads fast within the
     configured window, while leaving readahead and other bio users
     unchanged.

Diangang Li (1):
  buffer_head: fail fast on repeated reads after I/O errors

 Documentation/ABI/stable/sysfs-block | 26 +++++++++++
 block/genhd.c                        | 24 ++++++++++
 block/partitions/core.c              | 24 ++++++++++
 fs/buffer.c                          | 65 ++++++++++++++++++++++++++++
 include/linux/blk_types.h            |  3 ++
 include/linux/buffer_head.h          | 10 +++++
 6 files changed, 152 insertions(+)

-- 
2.39.5

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2026-05-06 13:51 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-06 13:50 [RFC v1 0/1] buffer_head: fail fast on repeated reads after I/O errors Diangang Li
2026-05-06 13:50 ` [RFC v1 1/1] " Diangang Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox