From: Diangang Li <diangangli@gmail.com>
To: tytso@mit.edu, adilger.kernel@dilger.ca
Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org, changfengnan@bytedance.com,
Diangang Li <lidiangang@bytedance.com>
Subject: [RFC PATCH 0/1] ext4: fail fast on repeated metadata reads after IO failure
Date: Wed, 25 Mar 2026 17:33:48 +0800 [thread overview]
Message-ID: <20260325093349.630193-1-diangangli@gmail.com> (raw)
From: Diangang Li <lidiangang@bytedance.com>
A production system reported hung tasks blocked for 300s+ in ext4 metadata
lookup paths. Hung task reports were accompanied by disk IO errors, but
profiling showed that most individual reads completed (or failed) within
10s, with the worst case around 60s.
At the same time, we observed a high repeat rate to the same disk LBAs.
The repeated reads frequently showed seconds-level latency and ended with
IO errors, e.g.:
[Tue Mar 24 14:16:24 2026] blk_update_request: I/O error, dev sdi,
sector 10704150288 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[Tue Mar 24 14:16:25 2026] blk_update_request: I/O error, dev sdi,
sector 10704488160 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[Tue Mar 24 14:16:26 2026] blk_update_request: I/O error, dev sdi,
sector 10704382912 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
We also sampled repeated-LBA latency histograms on /dev/sdi and saw that
the same error-prone LBAs were re-submitted many times with ~1-4s latency:
LBA 10704488160 (count=22): 1-2s: 20, 2-4s: 2
LBA 10704382912 (count=21): 1-2s: 20, 2-4s: 1
LBA 10704150288 (count=21): 1-2s: 19, 2-4s: 2
Root cause
==========
ext4 metadata reads commonly use buffer_head caching and serialize IO via
BH_Lock (lock_buffer). When a read fails, the buffer remains !Uptodate.
With multiple threads concurrently accessing the same metadata block, each
waiter wakes up after the previous owner drops BH_Lock, then submits the
same read again and waits again. This makes the latency grow linearly with
the number of contending threads, leading to 300s+ hung tasks.
The failing IOs are repeatedly issued to the same LBA. The observed 1s+
per-IO latency is likely from device-side retry/error recovery. On SCSI the
driver typically retries reads several times (e.g. 5 retries in our
environment), so a single filesystem submission can easily accumulate 5s+
delay before failing. When multiple threads then re-submit the same failing
metadata read and serialize on BH_Lock, the delay is amplified into 300s+
hung tasks.
Similar behavior exists for other devices (e.g. NVMe with multiple internal
retries).
Example hung stacks:
INFO: task toutiao.infra.t:3760933 blocked for more than 327 seconds.
Call Trace:
__schedule
io_schedule
__wait_on_bit_lock
bh_uptodate_or_lock
__read_extent_tree_block
ext4_find_extent
ext4_ext_map_blocks
ext4_map_blocks
ext4_getblk
ext4_bread
__ext4_read_dirblock
dx_probe
ext4_htree_fill_tree
ext4_readdir
iterate_dir
ksys_getdents64
INFO: task toutiao.infra.t:2724456 blocked for more than 327 seconds.
Call Trace:
__schedule
io_schedule
__wait_on_bit_lock
ext4_read_bh_lock
ext4_bread
__ext4_read_dirblock
htree_dirblock_to_tree
ext4_htree_fill_tree
ext4_readdir
iterate_dir
ksys_getdents64
Approach
========
Remember read failures on buffer_head and fail fast for ext4 metadata reads
once a buffer has already seen a read failure. Clear the flag on successful
read/write completion so the buffer can recover if the error is transient.
Note that ext4 read-ahead uses ext4_read_bh_nowait(), so it does not set
the failure flag and remains best-effort.
Patch summary
=============
1) Add BH_Read_EIO buffer_head state bit and helpers.
2) Clear BH_Read_EIO on successful read/write completion.
3) In ext4 metadata reads, if BH_Read_EIO is already set (and not
BH_Write_EIO), fail fast instead of re-submitting the same failing
read. On read failure, set BH_Read_EIO.
Diangang Li (1):
ext4: fail fast on repeated metadata reads after IO failure
fs/buffer.c | 2 ++
fs/ext4/super.c | 12 +++++++++++-
include/linux/buffer_head.h | 2 ++
3 files changed, 15 insertions(+), 1 deletion(-)
--
2.39.5
next reply other threads:[~2026-03-25 9:34 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-25 9:33 Diangang Li [this message]
2026-03-25 9:33 ` [RFC 1/1] ext4: fail fast on repeated metadata reads after IO failure Diangang Li
2026-03-25 10:15 ` Andreas Dilger
2026-03-25 11:13 ` Diangang Li
2026-03-25 14:27 ` Zhang Yi
2026-03-26 2:26 ` changfengnan
2026-03-26 7:42 ` Diangang Li
2026-03-26 11:09 ` Zhang Yi
2026-03-25 15:06 ` Matthew Wilcox
2026-03-26 12:09 ` Diangang Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260325093349.630193-1-diangangli@gmail.com \
--to=diangangli@gmail.com \
--cc=adilger.kernel@dilger.ca \
--cc=changfengnan@bytedance.com \
--cc=lidiangang@bytedance.com \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox