From: Diangang Li <diangangli@gmail.com>
To: tytso@mit.edu, adilger.kernel@dilger.ca
Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org, changfengnan@bytedance.com,
Diangang Li <lidiangang@bytedance.com>
Subject: [RFC 1/1] ext4: fail fast on repeated metadata reads after IO failure
Date: Wed, 25 Mar 2026 17:33:49 +0800 [thread overview]
Message-ID: <20260325093349.630193-2-diangangli@gmail.com> (raw)
In-Reply-To: <20260325093349.630193-1-diangangli@gmail.com>
From: Diangang Li <lidiangang@bytedance.com>
ext4 metadata reads serialize on BH_Lock (lock_buffer). If the read fails,
the buffer remains !Uptodate. With concurrent callers, each waiter can
retry the same failing read after the previous holder drops BH_Lock. This
amplifies device retry latency and may trigger hung tasks.
In the normal read path the block driver already performs its own retries.
Once the retries keep failing, re-submitting the same metadata read from
the filesystem just amplifies the latency by serializing waiters on
BH_Lock.
Remember read failures on buffer_head and fail fast for ext4 metadata reads
once a buffer has already failed to read. Clear the flag on successful
read/write completion so the buffer can recover. ext4 read-ahead uses
ext4_read_bh_nowait(), so it does not set the failure flag and remains
best-effort.
Example hung stacks:
INFO: task toutiao.infra.t:3760933 blocked for more than 327 seconds.
Call Trace:
__schedule
io_schedule
__wait_on_bit_lock
bh_uptodate_or_lock
__read_extent_tree_block
ext4_find_extent
ext4_ext_map_blocks
ext4_map_blocks
ext4_getblk
ext4_bread
__ext4_read_dirblock
dx_probe
ext4_htree_fill_tree
ext4_readdir
iterate_dir
ksys_getdents64
INFO: task toutiao.infra.t:2724456 blocked for more than 327 seconds.
Call Trace:
__schedule
io_schedule
__wait_on_bit_lock
ext4_read_bh_lock
ext4_bread
__ext4_read_dirblock
htree_dirblock_to_tree
ext4_htree_fill_tree
ext4_readdir
iterate_dir
ksys_getdents64
Signed-off-by: Diangang Li <lidiangang@bytedance.com>
Reviewed-by: Fengnan Chang <changfengnan@bytedance.com>
---
fs/buffer.c | 2 ++
fs/ext4/super.c | 12 +++++++++++-
include/linux/buffer_head.h | 2 ++
3 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index 2d2e3ecec6b2b..b41d54b8b1f4d 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -145,6 +145,7 @@ static void __end_buffer_read_notouch(struct buffer_head *bh, int uptodate)
{
if (uptodate) {
set_buffer_uptodate(bh);
+ clear_buffer_read_io_error(bh);
} else {
/* This happens, due to failed read-ahead attempts. */
clear_buffer_uptodate(bh);
@@ -167,6 +168,7 @@ void end_buffer_write_sync(struct buffer_head *bh, int uptodate)
{
if (uptodate) {
set_buffer_uptodate(bh);
+ clear_buffer_read_io_error(bh);
} else {
buffer_io_error(bh, ", lost sync page write");
mark_buffer_write_io_error(bh);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 781c083000c2e..89a99851864a0 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -198,7 +198,13 @@ int ext4_read_bh(struct buffer_head *bh, blk_opf_t op_flags,
{
BUG_ON(!buffer_locked(bh));
+ if (!buffer_write_io_error(bh) && buffer_read_io_error(bh)) {
+ unlock_buffer(bh);
+ return -EIO;
+ }
+
if (ext4_buffer_uptodate(bh)) {
+ clear_buffer_read_io_error(bh);
unlock_buffer(bh);
return 0;
}
@@ -206,8 +212,12 @@ int ext4_read_bh(struct buffer_head *bh, blk_opf_t op_flags,
__ext4_read_bh(bh, op_flags, end_io, simu_fail);
wait_on_buffer(bh);
- if (buffer_uptodate(bh))
+ if (buffer_uptodate(bh)) {
+ clear_buffer_read_io_error(bh);
return 0;
+ }
+ if (!buffer_write_io_error(bh))
+ set_buffer_read_io_error(bh);
return -EIO;
}
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index b16b88bfbc3e7..be8bedcde379e 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -29,6 +29,7 @@ enum bh_state_bits {
BH_Delay, /* Buffer is not yet allocated on disk */
BH_Boundary, /* Block is followed by a discontiguity */
BH_Write_EIO, /* I/O error on write */
+ BH_Read_EIO, /* I/O error on read */
BH_Unwritten, /* Buffer is allocated on disk but not written */
BH_Quiet, /* Buffer Error Prinks to be quiet */
BH_Meta, /* Buffer contains metadata */
@@ -132,6 +133,7 @@ BUFFER_FNS(Async_Write, async_write)
BUFFER_FNS(Delay, delay)
BUFFER_FNS(Boundary, boundary)
BUFFER_FNS(Write_EIO, write_io_error)
+BUFFER_FNS(Read_EIO, read_io_error)
BUFFER_FNS(Unwritten, unwritten)
BUFFER_FNS(Meta, meta)
BUFFER_FNS(Prio, prio)
--
2.39.5
next prev parent reply other threads:[~2026-03-25 9:34 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-25 9:33 [RFC PATCH 0/1] ext4: fail fast on repeated metadata reads after IO failure Diangang Li
2026-03-25 9:33 ` Diangang Li [this message]
2026-03-25 10:15 ` [RFC 1/1] " Andreas Dilger
2026-03-25 11:13 ` Diangang Li
2026-03-25 14:27 ` Zhang Yi
2026-03-26 2:26 ` changfengnan
2026-03-26 7:42 ` Diangang Li
2026-03-26 11:09 ` Zhang Yi
2026-03-25 15:06 ` Matthew Wilcox
2026-03-26 12:09 ` Diangang Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260325093349.630193-2-diangangli@gmail.com \
--to=diangangli@gmail.com \
--cc=adilger.kernel@dilger.ca \
--cc=changfengnan@bytedance.com \
--cc=lidiangang@bytedance.com \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox