Re: [RFC 1/1] ext4: fail fast on repeated metadata reads after IO failure

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

From: "Diangang Li" <lidiangang@bytedance.com>
To: "Matthew Wilcox" <willy@infradead.org>,
	 "Andreas Dilger" <adilger@dilger.ca>
Cc: "Diangang Li" <diangangli@gmail.com>, <tytso@mit.edu>,
	 <linux-ext4@vger.kernel.org>, <linux-fsdevel@vger.kernel.org>,
	 <linux-kernel@vger.kernel.org>, <changfengnan@bytedance.com>
Subject: Re: [RFC 1/1] ext4: fail fast on repeated metadata reads after IO failure
Date: Thu, 26 Mar 2026 20:09:11 +0800	[thread overview]
Message-ID: <7792457f-453b-4534-95bb-f932cbc62d12@bytedance.com> (raw)
In-Reply-To: <acP5-v85CwUQZlMB@casper.infradead.org>

On 3/25/26 11:06 PM, Matthew Wilcox wrote:
> On Wed, Mar 25, 2026 at 04:15:42AM -0600, Andreas Dilger wrote:
>> On Mar 25, 2026, at 03:33, Diangang Li <diangangli@gmail.com> wrote:
>>>
>>> From: Diangang Li <lidiangang@bytedance.com>
>>>
>>> ext4 metadata reads serialize on BH_Lock (lock_buffer). If the read fails,
>>> the buffer remains !Uptodate. With concurrent callers, each waiter can
>>> retry the same failing read after the previous holder drops BH_Lock. This
>>> amplifies device retry latency and may trigger hung tasks.
>>>
>>> In the normal read path the block driver already performs its own retries.
>>> Once the retries keep failing, re-submitting the same metadata read from
>>> the filesystem just amplifies the latency by serializing waiters on
>>> BH_Lock.
>>>
>>> Remember read failures on buffer_head and fail fast for ext4 metadata reads
>>> once a buffer has already failed to read. Clear the flag on successful
>>> read/write completion so the buffer can recover. ext4 read-ahead uses
>>> ext4_read_bh_nowait(), so it does not set the failure flag and remains
>>> best-effort.
>>
>> Not that the patch is bad, but if the BH_Read_EIO flag is set on a buffer
>> and it prevents other tasks from reading that block again, how would the
>> buffer ever become Uptodate to clear the flag?  There isn't enough state
>> in a 1-bit flag to have any kind of expiry and later retry.
> 
> I've been thinking about this problem too, albeit from a folio read
> perspective, not from a buffer_head read perspective.  You're quite
> right that one bit isn't enough.  The solution I was considering but
> haven't implemented yet was to tell all the current waiters that
> the IO has failed, but not set any kind of permanent error flag.
> 
> I was thinking about starting with this:
> 
> +++ b/include/linux/wait_bit.h
> @@ -10,6 +10,7 @@
>   struct wait_bit_key {
>          unsigned long           *flags;
>          int                     bit_nr;
> +       int                     error;
>          unsigned long           timeout;
>   };
> 
> 
> and then adding/changing various APIs to allow an error to be passed in
> and noticed by the woken task.
> 
> With this change, the thundering herd all wake up, see the error and
> return immediately instead of each submitting their own I/O.  New reads
> will retry the read, but each will only be held up for a maximum of
> their own timeout.

Hi Matthew and all,

Thanks. The idea of waking the current waiters with an error makes a lot 
of sense.

I’ve been considering a smaller change on the buffer_head side that 
might get most of the benefit without touching the generic wait_bit 
APIs. The idea is to tell whether taking BH_Lock required waiting. If we 
had to wait, and the buffer is already marked with 
buffer_read_io_error(), then just return -EIO and don’t submit another 
read. If we got the lock without waiting, still submit the read. That 
should stop the thundering herd from reissuing the same failing IO.

Another option is a simple retry window. After a read failure, don’t
retry for some period of time, and size that window by error type. For
persistent media errors (e.g. MEDIUM ERROR, repeated IO ERROR) the 
window could be effectively infinite, while for transient cases (e.g. 
few IO ERROR, BLK_STS_RESOURCE) it could be small.

Any opinions on these two approaches, or other ideas for this problem?

Thanks,
Diangang

     prev parent reply	other threads:[~2026-03-26 12:09 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-25  9:33 [RFC PATCH 0/1] ext4: fail fast on repeated metadata reads after IO failure Diangang Li
2026-03-25  9:33 ` [RFC 1/1] " Diangang Li
2026-03-25 10:15   ` Andreas Dilger
2026-03-25 11:13     ` Diangang Li
2026-03-25 14:27       ` Zhang Yi
2026-03-26  2:26         ` changfengnan
2026-03-26  7:42         ` Diangang Li
2026-03-26 11:09           ` Zhang Yi
2026-03-25 15:06     ` Matthew Wilcox
2026-03-26 12:09       ` Diangang Li [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7792457f-453b-4534-95bb-f932cbc62d12@bytedance.com \
    --to=lidiangang@bytedance.com \
    --cc=adilger@dilger.ca \
    --cc=changfengnan@bytedance.com \
    --cc=diangangli@gmail.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tytso@mit.edu \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox