Linux EXT4 FS development
 help / color / mirror / Atom feed
From: Zhang Yi <yizhang089@gmail.com>
To: Baokun Li <libaokun@linux.alibaba.com>, linux-ext4@vger.kernel.org
Cc: tytso@mit.edu, adilger.kernel@dilger.ca, jack@suse.cz,
	yi.zhang@huawei.com, ojaswin@linux.ibm.com,
	ritesh.list@gmail.com, peng_wang@linux.alibaba.com
Subject: Re: [PATCH 2/2] ext4: base unaligned DIO lock decision on partial block zeroing
Date: Wed, 17 Jun 2026 10:45:10 +0800	[thread overview]
Message-ID: <d1adcf7c-c276-458d-9cac-68a4410f7626@gmail.com> (raw)
In-Reply-To: <060f63e0-d64f-40df-99a7-af53862049ee@linux.alibaba.com>

Hi, Baokun!

On 6/16/2026 9:10 PM, Baokun Li wrote:
> Hi all,
> 
> Thank you for your review!
> 
> After extensive testing, I found that after merging this patch, generic/746
> started failing intermittently on ext3 (mkfs.ext4 -O ^extents).  The test
> triggers a "Page cache invalidation failure on direct I/O" warning, and
> subsequent fsync returns -EIO.
> 
> The underlying race existed before this patch, but this patch appears to
> have widened the reproduction window considerably, so I thought it worth
> trying to address.  Here is my analysis:
> 
> On no-extent inodes, DIO writes that hit holes cannot use unwritten
> extents.  ext4_iomap_alloc() leaves m_flags=0, so ext4_map_blocks()
> returns 0 for a hole, and:
> 
>          if (!m_flags && !ret)
>                  ret = -ENOTBLK;
> 
> The iomap layer returns -ENOTBLK to ext4, which falls back to buffered
> I/O.  The fallback path dirties pages in the page cache, then flushes
> and invalidates them.  However, concurrent async DIO completions to
> other blocks on the same inode can run kiocb_invalidate_post_direct_write()
> without holding the inode lock.
> 
> Consider a file with two 4k extents: [hole][written].  Thread A does DIO
> to the written extent, while thread B does DIO spanning both extents:
> 
>    kworker A (4k DIO, allocated block)    kworker B (8k DIO, hole->fallback)
>    -----------------------------------    -----------------------------------
>    inode_lock_shared()                    inode_lock_shared()
>    iomap_dio_rw():                        iomap_dio_rw():
>      kiocb_invalidate_pages -> clean        iomap_begin -> -ENOTBLK
>      submit_bio (async)                     dio->size = 0
>    inode_unlock_shared()                  inode_unlock_shared()
> 
>    [bio pending in block layer]           /* fallback: inode lock released */
>                                           ext4_buffered_write_iter()
>                                             inode_lock(exclusive)
>                                             generic_perform_write()
>                                               -> dirty pages [0, 8k]
>                                             inode_unlock(exclusive)
> 
>                                           /* pages still dirty here */
>    [bio completes]                        filemap_write_and_wait_range()
>    iomap_dio_complete()                     -> flush dirty pages
>      kiocb_invalidate_post_direct_write() invalidate_mapping_pages()
>        invalidate_inode_pages2_range()
>        -> finds dirty page!               /* window closed */
>        -> dio_warn_stale_pagecache()
>        -> errseq_set(-EIO)
> 

It looks like this issue occurs when invalidate_inode_pages2_range()
checks beyond the DIO write range, which may only happen when folio size
is larger than block size. Is that correct?

> The critical window is the gap between ext4_buffered_write_iter() dirtying
> pages and filemap_write_and_wait_range() flushing them.  In this window the
> inode lock is not held, so another thread's async DIO completion is free to
> invalidate the still-dirty pages in the page cache.
> 
> This race has always existed on ext3 because indirect-block inodes lack
> unwritten-extent support.  However, the window was extremely narrow in
> practice, because the old ext4_overwrite_io() checked every block and
> would conservatively take an exclusive lock.  This patch replaced it
> with ext4_dio_needs_zeroing(), which only checks head and tail blocks,
> making unaligned DIO more likely to take a shared lock and
> proportionally increasing the chance of hitting the race.
> 
> I tried a couple of alternatives before settling on the patch below:
> 
> 1. Force exclusive lock + IOMAP_DIO_FORCE_WAIT for all no-extent DIO.
>     This closes the window for new DIO submissions, but does not protect
>     against bio completions from previously submitted async DIO, which
>     run independently of the inode lock.
> 
> 2. Wrap the fallback dirty+flush+invalidate sequence in
>     filemap_invalidate_lock().  However, the ext4 DIO and iomap layers
>     do not use this lock, so it would not serialise against DIO
>     completions.
> 

Could we add a call to inode_dio_wait() before falling back to buffered
I/O? That is, in thread B, when falling back to buffered I/O, could we
acquire the exclusive inode lock and then call inode_dio_wait() to wait
for in-flight DIO to complete? This should close the race window. Since
scenarios where DIO writes to holes on ext3 are relatively rare, the
performance impact should be minimal (I suppose).

> One straightforward approach that seems correct is to skip direct I/O
> for no-extent inodes entirely, by returning 0 from ext4_dio_alignment():
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -6131,6 +6131,8 @@ u32 ext4_dio_alignment(struct inode *inode)
>   {
>          if (fsverity_active(inode))
>                  return 0;
> +       if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> +               return 0;
>          if (ext4_should_journal_data(inode))
>                  return 0;
>          if (ext4_has_inline_data(inode))
> 
> With this, ext4_should_use_dio() returns false for no-extent inodes, and
> all I/O goes through ext4_buffered_write_iter() directly, bypassing the
> DIO path entirely.  On ext3, DIO to a hole already falls back to buffered
> I/O, so there is essentially no performance benefit to using DIO in the
> first place.
> 
> Note that with this change, the fallback branch in ext4_dio_write_iter():
> 
>          if (ret >= 0 && iov_iter_count(from)) {
>                  /* buffered fallback */
>          }
> 
> would also become dead code for extent-based inodes (since unwritten
> extents guarantee iomap_dio_rw() never returns zero with unconsumed
> data), and could be removed in a follow-up cleanup.
> 
> Thoughts?  Is there a reason to preserve DIO on no-extent inodes that
> I'm missing?
> 

Hmm, this would also cause DIO to fall back to buffered I/O in common
extending write cases, which I think would be unacceptable.

Cheers,
Yi.

> Looking forward to your feedback.
> 
> 
> Thanks,
> Baokun
> 
> 
> 


  reply	other threads:[~2026-06-17  2:45 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-11 16:34 [PATCH 0/2] ext4: allow more DIO writes under shared i_rwsem Baokun Li
2026-06-11 16:34 ` [PATCH 1/2] ext4: skip overwrite check for aligned non-extending DIO writes Baokun Li
2026-06-12 12:46   ` Jan Kara
2026-06-15  3:24   ` Zhang Yi
2026-06-11 16:34 ` [PATCH 2/2] ext4: base unaligned DIO lock decision on partial block zeroing Baokun Li
2026-06-12 12:55   ` Jan Kara
2026-06-15  3:28   ` Zhang Yi
2026-06-16 13:10   ` Baokun Li
2026-06-17  2:45     ` Zhang Yi [this message]
2026-06-17  7:52       ` Baokun Li
2026-06-17 10:54         ` Zhang Yi
2026-06-17 11:08         ` Jan Kara
2026-06-18  9:49           ` Baokun Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d1adcf7c-c276-458d-9cac-68a4410f7626@gmail.com \
    --to=yizhang089@gmail.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=jack@suse.cz \
    --cc=libaokun@linux.alibaba.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=ojaswin@linux.ibm.com \
    --cc=peng_wang@linux.alibaba.com \
    --cc=ritesh.list@gmail.com \
    --cc=tytso@mit.edu \
    --cc=yi.zhang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox