Linux EXT4 FS development
 help / color / mirror / Atom feed
From: Baokun Li <libaokun@linux.alibaba.com>
To: linux-ext4@vger.kernel.org
Cc: tytso@mit.edu, adilger.kernel@dilger.ca, jack@suse.cz,
	yi.zhang@huawei.com, ojaswin@linux.ibm.com,
	ritesh.list@gmail.com, peng_wang@linux.alibaba.com
Subject: Re: [PATCH 2/2] ext4: base unaligned DIO lock decision on partial block zeroing
Date: Tue, 16 Jun 2026 21:10:12 +0800	[thread overview]
Message-ID: <060f63e0-d64f-40df-99a7-af53862049ee@linux.alibaba.com> (raw)
In-Reply-To: <20260611163441.2431805-3-libaokun@linux.alibaba.com>

Hi all,

Thank you for your review!

After extensive testing, I found that after merging this patch, generic/746
started failing intermittently on ext3 (mkfs.ext4 -O ^extents).  The test
triggers a "Page cache invalidation failure on direct I/O" warning, and
subsequent fsync returns -EIO.

The underlying race existed before this patch, but this patch appears to
have widened the reproduction window considerably, so I thought it worth
trying to address.  Here is my analysis:

On no-extent inodes, DIO writes that hit holes cannot use unwritten
extents.  ext4_iomap_alloc() leaves m_flags=0, so ext4_map_blocks()
returns 0 for a hole, and:

        if (!m_flags && !ret)
                ret = -ENOTBLK;

The iomap layer returns -ENOTBLK to ext4, which falls back to buffered
I/O.  The fallback path dirties pages in the page cache, then flushes
and invalidates them.  However, concurrent async DIO completions to
other blocks on the same inode can run kiocb_invalidate_post_direct_write()
without holding the inode lock.

Consider a file with two 4k extents: [hole][written].  Thread A does DIO
to the written extent, while thread B does DIO spanning both extents:

  kworker A (4k DIO, allocated block)    kworker B (8k DIO, hole->fallback)
  -----------------------------------    -----------------------------------
  inode_lock_shared()                    inode_lock_shared()
  iomap_dio_rw():                        iomap_dio_rw():
    kiocb_invalidate_pages -> clean        iomap_begin -> -ENOTBLK
    submit_bio (async)                     dio->size = 0
  inode_unlock_shared()                  inode_unlock_shared()

  [bio pending in block layer]           /* fallback: inode lock released */
                                         ext4_buffered_write_iter()
                                           inode_lock(exclusive)
                                           generic_perform_write()
                                             -> dirty pages [0, 8k]
                                           inode_unlock(exclusive)

                                         /* pages still dirty here */
  [bio completes]                        filemap_write_and_wait_range()
  iomap_dio_complete()                     -> flush dirty pages
    kiocb_invalidate_post_direct_write() invalidate_mapping_pages()
      invalidate_inode_pages2_range()
      -> finds dirty page!               /* window closed */
      -> dio_warn_stale_pagecache()
      -> errseq_set(-EIO)

The critical window is the gap between ext4_buffered_write_iter() dirtying
pages and filemap_write_and_wait_range() flushing them.  In this window the
inode lock is not held, so another thread's async DIO completion is free to
invalidate the still-dirty pages in the page cache.

This race has always existed on ext3 because indirect-block inodes lack
unwritten-extent support.  However, the window was extremely narrow in
practice, because the old ext4_overwrite_io() checked every block and
would conservatively take an exclusive lock.  This patch replaced it
with ext4_dio_needs_zeroing(), which only checks head and tail blocks,
making unaligned DIO more likely to take a shared lock and
proportionally increasing the chance of hitting the race.

I tried a couple of alternatives before settling on the patch below:

1. Force exclusive lock + IOMAP_DIO_FORCE_WAIT for all no-extent DIO.
   This closes the window for new DIO submissions, but does not protect
   against bio completions from previously submitted async DIO, which
   run independently of the inode lock.

2. Wrap the fallback dirty+flush+invalidate sequence in
   filemap_invalidate_lock().  However, the ext4 DIO and iomap layers
   do not use this lock, so it would not serialise against DIO
   completions.

One straightforward approach that seems correct is to skip direct I/O
for no-extent inodes entirely, by returning 0 from ext4_dio_alignment():

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -6131,6 +6131,8 @@ u32 ext4_dio_alignment(struct inode *inode)
 {
        if (fsverity_active(inode))
                return 0;
+       if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
+               return 0;
        if (ext4_should_journal_data(inode))
                return 0;
        if (ext4_has_inline_data(inode))

With this, ext4_should_use_dio() returns false for no-extent inodes, and
all I/O goes through ext4_buffered_write_iter() directly, bypassing the
DIO path entirely.  On ext3, DIO to a hole already falls back to buffered
I/O, so there is essentially no performance benefit to using DIO in the
first place.

Note that with this change, the fallback branch in ext4_dio_write_iter():

        if (ret >= 0 && iov_iter_count(from)) {
                /* buffered fallback */
        }

would also become dead code for extent-based inodes (since unwritten
extents guarantee iomap_dio_rw() never returns zero with unconsumed
data), and could be removed in a follow-up cleanup.

Thoughts?  Is there a reason to preserve DIO on no-extent inodes that
I'm missing?

Looking forward to your feedback.


Thanks,
Baokun



  parent reply	other threads:[~2026-06-16 13:10 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-11 16:34 [PATCH 0/2] ext4: allow more DIO writes under shared i_rwsem Baokun Li
2026-06-11 16:34 ` [PATCH 1/2] ext4: skip overwrite check for aligned non-extending DIO writes Baokun Li
2026-06-12 12:46   ` Jan Kara
2026-06-15  3:24   ` Zhang Yi
2026-06-11 16:34 ` [PATCH 2/2] ext4: base unaligned DIO lock decision on partial block zeroing Baokun Li
2026-06-12 12:55   ` Jan Kara
2026-06-15  3:28   ` Zhang Yi
2026-06-16 13:10   ` Baokun Li [this message]
2026-06-17  2:45     ` Zhang Yi
2026-06-17  7:52       ` Baokun Li
2026-06-17 10:54         ` Zhang Yi
2026-06-17 11:08         ` Jan Kara
2026-06-18  9:49           ` Baokun Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=060f63e0-d64f-40df-99a7-af53862049ee@linux.alibaba.com \
    --to=libaokun@linux.alibaba.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=ojaswin@linux.ibm.com \
    --cc=peng_wang@linux.alibaba.com \
    --cc=ritesh.list@gmail.com \
    --cc=tytso@mit.edu \
    --cc=yi.zhang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox