* [PATCH v3 0/9] ext4: allow more DIO writes under shared i_rwsem
@ 2026-06-26 8:35 Baokun Li
2026-06-26 8:35 ` [PATCH v3 1/9] ext4: prevent sleeping allocation in NOWAIT write path Baokun Li
` (8 more replies)
0 siblings, 9 replies; 12+ messages in thread
From: Baokun Li @ 2026-06-26 8:35 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
peng_wang
Changes since v2:
* Collect RVB from Honza and Yi. (Thank you for your review!)
* Patch 6: improved EXT4_GET_BLOCKS_CACHED_NOWAIT handling - cache-hit
path now uses "goto found" to run check_block_validity(), closing a
security bypass for malicious extents from crafted filesystem images.
* Added Patch 9 to fix NOWAIT semantic violation in DAX extending writes
reported by Sashiko.
Changes since v1:
* Collect RVB from Honza and Yi. (Thank you for your review!)
* Added Patch 1 to fix NOWAIT issues reported by Sashiko.
* Added Patch 2 to fix ext3 DIO and DIO fallback data race issue.
(Patch 4 increases the probability of this race)
* Added Patches 5-8 to fix other NOWAIT issues discovered during
investigation.
v1: https://lore.kernel.org/linux-ext4/20260611163441.2431805-1-libaokun@linux.alibaba.com/
v2: https://lore.kernel.org/linux-ext4/20260618125735.4156639-1-libaokun@linux.alibaba.com/
======
Hi all,
This series relaxes the i_rwsem requirements of ext4_dio_write_iter()
so that more direct I/O writes can proceed under the shared lock.
It continues the work started by Peng Wang's RFC [1]; I'm taking
over this effort going forward.
ext4_dio_write_checks() currently calls ext4_overwrite_io() to decide
whether the shared lock is sufficient. Its single ext4_map_blocks()
lookup only sees the first contiguous extent of the same type, which
forces the exclusive lock for two cases that are actually safe under
the shared lock (see individual patches for the full safety
argument):
1. Aligned writes spanning multiple already-allocated extents (e.g.
written + unwritten, or two discontiguous written extents).
2. Unaligned writes whose head/tail partial blocks land on written
extents but the fully-covered middle blocks include hole or
unwritten extents.
Patch 1 fixes a NOWAIT issue where ext4_iomap_alloc() may sleep when
IOMAP_NOWAIT is set.
Patch 2 fixes a data race between DIO completion and buffered I/O
fallback on ext3 (no-extent inodes). This race was made more likely
by Patch 4.
Patch 3 skips the ext4_overwrite_io() pre-check entirely for aligned
non-extending writes, letting them proceed under the shared lock
regardless of extent state.
Patch 4 replaces ext4_overwrite_io() with ext4_dio_needs_zeroing(),
which directly answers the question driving the lock decision. It
checks only the head and tail partial blocks (at most two
ext4_map_blocks() calls), and ignores the state of middle blocks.
Patch 5 fixes a NOWAIT issue by using kiocb_modified instead of
file_modified in DIO/DAX write paths.
Patch 6 improves EXT4_GET_BLOCKS_CACHED_NOWAIT handling in
ext4_map_blocks(): cache-hit path now jumps to validation instead of
returning directly, cache-miss returns -EAGAIN instead of 0, and
adds a WARN_ON_ONCE to assert CREATE and CACHED_NOWAIT are never
combined.
Patch 7 adds cache-only lookup support to ext4_iomap_begin() for
IOMAP_NOWAIT requests.
Patch 8 adds cache-only lookup support to ext4_dio_needs_zeroing()
for IOCB_NOWAIT requests.
Patch 9 fixes a NOWAIT semantic violation in DAX extending writes
where ext4_journal_start() could sleep when the write extends past
i_disksize.
Testing
=======
"kvm-xfstests -c ext4/all -g auto" passes with no new failures.
Performance
===========
Hardware: /dev/sda (rotational disk, ~1 GB/s sustained write)
Filesystem: ext4 default mkfs
Test 1: aligned 8K DIO writes spanning written+unwritten extent
boundaries. Each thread writes its own 1G region sequentially; the
file is rebuilt between runs so every block is written exactly once.
Metric: IOPS.
JOBS base +patch 3 +patch 3+4 speedup
---- --------- -------- ---------- -------
1 42,322 43,329 43,087 1.02x
2 68,516 70,677 66,958 1.03x
4 62,489 97,072 101,468 1.62x
8 58,701 110,819 113,679 1.94x
16 58,569 116,392 115,272 1.97x
32 60,860 117,244 119,621 1.97x
Wall time at JOBS=32: 69.2s (base) -> 35.4s (patched), 1.96x faster.
Test 2: unaligned DIO writes (14336 bytes at +512 within each 16K
stripe). Each stripe is laid out as [written][unwritten][unwritten]
[written], so the head and tail partial blocks land on written
extents but the middle is unwritten. Metric: IOPS.
JOBS base +patch 3 +patch 3+4 speedup
---- --------- -------- ---------- -------
1 15,547 15,975 17,381 1.12x
2 15,910 14,808 34,172 2.15x
4 15,014 14,828 57,567 3.83x
8 15,022 14,648 81,947 5.46x
16 14,586 14,262 99,126 6.80x
32 14,047 13,809 92,519 6.59x
Wall time at JOBS=32: 149.3s (base) -> 22.7s (patched), 6.58x faster.
In test 2, patch 3 alone has no effect (slight noise) because patch 3
only touches the aligned write path. Patch 4 introduces
ext4_dio_needs_zeroing() which precisely identifies when partial
block zeroing is required, allowing the shared lock for the much
larger set of unaligned writes that don't actually trigger zeroing.
Comments and questions are, as always, welcome.
Thanks,
Baokun
[1]: https://lore.kernel.org/linux-ext4/20260607124935.6168-1-peng_wang@linux.alibaba.com/
Baokun Li (9):
ext4: prevent sleeping allocation in NOWAIT write path
ext4: drain in-flight DIO before buffered write fallback
ext4: skip overwrite check for aligned non-extending DIO writes
ext4: base unaligned DIO lock decision on partial block zeroing
ext4: use kiocb_modified instead of file_modified in DIO/DAX write
path
ext4: improve EXT4_GET_BLOCKS_CACHED_NOWAIT handling in
ext4_map_blocks
ext4: handle IOMAP_NOWAIT in ext4_iomap_begin() with cache-only lookup
ext4: handle IOCB_NOWAIT in ext4_dio_needs_zeroing() with cache-only
lookup
ext4: fix NOWAIT semantic violation in DAX extending writes
fs/ext4/file.c | 154 ++++++++++++++++++++++++++++++++++--------------
fs/ext4/inode.c | 22 +++++--
2 files changed, 127 insertions(+), 49 deletions(-)
--
2.43.7
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v3 1/9] ext4: prevent sleeping allocation in NOWAIT write path
2026-06-26 8:35 [PATCH v3 0/9] ext4: allow more DIO writes under shared i_rwsem Baokun Li
@ 2026-06-26 8:35 ` Baokun Li
2026-06-26 8:35 ` [PATCH v3 2/9] ext4: drain in-flight DIO before buffered write fallback Baokun Li
` (7 subsequent siblings)
8 siblings, 0 replies; 12+ messages in thread
From: Baokun Li @ 2026-06-26 8:35 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
peng_wang, Sashiko
Block allocation requires journal access which may sleep, violating
NOWAIT semantics. Return -EAGAIN early when IOMAP_NOWAIT is set,
allowing the caller to retry without the NOWAIT constraint.
This ensures that write paths using IOMAP_NOWAIT (e.g., DIO with
RWF_NOWAIT) will not block on journal operations when blocks need
to be allocated.
Reported-by: Sashiko <sashiko-bot@kernel.org>
Closes: https://sashiko.dev/#/patchset/20260611163441.2431805-1-libaokun@linux.alibaba.com?part=1
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
fs/ext4/inode.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c2c2d6ac7f3d..832794294ccf 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3672,6 +3672,9 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
int ret, dio_credits, m_flags = 0, retries = 0;
bool force_commit = false;
+ if (flags & IOMAP_NOWAIT)
+ return -EAGAIN;
+
/*
* Trim the mapping request to the maximum value that we can map at
* once for direct I/O.
--
2.43.7
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v3 2/9] ext4: drain in-flight DIO before buffered write fallback
2026-06-26 8:35 [PATCH v3 0/9] ext4: allow more DIO writes under shared i_rwsem Baokun Li
2026-06-26 8:35 ` [PATCH v3 1/9] ext4: prevent sleeping allocation in NOWAIT write path Baokun Li
@ 2026-06-26 8:35 ` Baokun Li
2026-06-26 8:35 ` [PATCH v3 3/9] ext4: skip overwrite check for aligned non-extending DIO writes Baokun Li
` (6 subsequent siblings)
8 siblings, 0 replies; 12+ messages in thread
From: Baokun Li @ 2026-06-26 8:35 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
peng_wang
generic/746 started failing intermittently on ext3 (no-extent inodes).
The test triggers 'Page cache invalidation failure on direct I/O'
warnings and subsequent fsync returns -EIO. Adding a 50ms delay
between ext4_buffered_write_iter() and filemap_write_and_wait_range()
in ext4_dio_write_iter() makes the race almost always reproducible.
On no-extent inodes, DIO writes to holes cannot use unwritten extents,
so ext4_iomap_alloc() leaves m_flags=0 and ext4_map_blocks() returns 0.
The iomap layer then returns -ENOTBLK, causing fallback to buffered I/O.
The fallback path in ext4_dio_write_iter() calls
ext4_buffered_write_iter() which dirties pages, then does flush and
invalidate. However, there's an unprotected window between
ext4_buffered_write_iter() returning (with inode lock released) and
the subsequent flush+invalidate.
Concurrent async DIO completions from other threads can run
kiocb_invalidate_post_direct_write() during this window. If pages have
been re-dirtied, post-invalidation finds dirty pages and triggers the
warning, setting -EIO in the error sequence.
Consider a file with two 4k extents: [hole][written]. Thread A does
DIO to the written extent, while thread B does DIO spanning both:
kworker A (4k DIO, allocated block) kworker B (8k DIO, fallback)
----------------------------------- ----------------------------
inode_lock_shared() inode_lock_shared()
iomap_dio_rw(): iomap_dio_rw():
kiocb_invalidate_pages -> clean iomap_begin -> -ENOTBLK
submit_bio (async) dio->size = 0
inode_unlock_shared() inode_unlock_shared()
[bio pending in block layer] /* fallback: lock released */
ext4_buffered_write_iter()
inode_lock(exclusive)
generic_perform_write()
-> dirty pages [0, 8k]
inode_unlock(exclusive)
/* pages dirty, no lock */
[bio completes] filemap_write_and_wait_range()
iomap_dio_complete() -> flush dirty pages
kiocb_invalidate_post_direct_write() invalidate_mapping_pages()
invalidate_inode_pages2_range()
-> finds dirty page!
-> dio_warn_stale_pagecache()
-> errseq_set(-EIO)
This issue can be triggered through normal I/O paths, not just
intentionally overlapping DIO writes from userspace. For example,
generic/746 uses a loop device where multiple kworkers issue concurrent
I/O to the backing file. Additionally, when block_size < folio_size,
non-overlapping DIO writes that share a large folio can also trigger
the race.
Add inode_dio_wait() in ext4_buffered_write_iter() before
ext4_write_checks() to drain all in-flight DIO. This ensures that
all DIO clears existing pages before submitting IO (via
kiocb_invalidate_pages()), all BIO waits for all DIO to complete
(via inode_dio_wait()), and ext4_write_checks() observes the inode
size after all completed DIO so that ext4_block_zero_eof() does not
race with in-flight DIO, thus eliminating the race.
Fixes: 378f32bab371 ("ext4: introduce direct I/O write using iomap infrastructure")
Suggested-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/d1adcf7c-c276-458d-9cac-68a4410f7626@gmail.com
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
fs/ext4/file.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index eb1a323962b1..130edf1ac242 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -309,6 +309,13 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
return -EOPNOTSUPP;
inode_lock(inode);
+
+ /*
+ * Prevent concurrent direct I/O and buffered I/O to the same file
+ * range. Wait for in-flight DIO to finish before dirtying pages.
+ */
+ inode_dio_wait(inode);
+
ret = ext4_write_checks(iocb, from);
if (ret <= 0)
goto out;
--
2.43.7
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v3 3/9] ext4: skip overwrite check for aligned non-extending DIO writes
2026-06-26 8:35 [PATCH v3 0/9] ext4: allow more DIO writes under shared i_rwsem Baokun Li
2026-06-26 8:35 ` [PATCH v3 1/9] ext4: prevent sleeping allocation in NOWAIT write path Baokun Li
2026-06-26 8:35 ` [PATCH v3 2/9] ext4: drain in-flight DIO before buffered write fallback Baokun Li
@ 2026-06-26 8:35 ` Baokun Li
2026-06-26 8:35 ` [PATCH v3 4/9] ext4: base unaligned DIO lock decision on partial block zeroing Baokun Li
` (5 subsequent siblings)
8 siblings, 0 replies; 12+ messages in thread
From: Baokun Li @ 2026-06-26 8:35 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
peng_wang
Currently, ext4_dio_write_checks() calls ext4_overwrite_io() to
determine if a write is a pure overwrite, and upgrades to exclusive
i_rwsem if not. However, ext4_overwrite_io() uses a single
ext4_map_blocks() call which only returns the first contiguous extent of
the same type. A write spanning multiple pre-allocated extents (e.g.
written + unwritten, or two physically discontiguous written extents)
produces a false negative, forcing an unnecessary exclusive lock upgrade.
After commit 5d87c7fca2c1 ("ext4: avoid starting handle when dio
writing an unwritten extent") and commit 012924f0eeef ("ext4: remove
useless ext4_iomap_overwrite_ops"), ext4_iomap_begin()'s fast path
accepts both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN without starting a
journal transaction. The iomap iteration naturally handles multi-extent
ranges: each call returns the mapping for the current segment, and
unwritten-to-written conversion is deferred to ext4_dio_write_end_io().
This means the common case of mixed written/unwritten extents never
reaches ext4_iomap_alloc() at all.
Even for the less common case where the range contains a hole and
ext4_iomap_alloc() is needed, exclusive i_rwsem is still unnecessary for
aligned non-extending writes:
- truncate/punch_hole are kept out: they require exclusive i_rwsem
(blocked by our shared lock during allocation), and inode_dio_begin()
keeps their inode_dio_wait() blocked until in-flight bios complete.
- i_data_sem write-lock inside ext4_map_blocks() serializes concurrent
extent tree modifications (parallel writers to the same hole).
- The journal handle is per-thread and does not require i_rwsem
exclusion.
- i_disksize and orphan list are not involved in non-extending writes.
Skip the ext4_overwrite_io() check entirely for aligned writes by
initializing overwrite to true and only calling ext4_overwrite_io() for
unaligned writes. Unaligned writes still need the extent state check
because concurrent partial block zeroing in the DIO layer requires
exclusive serialization unless the range is a pure written-extent
overwrite.
Performance:
Hardware: /dev/sda (rotational disk, ~1 GB/s sustained write)
Filesystem: ext4 default mkfs
Aligned 8K DIO writes spanning written+unwritten extent boundaries.
Each thread writes its own 1G region sequentially; the file is rebuilt
between runs so every block is written exactly once. Metric: IOPS.
JOBS Before After speedup
---- -------- --------- -------
1 42,322 43,329 1.02x
2 68,516 70,677 1.03x
4 62,489 97,072 1.55x
8 58,701 110,819 1.89x
16 58,569 116,392 1.99x
32 60,860 117,244 1.93x
Wall time at JOBS=32: 69.2s (Before) -> 35.4s (After), 1.96x faster.
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
fs/ext4/file.c | 52 +++++++++++++++++++++++++++++---------------------
1 file changed, 30 insertions(+), 22 deletions(-)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 130edf1ac242..7d453d7c003b 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -435,16 +435,27 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
* condition requires an exclusive inode lock. If yes, then we restart the
* whole operation by releasing the shared lock and acquiring exclusive lock.
*
- * - For unaligned_io we never take shared lock as it may cause data corruption
- * when two unaligned IO tries to modify the same block e.g. while zeroing.
+ * The decision is layered, evaluated in this order:
*
- * - For extending writes case we don't take the shared lock, since it requires
- * updating inode i_disksize and/or orphan handling with exclusive lock.
+ * 1. If file_modified() needs to update security info (!IS_NOSEC), upgrade
+ * to the exclusive lock -- the security update itself requires it,
+ * regardless of whether the write extends the file or is aligned.
*
- * - shared locking will only be true mostly with overwrites, including
- * initialized blocks and unwritten blocks.
+ * 2. If the write extends i_size or i_disksize, upgrade to the exclusive
+ * lock to safely update i_disksize and the orphan list, regardless of
+ * alignment.
*
- * - Otherwise we will switch to exclusive i_rwsem lock.
+ * 3. Otherwise, for aligned non-extending writes, shared lock is always
+ * sufficient regardless of extent state (written, unwritten, or hole).
+ * truncate/punch_hole cannot run while we hold the shared i_rwsem
+ * (they need it exclusively); after we release it, inode_dio_begin()
+ * keeps their inode_dio_wait() blocked until in-flight bios complete.
+ * i_data_sem serializes concurrent extent tree modifications.
+ *
+ * 4. Otherwise, the write is unaligned and non-extending. Shared lock is
+ * only safe for pure written-extent overwrites. Unwritten extents or
+ * holes require exclusive lock because concurrent partial block zeroing
+ * in the DIO layer could corrupt data.
*/
static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
bool *ilock_shared, bool *extend,
@@ -455,7 +466,7 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
loff_t offset;
size_t count;
ssize_t ret;
- bool overwrite, unaligned_io, unwritten;
+ bool overwrite = true, unaligned_io, unwritten = false;
restart:
ret = ext4_generic_write_checks(iocb, from);
@@ -467,22 +478,19 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
unaligned_io = ext4_unaligned_io(inode, from, offset);
*extend = ext4_extending_io(inode, offset, count);
- overwrite = ext4_overwrite_io(inode, offset, count, &unwritten);
/*
- * Determine whether we need to upgrade to an exclusive lock. This is
- * required to change security info in file_modified(), for extending
- * I/O, any form of non-overwrite I/O, and unaligned I/O to unwritten
- * extents (as partial block zeroing may be required).
- *
- * Note that unaligned writes are allowed under shared lock so long as
- * they are pure overwrites. Otherwise, concurrent unaligned writes risk
- * data corruption due to partial block zeroing in the dio layer, and so
- * the I/O must occur exclusively.
+ * For unaligned writes we need to know the extent state to determine
+ * whether shared lock is safe. For aligned writes we skip this check
+ * entirely since allocation under shared lock is safe.
*/
+ if (unaligned_io)
+ overwrite = ext4_overwrite_io(inode, offset, count, &unwritten);
+
+ /* Determine whether we need to upgrade to an exclusive lock. */
if (*ilock_shared &&
- ((!IS_NOSEC(inode) || *extend || !overwrite ||
- (unaligned_io && unwritten)))) {
+ ((!IS_NOSEC(inode) || *extend ||
+ (unaligned_io && (!overwrite || unwritten))))) {
if (iocb->ki_flags & IOCB_NOWAIT) {
ret = -EAGAIN;
goto out;
@@ -497,8 +505,8 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
* Now that locking is settled, determine dio flags and exclusivity
* requirements. We don't use DIO_OVERWRITE_ONLY because we enforce
* behavior already. The inode lock is already held exclusive if the
- * write is non-overwrite or extending, so drain all outstanding dio and
- * set the force wait dio flag.
+ * write is unaligned non-overwrite or extending, so drain all
+ * outstanding dio and set the force wait dio flag.
*/
if (!*ilock_shared && (unaligned_io || *extend)) {
if (iocb->ki_flags & IOCB_NOWAIT) {
--
2.43.7
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v3 4/9] ext4: base unaligned DIO lock decision on partial block zeroing
2026-06-26 8:35 [PATCH v3 0/9] ext4: allow more DIO writes under shared i_rwsem Baokun Li
` (2 preceding siblings ...)
2026-06-26 8:35 ` [PATCH v3 3/9] ext4: skip overwrite check for aligned non-extending DIO writes Baokun Li
@ 2026-06-26 8:35 ` Baokun Li
2026-06-26 8:35 ` [PATCH v3 5/9] ext4: use kiocb_modified instead of file_modified in DIO/DAX write path Baokun Li
` (4 subsequent siblings)
8 siblings, 0 replies; 12+ messages in thread
From: Baokun Li @ 2026-06-26 8:35 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
peng_wang
For unaligned DIO writes, the previous ext4_overwrite_io() required the
entire range to fall within a single written extent. This was overly
conservative: the DIO layer only performs partial block zeroing for the
head and tail blocks when they are partially covered by the write.
Middle blocks that are fully covered are written as whole blocks
without any zeroing, so they are safe regardless of extent state.
Therefore exclusive lock is only required when partial block zeroing
will actually happen:
- The head partial block (if any) lands on a hole or unwritten extent.
- The tail partial block (if any) lands on a hole or unwritten extent.
Middle full-cover blocks can be in any state (hole, unwritten, or
written) - block allocation under shared lock is safe per the previous
patch's analysis (inode_dio_begin + i_data_sem protection).
Replace ext4_overwrite_io() with ext4_dio_needs_zeroing(), which
directly answers the question driving the lock decision. It uses at
most two ext4_map_blocks() calls: one for the head partial block (also
catching the case where it spans through the tail), and one for the
tail partial block if not already covered.
This enables shared lock for previously-rejected scenarios such as:
- Unaligned write spanning written extent + mid-range hole + written
extent at the tail.
- Unaligned write where the partial blocks land on written extents but
the middle has unwritten extents.
Performance:
Hardware: /dev/sda (rotational disk, ~1 GB/s sustained write)
Filesystem: ext4 default mkfs
Unaligned DIO writes (14336 bytes at +512 within each 16K stripe).
Each stripe is laid out as [written][unwritten][unwritten][written],
so the head and tail partial blocks land on written extents but the
middle is unwritten. Metric: IOPS.
JOBS Before After speedup
---- -------- --------- -------
1 15,547 17,381 1.12x
2 15,910 34,172 2.15x
4 15,014 57,567 3.83x
8 15,022 81,947 5.46x
16 14,586 99,126 6.80x
32 14,047 92,519 6.59x
Wall time at JOBS=32: 149.3s (Before) -> 22.7s (After), 6.58x faster.
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
fs/ext4/file.c | 108 +++++++++++++++++++++++++++++++++----------------
1 file changed, 73 insertions(+), 35 deletions(-)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 7d453d7c003b..d12445e3907a 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -213,31 +213,60 @@ ext4_extending_io(struct inode *inode, loff_t offset, size_t len)
return false;
}
-/* Is IO overwriting allocated or initialized blocks? */
-static bool ext4_overwrite_io(struct inode *inode,
- loff_t pos, loff_t len, bool *unwritten)
+/*
+ * Does an unaligned DIO write require partial block zeroing?
+ *
+ * Partial block zeroing is performed only for the head and tail blocks
+ * when they are partially covered by the write and the underlying extent
+ * is a hole or unwritten. Middle blocks (fully covered by the write)
+ * are written as whole blocks without zeroing.
+ *
+ * When zeroing is required, two concurrent unaligned DIO writes to the
+ * same partial block can race and corrupt each other's data, so the
+ * caller must take the exclusive i_rwsem and drain in-flight DIO. When
+ * zeroing is not required, shared lock is safe -- block allocation and
+ * unwritten conversion for middle blocks are protected by i_data_sem
+ * and inode_dio_begin().
+ */
+static bool ext4_dio_needs_zeroing(struct inode *inode, loff_t pos, loff_t len)
{
struct ext4_map_blocks map;
unsigned int blkbits = inode->i_blkbits;
- int err, blklen;
+ unsigned long blockmask = inode->i_sb->s_blocksize - 1;
+ bool head_partial, tail_partial;
+ ext4_lblk_t head_lblk, tail_lblk;
+ int err;
if (pos + len > i_size_read(inode))
- return false;
+ return true;
- map.m_lblk = pos >> blkbits;
- map.m_len = EXT4_MAX_BLOCKS(len, pos, blkbits);
- blklen = map.m_len;
+ head_partial = (pos & blockmask) != 0;
+ tail_partial = ((pos + len) & blockmask) != 0;
+ head_lblk = pos >> blkbits;
+ tail_lblk = (pos + len - 1) >> blkbits;
+
+ /* Check the head partial block. */
+ if (head_partial) {
+ map.m_lblk = head_lblk;
+ map.m_len = tail_lblk - head_lblk + 1;
+ err = ext4_map_blocks(NULL, inode, &map, 0);
+ if (err <= 0 || !(map.m_flags & EXT4_MAP_MAPPED))
+ return true;
+ /* If this mapping already covers the tail block, we're done. */
+ if (!tail_partial || map.m_lblk + err > tail_lblk)
+ return false;
+ }
- err = ext4_map_blocks(NULL, inode, &map, 0);
- if (err != blklen)
- return false;
- /*
- * 'err==len' means that all of the blocks have been preallocated,
- * regardless of whether they have been initialized or not. We need to
- * check m_flags to distinguish the unwritten extents.
- */
- *unwritten = !(map.m_flags & EXT4_MAP_MAPPED);
- return true;
+ /* Check the tail partial block. */
+ if (tail_partial) {
+ map.m_lblk = tail_lblk;
+ map.m_len = 1;
+ err = ext4_map_blocks(NULL, inode, &map, 0);
+ if (err <= 0 || !(map.m_flags & EXT4_MAP_MAPPED))
+ return true;
+ }
+
+ return false;
}
static ssize_t ext4_generic_write_checks(struct kiocb *iocb,
@@ -453,9 +482,10 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
* i_data_sem serializes concurrent extent tree modifications.
*
* 4. Otherwise, the write is unaligned and non-extending. Shared lock is
- * only safe for pure written-extent overwrites. Unwritten extents or
- * holes require exclusive lock because concurrent partial block zeroing
- * in the DIO layer could corrupt data.
+ * safe unless the DIO layer needs to perform partial block zeroing --
+ * i.e. the head or tail partial block sits on a hole or unwritten
+ * extent. In that case upgrade to the exclusive lock and drain
+ * in-flight DIO to avoid races with concurrent partial block zeroing.
*/
static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
bool *ilock_shared, bool *extend,
@@ -466,7 +496,7 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
loff_t offset;
size_t count;
ssize_t ret;
- bool overwrite = true, unaligned_io, unwritten = false;
+ bool needs_zeroing = false;
restart:
ret = ext4_generic_write_checks(iocb, from);
@@ -476,21 +506,22 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
offset = iocb->ki_pos;
count = ret;
- unaligned_io = ext4_unaligned_io(inode, from, offset);
*extend = ext4_extending_io(inode, offset, count);
/*
- * For unaligned writes we need to know the extent state to determine
- * whether shared lock is safe. For aligned writes we skip this check
- * entirely since allocation under shared lock is safe.
+ * For unaligned writes, check whether partial block zeroing will be
+ * needed. If so, exclusive lock is required to serialize against
+ * concurrent DIO that could race with the zeroing.
+ *
+ * For aligned writes we skip this check entirely since allocation
+ * under shared lock is safe.
*/
- if (unaligned_io)
- overwrite = ext4_overwrite_io(inode, offset, count, &unwritten);
+ if (ext4_unaligned_io(inode, from, offset))
+ needs_zeroing = ext4_dio_needs_zeroing(inode, offset, count);
/* Determine whether we need to upgrade to an exclusive lock. */
if (*ilock_shared &&
- ((!IS_NOSEC(inode) || *extend ||
- (unaligned_io && (!overwrite || unwritten))))) {
+ (!IS_NOSEC(inode) || *extend || needs_zeroing)) {
if (iocb->ki_flags & IOCB_NOWAIT) {
ret = -EAGAIN;
goto out;
@@ -504,16 +535,23 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
/*
* Now that locking is settled, determine dio flags and exclusivity
* requirements. We don't use DIO_OVERWRITE_ONLY because we enforce
- * behavior already. The inode lock is already held exclusive if the
- * write is unaligned non-overwrite or extending, so drain all
- * outstanding dio and set the force wait dio flag.
+ * behavior already. When holding the exclusive lock for a write that
+ * needs partial block zeroing or is extending the file, we must wait
+ * for the I/O to complete synchronously:
+ *
+ * - needs_zeroing: drain in-flight DIO whose end_io could race with
+ * our partial block zeroing, and force synchronous completion so we
+ * don't leave in-flight zeroing bios for the next writer to drain.
+ *
+ * - extend: the caller must update i_disksize after I/O completion,
+ * which requires the data to be on disk first.
*/
- if (!*ilock_shared && (unaligned_io || *extend)) {
+ if (!*ilock_shared && (needs_zeroing || *extend)) {
if (iocb->ki_flags & IOCB_NOWAIT) {
ret = -EAGAIN;
goto out;
}
- if (unaligned_io && (!overwrite || unwritten))
+ if (needs_zeroing)
inode_dio_wait(inode);
*dio_flags = IOMAP_DIO_FORCE_WAIT;
}
--
2.43.7
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v3 5/9] ext4: use kiocb_modified instead of file_modified in DIO/DAX write path
2026-06-26 8:35 [PATCH v3 0/9] ext4: allow more DIO writes under shared i_rwsem Baokun Li
` (3 preceding siblings ...)
2026-06-26 8:35 ` [PATCH v3 4/9] ext4: base unaligned DIO lock decision on partial block zeroing Baokun Li
@ 2026-06-26 8:35 ` Baokun Li
2026-06-26 8:35 ` [PATCH v3 6/9] ext4: improve EXT4_GET_BLOCKS_CACHED_NOWAIT handling in ext4_map_blocks Baokun Li
` (3 subsequent siblings)
8 siblings, 0 replies; 12+ messages in thread
From: Baokun Li @ 2026-06-26 8:35 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
peng_wang
file_modified() passes flags=0 which drops IOCB_NOWAIT, causing
file_update_time() to sleep in ext4_journal_start() via
ext4_dirty_inode() even in non-blocking contexts.
kiocb_modified(iocb) propagates iocb->ki_flags so that
generic_update_time() correctly returns -EAGAIN when IOCB_NOWAIT
is set and ->dirty_inode could block, matching the behavior
already adopted by XFS, FUSE, and ext2.
Affected paths:
- ext4_dio_write_checks(): DIO NOWAIT write
- ext4_write_checks(): shared by buffered (rejects NOWAIT upfront)
and DAX write (supports NOWAIT)
ext4_fallocate() in extents.c is not affected as it has no kiocb.
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
fs/ext4/file.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index d12445e3907a..0e9448a110dc 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -307,7 +307,7 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
if (count <= 0)
return count;
- ret = file_modified(iocb->ki_filp);
+ ret = kiocb_modified(iocb);
if (ret)
return ret;
@@ -466,7 +466,7 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
*
* The decision is layered, evaluated in this order:
*
- * 1. If file_modified() needs to update security info (!IS_NOSEC), upgrade
+ * 1. If kiocb_modified() needs to update security info (!IS_NOSEC), upgrade
* to the exclusive lock -- the security update itself requires it,
* regardless of whether the write extends the file or is aligned.
*
@@ -556,7 +556,7 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
*dio_flags = IOMAP_DIO_FORCE_WAIT;
}
- ret = file_modified(file);
+ ret = kiocb_modified(iocb);
if (ret < 0)
goto out;
--
2.43.7
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v3 6/9] ext4: improve EXT4_GET_BLOCKS_CACHED_NOWAIT handling in ext4_map_blocks
2026-06-26 8:35 [PATCH v3 0/9] ext4: allow more DIO writes under shared i_rwsem Baokun Li
` (4 preceding siblings ...)
2026-06-26 8:35 ` [PATCH v3 5/9] ext4: use kiocb_modified instead of file_modified in DIO/DAX write path Baokun Li
@ 2026-06-26 8:35 ` Baokun Li
[not found] ` <20260626085003.BD4BC1F000E9@smtp.kernel.org>
2026-06-26 8:35 ` [PATCH v3 7/9] ext4: handle IOMAP_NOWAIT in ext4_iomap_begin() with cache-only lookup Baokun Li
` (2 subsequent siblings)
8 siblings, 1 reply; 12+ messages in thread
From: Baokun Li @ 2026-06-26 8:35 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
peng_wang
When EXT4_GET_BLOCKS_CACHED_NOWAIT is set and the extent status cache
hits, ext4_map_blocks() returns immediately without running
check_block_validity(). This allows malicious extents from crafted
filesystem images to bypass validation if they have been cached by a
previous blocking read.
Make three improvements to the EXT4_GET_BLOCKS_CACHED_NOWAIT handling:
1. Change the cache-hit path from "return retval" to "goto found" so
that check_block_validity() always runs, closing the security bypass.
2. Return -EAGAIN instead of 0 on cache miss to distinguish it from a
cache hit on a hole or delayed extent (which returns 0). The only
existing caller (ext4_get_link() -> ext4_getblk() -> ERR_PTR())
converts both -EAGAIN and 0 to ERR_PTR(-ECHILD), so the end result
is unchanged.
3. Add WARN_ON_ONCE after the EXT4_GET_BLOCKS_CREATE==0 early return
to assert that EXT4_GET_BLOCKS_CREATE and EXT4_GET_BLOCKS_CACHED_NOWAIT
are never combined, since EXT4_GET_BLOCKS_CREATE requires blocking on
i_data_sem.
Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
fs/ext4/inode.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 832794294ccf..7f9ae584ad98 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -759,8 +759,9 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
BUG();
}
+ /* Skip blocking operations and jump to extent validation. */
if (flags & EXT4_GET_BLOCKS_CACHED_NOWAIT)
- return retval;
+ goto found;
#ifdef ES_AGGRESSIVE_TEST
ext4_map_blocks_es_recheck(handle, inode, map,
&orig_map, flags);
@@ -776,7 +777,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
* cannot find extent in the cache.
*/
if (flags & EXT4_GET_BLOCKS_CACHED_NOWAIT)
- return 0;
+ return -EAGAIN;
/*
* Try to see if we can get the block without requesting a new
@@ -797,6 +798,9 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
if ((flags & EXT4_GET_BLOCKS_CREATE) == 0)
return retval;
+ /* EXT4_GET_BLOCKS_CREATE cannot operate in NOWAIT mode */
+ WARN_ON_ONCE(flags & EXT4_GET_BLOCKS_CACHED_NOWAIT);
+
/*
* Returns if the blocks have already allocated
*
--
2.43.7
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v3 7/9] ext4: handle IOMAP_NOWAIT in ext4_iomap_begin() with cache-only lookup
2026-06-26 8:35 [PATCH v3 0/9] ext4: allow more DIO writes under shared i_rwsem Baokun Li
` (5 preceding siblings ...)
2026-06-26 8:35 ` [PATCH v3 6/9] ext4: improve EXT4_GET_BLOCKS_CACHED_NOWAIT handling in ext4_map_blocks Baokun Li
@ 2026-06-26 8:35 ` Baokun Li
2026-06-26 8:35 ` [PATCH v3 8/9] ext4: handle IOCB_NOWAIT in ext4_dio_needs_zeroing() " Baokun Li
2026-06-26 8:35 ` [PATCH v3 9/9] ext4: fix NOWAIT semantic violation in DAX extending writes Baokun Li
8 siblings, 0 replies; 12+ messages in thread
From: Baokun Li @ 2026-06-26 8:35 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
peng_wang
Pass EXT4_GET_BLOCKS_CACHED_NOWAIT flag to ext4_map_blocks() when
IOMAP_NOWAIT is set, ensuring that extent lookups only use the cached
extent status tree. If the cache misses, ext4_map_blocks() returns
-EAGAIN instead of sleeping on down_read(i_data_sem) to read extent
tree from disk.
This applies to both write and read paths in ext4_iomap_begin(),
allowing DIO/DAX operations with RWF_NOWAIT to avoid blocking on
extent tree lookups.
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
fs/ext4/inode.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 7f9ae584ad98..3c05094475db 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3784,6 +3784,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
struct ext4_map_blocks map;
u8 blkbits = inode->i_blkbits;
unsigned int orig_mlen;
+ int map_flags = 0;
if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
return -EINVAL;
@@ -3798,6 +3799,12 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
orig_mlen = map.m_len;
+ /*
+ * In NOWAIT context, only use cached extent info. If es cache misses,
+ * return -EAGAIN to avoid sleeping on down_read(i_data_sem).
+ */
+ if (flags & IOMAP_NOWAIT)
+ map_flags = EXT4_GET_BLOCKS_CACHED_NOWAIT;
if (flags & IOMAP_WRITE) {
/*
@@ -3807,7 +3814,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
* especially in multi-threaded overwrite requests.
*/
if (offset + length <= i_size_read(inode)) {
- ret = ext4_map_blocks(NULL, inode, &map, 0);
+ ret = ext4_map_blocks(NULL, inode, &map, map_flags);
/*
* For DAX we convert extents to initialized ones before
* copying the data, otherwise we do it after I/O so
@@ -3828,7 +3835,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
}
ret = ext4_iomap_alloc(inode, &map, flags);
} else {
- ret = ext4_map_blocks(NULL, inode, &map, 0);
+ ret = ext4_map_blocks(NULL, inode, &map, map_flags);
}
if (ret < 0)
--
2.43.7
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v3 8/9] ext4: handle IOCB_NOWAIT in ext4_dio_needs_zeroing() with cache-only lookup
2026-06-26 8:35 [PATCH v3 0/9] ext4: allow more DIO writes under shared i_rwsem Baokun Li
` (6 preceding siblings ...)
2026-06-26 8:35 ` [PATCH v3 7/9] ext4: handle IOMAP_NOWAIT in ext4_iomap_begin() with cache-only lookup Baokun Li
@ 2026-06-26 8:35 ` Baokun Li
2026-06-26 8:35 ` [PATCH v3 9/9] ext4: fix NOWAIT semantic violation in DAX extending writes Baokun Li
8 siblings, 0 replies; 12+ messages in thread
From: Baokun Li @ 2026-06-26 8:35 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
peng_wang
Add a nowait parameter to ext4_dio_needs_zeroing() and pass
EXT4_GET_BLOCKS_CACHED_NOWAIT flag to ext4_map_blocks() when
IOCB_NOWAIT is set. This ensures the needs_zeroing check only uses
cached extent info. If cache misses, ext4_map_blocks() returns
-EAGAIN, causing ext4_dio_needs_zeroing() to conservatively return
true (needs zeroing). The caller then tries to upgrade to exclusive
lock, which returns -EAGAIN for NOWAIT, avoiding potential sleep on
down_read(i_data_sem).
The caller in ext4_dio_write_checks() is updated to pass the
IOCB_NOWAIT flag from the kiocb.
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
fs/ext4/file.c | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 0e9448a110dc..49074cc13751 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -228,7 +228,8 @@ ext4_extending_io(struct inode *inode, loff_t offset, size_t len)
* unwritten conversion for middle blocks are protected by i_data_sem
* and inode_dio_begin().
*/
-static bool ext4_dio_needs_zeroing(struct inode *inode, loff_t pos, loff_t len)
+static bool ext4_dio_needs_zeroing(struct inode *inode, loff_t pos, loff_t len,
+ bool nowait)
{
struct ext4_map_blocks map;
unsigned int blkbits = inode->i_blkbits;
@@ -236,10 +237,14 @@ static bool ext4_dio_needs_zeroing(struct inode *inode, loff_t pos, loff_t len)
bool head_partial, tail_partial;
ext4_lblk_t head_lblk, tail_lblk;
int err;
+ int map_flags = 0;
if (pos + len > i_size_read(inode))
return true;
+ if (nowait)
+ map_flags = EXT4_GET_BLOCKS_CACHED_NOWAIT;
+
head_partial = (pos & blockmask) != 0;
tail_partial = ((pos + len) & blockmask) != 0;
head_lblk = pos >> blkbits;
@@ -249,7 +254,7 @@ static bool ext4_dio_needs_zeroing(struct inode *inode, loff_t pos, loff_t len)
if (head_partial) {
map.m_lblk = head_lblk;
map.m_len = tail_lblk - head_lblk + 1;
- err = ext4_map_blocks(NULL, inode, &map, 0);
+ err = ext4_map_blocks(NULL, inode, &map, map_flags);
if (err <= 0 || !(map.m_flags & EXT4_MAP_MAPPED))
return true;
/* If this mapping already covers the tail block, we're done. */
@@ -261,7 +266,7 @@ static bool ext4_dio_needs_zeroing(struct inode *inode, loff_t pos, loff_t len)
if (tail_partial) {
map.m_lblk = tail_lblk;
map.m_len = 1;
- err = ext4_map_blocks(NULL, inode, &map, 0);
+ err = ext4_map_blocks(NULL, inode, &map, map_flags);
if (err <= 0 || !(map.m_flags & EXT4_MAP_MAPPED))
return true;
}
@@ -517,7 +522,8 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
* under shared lock is safe.
*/
if (ext4_unaligned_io(inode, from, offset))
- needs_zeroing = ext4_dio_needs_zeroing(inode, offset, count);
+ needs_zeroing = ext4_dio_needs_zeroing(inode, offset, count,
+ iocb->ki_flags & IOCB_NOWAIT);
/* Determine whether we need to upgrade to an exclusive lock. */
if (*ilock_shared &&
--
2.43.7
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v3 9/9] ext4: fix NOWAIT semantic violation in DAX extending writes
2026-06-26 8:35 [PATCH v3 0/9] ext4: allow more DIO writes under shared i_rwsem Baokun Li
` (7 preceding siblings ...)
2026-06-26 8:35 ` [PATCH v3 8/9] ext4: handle IOCB_NOWAIT in ext4_dio_needs_zeroing() " Baokun Li
@ 2026-06-26 8:35 ` Baokun Li
2026-06-26 14:32 ` Jan Kara
8 siblings, 1 reply; 12+ messages in thread
From: Baokun Li @ 2026-06-26 8:35 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
peng_wang, Sashiko
When a DAX write starts before EOF but extends past i_disksize,
ext4_write_checks() skips the IOCB_NOWAIT check because
iocb->ki_pos <= old_size. However, ext4_dax_write_iter() later calls
ext4_journal_start() to prepare for inode extension, which can sleep
waiting for journal space or transaction commit.
This violates NOWAIT semantics and can stall asynchronous I/O frameworks
like io_uring that rely on non-blocking behavior.
Fix this by checking IOCB_NOWAIT before calling ext4_journal_start()
in the extending write path. If NOWAIT is set and extension is needed,
return -EAGAIN so the caller can retry in blocking context.
Example scenario:
- File: i_size = 1000, i_disksize = 1000
- DAX NOWAIT write: offset = 500, count = 2000
- ext4_write_checks(): ki_pos (500) <= old_size (1000), skip NOWAIT check
- ext4_dax_write_iter(): offset + count (2500) > i_disksize (1000)
- ext4_journal_start() → may sleep → violates NOWAIT
Reported-by: Sashiko <sashiko-bot@kernel.org>
Closes: https://sashiko.dev/#/patchset/20260618125735.4156639-1-libaokun@linux.alibaba.com?part=5
Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
fs/ext4/file.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 49074cc13751..610e109fcd36 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -731,6 +731,11 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
count = iov_iter_count(from);
if (offset + count > EXT4_I(inode)->i_disksize) {
+ if (iocb->ki_flags & IOCB_NOWAIT) {
+ ret = -EAGAIN;
+ goto out;
+ }
+
handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
--
2.43.7
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v3 6/9] ext4: improve EXT4_GET_BLOCKS_CACHED_NOWAIT handling in ext4_map_blocks
[not found] ` <20260626085003.BD4BC1F000E9@smtp.kernel.org>
@ 2026-06-26 10:10 ` Baokun Li
0 siblings, 0 replies; 12+ messages in thread
From: Baokun Li @ 2026-06-26 10:10 UTC (permalink / raw)
To: Ext4 Developers List
Cc: tytso, sashiko-reviews, Andreas Dilger, Jan Kara, Zhang Yi,
Ojaswin Mujoo, Ritesh Harjani (IBM)
On 2026/6/26 16:50, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
>
> New issues:
> - [High] Encountering an invalid block during EXT4_GET_BLOCKS_CACHED_NOWAIT causes ext4_error_inode() to block while in RCU-walk mode.
>
> Pre-existing issues:
> - [High] check_block_validity() is bypassed for unwritten extents because EXT4_MAP_MAPPED is not set, leaving the filesystem vulnerable to arbitrary block reads.
> --
>
> commit 16a7856d68391995dd0a737adac6e20b3e6ddb76
> Author: Baokun Li <libaokun@linux.alibaba.com>
>
> ext4: improve EXT4_GET_BLOCKS_CACHED_NOWAIT handling in ext4_map_blocks
>
> When EXT4_GET_BLOCKS_CACHED_NOWAIT is set and the extent status cache hits,
> ext4_map_blocks() returns immediately without running check_block_validity().
> This allows malicious extents from crafted filesystem images to bypass
> validation if they have been cached by a previous blocking read.
>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index a60ecf0bebd04..2479f8d93ec8b 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -759,8 +759,9 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
>> BUG();
>> }
>>
>> + /* Skip blocking operations and jump to extent validation. */
>> if (flags & EXT4_GET_BLOCKS_CACHED_NOWAIT)
>> - return retval;
>> + goto found;
> [Severity: High]
> Can jumping to the extent validation cause a scheduling while atomic BUG?
>
> If a crafted filesystem provides a malicious extent that is successfully cached,
> a subsequent path walk in RCU mode will call ext4_getblk() with
> EXT4_GET_BLOCKS_CACHED_NOWAIT.
>
> Jumping to found executes check_block_validity(). If the extent is malicious,
> this calls ext4_error_inode(), which can block:
>
> check_block_validity()
> __check_block_validity()
> ext4_error_inode()
> ext4_handle_error()
> ext4_commit_super()
> wait_on_buffer()
>
> Since RCU-walk mode holds rcu_read_lock(), will blocking in wait_on_buffer()
> cause a crash?
Ah~ not doing check_block_validity() now will directly use potentially
abnormal extents, but doing check_block_validity() might sleep in atomic
context. It seems that validation before inserting into the es tree is the
thorough solution.
> [Severity: High]
> This isn't a bug introduced by this patch, but does the extent validation
> actually check unwritten extents?
>
> When looking up an unwritten extent in the ES cache, the flags are set to
> EXT4_MAP_UNWRITTEN but omit EXT4_MAP_MAPPED:
>
> ext4_map_blocks() {
> ...
> map->m_flags |= ext4_es_is_written(&es) ?
> EXT4_MAP_MAPPED : EXT4_MAP_UNWRITTEN;
> ...
> }
>
> At the found label, check_block_validity() is only called if EXT4_MAP_MAPPED
> is set:
>
> found:
> if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
> ret = check_block_validity(inode, map);
> ...
>
> Since unwritten extents skip check_block_validity(), could an attacker craft
> a filesystem with an unwritten extent in a directory pointing to critical
> metadata, forcing ext4_getblk() to read those blocks and leak their contents?
>
This is a false positive:
1. Symlinks cannot have unwritten extents. ext4 only creates
unwritten extents via fallocate for regular files; symlink
data is always fully written at creation time.
2. Reading unwritten extents does not leak stale data.
3. Writing to unwritten extents triggers re-validation. The
unwritten-to-written conversion re-maps through the full
ext4_map_blocks() path (without CACHED_NOWAIT), which runs
check_block_validity() before any user data reaches the
extent.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v3 9/9] ext4: fix NOWAIT semantic violation in DAX extending writes
2026-06-26 8:35 ` [PATCH v3 9/9] ext4: fix NOWAIT semantic violation in DAX extending writes Baokun Li
@ 2026-06-26 14:32 ` Jan Kara
0 siblings, 0 replies; 12+ messages in thread
From: Jan Kara @ 2026-06-26 14:32 UTC (permalink / raw)
To: Baokun Li
Cc: linux-ext4, tytso, adilger.kernel, jack, yi.zhang, ojaswin,
ritesh.list, peng_wang, Sashiko
On Fri 26-06-26 16:35:18, Baokun Li wrote:
> When a DAX write starts before EOF but extends past i_disksize,
> ext4_write_checks() skips the IOCB_NOWAIT check because
> iocb->ki_pos <= old_size. However, ext4_dax_write_iter() later calls
> ext4_journal_start() to prepare for inode extension, which can sleep
> waiting for journal space or transaction commit.
>
> This violates NOWAIT semantics and can stall asynchronous I/O frameworks
> like io_uring that rely on non-blocking behavior.
>
> Fix this by checking IOCB_NOWAIT before calling ext4_journal_start()
> in the extending write path. If NOWAIT is set and extension is needed,
> return -EAGAIN so the caller can retry in blocking context.
>
> Example scenario:
> - File: i_size = 1000, i_disksize = 1000
> - DAX NOWAIT write: offset = 500, count = 2000
> - ext4_write_checks(): ki_pos (500) <= old_size (1000), skip NOWAIT check
> - ext4_dax_write_iter(): offset + count (2500) > i_disksize (1000)
> - ext4_journal_start() → may sleep → violates NOWAIT
>
> Reported-by: Sashiko <sashiko-bot@kernel.org>
> Closes: https://sashiko.dev/#/patchset/20260618125735.4156639-1-libaokun@linux.alibaba.com?part=5
> Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
Looks good. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/ext4/file.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 49074cc13751..610e109fcd36 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -731,6 +731,11 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
> count = iov_iter_count(from);
>
> if (offset + count > EXT4_I(inode)->i_disksize) {
> + if (iocb->ki_flags & IOCB_NOWAIT) {
> + ret = -EAGAIN;
> + goto out;
> + }
> +
> handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> if (IS_ERR(handle)) {
> ret = PTR_ERR(handle);
> --
> 2.43.7
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2026-06-26 14:32 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-26 8:35 [PATCH v3 0/9] ext4: allow more DIO writes under shared i_rwsem Baokun Li
2026-06-26 8:35 ` [PATCH v3 1/9] ext4: prevent sleeping allocation in NOWAIT write path Baokun Li
2026-06-26 8:35 ` [PATCH v3 2/9] ext4: drain in-flight DIO before buffered write fallback Baokun Li
2026-06-26 8:35 ` [PATCH v3 3/9] ext4: skip overwrite check for aligned non-extending DIO writes Baokun Li
2026-06-26 8:35 ` [PATCH v3 4/9] ext4: base unaligned DIO lock decision on partial block zeroing Baokun Li
2026-06-26 8:35 ` [PATCH v3 5/9] ext4: use kiocb_modified instead of file_modified in DIO/DAX write path Baokun Li
2026-06-26 8:35 ` [PATCH v3 6/9] ext4: improve EXT4_GET_BLOCKS_CACHED_NOWAIT handling in ext4_map_blocks Baokun Li
[not found] ` <20260626085003.BD4BC1F000E9@smtp.kernel.org>
2026-06-26 10:10 ` Baokun Li
2026-06-26 8:35 ` [PATCH v3 7/9] ext4: handle IOMAP_NOWAIT in ext4_iomap_begin() with cache-only lookup Baokun Li
2026-06-26 8:35 ` [PATCH v3 8/9] ext4: handle IOCB_NOWAIT in ext4_dio_needs_zeroing() " Baokun Li
2026-06-26 8:35 ` [PATCH v3 9/9] ext4: fix NOWAIT semantic violation in DAX extending writes Baokun Li
2026-06-26 14:32 ` Jan Kara
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox