From: Peng Wang <peng_wang@linux.alibaba.com>
To: tytso@mit.edu, adilger.kernel@dilger.ca,
libaokun@linux.alibaba.com, jack@suse.cz, ojaswin@linux.ibm.com,
ritesh.list@gmail.com, yi.zhang@huawei.com
Cc: linux-ext4@vger.kernel.org, inux-kernel@vger.kernel.org,
Peng Wang <peng_wang@linux.alibaba.com>
Subject: [RFC PATCH] ext4: fix false-negative overwrite check for DIO spanning extent boundaries
Date: Sun, 7 Jun 2026 20:49:35 +0800 [thread overview]
Message-ID: <20260607124935.6168-1-peng_wang@linux.alibaba.com> (raw)
ext4_overwrite_io() decides whether a direct I/O write is an overwrite
(all target blocks already allocated) so the write can proceed under a
shared inode lock. It calls ext4_map_blocks() once and returns false
if the mapped length is shorter than the requested length.
ext4_map_blocks() maps at most one extent per call. When a write
straddles two extents (e.g. a written extent and an adjacent unwritten
extent created by fallocate), the single call returns only the first
extent's length. ext4_overwrite_io() then mis-classifies the write as
non-overwrite and forces the caller to cycle i_rwsem from shared to
exclusive.
On workloads where a DIO writer appends through a fallocated region
while a DIO reader tails the same file, every write that crosses a
written/unwritten extent boundary triggers an exclusive lock
acquisition. The writer must wait for the reader's shared lock to be
released, and while waiting the RWSEM_FLAG_WAITERS bit blocks all
other shared acquirers. This serialises all writers to queue-depth 1
and throughput collapses.
Fix by looping ext4_map_blocks() over the remaining range. As long as
every queried extent reports allocated blocks (written or unwritten),
the function returns true and the write keeps the shared lock.
The *unwritten output now uses OR semantics across extents: set if any
block in the range is unwritten. This is correct for the two callers:
- (unaligned_io && unwritten) takes the exclusive lock, which is
needed if any block requires partial-block zeroing.
- (ilock_shared && !unwritten) selects ext4_iomap_overwrite_ops,
which skips journal transactions and is only safe when every block
is written/mapped.
The loop adds at most one extra ext4_map_blocks() call per extent
boundary, which is negligible compared to the lock contention it
eliminates.
Reproducer: two threads doing O_DIRECT I/O on a fallocated ext4 file.
Thread 1 appends sequentially in 4-16 KB writes. Thread 2 reads from
the tail of the file in up to 1 MB reads. Both use the same fd with
the file preallocated via posix_fallocate().
Tested on ext4 over NVMe, 6.6 based kernel:
before after
writer-only throughput: 399 MB/s 412 MB/s
mixed (writer + reader): 11 MB/s 381 MB/s
write latency (mixed): 880 us 21 us
rwsem_down_write_slowpath
(5 s sample, mixed): 1792 2
Signed-off-by: Peng Wang <peng_wang@linux.alibaba.com>
---
fs/ext4/file.c | 25 ++++++++++++++++---------
1 file changed, 16 insertions(+), 9 deletions(-)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index eb1a323962b1..d060de8eddac 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -228,15 +228,22 @@ static bool ext4_overwrite_io(struct inode *inode,
map.m_len = EXT4_MAX_BLOCKS(len, pos, blkbits);
blklen = map.m_len;
- err = ext4_map_blocks(NULL, inode, &map, 0);
- if (err != blklen)
- return false;
- /*
- * 'err==len' means that all of the blocks have been preallocated,
- * regardless of whether they have been initialized or not. We need to
- * check m_flags to distinguish the unwritten extents.
- */
- *unwritten = !(map.m_flags & EXT4_MAP_MAPPED);
+ *unwritten = false;
+
+ while (blklen > 0) {
+ map.m_len = blklen;
+ err = ext4_map_blocks(NULL, inode, &map, 0);
+ /*
+ * err <= 0 means a hole or error; the write needs block
+ * allocation so it cannot be treated as an overwrite.
+ */
+ if (err <= 0)
+ return false;
+ if (!(map.m_flags & EXT4_MAP_MAPPED))
+ *unwritten = true;
+ blklen -= err;
+ map.m_lblk += err;
+ }
return true;
}
--
2.43.0
next reply other threads:[~2026-06-07 12:49 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-07 12:49 Peng Wang [this message]
2026-06-08 2:25 ` [RFC PATCH] ext4: fix false-negative overwrite check for DIO spanning extent boundaries Baokun Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260607124935.6168-1-peng_wang@linux.alibaba.com \
--to=peng_wang@linux.alibaba.com \
--cc=adilger.kernel@dilger.ca \
--cc=inux-kernel@vger.kernel.org \
--cc=jack@suse.cz \
--cc=libaokun@linux.alibaba.com \
--cc=linux-ext4@vger.kernel.org \
--cc=ojaswin@linux.ibm.com \
--cc=ritesh.list@gmail.com \
--cc=tytso@mit.edu \
--cc=yi.zhang@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox