From: Baokun Li <libaokun@linux.alibaba.com>
To: linux-ext4@vger.kernel.org
Cc: tytso@mit.edu, adilger.kernel@dilger.ca, jack@suse.cz,
yi.zhang@huawei.com, ojaswin@linux.ibm.com,
ritesh.list@gmail.com, peng_wang@linux.alibaba.com
Subject: [PATCH 1/2] ext4: skip overwrite check for aligned non-extending DIO writes
Date: Fri, 12 Jun 2026 00:34:40 +0800 [thread overview]
Message-ID: <20260611163441.2431805-2-libaokun@linux.alibaba.com> (raw)
In-Reply-To: <20260611163441.2431805-1-libaokun@linux.alibaba.com>
Currently, ext4_dio_write_checks() calls ext4_overwrite_io() to
determine if a write is a pure overwrite, and upgrades to exclusive
i_rwsem if not. However, ext4_overwrite_io() uses a single
ext4_map_blocks() call which only returns the first contiguous extent of
the same type. A write spanning multiple pre-allocated extents (e.g.
written + unwritten, or two physically discontiguous written extents)
produces a false negative, forcing an unnecessary exclusive lock upgrade.
After commit 5d87c7fca2c1 ("ext4: avoid starting handle when dio
writing an unwritten extent") and commit 012924f0eeef ("ext4: remove
useless ext4_iomap_overwrite_ops"), ext4_iomap_begin()'s fast path
accepts both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN without starting a
journal transaction. The iomap iteration naturally handles multi-extent
ranges: each call returns the mapping for the current segment, and
unwritten-to-written conversion is deferred to ext4_dio_write_end_io().
This means the common case of mixed written/unwritten extents never
reaches ext4_iomap_alloc() at all.
Even for the less common case where the range contains a hole and
ext4_iomap_alloc() is needed, exclusive i_rwsem is still unnecessary for
aligned non-extending writes:
- truncate/punch_hole are kept out: they require exclusive i_rwsem
(blocked by our shared lock during allocation), and inode_dio_begin()
keeps their inode_dio_wait() blocked until in-flight bios complete.
- i_data_sem write-lock inside ext4_map_blocks() serializes concurrent
extent tree modifications (parallel writers to the same hole).
- The journal handle is per-thread and does not require i_rwsem
exclusion.
- i_disksize and orphan list are not involved in non-extending writes.
Skip the ext4_overwrite_io() check entirely for aligned writes by
initializing overwrite to true and only calling ext4_overwrite_io() for
unaligned writes. Unaligned writes still need the extent state check
because concurrent partial block zeroing in the DIO layer requires
exclusive serialization unless the range is a pure written-extent
overwrite.
Performance:
Hardware: /dev/sda (rotational disk, ~1 GB/s sustained write)
Filesystem: ext4 default mkfs
Aligned 8K DIO writes spanning written+unwritten extent boundaries.
Each thread writes its own 1G region sequentially; the file is rebuilt
between runs so every block is written exactly once. Metric: IOPS.
JOBS Before After speedup
---- -------- --------- -------
1 42,322 43,329 1.02x
2 68,516 70,677 1.03x
4 62,489 97,072 1.55x
8 58,701 110,819 1.89x
16 58,569 116,392 1.99x
32 60,860 117,244 1.93x
Wall time at JOBS=32: 69.2s (Before) -> 35.4s (After), 1.96x faster.
Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
fs/ext4/file.c | 52 +++++++++++++++++++++++++++++---------------------
1 file changed, 30 insertions(+), 22 deletions(-)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index eb1a323962b1..6f3886465ce3 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -428,16 +428,27 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
* condition requires an exclusive inode lock. If yes, then we restart the
* whole operation by releasing the shared lock and acquiring exclusive lock.
*
- * - For unaligned_io we never take shared lock as it may cause data corruption
- * when two unaligned IO tries to modify the same block e.g. while zeroing.
+ * The decision is layered, evaluated in this order:
*
- * - For extending writes case we don't take the shared lock, since it requires
- * updating inode i_disksize and/or orphan handling with exclusive lock.
+ * 1. If file_modified() needs to update security info (!IS_NOSEC), upgrade
+ * to the exclusive lock -- the security update itself requires it,
+ * regardless of whether the write extends the file or is aligned.
*
- * - shared locking will only be true mostly with overwrites, including
- * initialized blocks and unwritten blocks.
+ * 2. If the write extends i_size or i_disksize, upgrade to the exclusive
+ * lock to safely update i_disksize and the orphan list, regardless of
+ * alignment.
*
- * - Otherwise we will switch to exclusive i_rwsem lock.
+ * 3. Otherwise, for aligned non-extending writes, shared lock is always
+ * sufficient regardless of extent state (written, unwritten, or hole).
+ * truncate/punch_hole cannot run while we hold the shared i_rwsem
+ * (they need it exclusively); after we release it, inode_dio_begin()
+ * keeps their inode_dio_wait() blocked until in-flight bios complete.
+ * i_data_sem serializes concurrent extent tree modifications.
+ *
+ * 4. Otherwise, the write is unaligned and non-extending. Shared lock is
+ * only safe for pure written-extent overwrites. Unwritten extents or
+ * holes require exclusive lock because concurrent partial block zeroing
+ * in the DIO layer could corrupt data.
*/
static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
bool *ilock_shared, bool *extend,
@@ -448,7 +459,7 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
loff_t offset;
size_t count;
ssize_t ret;
- bool overwrite, unaligned_io, unwritten;
+ bool overwrite = true, unaligned_io, unwritten = false;
restart:
ret = ext4_generic_write_checks(iocb, from);
@@ -460,22 +471,19 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
unaligned_io = ext4_unaligned_io(inode, from, offset);
*extend = ext4_extending_io(inode, offset, count);
- overwrite = ext4_overwrite_io(inode, offset, count, &unwritten);
/*
- * Determine whether we need to upgrade to an exclusive lock. This is
- * required to change security info in file_modified(), for extending
- * I/O, any form of non-overwrite I/O, and unaligned I/O to unwritten
- * extents (as partial block zeroing may be required).
- *
- * Note that unaligned writes are allowed under shared lock so long as
- * they are pure overwrites. Otherwise, concurrent unaligned writes risk
- * data corruption due to partial block zeroing in the dio layer, and so
- * the I/O must occur exclusively.
+ * For unaligned writes we need to know the extent state to determine
+ * whether shared lock is safe. For aligned writes we skip this check
+ * entirely since allocation under shared lock is safe.
*/
+ if (unaligned_io)
+ overwrite = ext4_overwrite_io(inode, offset, count, &unwritten);
+
+ /* Determine whether we need to upgrade to an exclusive lock. */
if (*ilock_shared &&
- ((!IS_NOSEC(inode) || *extend || !overwrite ||
- (unaligned_io && unwritten)))) {
+ ((!IS_NOSEC(inode) || *extend ||
+ (unaligned_io && (!overwrite || unwritten))))) {
if (iocb->ki_flags & IOCB_NOWAIT) {
ret = -EAGAIN;
goto out;
@@ -490,8 +498,8 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
* Now that locking is settled, determine dio flags and exclusivity
* requirements. We don't use DIO_OVERWRITE_ONLY because we enforce
* behavior already. The inode lock is already held exclusive if the
- * write is non-overwrite or extending, so drain all outstanding dio and
- * set the force wait dio flag.
+ * write is unaligned non-overwrite or extending, so drain all
+ * outstanding dio and set the force wait dio flag.
*/
if (!*ilock_shared && (unaligned_io || *extend)) {
if (iocb->ki_flags & IOCB_NOWAIT) {
--
2.43.7
next prev parent reply other threads:[~2026-06-11 16:35 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-11 16:34 [PATCH 0/2] ext4: allow more DIO writes under shared i_rwsem Baokun Li
2026-06-11 16:34 ` Baokun Li [this message]
2026-06-11 16:34 ` [PATCH 2/2] ext4: base unaligned DIO lock decision on partial block zeroing Baokun Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260611163441.2431805-2-libaokun@linux.alibaba.com \
--to=libaokun@linux.alibaba.com \
--cc=adilger.kernel@dilger.ca \
--cc=jack@suse.cz \
--cc=linux-ext4@vger.kernel.org \
--cc=ojaswin@linux.ibm.com \
--cc=peng_wang@linux.alibaba.com \
--cc=ritesh.list@gmail.com \
--cc=tytso@mit.edu \
--cc=yi.zhang@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox