[PATCH 0/2] ext4: allow more DIO writes under shared i

Linux EXT4 FS development
 help / color / mirror / Atom feed

* [PATCH 0/2] ext4: allow more DIO writes under shared i_rwsem
@ 2026-06-11 16:34 Baokun Li
  2026-06-11 16:34 ` [PATCH 1/2] ext4: skip overwrite check for aligned non-extending DIO writes Baokun Li
  2026-06-11 16:34 ` [PATCH 2/2] ext4: base unaligned DIO lock decision on partial block zeroing Baokun Li
  0 siblings, 2 replies; 3+ messages in thread
From: Baokun Li @ 2026-06-11 16:34 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
	peng_wang

Hi all,

This series relaxes the i_rwsem requirements of ext4_dio_write_iter()
so that more direct I/O writes can proceed under the shared lock.

It continues the work started by Peng Wang's RFC [1]; I'm taking
over this effort going forward.

ext4_dio_write_checks() currently calls ext4_overwrite_io() to decide
whether the shared lock is sufficient. Its single ext4_map_blocks()
lookup only sees the first contiguous extent of the same type, which
forces the exclusive lock for two cases that are actually safe under
the shared lock (see individual patches for the full safety
argument):

  1. Aligned writes spanning multiple already-allocated extents (e.g.
     written + unwritten, or two discontiguous written extents).

  2. Unaligned writes whose head/tail partial blocks land on written
     extents but the fully-covered middle blocks include hole or
     unwritten extents.

Patch 1 skips the ext4_overwrite_io() pre-check entirely for aligned
non-extending writes, letting them proceed under the shared lock
regardless of extent state.

Patch 2 replaces ext4_overwrite_io() with ext4_dio_needs_zeroing(),
which directly answers the question driving the lock decision. It
checks only the head and tail partial blocks (at most two
ext4_map_blocks() calls), and ignores the state of middle blocks.

Testing
=======

"kvm-xfstests -c ext4/all -g auto" passes with no new failures.

Performance
===========

Hardware: /dev/sda (rotational disk, ~1 GB/s sustained write)
Filesystem: ext4 default mkfs

Test 1: aligned 8K DIO writes spanning written+unwritten extent
boundaries. Each thread writes its own 1G region sequentially; the
file is rebuilt between runs so every block is written exactly once.
Metric: IOPS.

  JOBS         base    +patch 1    +patch 1+2    speedup
  ----    ---------    --------    ----------    -------
     1       42,322      43,329        43,087      1.02x
     2       68,516      70,677        66,958      1.03x
     4       62,489      97,072       101,468      1.62x
     8       58,701     110,819       113,679      1.94x
    16       58,569     116,392       115,272      1.97x
    32       60,860     117,244       119,621      1.97x

Wall time at JOBS=32: 69.2s (base) -> 35.4s (patched), 1.96x faster.

Test 2: unaligned DIO writes (14336 bytes at +512 within each 16K
stripe). Each stripe is laid out as [written][unwritten][unwritten]
[written], so the head and tail partial blocks land on written
extents but the middle is unwritten. Metric: IOPS.

  JOBS         base    +patch 1    +patch 1+2    speedup
  ----    ---------    --------    ----------    -------
     1       15,547      15,975        17,381      1.12x
     2       15,910      14,808        34,172      2.15x
     4       15,014      14,828        57,567      3.83x
     8       15,022      14,648        81,947      5.46x
    16       14,586      14,262        99,126      6.80x
    32       14,047      13,809        92,519      6.59x

Wall time at JOBS=32: 149.3s (base) -> 22.7s (patched), 6.58x faster.

In test 2, patch 1 alone has no effect (slight noise) because patch 1
only touches the aligned write path. Patch 2 introduces
ext4_dio_needs_zeroing() which precisely identifies when partial
block zeroing is required, allowing the shared lock for the much
larger set of unaligned writes that don't actually trigger zeroing.

Comments and questions are, as always, welcome.

Thanks,
Baokun

[1]: https://patch.msgid.link/20260607124935.6168-1-peng_wang@linux.alibaba.com

Baokun Li (2):
  ext4: skip overwrite check for aligned non-extending DIO writes
  ext4: base unaligned DIO lock decision on partial block zeroing

 fs/ext4/file.c | 132 +++++++++++++++++++++++++++++++++----------------
 1 file changed, 89 insertions(+), 43 deletions(-)

-- 
2.43.7

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH 1/2] ext4: skip overwrite check for aligned non-extending DIO writes
  2026-06-11 16:34 [PATCH 0/2] ext4: allow more DIO writes under shared i_rwsem Baokun Li
@ 2026-06-11 16:34 ` Baokun Li
  2026-06-11 16:34 ` [PATCH 2/2] ext4: base unaligned DIO lock decision on partial block zeroing Baokun Li
  1 sibling, 0 replies; 3+ messages in thread
From: Baokun Li @ 2026-06-11 16:34 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
	peng_wang

Currently, ext4_dio_write_checks() calls ext4_overwrite_io() to
determine if a write is a pure overwrite, and upgrades to exclusive
i_rwsem if not. However, ext4_overwrite_io() uses a single
ext4_map_blocks() call which only returns the first contiguous extent of
the same type. A write spanning multiple pre-allocated extents (e.g.
written + unwritten, or two physically discontiguous written extents)
produces a false negative, forcing an unnecessary exclusive lock upgrade.

After commit 5d87c7fca2c1 ("ext4: avoid starting handle when dio
writing an unwritten extent") and commit 012924f0eeef ("ext4: remove
useless ext4_iomap_overwrite_ops"), ext4_iomap_begin()'s fast path
accepts both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN without starting a
journal transaction. The iomap iteration naturally handles multi-extent
ranges: each call returns the mapping for the current segment, and
unwritten-to-written conversion is deferred to ext4_dio_write_end_io().
This means the common case of mixed written/unwritten extents never
reaches ext4_iomap_alloc() at all.

Even for the less common case where the range contains a hole and
ext4_iomap_alloc() is needed, exclusive i_rwsem is still unnecessary for
aligned non-extending writes:

 - truncate/punch_hole are kept out: they require exclusive i_rwsem
   (blocked by our shared lock during allocation), and inode_dio_begin()
   keeps their inode_dio_wait() blocked until in-flight bios complete.
 - i_data_sem write-lock inside ext4_map_blocks() serializes concurrent
   extent tree modifications (parallel writers to the same hole).
 - The journal handle is per-thread and does not require i_rwsem
   exclusion.
 - i_disksize and orphan list are not involved in non-extending writes.

Skip the ext4_overwrite_io() check entirely for aligned writes by
initializing overwrite to true and only calling ext4_overwrite_io() for
unaligned writes. Unaligned writes still need the extent state check
because concurrent partial block zeroing in the DIO layer requires
exclusive serialization unless the range is a pure written-extent
overwrite.

Performance:

Hardware: /dev/sda (rotational disk, ~1 GB/s sustained write)
Filesystem: ext4 default mkfs

Aligned 8K DIO writes spanning written+unwritten extent boundaries.
Each thread writes its own 1G region sequentially; the file is rebuilt
between runs so every block is written exactly once. Metric: IOPS.

  JOBS      Before        After    speedup
  ----    --------    ---------    -------
     1      42,322       43,329      1.02x
     2      68,516       70,677      1.03x
     4      62,489       97,072      1.55x
     8      58,701      110,819      1.89x
    16      58,569      116,392      1.99x
    32      60,860      117,244      1.93x

Wall time at JOBS=32: 69.2s (Before) -> 35.4s (After), 1.96x faster.

Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
 fs/ext4/file.c | 52 +++++++++++++++++++++++++++++---------------------
 1 file changed, 30 insertions(+), 22 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index eb1a323962b1..6f3886465ce3 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -428,16 +428,27 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
  * condition requires an exclusive inode lock. If yes, then we restart the
  * whole operation by releasing the shared lock and acquiring exclusive lock.
  *
- * - For unaligned_io we never take shared lock as it may cause data corruption
- *   when two unaligned IO tries to modify the same block e.g. while zeroing.
+ * The decision is layered, evaluated in this order:
  *
- * - For extending writes case we don't take the shared lock, since it requires
- *   updating inode i_disksize and/or orphan handling with exclusive lock.
+ * 1. If file_modified() needs to update security info (!IS_NOSEC), upgrade
+ *    to the exclusive lock -- the security update itself requires it,
+ *    regardless of whether the write extends the file or is aligned.
  *
- * - shared locking will only be true mostly with overwrites, including
- *   initialized blocks and unwritten blocks.
+ * 2. If the write extends i_size or i_disksize, upgrade to the exclusive
+ *    lock to safely update i_disksize and the orphan list, regardless of
+ *    alignment.
  *
- * - Otherwise we will switch to exclusive i_rwsem lock.
+ * 3. Otherwise, for aligned non-extending writes, shared lock is always
+ *    sufficient regardless of extent state (written, unwritten, or hole).
+ *    truncate/punch_hole cannot run while we hold the shared i_rwsem
+ *    (they need it exclusively); after we release it, inode_dio_begin()
+ *    keeps their inode_dio_wait() blocked until in-flight bios complete.
+ *    i_data_sem serializes concurrent extent tree modifications.
+ *
+ * 4. Otherwise, the write is unaligned and non-extending. Shared lock is
+ *    only safe for pure written-extent overwrites. Unwritten extents or
+ *    holes require exclusive lock because concurrent partial block zeroing
+ *    in the DIO layer could corrupt data.
  */
 static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 				     bool *ilock_shared, bool *extend,
@@ -448,7 +459,7 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 	loff_t offset;
 	size_t count;
 	ssize_t ret;
-	bool overwrite, unaligned_io, unwritten;
+	bool overwrite = true, unaligned_io, unwritten = false;
 
 restart:
 	ret = ext4_generic_write_checks(iocb, from);
@@ -460,22 +471,19 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 
 	unaligned_io = ext4_unaligned_io(inode, from, offset);
 	*extend = ext4_extending_io(inode, offset, count);
-	overwrite = ext4_overwrite_io(inode, offset, count, &unwritten);
 
 	/*
-	 * Determine whether we need to upgrade to an exclusive lock. This is
-	 * required to change security info in file_modified(), for extending
-	 * I/O, any form of non-overwrite I/O, and unaligned I/O to unwritten
-	 * extents (as partial block zeroing may be required).
-	 *
-	 * Note that unaligned writes are allowed under shared lock so long as
-	 * they are pure overwrites. Otherwise, concurrent unaligned writes risk
-	 * data corruption due to partial block zeroing in the dio layer, and so
-	 * the I/O must occur exclusively.
+	 * For unaligned writes we need to know the extent state to determine
+	 * whether shared lock is safe. For aligned writes we skip this check
+	 * entirely since allocation under shared lock is safe.
 	 */
+	if (unaligned_io)
+		overwrite = ext4_overwrite_io(inode, offset, count, &unwritten);
+
+	/* Determine whether we need to upgrade to an exclusive lock. */
 	if (*ilock_shared &&
-	    ((!IS_NOSEC(inode) || *extend || !overwrite ||
-	     (unaligned_io && unwritten)))) {
+	    ((!IS_NOSEC(inode) || *extend ||
+	     (unaligned_io && (!overwrite || unwritten))))) {
 		if (iocb->ki_flags & IOCB_NOWAIT) {
 			ret = -EAGAIN;
 			goto out;
@@ -490,8 +498,8 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 	 * Now that locking is settled, determine dio flags and exclusivity
 	 * requirements. We don't use DIO_OVERWRITE_ONLY because we enforce
 	 * behavior already. The inode lock is already held exclusive if the
-	 * write is non-overwrite or extending, so drain all outstanding dio and
-	 * set the force wait dio flag.
+	 * write is unaligned non-overwrite or extending, so drain all
+	 * outstanding dio and set the force wait dio flag.
 	 */
 	if (!*ilock_shared && (unaligned_io || *extend)) {
 		if (iocb->ki_flags & IOCB_NOWAIT) {
-- 
2.43.7


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH 2/2] ext4: base unaligned DIO lock decision on partial block zeroing
  2026-06-11 16:34 [PATCH 0/2] ext4: allow more DIO writes under shared i_rwsem Baokun Li
  2026-06-11 16:34 ` [PATCH 1/2] ext4: skip overwrite check for aligned non-extending DIO writes Baokun Li
@ 2026-06-11 16:34 ` Baokun Li
  1 sibling, 0 replies; 3+ messages in thread
From: Baokun Li @ 2026-06-11 16:34 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
	peng_wang

For unaligned DIO writes, the previous ext4_overwrite_io() required the
entire range to fall within a single written extent.  This was overly
conservative: the DIO layer only performs partial block zeroing for the
head and tail blocks when they are partially covered by the write.
Middle blocks that are fully covered are written as whole blocks
without any zeroing, so they are safe regardless of extent state.

Therefore exclusive lock is only required when partial block zeroing
will actually happen:
 - The head partial block (if any) lands on a hole or unwritten extent.
 - The tail partial block (if any) lands on a hole or unwritten extent.

Middle full-cover blocks can be in any state (hole, unwritten, or
written) - block allocation under shared lock is safe per the previous
patch's analysis (inode_dio_begin + i_data_sem protection).

Replace ext4_overwrite_io() with ext4_dio_needs_zeroing(), which
directly answers the question driving the lock decision.  It uses at
most two ext4_map_blocks() calls: one for the head partial block (also
catching the case where it spans through the tail), and one for the
tail partial block if not already covered.

This enables shared lock for previously-rejected scenarios such as:
 - Unaligned write spanning written extent + mid-range hole + written
   extent at the tail.
 - Unaligned write where the partial blocks land on written extents but
   the middle has unwritten extents.

Performance:

Hardware: /dev/sda (rotational disk, ~1 GB/s sustained write)
Filesystem: ext4 default mkfs

Unaligned DIO writes (14336 bytes at +512 within each 16K stripe).
Each stripe is laid out as [written][unwritten][unwritten][written],
so the head and tail partial blocks land on written extents but the
middle is unwritten.  Metric: IOPS.

  JOBS      Before        After    speedup
  ----    --------    ---------    -------
     1      15,547       17,381      1.12x
     2      15,910       34,172      2.15x
     4      15,014       57,567      3.83x
     8      15,022       81,947      5.46x
    16      14,586       99,126      6.80x
    32      14,047       92,519      6.59x

Wall time at JOBS=32: 149.3s (Before) -> 22.7s (After), 6.58x faster.

Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
 fs/ext4/file.c | 108 +++++++++++++++++++++++++++++++++----------------
 1 file changed, 73 insertions(+), 35 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 6f3886465ce3..aa926e641739 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -213,31 +213,60 @@ ext4_extending_io(struct inode *inode, loff_t offset, size_t len)
 	return false;
 }
 
-/* Is IO overwriting allocated or initialized blocks? */
-static bool ext4_overwrite_io(struct inode *inode,
-			      loff_t pos, loff_t len, bool *unwritten)
+/*
+ * Does an unaligned DIO write require partial block zeroing?
+ *
+ * Partial block zeroing is performed only for the head and tail blocks
+ * when they are partially covered by the write and the underlying extent
+ * is a hole or unwritten. Middle blocks (fully covered by the write)
+ * are written as whole blocks without zeroing.
+ *
+ * When zeroing is required, two concurrent unaligned DIO writes to the
+ * same partial block can race and corrupt each other's data, so the
+ * caller must take the exclusive i_rwsem and drain in-flight DIO. When
+ * zeroing is not required, shared lock is safe -- block allocation and
+ * unwritten conversion for middle blocks are protected by i_data_sem
+ * and inode_dio_begin().
+ */
+static bool ext4_dio_needs_zeroing(struct inode *inode, loff_t pos, loff_t len)
 {
 	struct ext4_map_blocks map;
 	unsigned int blkbits = inode->i_blkbits;
-	int err, blklen;
+	unsigned long blockmask = inode->i_sb->s_blocksize - 1;
+	bool head_partial, tail_partial;
+	ext4_lblk_t head_lblk, tail_lblk;
+	int err;
 
 	if (pos + len > i_size_read(inode))
-		return false;
+		return true;
 
-	map.m_lblk = pos >> blkbits;
-	map.m_len = EXT4_MAX_BLOCKS(len, pos, blkbits);
-	blklen = map.m_len;
+	head_partial = (pos & blockmask) != 0;
+	tail_partial = ((pos + len) & blockmask) != 0;
+	head_lblk = pos >> blkbits;
+	tail_lblk = (pos + len - 1) >> blkbits;
+
+	/* Check the head partial block. */
+	if (head_partial) {
+		map.m_lblk = head_lblk;
+		map.m_len = tail_lblk - head_lblk + 1;
+		err = ext4_map_blocks(NULL, inode, &map, 0);
+		if (err <= 0 || !(map.m_flags & EXT4_MAP_MAPPED))
+			return true;
+		/* If this mapping already covers the tail block, we're done. */
+		if (!tail_partial || map.m_lblk + err > tail_lblk)
+			return false;
+	}
 
-	err = ext4_map_blocks(NULL, inode, &map, 0);
-	if (err != blklen)
-		return false;
-	/*
-	 * 'err==len' means that all of the blocks have been preallocated,
-	 * regardless of whether they have been initialized or not. We need to
-	 * check m_flags to distinguish the unwritten extents.
-	 */
-	*unwritten = !(map.m_flags & EXT4_MAP_MAPPED);
-	return true;
+	/* Check the tail partial block. */
+	if (tail_partial) {
+		map.m_lblk = tail_lblk;
+		map.m_len = 1;
+		err = ext4_map_blocks(NULL, inode, &map, 0);
+		if (err <= 0 || !(map.m_flags & EXT4_MAP_MAPPED))
+			return true;
+	}
+
+	return false;
 }
 
 static ssize_t ext4_generic_write_checks(struct kiocb *iocb,
@@ -446,9 +475,10 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
  *    i_data_sem serializes concurrent extent tree modifications.
  *
  * 4. Otherwise, the write is unaligned and non-extending. Shared lock is
- *    only safe for pure written-extent overwrites. Unwritten extents or
- *    holes require exclusive lock because concurrent partial block zeroing
- *    in the DIO layer could corrupt data.
+ *    safe unless the DIO layer needs to perform partial block zeroing --
+ *    i.e. the head or tail partial block sits on a hole or unwritten
+ *    extent. In that case upgrade to the exclusive lock and drain
+ *    in-flight DIO to avoid races with concurrent partial block zeroing.
  */
 static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 				     bool *ilock_shared, bool *extend,
@@ -459,7 +489,7 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 	loff_t offset;
 	size_t count;
 	ssize_t ret;
-	bool overwrite = true, unaligned_io, unwritten = false;
+	bool needs_zeroing = false;
 
 restart:
 	ret = ext4_generic_write_checks(iocb, from);
@@ -469,21 +499,22 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 	offset = iocb->ki_pos;
 	count = ret;
 
-	unaligned_io = ext4_unaligned_io(inode, from, offset);
 	*extend = ext4_extending_io(inode, offset, count);
 
 	/*
-	 * For unaligned writes we need to know the extent state to determine
-	 * whether shared lock is safe. For aligned writes we skip this check
-	 * entirely since allocation under shared lock is safe.
+	 * For unaligned writes, check whether partial block zeroing will be
+	 * needed. If so, exclusive lock is required to serialize against
+	 * concurrent DIO that could race with the zeroing.
+	 *
+	 * For aligned writes we skip this check entirely since allocation
+	 * under shared lock is safe.
 	 */
-	if (unaligned_io)
-		overwrite = ext4_overwrite_io(inode, offset, count, &unwritten);
+	if (ext4_unaligned_io(inode, from, offset))
+		needs_zeroing = ext4_dio_needs_zeroing(inode, offset, count);
 
 	/* Determine whether we need to upgrade to an exclusive lock. */
 	if (*ilock_shared &&
-	    ((!IS_NOSEC(inode) || *extend ||
-	     (unaligned_io && (!overwrite || unwritten))))) {
+	    (!IS_NOSEC(inode) || *extend || needs_zeroing)) {
 		if (iocb->ki_flags & IOCB_NOWAIT) {
 			ret = -EAGAIN;
 			goto out;
@@ -497,16 +528,23 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 	/*
 	 * Now that locking is settled, determine dio flags and exclusivity
 	 * requirements. We don't use DIO_OVERWRITE_ONLY because we enforce
-	 * behavior already. The inode lock is already held exclusive if the
-	 * write is unaligned non-overwrite or extending, so drain all
-	 * outstanding dio and set the force wait dio flag.
+	 * behavior already. When holding the exclusive lock for a write that
+	 * needs partial block zeroing or is extending the file, we must wait
+	 * for the I/O to complete synchronously:
+	 *
+	 *  - needs_zeroing: drain in-flight DIO whose end_io could race with
+	 *    our partial block zeroing, and force synchronous completion so we
+	 *    don't leave in-flight zeroing bios for the next writer to drain.
+	 *
+	 *  - extend: the caller must update i_disksize after I/O completion,
+	 *    which requires the data to be on disk first.
 	 */
-	if (!*ilock_shared && (unaligned_io || *extend)) {
+	if (!*ilock_shared && (needs_zeroing || *extend)) {
 		if (iocb->ki_flags & IOCB_NOWAIT) {
 			ret = -EAGAIN;
 			goto out;
 		}
-		if (unaligned_io && (!overwrite || unwritten))
+		if (needs_zeroing)
 			inode_dio_wait(inode);
 		*dio_flags = IOMAP_DIO_FORCE_WAIT;
 	}
-- 
2.43.7


^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-06-11 16:35 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-11 16:34 [PATCH 0/2] ext4: allow more DIO writes under shared i_rwsem Baokun Li
2026-06-11 16:34 ` [PATCH 1/2] ext4: skip overwrite check for aligned non-extending DIO writes Baokun Li
2026-06-11 16:34 ` [PATCH 2/2] ext4: base unaligned DIO lock decision on partial block zeroing Baokun Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox