Linux EXT4 FS development
 help / color / mirror / Atom feed
From: Peng Wang <peng_wang@linux.alibaba.com>
To: tytso@mit.edu, adilger.kernel@dilger.ca,
	libaokun@linux.alibaba.com, jack@suse.cz, ojaswin@linux.ibm.com,
	ritesh.list@gmail.com, yi.zhang@huawei.com
Cc: linux-ext4@vger.kernel.org, inux-kernel@vger.kernel.org,
	Peng Wang <peng_wang@linux.alibaba.com>
Subject: [RFC PATCH] ext4: fix false-negative overwrite check for DIO spanning extent boundaries
Date: Sun,  7 Jun 2026 20:49:35 +0800	[thread overview]
Message-ID: <20260607124935.6168-1-peng_wang@linux.alibaba.com> (raw)

ext4_overwrite_io() decides whether a direct I/O write is an overwrite
(all target blocks already allocated) so the write can proceed under a
shared inode lock.  It calls ext4_map_blocks() once and returns false
if the mapped length is shorter than the requested length.

ext4_map_blocks() maps at most one extent per call.  When a write
straddles two extents (e.g. a written extent and an adjacent unwritten
extent created by fallocate), the single call returns only the first
extent's length.  ext4_overwrite_io() then mis-classifies the write as
non-overwrite and forces the caller to cycle i_rwsem from shared to
exclusive.

On workloads where a DIO writer appends through a fallocated region
while a DIO reader tails the same file, every write that crosses a
written/unwritten extent boundary triggers an exclusive lock
acquisition.  The writer must wait for the reader's shared lock to be
released, and while waiting the RWSEM_FLAG_WAITERS bit blocks all
other shared acquirers.  This serialises all writers to queue-depth 1
and throughput collapses.

Fix by looping ext4_map_blocks() over the remaining range.  As long as
every queried extent reports allocated blocks (written or unwritten),
the function returns true and the write keeps the shared lock.

The *unwritten output now uses OR semantics across extents: set if any
block in the range is unwritten.  This is correct for the two callers:

 - (unaligned_io && unwritten) takes the exclusive lock, which is
   needed if any block requires partial-block zeroing.
 - (ilock_shared && !unwritten) selects ext4_iomap_overwrite_ops,
   which skips journal transactions and is only safe when every block
   is written/mapped.

The loop adds at most one extra ext4_map_blocks() call per extent
boundary, which is negligible compared to the lock contention it
eliminates.

Reproducer: two threads doing O_DIRECT I/O on a fallocated ext4 file.
Thread 1 appends sequentially in 4-16 KB writes.  Thread 2 reads from
the tail of the file in up to 1 MB reads.  Both use the same fd with
the file preallocated via posix_fallocate().

Tested on ext4 over NVMe, 6.6 based kernel:

                              before          after
  writer-only throughput:     399 MB/s        412 MB/s
  mixed (writer + reader):     11 MB/s        381 MB/s
  write latency (mixed):     880 us            21 us
  rwsem_down_write_slowpath
   (5 s sample, mixed):       1792              2

Signed-off-by: Peng Wang <peng_wang@linux.alibaba.com>
---
 fs/ext4/file.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index eb1a323962b1..d060de8eddac 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -228,15 +228,22 @@ static bool ext4_overwrite_io(struct inode *inode,
 	map.m_len = EXT4_MAX_BLOCKS(len, pos, blkbits);
 	blklen = map.m_len;
 
-	err = ext4_map_blocks(NULL, inode, &map, 0);
-	if (err != blklen)
-		return false;
-	/*
-	 * 'err==len' means that all of the blocks have been preallocated,
-	 * regardless of whether they have been initialized or not. We need to
-	 * check m_flags to distinguish the unwritten extents.
-	 */
-	*unwritten = !(map.m_flags & EXT4_MAP_MAPPED);
+	*unwritten = false;
+
+	while (blklen > 0) {
+		map.m_len = blklen;
+		err = ext4_map_blocks(NULL, inode, &map, 0);
+		/*
+		 * err <= 0 means a hole or error; the write needs block
+		 * allocation so it cannot be treated as an overwrite.
+		 */
+		if (err <= 0)
+			return false;
+		if (!(map.m_flags & EXT4_MAP_MAPPED))
+			*unwritten = true;
+		blklen -= err;
+		map.m_lblk += err;
+	}
 	return true;
 }
 
-- 
2.43.0


             reply	other threads:[~2026-06-07 12:49 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-07 12:49 Peng Wang [this message]
2026-06-08  2:25 ` [RFC PATCH] ext4: fix false-negative overwrite check for DIO spanning extent boundaries Baokun Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260607124935.6168-1-peng_wang@linux.alibaba.com \
    --to=peng_wang@linux.alibaba.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=inux-kernel@vger.kernel.org \
    --cc=jack@suse.cz \
    --cc=libaokun@linux.alibaba.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=ojaswin@linux.ibm.com \
    --cc=ritesh.list@gmail.com \
    --cc=tytso@mit.edu \
    --cc=yi.zhang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox