From: Baokun Li <libaokun@linux.alibaba.com>
To: linux-ext4@vger.kernel.org
Cc: tytso@mit.edu, adilger.kernel@dilger.ca, jack@suse.cz,
yi.zhang@huawei.com, ojaswin@linux.ibm.com,
ritesh.list@gmail.com, peng_wang@linux.alibaba.com
Subject: [PATCH v2 0/8] ext4: allow more DIO writes under shared i_rwsem
Date: Thu, 18 Jun 2026 20:57:27 +0800 [thread overview]
Message-ID: <20260618125735.4156639-1-libaokun@linux.alibaba.com> (raw)
Changes since v1:
* Collect RVB from Honza and Yi. (Thank you for your review!)
* Added Patch 1 to fix NOWAIT issues reported by Sashiko.
* Added Patch 2 to fix ext3 DIO and DIO fallback data race issue.
(Patch 4 increases the probability of this race)
* Added Patches 5-8 to fix other NOWAIT issues discovered during
investigation.
v1: https://patch.msgid.link/20260611163441.2431805-1-libaokun@linux.alibaba.com
======
Hi all,
This series relaxes the i_rwsem requirements of ext4_dio_write_iter()
so that more direct I/O writes can proceed under the shared lock.
It continues the work started by Peng Wang's RFC [1]; I'm taking
over this effort going forward.
ext4_dio_write_checks() currently calls ext4_overwrite_io() to decide
whether the shared lock is sufficient. Its single ext4_map_blocks()
lookup only sees the first contiguous extent of the same type, which
forces the exclusive lock for two cases that are actually safe under
the shared lock (see individual patches for the full safety
argument):
1. Aligned writes spanning multiple already-allocated extents (e.g.
written + unwritten, or two discontiguous written extents).
2. Unaligned writes whose head/tail partial blocks land on written
extents but the fully-covered middle blocks include hole or
unwritten extents.
Patch 1 fixes a NOWAIT issue where ext4_iomap_alloc() may sleep when
IOMAP_NOWAIT is set.
Patch 2 fixes a data race between DIO completion and buffered I/O
fallback on ext3 (no-extent inodes). This race was made more likely
by Patch 4.
Patch 3 skips the ext4_overwrite_io() pre-check entirely for aligned
non-extending writes, letting them proceed under the shared lock
regardless of extent state.
Patch 4 replaces ext4_overwrite_io() with ext4_dio_needs_zeroing(),
which directly answers the question driving the lock decision. It
checks only the head and tail partial blocks (at most two
ext4_map_blocks() calls), and ignores the state of middle blocks.
Patch 5 fixes a NOWAIT issue by using kiocb_modified instead of
file_modified in DIO/DAX write paths.
Patch 6 makes ext4_map_blocks() to return -EAGAIN instead of 0 when
EXT4_GET_BLOCKS_CACHED_NOWAIT is set and cache lookup misses.
Patch 7 adds cache-only lookup support to ext4_iomap_begin() for
IOMAP_NOWAIT requests.
Patch 8 adds cache-only lookup support to ext4_dio_needs_zeroing()
for IOCB_NOWAIT requests.
Testing
=======
"kvm-xfstests -c ext4/all -g auto" passes with no new failures.
Performance
===========
Hardware: /dev/sda (rotational disk, ~1 GB/s sustained write)
Filesystem: ext4 default mkfs
Test 1: aligned 8K DIO writes spanning written+unwritten extent
boundaries. Each thread writes its own 1G region sequentially; the
file is rebuilt between runs so every block is written exactly once.
Metric: IOPS.
JOBS base +patch 3 +patch 3+4 speedup
---- --------- -------- ---------- -------
1 42,322 43,329 43,087 1.02x
2 68,516 70,677 66,958 1.03x
4 62,489 97,072 101,468 1.62x
8 58,701 110,819 113,679 1.94x
16 58,569 116,392 115,272 1.97x
32 60,860 117,244 119,621 1.97x
Wall time at JOBS=32: 69.2s (base) -> 35.4s (patched), 1.96x faster.
Test 2: unaligned DIO writes (14336 bytes at +512 within each 16K
stripe). Each stripe is laid out as [written][unwritten][unwritten]
[written], so the head and tail partial blocks land on written
extents but the middle is unwritten. Metric: IOPS.
JOBS base +patch 3 +patch 3+4 speedup
---- --------- -------- ---------- -------
1 15,547 15,975 17,381 1.12x
2 15,910 14,808 34,172 2.15x
4 15,014 14,828 57,567 3.83x
8 15,022 14,648 81,947 5.46x
16 14,586 14,262 99,126 6.80x
32 14,047 13,809 92,519 6.59x
Wall time at JOBS=32: 149.3s (base) -> 22.7s (patched), 6.58x faster.
In test 2, patch 3 alone has no effect (slight noise) because patch 3
only touches the aligned write path. Patch 4 introduces
ext4_dio_needs_zeroing() which precisely identifies when partial
block zeroing is required, allowing the shared lock for the much
larger set of unaligned writes that don't actually trigger zeroing.
Comments and questions are, as always, welcome.
Thanks,
Baokun
[1]: https://patch.msgid.link/20260607124935.6168-1-peng_wang@linux.alibaba.com
Baokun Li (8):
ext4: prevent sleeping allocation in NOWAIT write path
ext4: drain in-flight DIO before buffered write fallback
ext4: skip overwrite check for aligned non-extending DIO writes
ext4: base unaligned DIO lock decision on partial block zeroing
ext4: use kiocb_modified instead of file_modified in DIO/DAX write
path
ext4: return -EAGAIN from ext4_map_blocks() in NOWAIT cache miss
ext4: handle IOMAP_NOWAIT in ext4_iomap_begin() with cache-only lookup
ext4: handle IOCB_NOWAIT in ext4_dio_needs_zeroing() with cache-only
lookup
fs/ext4/file.c | 148 +++++++++++++++++++++++++++++++++---------------
fs/ext4/inode.c | 19 +++++--
2 files changed, 118 insertions(+), 49 deletions(-)
--
2.43.7
next reply other threads:[~2026-06-18 12:57 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-18 12:57 Baokun Li [this message]
2026-06-18 12:57 ` [PATCH v2 1/8] ext4: prevent sleeping allocation in NOWAIT write path Baokun Li
2026-06-18 13:52 ` Jan Kara
2026-06-18 12:57 ` [PATCH v2 2/8] ext4: drain in-flight DIO before buffered write fallback Baokun Li
2026-06-18 13:54 ` Jan Kara
2026-06-18 12:57 ` [PATCH v2 3/8] ext4: skip overwrite check for aligned non-extending DIO writes Baokun Li
2026-06-18 12:57 ` [PATCH v2 4/8] ext4: base unaligned DIO lock decision on partial block zeroing Baokun Li
2026-06-18 12:57 ` [PATCH v2 5/8] ext4: use kiocb_modified instead of file_modified in DIO/DAX write path Baokun Li
2026-06-18 13:56 ` Jan Kara
2026-06-18 12:57 ` [PATCH v2 6/8] ext4: return -EAGAIN from ext4_map_blocks() in NOWAIT cache miss Baokun Li
2026-06-18 14:09 ` Jan Kara
2026-06-18 15:51 ` Baokun Li
2026-06-18 12:57 ` [PATCH v2 7/8] ext4: handle IOMAP_NOWAIT in ext4_iomap_begin() with cache-only lookup Baokun Li
2026-06-18 14:09 ` Jan Kara
2026-06-18 12:57 ` [PATCH v2 8/8] ext4: handle IOCB_NOWAIT in ext4_dio_needs_zeroing() " Baokun Li
2026-06-18 14:10 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260618125735.4156639-1-libaokun@linux.alibaba.com \
--to=libaokun@linux.alibaba.com \
--cc=adilger.kernel@dilger.ca \
--cc=jack@suse.cz \
--cc=linux-ext4@vger.kernel.org \
--cc=ojaswin@linux.ibm.com \
--cc=peng_wang@linux.alibaba.com \
--cc=ritesh.list@gmail.com \
--cc=tytso@mit.edu \
--cc=yi.zhang@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.