All of lore.kernel.org
 help / color / mirror / Atom feed
From: Baokun Li <libaokun@linux.alibaba.com>
To: linux-ext4@vger.kernel.org
Cc: tytso@mit.edu, adilger.kernel@dilger.ca, jack@suse.cz,
	yi.zhang@huawei.com, ojaswin@linux.ibm.com,
	ritesh.list@gmail.com, peng_wang@linux.alibaba.com
Subject: [PATCH v2 0/8] ext4: allow more DIO writes under shared i_rwsem
Date: Thu, 18 Jun 2026 20:57:27 +0800	[thread overview]
Message-ID: <20260618125735.4156639-1-libaokun@linux.alibaba.com> (raw)


Changes since v1:
  * Collect RVB from Honza and Yi. (Thank you for your review!)
  * Added Patch 1 to fix NOWAIT issues reported by Sashiko.
  * Added Patch 2 to fix ext3 DIO and DIO fallback data race issue.
    (Patch 4 increases the probability of this race)
  * Added Patches 5-8 to fix other NOWAIT issues discovered during
    investigation.

v1: https://patch.msgid.link/20260611163441.2431805-1-libaokun@linux.alibaba.com


======

Hi all,

This series relaxes the i_rwsem requirements of ext4_dio_write_iter()
so that more direct I/O writes can proceed under the shared lock.

It continues the work started by Peng Wang's RFC [1]; I'm taking
over this effort going forward.

ext4_dio_write_checks() currently calls ext4_overwrite_io() to decide
whether the shared lock is sufficient. Its single ext4_map_blocks()
lookup only sees the first contiguous extent of the same type, which
forces the exclusive lock for two cases that are actually safe under
the shared lock (see individual patches for the full safety
argument):

  1. Aligned writes spanning multiple already-allocated extents (e.g.
     written + unwritten, or two discontiguous written extents).

  2. Unaligned writes whose head/tail partial blocks land on written
     extents but the fully-covered middle blocks include hole or
     unwritten extents.

Patch 1 fixes a NOWAIT issue where ext4_iomap_alloc() may sleep when
IOMAP_NOWAIT is set.

Patch 2 fixes a data race between DIO completion and buffered I/O
fallback on ext3 (no-extent inodes). This race was made more likely
by Patch 4.

Patch 3 skips the ext4_overwrite_io() pre-check entirely for aligned
non-extending writes, letting them proceed under the shared lock
regardless of extent state.

Patch 4 replaces ext4_overwrite_io() with ext4_dio_needs_zeroing(),
which directly answers the question driving the lock decision. It
checks only the head and tail partial blocks (at most two
ext4_map_blocks() calls), and ignores the state of middle blocks.

Patch 5 fixes a NOWAIT issue by using kiocb_modified instead of
file_modified in DIO/DAX write paths.

Patch 6 makes ext4_map_blocks() to return -EAGAIN instead of 0 when
EXT4_GET_BLOCKS_CACHED_NOWAIT is set and cache lookup misses.

Patch 7 adds cache-only lookup support to ext4_iomap_begin() for
IOMAP_NOWAIT requests.

Patch 8 adds cache-only lookup support to ext4_dio_needs_zeroing()
for IOCB_NOWAIT requests.


Testing
=======

"kvm-xfstests -c ext4/all -g auto" passes with no new failures.


Performance
===========

Hardware: /dev/sda (rotational disk, ~1 GB/s sustained write)
Filesystem: ext4 default mkfs

Test 1: aligned 8K DIO writes spanning written+unwritten extent
boundaries. Each thread writes its own 1G region sequentially; the
file is rebuilt between runs so every block is written exactly once.
Metric: IOPS.

  JOBS         base    +patch 3    +patch 3+4    speedup
  ----    ---------    --------    ----------    -------
     1       42,322      43,329        43,087      1.02x
     2       68,516      70,677        66,958      1.03x
     4       62,489      97,072       101,468      1.62x
     8       58,701     110,819       113,679      1.94x
    16       58,569     116,392       115,272      1.97x
    32       60,860     117,244       119,621      1.97x

Wall time at JOBS=32: 69.2s (base) -> 35.4s (patched), 1.96x faster.

Test 2: unaligned DIO writes (14336 bytes at +512 within each 16K
stripe). Each stripe is laid out as [written][unwritten][unwritten]
[written], so the head and tail partial blocks land on written
extents but the middle is unwritten. Metric: IOPS.

  JOBS         base    +patch 3    +patch 3+4    speedup
  ----    ---------    --------    ----------    -------
     1       15,547      15,975        17,381      1.12x
     2       15,910      14,808        34,172      2.15x
     4       15,014      14,828        57,567      3.83x
     8       15,022      14,648        81,947      5.46x
    16       14,586      14,262        99,126      6.80x
    32       14,047      13,809        92,519      6.59x

Wall time at JOBS=32: 149.3s (base) -> 22.7s (patched), 6.58x faster.

In test 2, patch 3 alone has no effect (slight noise) because patch 3
only touches the aligned write path. Patch 4 introduces
ext4_dio_needs_zeroing() which precisely identifies when partial
block zeroing is required, allowing the shared lock for the much
larger set of unaligned writes that don't actually trigger zeroing.

Comments and questions are, as always, welcome.

Thanks,
Baokun

[1]: https://patch.msgid.link/20260607124935.6168-1-peng_wang@linux.alibaba.com

Baokun Li (8):
  ext4: prevent sleeping allocation in NOWAIT write path
  ext4: drain in-flight DIO before buffered write fallback
  ext4: skip overwrite check for aligned non-extending DIO writes
  ext4: base unaligned DIO lock decision on partial block zeroing
  ext4: use kiocb_modified instead of file_modified in DIO/DAX write
    path
  ext4: return -EAGAIN from ext4_map_blocks() in NOWAIT cache miss
  ext4: handle IOMAP_NOWAIT in ext4_iomap_begin() with cache-only lookup
  ext4: handle IOCB_NOWAIT in ext4_dio_needs_zeroing() with cache-only
    lookup

 fs/ext4/file.c  | 148 +++++++++++++++++++++++++++++++++---------------
 fs/ext4/inode.c |  19 +++++--
 2 files changed, 118 insertions(+), 49 deletions(-)

-- 
2.43.7


             reply	other threads:[~2026-06-18 12:57 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-18 12:57 Baokun Li [this message]
2026-06-18 12:57 ` [PATCH v2 1/8] ext4: prevent sleeping allocation in NOWAIT write path Baokun Li
2026-06-18 13:52   ` Jan Kara
2026-06-18 12:57 ` [PATCH v2 2/8] ext4: drain in-flight DIO before buffered write fallback Baokun Li
2026-06-18 13:54   ` Jan Kara
2026-06-18 12:57 ` [PATCH v2 3/8] ext4: skip overwrite check for aligned non-extending DIO writes Baokun Li
2026-06-18 12:57 ` [PATCH v2 4/8] ext4: base unaligned DIO lock decision on partial block zeroing Baokun Li
2026-06-18 12:57 ` [PATCH v2 5/8] ext4: use kiocb_modified instead of file_modified in DIO/DAX write path Baokun Li
2026-06-18 13:56   ` Jan Kara
2026-06-18 12:57 ` [PATCH v2 6/8] ext4: return -EAGAIN from ext4_map_blocks() in NOWAIT cache miss Baokun Li
2026-06-18 14:09   ` Jan Kara
2026-06-18 15:51     ` Baokun Li
2026-06-18 12:57 ` [PATCH v2 7/8] ext4: handle IOMAP_NOWAIT in ext4_iomap_begin() with cache-only lookup Baokun Li
2026-06-18 14:09   ` Jan Kara
2026-06-18 12:57 ` [PATCH v2 8/8] ext4: handle IOCB_NOWAIT in ext4_dio_needs_zeroing() " Baokun Li
2026-06-18 14:10   ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260618125735.4156639-1-libaokun@linux.alibaba.com \
    --to=libaokun@linux.alibaba.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=ojaswin@linux.ibm.com \
    --cc=peng_wang@linux.alibaba.com \
    --cc=ritesh.list@gmail.com \
    --cc=tytso@mit.edu \
    --cc=yi.zhang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.