The Linux Kernel Mailing List
 help / color / mirror / Atom feed
* [PATCH v4 00/23] ext4: use iomap for regular file's buffered I/O path
@ 2026-05-11  7:23 Zhang Yi
  2026-05-11  7:23 ` [PATCH v4 01/23] ext4: simplify size updating in ext4_setattr() Zhang Yi
                   ` (22 more replies)
  0 siblings, 23 replies; 26+ messages in thread
From: Zhang Yi @ 2026-05-11  7:23 UTC (permalink / raw)
  To: linux-ext4, linux-fsdevel
  Cc: linux-kernel, tytso, adilger.kernel, libaokun, jack, ojaswin,
	ritesh.list, djwong, hch, yi.zhang, yi.zhang, yizhang089,
	yangerkun, yukuai

From: Zhang Yi <yi.zhang@huawei.com>

Hi,

This version is a small revision of v3 with no design changes. It fixes
some issues pointed out by Jan and Sashiko, and adds numerous comments
to clarify functionality and key considerations. You can get commits
here:

 https://github.com/zhangyi089/linux/commits/ext4_buffered_iomap_v4/

Original Cover-letter:
===

This series adds the iomap buffered I/O path support for regular files,
based on the latest upstream kernel. It implements the core iomap APIs
on ext4 and introduces the 'buffered_iomap' mount option to enable the
iomap buffered I/O path. It supports default features, default mount
options and bigalloc feature. However, it does not support online
defragmentation, inline data, fsverify, fscrypt, non-extent inodes, and
data=journal mode, it will fall to buffered_head I/O path automatically
if these features and options are used.

This iomap buffered I/O path is not enabled by default because the
preceding features are not supported. Users can explicitly enable or
disable it via 'buffered_iomap' and 'nobuffered_iomap' mount options.

Key notes
=========

1. Lock ordering difference

   The lock ordering of folio lock and transaction start in the iomap
   path is the opposite of that in the buffer_head path.

2. data=ordered mode is not used

   Two main reasons:
   a) The lock ordering of folio lock and transaction start for
      data=ordered mode is opposite to the iomap path, which would cause
      a deadlock.
   b) The iomap writeback path does not support partial folio submission
      (required by data=ordered mode when block size < folio size, and
      it is currently handled by ext4_bio_write_folio()), which would
      also cause a deadlock.

   To replace data=ordered mode functionality:

   - For append write: Always allocate unwritten extents (dioread_nolock
     behavior) to prevent stale data exposure.

   - For post-EOF partial block zeroing: Issue zeroing I/O immediately
     and asynchronously or synchronously wait for completion before
     updating i_disksize. On ordered I/O completion, set i_disksize to
     i_size to avoid lost updates in the truncate up and append
     fallocate cases. (Jan suggested).

   - For online defragmentation: Not supported yet, needs further
     consideration.

3. Always enable dioread_nolock

   Two main reasons:
   a) Since data=ordered mode cannot be used, allocating written blocks
      directly would expose stale data.
   b) To optimize writeback, we should allocate blocks based on writeback
      length rather than per-folio mapping. Direct written allocation
      would over-allocate blocks.

   dioread_nolock has been the default mount option for many years, and
   Jan pointed out that we may no longer need to disable it, so gradually
   remove this mount option in the future.

Series structure
================

 - Patch 01-03: Simplify truncate operations and prepare for conversion.
 - Patch 04-16: Implement core iomap buffered read/write, writeback,
                mmap, and partial block zeroing paths.
 - Patch 17-21: Handle ordered I/O for zeroing post-EOF partial block.
 - Patch 22-23: Enable iomap buffered I/O path.

Testing and Performance
=======================

Tested with xfstests-bld using -g auto, fast_commit, and 64k
configurations. No new regressions were observed.

For the special case of zeroing post-EOF partial block, I add a new
generic/790 to address this scenario.

  https://lore.kernel.org/fstests/20260428085750.1072612-1-yi.zhang@huaweicloud.com/

Performance tested with fio on a 150 GB memory-backed virtual machine
(no much difference compared to v2 and v3, so no update):

 Buffered write (MiB/s)
 ===

  bs       write cache    uncached write
           bh     iomap   bh      iomap
  1k       423    403     36.3    57
  4k       1067   1093    58.4    61
  64k      4321   6488    869     1206
  1M       4640   7378    3158    4818

 Buffered read (MiB/s)
 ===

  bs       read hole        read pre-cache     read ondisk data
           bh     iomap     bh     iomap       bh      iomap
  1k       635    643       661    653         605     602
  4k       1987   2075      2128   2159        1761    1716
  64k      6068   6267      9472   9545        4475    4451
  1M       5471   6072      8657   9191        4405    4467

Large I/O write performance improved by approximately 30% to 50%.
Read performance showed no significant difference.

Changes since v3:
 - Rebased on the latest upstream kernel.
 - Improve commit messages for patches 07-23 to clarify functionality
   and key considerations.
 - Move the patches that enables IOMAP to the end of this series.
 - Patch 02: Move ext4_set_inode_size() declarations from ext4.h into
   inode.c, move truncate_pagecache() and ext4_truncate() to
   ext4_truncate_down() as Jan suggested.
 - Patch 08: Add check for non-extent inodes in the non-delalloc write
   path, and clarify the reason why we don't need to truncate blocks on
   short writes. (Pointed out by sashiko)
 - Patch 09: Fix the issue where DATA_ERR_ABORT fails to work in
   overwrite scenarios. Replace iomap_finish_ioends() with
   iomap_finish_ioend() during end_io to prevent might_sleep() being
   called in interrupt context. (Pointed out by sashiko)
 - Patch 11: Fix underflow of the nr_blks variable. (Pointed out by
   sashiko)
 - Patch 17: Factor out ext4_iomap_submit_zero_block() helper to handle
   ordered mode after zeroing a post-EOF partial block in the iomap
   path, also add comments.
 - Patch 18: Fix off-by-one in ext4_iomap_wb_ordered_wait() and clarify
   why a single i_ordered_len tracker suffices. (Pointed out by sashiko)
 - Patch 19: Fix an issue where the correct file size may be lost due to
   a missing memory barrier. (Pointed out by sashiko)
 - Patch 20: Change the logic for waiting on ordered I/Os in the insert
   range and collapse range from asynchronous to synchronous.
 - Patch 21: Allow per-inode journal mode changes but disallow per-inode
   extent type changes, add comments of restrictions on using iomap.

Changes since v2:
 - Rebased on the latest upstream kernel (7.1-rc1).
 - Added patches 01-03 to simplify truncate operations.
 - Added patch 13 to fix incorrect did_zero parameter in
   iomap_zero_range().
 - Added patches 19-22 to handle ordered I/O for zeroing post-EOF
   partial block.
 - Minor code and comment optimizations.

Changes since v1:
 - Rebase this series on linux-next 20260122.
 - Refactor partial block zero range, stop passing handle to
   ext4_block_truncate_page() and ext4_zero_partial_blocks(), and move
   partial block zeroing operation outside an active journal transaction
   to prevent potential deadlocks because of the lock ordering of folio
   and transaction start.
 - Clarify the lock ordering of folio lock and transaction start, update
   the comments accordingly.
 - Fix some issues related to fast commit, pollute post-EOF folio.
 - Some minor code and comments optimizations.

v3:     https://lore.kernel.org/linux-ext4/20260422021042.4157510-1-yi.zhang@huaweicloud.com/
v2:     https://lore.kernel.org/linux-ext4/20260203062523.3869120-1-yi.zhang@huawei.com/
v1:     https://lore.kernel.org/linux-ext4/20241022111059.2566137-1-yi.zhang@huaweicloud.com/
RFC v4: https://lore.kernel.org/linux-ext4/20240410142948.2817554-1-yi.zhang@huaweicloud.com/
RFC v3: https://lore.kernel.org/linux-ext4/20240127015825.1608160-1-yi.zhang@huaweicloud.com/
RFC v2: https://lore.kernel.org/linux-ext4/20240102123918.799062-1-yi.zhang@huaweicloud.com/
RFC v1: https://lore.kernel.org/linux-ext4/20231123125121.4064694-1-yi.zhang@huaweicloud.com/

Comments and suggestions are welcome!

Thanks,
Yi.

Zhang Yi (23):
  ext4: simplify size updating in ext4_setattr()
  ext4: factor out ext4_truncate_[up|down]()
  ext4: simplify error handling in ext4_setattr()
  ext4: add iomap address space operations for buffered I/O
  ext4: implement buffered read path using iomap
  ext4: pass out extent seq counter when mapping da blocks
  ext4: do not use data=ordered mode for inodes using buffered iomap
    path
  ext4: implement buffered write path using iomap
  ext4: implement writeback path using iomap
  ext4: implement mmap path using iomap
  iomap: correct the range of a partial dirty clear
  iomap: support invalidating partial folios
  iomap: fix incorrect did_zero setting in iomap_zero_iter()
  ext4: implement partial block zero range path using iomap
  ext4: add block mapping tracepoints for iomap buffered I/O path
  ext4: disable online defrag when inode using iomap buffered I/O path
  ext4: submit zeroed post-EOF data immediately in the iomap buffered
    I/O path
  ext4: wait for ordered I/O in the iomap buffered I/O path
  ext4: update i_disksize to i_size on ordered I/O completion
  ext4: wait for ordered I/O to complete during insert and collapse
    range
  ext4: add tracepoints for ordered I/O in the iomap buffered I/O path
  ext4: partially enable iomap for the buffered I/O path of regular
    files
  ext4: introduce a mount option for iomap buffered I/O path

 fs/ext4/ext4.h              |   57 +-
 fs/ext4/ext4_jbd2.c         |    8 +-
 fs/ext4/ext4_jbd2.h         |    7 +-
 fs/ext4/extents.c           |   18 +
 fs/ext4/file.c              |   20 +-
 fs/ext4/ialloc.c            |    1 +
 fs/ext4/inode.c             | 1040 ++++++++++++++++++++++++++++++-----
 fs/ext4/migrate.c           |    2 +
 fs/ext4/move_extent.c       |   11 +
 fs/ext4/page-io.c           |  210 +++++++
 fs/ext4/super.c             |   55 +-
 fs/iomap/buffered-io.c      |   22 +-
 fs/iomap/ioend.c            |    3 +-
 include/linux/iomap.h       |    1 +
 include/trace/events/ext4.h |  142 +++++
 15 files changed, 1446 insertions(+), 151 deletions(-)

-- 
2.52.0


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2026-05-11  8:58 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-11  7:23 [PATCH v4 00/23] ext4: use iomap for regular file's buffered I/O path Zhang Yi
2026-05-11  7:23 ` [PATCH v4 01/23] ext4: simplify size updating in ext4_setattr() Zhang Yi
2026-05-11  7:23 ` [PATCH v4 02/23] ext4: factor out ext4_truncate_[up|down]() Zhang Yi
2026-05-11  7:23 ` [PATCH v4 03/23] ext4: simplify error handling in ext4_setattr() Zhang Yi
2026-05-11  7:23 ` [PATCH v4 04/23] ext4: add iomap address space operations for buffered I/O Zhang Yi
2026-05-11  7:23 ` [PATCH v4 05/23] ext4: implement buffered read path using iomap Zhang Yi
2026-05-11  7:23 ` [PATCH v4 06/23] ext4: pass out extent seq counter when mapping da blocks Zhang Yi
2026-05-11  7:23 ` [PATCH v4 07/23] ext4: do not use data=ordered mode for inodes using buffered iomap path Zhang Yi
2026-05-11  7:23 ` [PATCH v4 08/23] ext4: implement buffered write path using iomap Zhang Yi
2026-05-11  7:23 ` [PATCH v4 09/23] ext4: implement writeback " Zhang Yi
2026-05-11  7:23 ` [PATCH v4 10/23] ext4: implement mmap " Zhang Yi
2026-05-11  7:23 ` [PATCH v4 11/23] iomap: correct the range of a partial dirty clear Zhang Yi
2026-05-11  7:46   ` Christoph Hellwig
2026-05-11  8:57     ` Zhang Yi
2026-05-11  7:23 ` [PATCH v4 12/23] iomap: support invalidating partial folios Zhang Yi
2026-05-11  7:23 ` [PATCH v4 13/23] iomap: fix incorrect did_zero setting in iomap_zero_iter() Zhang Yi
2026-05-11  7:23 ` [PATCH v4 14/23] ext4: implement partial block zero range path using iomap Zhang Yi
2026-05-11  7:23 ` [PATCH v4 15/23] ext4: add block mapping tracepoints for iomap buffered I/O path Zhang Yi
2026-05-11  7:23 ` [PATCH v4 16/23] ext4: disable online defrag when inode using " Zhang Yi
2026-05-11  7:23 ` [PATCH v4 17/23] ext4: submit zeroed post-EOF data immediately in the " Zhang Yi
2026-05-11  7:23 ` [PATCH v4 18/23] ext4: wait for ordered I/O " Zhang Yi
2026-05-11  7:23 ` [PATCH v4 19/23] ext4: update i_disksize to i_size on ordered I/O completion Zhang Yi
2026-05-11  7:23 ` [PATCH v4 20/23] ext4: wait for ordered I/O to complete during insert and collapse range Zhang Yi
2026-05-11  7:23 ` [PATCH v4 21/23] ext4: add tracepoints for ordered I/O in the iomap buffered I/O path Zhang Yi
2026-05-11  7:23 ` [PATCH v4 22/23] ext4: partially enable iomap for the buffered I/O path of regular files Zhang Yi
2026-05-11  7:23 ` [PATCH v4 23/23] ext4: introduce a mount option for iomap buffered I/O path Zhang Yi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox