Re: [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio
  2024-10-22 11:10 [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio Zhang Yi
@ 2024-10-22  6:59 ` Sedat Dilek
  2024-10-22  9:22   ` Zhang Yi
  2024-10-22 11:10 ` [PATCH 01/27] ext4: remove writable userspace mappings before truncating page cache Zhang Yi
                   ` (26 subsequent siblings)
  27 siblings, 1 reply; 59+ messages in thread
From: Sedat Dilek @ 2024-10-22  6:59 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, ritesh.list, hch, djwong, david, zokeefe, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

On Tue, Oct 22, 2024 at 5:13 AM Zhang Yi <yi.zhang@huaweicloud.com> wrote:
>
> From: Zhang Yi <yi.zhang@huawei.com>
>
> Hello！
>
> This patch series is the latest version based on my previous RFC
> series[1], which converts the buffered I/O path of ext4 regular files to
> iomap and enables large folios. After several months of work, almost all
> preparatory changes have been upstreamed, thanks a lot for the review
> and comments from Jan, Dave, Christoph, Darrick and Ritesh. Now it is
> time for the main implementation of this conversion.
>
> This series is the main part of iomap buffered iomap conversion, it's
> based on 6.12-rc4, and the code context is also depend on my anohter
> cleanup series[1] (I've put that in this seris so we can merge it
> directly), fixed all minor bugs found in my previous RFC v4 series.
> Additionally, I've update change logs in each patch and also includes
> some code modifications as Dave's suggestions. This series implements
> the core iomap APIs on ext4 and introduces a mount option called
> "buffered_iomap" to enable the iomap buffered I/O path. We have already
> supported the default features, default mount options and bigalloc
> feature. However, we do not yet support online defragmentation, inline
> data, fs_verify, fs_crypt, ext3, and data=journal mode, ext4 will fall
> to buffered_head I/O path automatically if you use those features and
> options. Some of these features should be supported gradually in the
> near future.
>
> Most of the implementations resemble the original buffered_head path;
> however, there are four key differences.
>
> 1. The first aspect is the block allocation in the writeback path. The
>    iomap frame will invoke ->map_blocks() at least once for each dirty
>    folio. To ensure optimal writeback performance, we aim to allocate a
>    range of delalloc blocks that is as long as possible within the
>    writeback length for each invocation. In certain situations, we may
>    allocate a range of blocks that exceeds the amount we will actually
>    write back. Therefore,
> 1) we cannot allocate a written extent for those blocks because it may
>    expose stale data in such short write cases. Instead, we should
>    allocate an unwritten extent, which means we must always enable the
>    dioread_nolock option. This change could also bring many other
>    benefits.
> 2) We should postpone updating the 'i_disksize' until the end of the I/O
>    process, based on the actual written length. This approach can also
>    prevent the exposure of zero data, which may occur if there is a
>    power failure during an append write.
> 3) We do not need to pre-split extents during write-back, we can
>    postpone this task until the end I/O process while converting
>    unwritten extents.
>
> 2. The second reason is that since we always allocate unwritten space
>    for new blocks, there is no risk of exposing stale data. As a result,
>    we do not need to order the data, which allows us to disable the
>    data=ordered mode. Consequently, we also do not require the reserved
>    handle when converting the unwritten extent in the final I/O worker,
>    we can directly start with the normal handle.
>
> Series details:
>
> Patch 1-10 is just another series of mine that refactors the fallocate
> functions[1]. This series relies on the code context of that but has no
> logical dependencies. I put this here just for easy access and merge.
>
> Patch 11-21 implement the iomap buffered read/write path, dirty folio
> write back path and mmap path for ext4 regular file.
>
> Patch 22-23 disable the unsupported online-defragmentation function and
> disable the changing of the inode journal flag to data=journal mode.
> Please look at the following patch for details.
>
> Patch 24-27 introduce "buffered_iomap" mount option (is not enabled by
> default now) to partially enable the iomap buffered I/O path and also
> enable large folio.
>
>
> About performance:
>
> Fio tests with psync on my machine with Intel Xeon Gold 6240 CPU with
> 400GB system ram, 200GB ramdisk and 4TB nvme ssd disk.
>
>  fio -directory=/mnt -direct=0 -iodepth=$iodepth -fsync=$sync -rw=$rw \
>      -numjobs=${numjobs} -bs=${bs} -ioengine=psync -size=$size \
>      -runtime=60 -norandommap=0 -fallocate=none -overwrite=$overwrite \
>      -group_reportin -name=$name --output=/tmp/test_log
>

Hi Zhang Yi,

can you clarify about the FIO values for the diverse parameters?

Thanks.

BR,
-Sedat-

>  == buffer read ==
>
>                 buffer_head        iomap + large folio
>  type     bs    IOPS    BW(MiB/s)  IOPS    BW(MiB/s)
>  -------------------------------------------------------
>  hole     4K    576k    2253       762k    2975     +32%
>  hole     64K   48.7k   3043       77.8k   4860     +60%
>  hole     1M    2960    2960       4942    4942     +67%
>  ramdisk  4K    443k    1732       530k    2069     +19%
>  ramdisk  64K   34.5k   2156       45.6k   2850     +32%
>  ramdisk  1M    2093    2093       2841    2841     +36%
>  nvme     4K    339k    1323       364k    1425     +8%
>  nvme     64K   23.6k   1471       25.2k   1574     +7%
>  nvme     1M    2012    2012       2153    2153     +7%
>
>
>  == buffer write ==
>
>                                        buffer_head  iomap + large folio
>  type   Overwrite Sync Writeback  bs   IOPS   BW    IOPS   BW(MiB/s)
>  ----------------------------------------------------------------------
>  cache      N    N    N    4K     417k    1631    440k    1719   +5%
>  cache      N    N    N    64K    33.4k   2088    81.5k   5092   +144%
>  cache      N    N    N    1M     2143    2143    5716    5716   +167%
>  cache      Y    N    N    4K     449k    1755    469k    1834   +5%
>  cache      Y    N    N    64K    36.6k   2290    82.3k   5142   +125%
>  cache      Y    N    N    1M     2352    2352    5577    5577   +137%
>  ramdisk    N    N    Y    4K     365k    1424    354k    1384   -3%
>  ramdisk    N    N    Y    64K    31.2k   1950    74.2k   4640   +138%
>  ramdisk    N    N    Y    1M     1968    1968    5201    5201   +164%
>  ramdisk    N    Y    N    4K     9984    39      12.9k   51     +29%
>  ramdisk    N    Y    N    64K    5936    371     8960    560    +51%
>  ramdisk    N    Y    N    1M     1050    1050    1835    1835   +75%
>  ramdisk    Y    N    Y    4K     411k    1609    443k    1731   +8%
>  ramdisk    Y    N    Y    64K    34.1k   2134    77.5k   4844   +127%
>  ramdisk    Y    N    Y    1M     2248    2248    5372    5372   +139%
>  ramdisk    Y    Y    N    4K     182k    711     186k    730    +3%
>  ramdisk    Y    Y    N    64K    18.7k   1170    34.7k   2171   +86%
>  ramdisk    Y    Y    N    1M     1229    1229    2269    2269   +85%
>  nvme       N    N    Y    4K     373k    1458    387k    1512   +4%
>  nvme       N    N    Y    64K    29.2k   1827    70.9k   4431   +143%
>  nvme       N    N    Y    1M     1835    1835    4919    4919   +168%
>  nvme       N    Y    N    4K     11.7k   46      11.7k   46      0%
>  nvme       N    Y    N    64K    6453    403     8661    541    +34%
>  nvme       N    Y    N    1M     649     649     1351    1351   +108%
>  nvme       Y    N    Y    4K     372k    1456    433k    1693   +16%
>  nvme       Y    N    Y    64K    33.0k   2064    74.7k   4669   +126%
>  nvme       Y    N    Y    1M     2131    2131    5273    5273   +147%
>  nvme       Y    Y    N    4K     56.7k   222     56.4k   220    -1%
>  nvme       Y    Y    N    64K    13.4k   840     19.4k   1214   +45%
>  nvme       Y    Y    N    1M     714     714     1504    1504   +111%
>
> Thanks,
> Yi.
>
> Major changes since RFC v4:
>  - Disable unsupported online defragmentation, do not fall back to
>    buffer_head path.
>  - Wite and wait data back while doing partial block truncate down to
>    fix a stale data problem.
>  - Disable the online changing of the inode journal flag to data=journal
>    mode.
>  - Since iomap can zero out dirty pages with unwritten extent, do not
>    write data before zeroing out in ext4_zero_range(), and also do not
>    zero partial blocks under a started journal handle.
>
> [1] https://lore.kernel.org/linux-ext4/20241010133333.146793-1-yi.zhang@huawei.com/
>
> ---
> RFC v4: https://lore.kernel.org/linux-ext4/20240410142948.2817554-1-yi.zhang@huaweicloud.com/
> RFC v3: https://lore.kernel.org/linux-ext4/20240127015825.1608160-1-yi.zhang@huaweicloud.com/
> RFC v2: https://lore.kernel.org/linux-ext4/20240102123918.799062-1-yi.zhang@huaweicloud.com/
> RFC v1: https://lore.kernel.org/linux-ext4/20231123125121.4064694-1-yi.zhang@huaweicloud.com/
>
>
> Zhang Yi (27):
>   ext4: remove writable userspace mappings before truncating page cache
>   ext4: don't explicit update times in ext4_fallocate()
>   ext4: don't write back data before punch hole in nojournal mode
>   ext4: refactor ext4_punch_hole()
>   ext4: refactor ext4_zero_range()
>   ext4: refactor ext4_collapse_range()
>   ext4: refactor ext4_insert_range()
>   ext4: factor out ext4_do_fallocate()
>   ext4: move out inode_lock into ext4_fallocate()
>   ext4: move out common parts into ext4_fallocate()
>   ext4: use reserved metadata blocks when splitting extent on endio
>   ext4: introduce seq counter for the extent status entry
>   ext4: add a new iomap aops for regular file's buffered IO path
>   ext4: implement buffered read iomap path
>   ext4: implement buffered write iomap path
>   ext4: don't order data for inode with EXT4_STATE_BUFFERED_IOMAP
>   ext4: implement writeback iomap path
>   ext4: implement mmap iomap path
>   ext4: do not always order data when partial zeroing out a block
>   ext4: do not start handle if unnecessary while partial zeroing out a
>     block
>   ext4: implement zero_range iomap path
>   ext4: disable online defrag when inode using iomap buffered I/O path
>   ext4: disable inode journal mode when using iomap buffered I/O path
>   ext4: partially enable iomap for the buffered I/O path of regular
>     files
>   ext4: enable large folio for regular file with iomap buffered I/O path
>   ext4: change mount options code style
>   ext4: introduce a mount option for iomap buffered I/O path
>
>  fs/ext4/ext4.h              |  17 +-
>  fs/ext4/ext4_jbd2.c         |   3 +-
>  fs/ext4/ext4_jbd2.h         |   8 +
>  fs/ext4/extents.c           | 568 +++++++++++----------------
>  fs/ext4/extents_status.c    |  13 +-
>  fs/ext4/file.c              |  19 +-
>  fs/ext4/ialloc.c            |   5 +
>  fs/ext4/inode.c             | 755 ++++++++++++++++++++++++++++++------
>  fs/ext4/move_extent.c       |   7 +
>  fs/ext4/page-io.c           | 105 +++++
>  fs/ext4/super.c             | 185 ++++-----
>  include/trace/events/ext4.h |  57 +--
>  12 files changed, 1153 insertions(+), 589 deletions(-)
>
> --
> 2.46.1
>
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio
  2024-10-22  6:59 ` Sedat Dilek
@ 2024-10-22  9:22   ` Zhang Yi
  2024-10-23 12:13     ` Sedat Dilek
  0 siblings, 1 reply; 59+ messages in thread
From: Zhang Yi @ 2024-10-22  9:22 UTC (permalink / raw)
  To: sedat.dilek
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, ritesh.list, hch, djwong, david, zokeefe, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

On 2024/10/22 14:59, Sedat Dilek wrote:
> On Tue, Oct 22, 2024 at 5:13 AM Zhang Yi <yi.zhang@huaweicloud.com> wrote:
>>
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> Hello！
>>
>> This patch series is the latest version based on my previous RFC
>> series[1], which converts the buffered I/O path of ext4 regular files to
>> iomap and enables large folios. After several months of work, almost all
>> preparatory changes have been upstreamed, thanks a lot for the review
>> and comments from Jan, Dave, Christoph, Darrick and Ritesh. Now it is
>> time for the main implementation of this conversion.
>>
>> This series is the main part of iomap buffered iomap conversion, it's
>> based on 6.12-rc4, and the code context is also depend on my anohter
>> cleanup series[1] (I've put that in this seris so we can merge it
>> directly), fixed all minor bugs found in my previous RFC v4 series.
>> Additionally, I've update change logs in each patch and also includes
>> some code modifications as Dave's suggestions. This series implements
>> the core iomap APIs on ext4 and introduces a mount option called
>> "buffered_iomap" to enable the iomap buffered I/O path. We have already
>> supported the default features, default mount options and bigalloc
>> feature. However, we do not yet support online defragmentation, inline
>> data, fs_verify, fs_crypt, ext3, and data=journal mode, ext4 will fall
>> to buffered_head I/O path automatically if you use those features and
>> options. Some of these features should be supported gradually in the
>> near future.
>>
>> Most of the implementations resemble the original buffered_head path;
>> however, there are four key differences.
>>
>> 1. The first aspect is the block allocation in the writeback path. The
>>    iomap frame will invoke ->map_blocks() at least once for each dirty
>>    folio. To ensure optimal writeback performance, we aim to allocate a
>>    range of delalloc blocks that is as long as possible within the
>>    writeback length for each invocation. In certain situations, we may
>>    allocate a range of blocks that exceeds the amount we will actually
>>    write back. Therefore,
>> 1) we cannot allocate a written extent for those blocks because it may
>>    expose stale data in such short write cases. Instead, we should
>>    allocate an unwritten extent, which means we must always enable the
>>    dioread_nolock option. This change could also bring many other
>>    benefits.
>> 2) We should postpone updating the 'i_disksize' until the end of the I/O
>>    process, based on the actual written length. This approach can also
>>    prevent the exposure of zero data, which may occur if there is a
>>    power failure during an append write.
>> 3) We do not need to pre-split extents during write-back, we can
>>    postpone this task until the end I/O process while converting
>>    unwritten extents.
>>
>> 2. The second reason is that since we always allocate unwritten space
>>    for new blocks, there is no risk of exposing stale data. As a result,
>>    we do not need to order the data, which allows us to disable the
>>    data=ordered mode. Consequently, we also do not require the reserved
>>    handle when converting the unwritten extent in the final I/O worker,
>>    we can directly start with the normal handle.
>>
>> Series details:
>>
>> Patch 1-10 is just another series of mine that refactors the fallocate
>> functions[1]. This series relies on the code context of that but has no
>> logical dependencies. I put this here just for easy access and merge.
>>
>> Patch 11-21 implement the iomap buffered read/write path, dirty folio
>> write back path and mmap path for ext4 regular file.
>>
>> Patch 22-23 disable the unsupported online-defragmentation function and
>> disable the changing of the inode journal flag to data=journal mode.
>> Please look at the following patch for details.
>>
>> Patch 24-27 introduce "buffered_iomap" mount option (is not enabled by
>> default now) to partially enable the iomap buffered I/O path and also
>> enable large folio.
>>
>>
>> About performance:
>>
>> Fio tests with psync on my machine with Intel Xeon Gold 6240 CPU with
>> 400GB system ram, 200GB ramdisk and 4TB nvme ssd disk.
>>
>>  fio -directory=/mnt -direct=0 -iodepth=$iodepth -fsync=$sync -rw=$rw \
>>      -numjobs=${numjobs} -bs=${bs} -ioengine=psync -size=$size \
>>      -runtime=60 -norandommap=0 -fallocate=none -overwrite=$overwrite \
>>      -group_reportin -name=$name --output=/tmp/test_log
>>
> 
> Hi Zhang Yi,
> 
> can you clarify about the FIO values for the diverse parameters?
> 

Hi Sedat,

Sure, the test I present here is a simple single-thread and single-I/O
depth case with psync ioengine. Most of the FIO parameters are shown
in the tables below.

For the rest, the 'iodepth' and 'numjobs' are always set to 1 and the
'size' is 40GB. During the write cache test, I also disable the write
back process through:

 echo 0 > /proc/sys/vm/dirty_writeback_centisecs
 echo 100 > /proc/sys/vm/dirty_background_ratio
 echo 100 > /proc/sys/vm/dirty_ratio

Thanks,
Yi.

> 
>>  == buffer read ==
>>
>>                 buffer_head        iomap + large folio
>>  type     bs    IOPS    BW(MiB/s)  IOPS    BW(MiB/s)
>>  -------------------------------------------------------
>>  hole     4K    576k    2253       762k    2975     +32%
>>  hole     64K   48.7k   3043       77.8k   4860     +60%
>>  hole     1M    2960    2960       4942    4942     +67%
>>  ramdisk  4K    443k    1732       530k    2069     +19%
>>  ramdisk  64K   34.5k   2156       45.6k   2850     +32%
>>  ramdisk  1M    2093    2093       2841    2841     +36%
>>  nvme     4K    339k    1323       364k    1425     +8%
>>  nvme     64K   23.6k   1471       25.2k   1574     +7%
>>  nvme     1M    2012    2012       2153    2153     +7%
>>
>>
>>  == buffer write ==
>>
>>                                        buffer_head  iomap + large folio
>>  type   Overwrite Sync Writeback  bs   IOPS   BW    IOPS   BW(MiB/s)
>>  ----------------------------------------------------------------------
>>  cache      N    N    N    4K     417k    1631    440k    1719   +5%
>>  cache      N    N    N    64K    33.4k   2088    81.5k   5092   +144%
>>  cache      N    N    N    1M     2143    2143    5716    5716   +167%
>>  cache      Y    N    N    4K     449k    1755    469k    1834   +5%
>>  cache      Y    N    N    64K    36.6k   2290    82.3k   5142   +125%
>>  cache      Y    N    N    1M     2352    2352    5577    5577   +137%
>>  ramdisk    N    N    Y    4K     365k    1424    354k    1384   -3%
>>  ramdisk    N    N    Y    64K    31.2k   1950    74.2k   4640   +138%
>>  ramdisk    N    N    Y    1M     1968    1968    5201    5201   +164%
>>  ramdisk    N    Y    N    4K     9984    39      12.9k   51     +29%
>>  ramdisk    N    Y    N    64K    5936    371     8960    560    +51%
>>  ramdisk    N    Y    N    1M     1050    1050    1835    1835   +75%
>>  ramdisk    Y    N    Y    4K     411k    1609    443k    1731   +8%
>>  ramdisk    Y    N    Y    64K    34.1k   2134    77.5k   4844   +127%
>>  ramdisk    Y    N    Y    1M     2248    2248    5372    5372   +139%
>>  ramdisk    Y    Y    N    4K     182k    711     186k    730    +3%
>>  ramdisk    Y    Y    N    64K    18.7k   1170    34.7k   2171   +86%
>>  ramdisk    Y    Y    N    1M     1229    1229    2269    2269   +85%
>>  nvme       N    N    Y    4K     373k    1458    387k    1512   +4%
>>  nvme       N    N    Y    64K    29.2k   1827    70.9k   4431   +143%
>>  nvme       N    N    Y    1M     1835    1835    4919    4919   +168%
>>  nvme       N    Y    N    4K     11.7k   46      11.7k   46      0%
>>  nvme       N    Y    N    64K    6453    403     8661    541    +34%
>>  nvme       N    Y    N    1M     649     649     1351    1351   +108%
>>  nvme       Y    N    Y    4K     372k    1456    433k    1693   +16%
>>  nvme       Y    N    Y    64K    33.0k   2064    74.7k   4669   +126%
>>  nvme       Y    N    Y    1M     2131    2131    5273    5273   +147%
>>  nvme       Y    Y    N    4K     56.7k   222     56.4k   220    -1%
>>  nvme       Y    Y    N    64K    13.4k   840     19.4k   1214   +45%
>>  nvme       Y    Y    N    1M     714     714     1504    1504   +111%
>>
>> Thanks,
>> Yi.
>>
>> Major changes since RFC v4:
>>  - Disable unsupported online defragmentation, do not fall back to
>>    buffer_head path.
>>  - Wite and wait data back while doing partial block truncate down to
>>    fix a stale data problem.
>>  - Disable the online changing of the inode journal flag to data=journal
>>    mode.
>>  - Since iomap can zero out dirty pages with unwritten extent, do not
>>    write data before zeroing out in ext4_zero_range(), and also do not
>>    zero partial blocks under a started journal handle.
>>
>> [1] https://lore.kernel.org/linux-ext4/20241010133333.146793-1-yi.zhang@huawei.com/
>>
>> ---
>> RFC v4: https://lore.kernel.org/linux-ext4/20240410142948.2817554-1-yi.zhang@huaweicloud.com/
>> RFC v3: https://lore.kernel.org/linux-ext4/20240127015825.1608160-1-yi.zhang@huaweicloud.com/
>> RFC v2: https://lore.kernel.org/linux-ext4/20240102123918.799062-1-yi.zhang@huaweicloud.com/
>> RFC v1: https://lore.kernel.org/linux-ext4/20231123125121.4064694-1-yi.zhang@huaweicloud.com/
>>
>>
>> Zhang Yi (27):
>>   ext4: remove writable userspace mappings before truncating page cache
>>   ext4: don't explicit update times in ext4_fallocate()
>>   ext4: don't write back data before punch hole in nojournal mode
>>   ext4: refactor ext4_punch_hole()
>>   ext4: refactor ext4_zero_range()
>>   ext4: refactor ext4_collapse_range()
>>   ext4: refactor ext4_insert_range()
>>   ext4: factor out ext4_do_fallocate()
>>   ext4: move out inode_lock into ext4_fallocate()
>>   ext4: move out common parts into ext4_fallocate()
>>   ext4: use reserved metadata blocks when splitting extent on endio
>>   ext4: introduce seq counter for the extent status entry
>>   ext4: add a new iomap aops for regular file's buffered IO path
>>   ext4: implement buffered read iomap path
>>   ext4: implement buffered write iomap path
>>   ext4: don't order data for inode with EXT4_STATE_BUFFERED_IOMAP
>>   ext4: implement writeback iomap path
>>   ext4: implement mmap iomap path
>>   ext4: do not always order data when partial zeroing out a block
>>   ext4: do not start handle if unnecessary while partial zeroing out a
>>     block
>>   ext4: implement zero_range iomap path
>>   ext4: disable online defrag when inode using iomap buffered I/O path
>>   ext4: disable inode journal mode when using iomap buffered I/O path
>>   ext4: partially enable iomap for the buffered I/O path of regular
>>     files
>>   ext4: enable large folio for regular file with iomap buffered I/O path
>>   ext4: change mount options code style
>>   ext4: introduce a mount option for iomap buffered I/O path
>>
>>  fs/ext4/ext4.h              |  17 +-
>>  fs/ext4/ext4_jbd2.c         |   3 +-
>>  fs/ext4/ext4_jbd2.h         |   8 +
>>  fs/ext4/extents.c           | 568 +++++++++++----------------
>>  fs/ext4/extents_status.c    |  13 +-
>>  fs/ext4/file.c              |  19 +-
>>  fs/ext4/ialloc.c            |   5 +
>>  fs/ext4/inode.c             | 755 ++++++++++++++++++++++++++++++------
>>  fs/ext4/move_extent.c       |   7 +
>>  fs/ext4/page-io.c           | 105 +++++
>>  fs/ext4/super.c             | 185 ++++-----
>>  include/trace/events/ext4.h |  57 +--
>>  12 files changed, 1153 insertions(+), 589 deletions(-)
>>
>> --
>> 2.46.1
>>
>>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio
@ 2024-10-22 11:10 Zhang Yi
  2024-10-22  6:59 ` Sedat Dilek
                   ` (27 more replies)
  0 siblings, 28 replies; 59+ messages in thread
From: Zhang Yi @ 2024-10-22 11:10 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	ritesh.list, hch, djwong, david, zokeefe, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Hello！

This patch series is the latest version based on my previous RFC
series[1], which converts the buffered I/O path of ext4 regular files to
iomap and enables large folios. After several months of work, almost all
preparatory changes have been upstreamed, thanks a lot for the review
and comments from Jan, Dave, Christoph, Darrick and Ritesh. Now it is
time for the main implementation of this conversion. 

This series is the main part of iomap buffered iomap conversion, it's
based on 6.12-rc4, and the code context is also depend on my anohter
cleanup series[1] (I've put that in this seris so we can merge it
directly), fixed all minor bugs found in my previous RFC v4 series.
Additionally, I've update change logs in each patch and also includes
some code modifications as Dave's suggestions. This series implements
the core iomap APIs on ext4 and introduces a mount option called
"buffered_iomap" to enable the iomap buffered I/O path. We have already
supported the default features, default mount options and bigalloc
feature. However, we do not yet support online defragmentation, inline
data, fs_verify, fs_crypt, ext3, and data=journal mode, ext4 will fall
to buffered_head I/O path automatically if you use those features and
options. Some of these features should be supported gradually in the
near future.

Most of the implementations resemble the original buffered_head path;
however, there are four key differences.

1. The first aspect is the block allocation in the writeback path. The
   iomap frame will invoke ->map_blocks() at least once for each dirty
   folio. To ensure optimal writeback performance, we aim to allocate a
   range of delalloc blocks that is as long as possible within the
   writeback length for each invocation. In certain situations, we may
   allocate a range of blocks that exceeds the amount we will actually
   write back. Therefore,
1) we cannot allocate a written extent for those blocks because it may
   expose stale data in such short write cases. Instead, we should
   allocate an unwritten extent, which means we must always enable the
   dioread_nolock option. This change could also bring many other
   benefits.
2) We should postpone updating the 'i_disksize' until the end of the I/O
   process, based on the actual written length. This approach can also
   prevent the exposure of zero data, which may occur if there is a
   power failure during an append write.
3) We do not need to pre-split extents during write-back, we can
   postpone this task until the end I/O process while converting
   unwritten extents.

2. The second reason is that since we always allocate unwritten space
   for new blocks, there is no risk of exposing stale data. As a result,
   we do not need to order the data, which allows us to disable the
   data=ordered mode. Consequently, we also do not require the reserved
   handle when converting the unwritten extent in the final I/O worker,
   we can directly start with the normal handle.

Series details:

Patch 1-10 is just another series of mine that refactors the fallocate
functions[1]. This series relies on the code context of that but has no
logical dependencies. I put this here just for easy access and merge.

Patch 11-21 implement the iomap buffered read/write path, dirty folio
write back path and mmap path for ext4 regular file. 

Patch 22-23 disable the unsupported online-defragmentation function and
disable the changing of the inode journal flag to data=journal mode.
Please look at the following patch for details.

Patch 24-27 introduce "buffered_iomap" mount option (is not enabled by
default now) to partially enable the iomap buffered I/O path and also
enable large folio.

About performance:

Fio tests with psync on my machine with Intel Xeon Gold 6240 CPU with
400GB system ram, 200GB ramdisk and 4TB nvme ssd disk.

 fio -directory=/mnt -direct=0 -iodepth=$iodepth -fsync=$sync -rw=$rw \
     -numjobs=${numjobs} -bs=${bs} -ioengine=psync -size=$size \
     -runtime=60 -norandommap=0 -fallocate=none -overwrite=$overwrite \
     -group_reportin -name=$name --output=/tmp/test_log

 == buffer read ==

                buffer_head        iomap + large folio
 type     bs    IOPS    BW(MiB/s)  IOPS    BW(MiB/s)
 -------------------------------------------------------
 hole     4K    576k    2253       762k    2975     +32%
 hole     64K   48.7k   3043       77.8k   4860     +60%
 hole     1M    2960    2960       4942    4942     +67%
 ramdisk  4K    443k    1732       530k    2069     +19%
 ramdisk  64K   34.5k   2156       45.6k   2850     +32%
 ramdisk  1M    2093    2093       2841    2841     +36%
 nvme     4K    339k    1323       364k    1425     +8%
 nvme     64K   23.6k   1471       25.2k   1574     +7%
 nvme     1M    2012    2012       2153    2153     +7%

 == buffer write ==

                                       buffer_head  iomap + large folio
 type   Overwrite Sync Writeback  bs   IOPS   BW    IOPS   BW(MiB/s)
 ----------------------------------------------------------------------
 cache      N    N    N    4K     417k    1631    440k    1719   +5%
 cache      N    N    N    64K    33.4k   2088    81.5k   5092   +144%
 cache      N    N    N    1M     2143    2143    5716    5716   +167%
 cache      Y    N    N    4K     449k    1755    469k    1834   +5%
 cache      Y    N    N    64K    36.6k   2290    82.3k   5142   +125%
 cache      Y    N    N    1M     2352    2352    5577    5577   +137%
 ramdisk    N    N    Y    4K     365k    1424    354k    1384   -3%
 ramdisk    N    N    Y    64K    31.2k   1950    74.2k   4640   +138%
 ramdisk    N    N    Y    1M     1968    1968    5201    5201   +164%
 ramdisk    N    Y    N    4K     9984    39      12.9k   51     +29%
 ramdisk    N    Y    N    64K    5936    371     8960    560    +51%
 ramdisk    N    Y    N    1M     1050    1050    1835    1835   +75%
 ramdisk    Y    N    Y    4K     411k    1609    443k    1731   +8%
 ramdisk    Y    N    Y    64K    34.1k   2134    77.5k   4844   +127%
 ramdisk    Y    N    Y    1M     2248    2248    5372    5372   +139%
 ramdisk    Y    Y    N    4K     182k    711     186k    730    +3%
 ramdisk    Y    Y    N    64K    18.7k   1170    34.7k   2171   +86%
 ramdisk    Y    Y    N    1M     1229    1229    2269    2269   +85%
 nvme       N    N    Y    4K     373k    1458    387k    1512   +4%
 nvme       N    N    Y    64K    29.2k   1827    70.9k   4431   +143%
 nvme       N    N    Y    1M     1835    1835    4919    4919   +168%
 nvme       N    Y    N    4K     11.7k   46      11.7k   46      0%
 nvme       N    Y    N    64K    6453    403     8661    541    +34%
 nvme       N    Y    N    1M     649     649     1351    1351   +108%
 nvme       Y    N    Y    4K     372k    1456    433k    1693   +16%
 nvme       Y    N    Y    64K    33.0k   2064    74.7k   4669   +126%
 nvme       Y    N    Y    1M     2131    2131    5273    5273   +147%
 nvme       Y    Y    N    4K     56.7k   222     56.4k   220    -1%
 nvme       Y    Y    N    64K    13.4k   840     19.4k   1214   +45%
 nvme       Y    Y    N    1M     714     714     1504    1504   +111%

Thanks,
Yi.

Major changes since RFC v4:
 - Disable unsupported online defragmentation, do not fall back to
   buffer_head path.
 - Wite and wait data back while doing partial block truncate down to
   fix a stale data problem.
 - Disable the online changing of the inode journal flag to data=journal
   mode.
 - Since iomap can zero out dirty pages with unwritten extent, do not
   write data before zeroing out in ext4_zero_range(), and also do not
   zero partial blocks under a started journal handle.

[1] https://lore.kernel.org/linux-ext4/20241010133333.146793-1-yi.zhang@huawei.com/

---
RFC v4: https://lore.kernel.org/linux-ext4/20240410142948.2817554-1-yi.zhang@huaweicloud.com/
RFC v3: https://lore.kernel.org/linux-ext4/20240127015825.1608160-1-yi.zhang@huaweicloud.com/
RFC v2: https://lore.kernel.org/linux-ext4/20240102123918.799062-1-yi.zhang@huaweicloud.com/
RFC v1: https://lore.kernel.org/linux-ext4/20231123125121.4064694-1-yi.zhang@huaweicloud.com/

Zhang Yi (27):
  ext4: remove writable userspace mappings before truncating page cache
  ext4: don't explicit update times in ext4_fallocate()
  ext4: don't write back data before punch hole in nojournal mode
  ext4: refactor ext4_punch_hole()
  ext4: refactor ext4_zero_range()
  ext4: refactor ext4_collapse_range()
  ext4: refactor ext4_insert_range()
  ext4: factor out ext4_do_fallocate()
  ext4: move out inode_lock into ext4_fallocate()
  ext4: move out common parts into ext4_fallocate()
  ext4: use reserved metadata blocks when splitting extent on endio
  ext4: introduce seq counter for the extent status entry
  ext4: add a new iomap aops for regular file's buffered IO path
  ext4: implement buffered read iomap path
  ext4: implement buffered write iomap path
  ext4: don't order data for inode with EXT4_STATE_BUFFERED_IOMAP
  ext4: implement writeback iomap path
  ext4: implement mmap iomap path
  ext4: do not always order data when partial zeroing out a block
  ext4: do not start handle if unnecessary while partial zeroing out a
    block
  ext4: implement zero_range iomap path
  ext4: disable online defrag when inode using iomap buffered I/O path
  ext4: disable inode journal mode when using iomap buffered I/O path
  ext4: partially enable iomap for the buffered I/O path of regular
    files
  ext4: enable large folio for regular file with iomap buffered I/O path
  ext4: change mount options code style
  ext4: introduce a mount option for iomap buffered I/O path

 fs/ext4/ext4.h              |  17 +-
 fs/ext4/ext4_jbd2.c         |   3 +-
 fs/ext4/ext4_jbd2.h         |   8 +
 fs/ext4/extents.c           | 568 +++++++++++----------------
 fs/ext4/extents_status.c    |  13 +-
 fs/ext4/file.c              |  19 +-
 fs/ext4/ialloc.c            |   5 +
 fs/ext4/inode.c             | 755 ++++++++++++++++++++++++++++++------
 fs/ext4/move_extent.c       |   7 +
 fs/ext4/page-io.c           | 105 +++++
 fs/ext4/super.c             | 185 ++++-----
 include/trace/events/ext4.h |  57 +--
 12 files changed, 1153 insertions(+), 589 deletions(-)

-- 
2.46.1

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH 01/27] ext4: remove writable userspace mappings before truncating page cache
  2024-10-22 11:10 [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio Zhang Yi
  2024-10-22  6:59 ` Sedat Dilek
@ 2024-10-22 11:10 ` Zhang Yi
  2024-12-04 11:13   ` Jan Kara
  2024-10-22 11:10 ` [PATCH 02/27] ext4: don't explicit update times in ext4_fallocate() Zhang Yi
                   ` (25 subsequent siblings)
  27 siblings, 1 reply; 59+ messages in thread
From: Zhang Yi @ 2024-10-22 11:10 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	ritesh.list, hch, djwong, david, zokeefe, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

When zeroing a range of folios on the filesystem which block size is
less than the page size, the file's mapped partial blocks within one
page will be marked as unwritten, we should remove writable userspace
mappings to ensure that ext4_page_mkwrite() can be called during
subsequent write access to these folios. Otherwise, data written by
subsequent mmap writes may not be saved to disk.

 $mkfs.ext4 -b 1024 /dev/vdb
 $mount /dev/vdb /mnt
 $xfs_io -t -f -c "pwrite -S 0x58 0 4096" -c "mmap -rw 0 4096" \
               -c "mwrite -S 0x5a 2048 2048" -c "fzero 2048 2048" \
               -c "mwrite -S 0x59 2048 2048" -c "close" /mnt/foo

 $od -Ax -t x1z /mnt/foo
 000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
 *
 000800 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59
 *
 001000

 $umount /mnt && mount /dev/vdb /mnt
 $od -Ax -t x1z /mnt/foo
 000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
 *
 000800 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 *
 001000

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/ext4.h    |  2 ++
 fs/ext4/extents.c |  1 +
 fs/ext4/inode.c   | 41 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 44 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 44b0d418143c..6d0267afd4c1 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3020,6 +3020,8 @@ extern int ext4_inode_attach_jinode(struct inode *inode);
 extern int ext4_can_truncate(struct inode *inode);
 extern int ext4_truncate(struct inode *);
 extern int ext4_break_layouts(struct inode *);
+extern void ext4_truncate_folios_range(struct inode *inode, loff_t start,
+				       loff_t end);
 extern int ext4_punch_hole(struct file *file, loff_t offset, loff_t length);
 extern void ext4_set_inode_flags(struct inode *, bool init);
 extern int ext4_alloc_da_blocks(struct inode *inode);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 34e25eee6521..2a054c3689f0 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4677,6 +4677,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 		}
 
 		/* Now release the pages and zero block aligned part of pages */
+		ext4_truncate_folios_range(inode, start, end);
 		truncate_pagecache_range(inode, start, end - 1);
 		inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 54bdd4884fe6..8b34e79112d5 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -31,6 +31,7 @@
 #include <linux/writeback.h>
 #include <linux/pagevec.h>
 #include <linux/mpage.h>
+#include <linux/rmap.h>
 #include <linux/namei.h>
 #include <linux/uio.h>
 #include <linux/bio.h>
@@ -3870,6 +3871,46 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,
 	return ret;
 }
 
+static inline void ext4_truncate_folio(struct inode *inode,
+				       loff_t start, loff_t end)
+{
+	unsigned long blocksize = i_blocksize(inode);
+	struct folio *folio;
+
+	if (round_up(start, blocksize) >= round_down(end, blocksize))
+		return;
+
+	folio = filemap_lock_folio(inode->i_mapping, start >> PAGE_SHIFT);
+	if (IS_ERR(folio))
+		return;
+
+	if (folio_mkclean(folio))
+		folio_mark_dirty(folio);
+	folio_unlock(folio);
+	folio_put(folio);
+}
+
+/*
+ * When truncating a range of folios, if the block size is less than the
+ * page size, the file's mapped partial blocks within one page could be
+ * freed or converted to unwritten. We should call this function to remove
+ * writable userspace mappings so that ext4_page_mkwrite() can be called
+ * during subsequent write access to these folios.
+ */
+void ext4_truncate_folios_range(struct inode *inode, loff_t start, loff_t end)
+{
+	unsigned long blocksize = i_blocksize(inode);
+
+	if (end > inode->i_size)
+		end = inode->i_size;
+	if (start >= end || blocksize >= PAGE_SIZE)
+		return;
+
+	ext4_truncate_folio(inode, start, min(round_up(start, PAGE_SIZE), end));
+	if (end > round_up(start, PAGE_SIZE))
+		ext4_truncate_folio(inode, round_down(end, PAGE_SIZE), end);
+}
+
 static void ext4_wait_dax_page(struct inode *inode)
 {
 	filemap_invalidate_unlock(inode->i_mapping);
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 02/27] ext4: don't explicit update times in ext4_fallocate()
  2024-10-22 11:10 [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio Zhang Yi
  2024-10-22  6:59 ` Sedat Dilek
  2024-10-22 11:10 ` [PATCH 01/27] ext4: remove writable userspace mappings before truncating page cache Zhang Yi
@ 2024-10-22 11:10 ` Zhang Yi
  2024-10-22 11:10 ` [PATCH 03/27] ext4: don't write back data before punch hole in nojournal mode Zhang Yi
                   ` (24 subsequent siblings)
  27 siblings, 0 replies; 59+ messages in thread
From: Zhang Yi @ 2024-10-22 11:10 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	ritesh.list, hch, djwong, david, zokeefe, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

After commit 'ad5cd4f4ee4d ("ext4: fix fallocate to use file_modified to
update permissions consistently"), we can update mtime and ctime
appropriately through file_modified() when doing zero range, collapse
rage, insert range and punch hole, hence there is no need to explicit
update times in those paths, just drop them.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/extents.c | 4 ----
 fs/ext4/inode.c   | 1 -
 2 files changed, 5 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 2a054c3689f0..aa07b5ddaff8 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4679,7 +4679,6 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 		/* Now release the pages and zero block aligned part of pages */
 		ext4_truncate_folios_range(inode, start, end);
 		truncate_pagecache_range(inode, start, end - 1);
-		inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
 
 		ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size,
 					     flags);
@@ -4704,7 +4703,6 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 		goto out_mutex;
 	}
 
-	inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
 	if (new_size)
 		ext4_update_inode_size(inode, new_size);
 	ret = ext4_mark_inode_dirty(handle, inode);
@@ -5440,7 +5438,6 @@ static int ext4_collapse_range(struct file *file, loff_t offset, loff_t len)
 	up_write(&EXT4_I(inode)->i_data_sem);
 	if (IS_SYNC(inode))
 		ext4_handle_sync(handle);
-	inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
 	ret = ext4_mark_inode_dirty(handle, inode);
 	ext4_update_inode_fsync_trans(handle, inode, 1);
 
@@ -5550,7 +5547,6 @@ static int ext4_insert_range(struct file *file, loff_t offset, loff_t len)
 	/* Expand file to avoid data loss if there is error while shifting */
 	inode->i_size += len;
 	EXT4_I(inode)->i_disksize += len;
-	inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
 	ret = ext4_mark_inode_dirty(handle, inode);
 	if (ret)
 		goto out_stop;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 8b34e79112d5..f8796f7b0f94 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4085,7 +4085,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
 	if (IS_SYNC(inode))
 		ext4_handle_sync(handle);
 
-	inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
 	ret2 = ext4_mark_inode_dirty(handle, inode);
 	if (unlikely(ret2))
 		ret = ret2;
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 03/27] ext4: don't write back data before punch hole in nojournal mode
  2024-10-22 11:10 [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio Zhang Yi
                   ` (2 preceding siblings ...)
  2024-10-22 11:10 ` [PATCH 02/27] ext4: don't explicit update times in ext4_fallocate() Zhang Yi
@ 2024-10-22 11:10 ` Zhang Yi
  2024-11-18 23:15   ` Darrick J. Wong
  2024-12-04 11:27   ` Jan Kara
  2024-10-22 11:10 ` [PATCH 04/27] ext4: refactor ext4_punch_hole() Zhang Yi
                   ` (23 subsequent siblings)
  27 siblings, 2 replies; 59+ messages in thread
From: Zhang Yi @ 2024-10-22 11:10 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	ritesh.list, hch, djwong, david, zokeefe, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

There is no need to write back all data before punching a hole in
data=ordered|writeback mode since it will be dropped soon after removing
space, so just remove the filemap_write_and_wait_range() in these modes.
However, in data=journal mode, we need to write dirty pages out before
discarding page cache in case of crash before committing the freeing
data transaction, which could expose old, stale data.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/inode.c | 26 +++++++++++++++-----------
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f8796f7b0f94..94b923afcd9c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3965,17 +3965,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
 
 	trace_ext4_punch_hole(inode, offset, length, 0);
 
-	/*
-	 * Write out all dirty pages to avoid race conditions
-	 * Then release them.
-	 */
-	if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
-		ret = filemap_write_and_wait_range(mapping, offset,
-						   offset + length - 1);
-		if (ret)
-			return ret;
-	}
-
 	inode_lock(inode);
 
 	/* No need to punch hole beyond i_size */
@@ -4037,6 +4026,21 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
 		ret = ext4_update_disksize_before_punch(inode, offset, length);
 		if (ret)
 			goto out_dio;
+
+		/*
+		 * For journalled data we need to write (and checkpoint) pages
+		 * before discarding page cache to avoid inconsitent data on
+		 * disk in case of crash before punching trans is committed.
+		 */
+		if (ext4_should_journal_data(inode)) {
+			ret = filemap_write_and_wait_range(mapping,
+					first_block_offset, last_block_offset);
+			if (ret)
+				goto out_dio;
+		}
+
+		ext4_truncate_folios_range(inode, first_block_offset,
+					   last_block_offset + 1);
 		truncate_pagecache_range(inode, first_block_offset,
 					 last_block_offset);
 	}
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 04/27] ext4: refactor ext4_punch_hole()
  2024-10-22 11:10 [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio Zhang Yi
                   ` (3 preceding siblings ...)
  2024-10-22 11:10 ` [PATCH 03/27] ext4: don't write back data before punch hole in nojournal mode Zhang Yi
@ 2024-10-22 11:10 ` Zhang Yi
  2024-11-18 23:27   ` Darrick J. Wong
  2024-12-04 11:36   ` Jan Kara
  2024-10-22 11:10 ` [PATCH 05/27] ext4: refactor ext4_zero_range() Zhang Yi
                   ` (22 subsequent siblings)
  27 siblings, 2 replies; 59+ messages in thread
From: Zhang Yi @ 2024-10-22 11:10 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	ritesh.list, hch, djwong, david, zokeefe, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

The current implementation of ext4_punch_hole() contains complex
position calculations and stale error tags. To improve the code's
clarity and maintainability, it is essential to clean up the code and
improve its readability, this can be achieved by: a) simplifying and
renaming variables; b) eliminating unnecessary position calculations;
c) writing back all data in data=journal mode, and drop page cache from
the original offset to the end, rather than using aligned blocks,
d) renaming the stale error tags.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/inode.c | 140 +++++++++++++++++++++---------------------------
 1 file changed, 62 insertions(+), 78 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 94b923afcd9c..1d128333bd06 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3955,13 +3955,14 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
 {
 	struct inode *inode = file_inode(file);
 	struct super_block *sb = inode->i_sb;
-	ext4_lblk_t first_block, stop_block;
+	ext4_lblk_t start_lblk, end_lblk;
 	struct address_space *mapping = inode->i_mapping;
-	loff_t first_block_offset, last_block_offset, max_length;
-	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+	loff_t max_end = EXT4_SB(sb)->s_bitmap_maxbytes - sb->s_blocksize;
+	loff_t end = offset + length;
+	unsigned long blocksize = i_blocksize(inode);
 	handle_t *handle;
 	unsigned int credits;
-	int ret = 0, ret2 = 0;
+	int ret = 0;
 
 	trace_ext4_punch_hole(inode, offset, length, 0);
 
@@ -3969,36 +3970,27 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
 
 	/* No need to punch hole beyond i_size */
 	if (offset >= inode->i_size)
-		goto out_mutex;
+		goto out;
 
 	/*
-	 * If the hole extends beyond i_size, set the hole
-	 * to end after the page that contains i_size
+	 * If the hole extends beyond i_size, set the hole to end after
+	 * the page that contains i_size, and also make sure that the hole
+	 * within one block before last range.
 	 */
-	if (offset + length > inode->i_size) {
-		length = inode->i_size +
-		   PAGE_SIZE - (inode->i_size & (PAGE_SIZE - 1)) -
-		   offset;
-	}
+	if (end > inode->i_size)
+		end = round_up(inode->i_size, PAGE_SIZE);
+	if (end > max_end)
+		end = max_end;
+	length = end - offset;
 
 	/*
-	 * For punch hole the length + offset needs to be within one block
-	 * before last range. Adjust the length if it goes beyond that limit.
+	 * Attach jinode to inode for jbd2 if we do any zeroing of partial
+	 * block.
 	 */
-	max_length = sbi->s_bitmap_maxbytes - inode->i_sb->s_blocksize;
-	if (offset + length > max_length)
-		length = max_length - offset;
-
-	if (offset & (sb->s_blocksize - 1) ||
-	    (offset + length) & (sb->s_blocksize - 1)) {
-		/*
-		 * Attach jinode to inode for jbd2 if we do any zeroing of
-		 * partial block
-		 */
+	if (offset & (blocksize - 1) || end & (blocksize - 1)) {
 		ret = ext4_inode_attach_jinode(inode);
 		if (ret < 0)
-			goto out_mutex;
-
+			goto out;
 	}
 
 	/* Wait all existing dio workers, newcomers will block on i_rwsem */
@@ -4006,7 +3998,7 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
 
 	ret = file_modified(file);
 	if (ret)
-		goto out_mutex;
+		goto out;
 
 	/*
 	 * Prevent page faults from reinstantiating pages we have released from
@@ -4016,34 +4008,24 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
 
 	ret = ext4_break_layouts(inode);
 	if (ret)
-		goto out_dio;
-
-	first_block_offset = round_up(offset, sb->s_blocksize);
-	last_block_offset = round_down((offset + length), sb->s_blocksize) - 1;
+		goto out_invalidate_lock;
 
-	/* Now release the pages and zero block aligned part of pages*/
-	if (last_block_offset > first_block_offset) {
+	/*
+	 * For journalled data we need to write (and checkpoint) pages
+	 * before discarding page cache to avoid inconsitent data on
+	 * disk in case of crash before punching trans is committed.
+	 */
+	if (ext4_should_journal_data(inode)) {
+		ret = filemap_write_and_wait_range(mapping, offset, end - 1);
+	} else {
 		ret = ext4_update_disksize_before_punch(inode, offset, length);
-		if (ret)
-			goto out_dio;
-
-		/*
-		 * For journalled data we need to write (and checkpoint) pages
-		 * before discarding page cache to avoid inconsitent data on
-		 * disk in case of crash before punching trans is committed.
-		 */
-		if (ext4_should_journal_data(inode)) {
-			ret = filemap_write_and_wait_range(mapping,
-					first_block_offset, last_block_offset);
-			if (ret)
-				goto out_dio;
-		}
-
-		ext4_truncate_folios_range(inode, first_block_offset,
-					   last_block_offset + 1);
-		truncate_pagecache_range(inode, first_block_offset,
-					 last_block_offset);
+		ext4_truncate_folios_range(inode, offset, end);
 	}
+	if (ret)
+		goto out_invalidate_lock;
+
+	/* Now release the pages and zero block aligned part of pages*/
+	truncate_pagecache_range(inode, offset, end - 1);
 
 	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
 		credits = ext4_writepage_trans_blocks(inode);
@@ -4053,52 +4035,54 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
 	if (IS_ERR(handle)) {
 		ret = PTR_ERR(handle);
 		ext4_std_error(sb, ret);
-		goto out_dio;
+		goto out_invalidate_lock;
 	}
 
-	ret = ext4_zero_partial_blocks(handle, inode, offset,
-				       length);
+	ret = ext4_zero_partial_blocks(handle, inode, offset, length);
 	if (ret)
-		goto out_stop;
-
-	first_block = (offset + sb->s_blocksize - 1) >>
-		EXT4_BLOCK_SIZE_BITS(sb);
-	stop_block = (offset + length) >> EXT4_BLOCK_SIZE_BITS(sb);
+		goto out_handle;
 
 	/* If there are blocks to remove, do it */
-	if (stop_block > first_block) {
-		ext4_lblk_t hole_len = stop_block - first_block;
+	start_lblk = round_up(offset, blocksize) >> inode->i_blkbits;
+	end_lblk = end >> inode->i_blkbits;
+
+	if (end_lblk > start_lblk) {
+		ext4_lblk_t hole_len = end_lblk - start_lblk;
 
 		down_write(&EXT4_I(inode)->i_data_sem);
 		ext4_discard_preallocations(inode);
 
-		ext4_es_remove_extent(inode, first_block, hole_len);
+		ext4_es_remove_extent(inode, start_lblk, hole_len);
 
 		if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
-			ret = ext4_ext_remove_space(inode, first_block,
-						    stop_block - 1);
+			ret = ext4_ext_remove_space(inode, start_lblk,
+						    end_lblk - 1);
 		else
-			ret = ext4_ind_remove_space(handle, inode, first_block,
-						    stop_block);
+			ret = ext4_ind_remove_space(handle, inode, start_lblk,
+						    end_lblk);
+		if (ret) {
+			up_write(&EXT4_I(inode)->i_data_sem);
+			goto out_handle;
+		}
 
-		ext4_es_insert_extent(inode, first_block, hole_len, ~0,
+		ext4_es_insert_extent(inode, start_lblk, hole_len, ~0,
 				      EXTENT_STATUS_HOLE, 0);
 		up_write(&EXT4_I(inode)->i_data_sem);
 	}
-	ext4_fc_track_range(handle, inode, first_block, stop_block);
+	ext4_fc_track_range(handle, inode, start_lblk, end_lblk);
+
+	ret = ext4_mark_inode_dirty(handle, inode);
+	if (unlikely(ret))
+		goto out_handle;
+
+	ext4_update_inode_fsync_trans(handle, inode, 1);
 	if (IS_SYNC(inode))
 		ext4_handle_sync(handle);
-
-	ret2 = ext4_mark_inode_dirty(handle, inode);
-	if (unlikely(ret2))
-		ret = ret2;
-	if (ret >= 0)
-		ext4_update_inode_fsync_trans(handle, inode, 1);
-out_stop:
+out_handle:
 	ext4_journal_stop(handle);
-out_dio:
+out_invalidate_lock:
 	filemap_invalidate_unlock(mapping);
-out_mutex:
+out:
 	inode_unlock(inode);
 	return ret;
 }
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 05/27] ext4: refactor ext4_zero_range()
  2024-10-22 11:10 [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio Zhang Yi
                   ` (4 preceding siblings ...)
  2024-10-22 11:10 ` [PATCH 04/27] ext4: refactor ext4_punch_hole() Zhang Yi
@ 2024-10-22 11:10 ` Zhang Yi
  2024-12-04 11:52   ` Jan Kara
  2024-10-22 11:10 ` [PATCH 06/27] ext4: refactor ext4_collapse_range() Zhang Yi
                   ` (21 subsequent siblings)
  27 siblings, 1 reply; 59+ messages in thread
From: Zhang Yi @ 2024-10-22 11:10 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	ritesh.list, hch, djwong, david, zokeefe, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

The current implementation of ext4_zero_range() contains complex
position calculations and stale error tags. To improve the code's
clarity and maintainability, it is essential to clean up the code and
improve its readability, this can be achieved by: a) simplifying and
renaming variables, making the style the same as ext4_punch_hole(); b)
eliminating unnecessary position calculations, writing back all data in
data=journal mode, and drop page cache from the original offset to the
end, rather than using aligned blocks; c) renaming the stale out_mutex
tags.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents.c | 161 +++++++++++++++++++---------------------------
 1 file changed, 65 insertions(+), 96 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index aa07b5ddaff8..f843342e5164 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4565,40 +4565,15 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 	struct inode *inode = file_inode(file);
 	struct address_space *mapping = file->f_mapping;
 	handle_t *handle = NULL;
-	unsigned int max_blocks;
 	loff_t new_size = 0;
-	int ret = 0;
-	int flags;
-	int credits;
-	int partial_begin, partial_end;
-	loff_t start, end;
-	ext4_lblk_t lblk;
+	loff_t end = offset + len;
+	ext4_lblk_t start_lblk, end_lblk;
+	unsigned int blocksize = i_blocksize(inode);
 	unsigned int blkbits = inode->i_blkbits;
+	int ret, flags, credits;
 
 	trace_ext4_zero_range(inode, offset, len, mode);
 
-	/*
-	 * Round up offset. This is not fallocate, we need to zero out
-	 * blocks, so convert interior block aligned part of the range to
-	 * unwritten and possibly manually zero out unaligned parts of the
-	 * range. Here, start and partial_begin are inclusive, end and
-	 * partial_end are exclusive.
-	 */
-	start = round_up(offset, 1 << blkbits);
-	end = round_down((offset + len), 1 << blkbits);
-
-	if (start < offset || end > offset + len)
-		return -EINVAL;
-	partial_begin = offset & ((1 << blkbits) - 1);
-	partial_end = (offset + len) & ((1 << blkbits) - 1);
-
-	lblk = start >> blkbits;
-	max_blocks = (end >> blkbits);
-	if (max_blocks < lblk)
-		max_blocks = 0;
-	else
-		max_blocks -= lblk;
-
 	inode_lock(inode);
 
 	/*
@@ -4606,88 +4581,78 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 	 */
 	if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
 		ret = -EOPNOTSUPP;
-		goto out_mutex;
+		goto out;
 	}
 
 	if (!(mode & FALLOC_FL_KEEP_SIZE) &&
-	    (offset + len > inode->i_size ||
-	     offset + len > EXT4_I(inode)->i_disksize)) {
-		new_size = offset + len;
+	    (end > inode->i_size || end > EXT4_I(inode)->i_disksize)) {
+		new_size = end;
 		ret = inode_newsize_ok(inode, new_size);
 		if (ret)
-			goto out_mutex;
+			goto out;
 	}
 
-	flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT;
-
 	/* Wait all existing dio workers, newcomers will block on i_rwsem */
 	inode_dio_wait(inode);
 
 	ret = file_modified(file);
 	if (ret)
-		goto out_mutex;
-
-	/* Preallocate the range including the unaligned edges */
-	if (partial_begin || partial_end) {
-		ret = ext4_alloc_file_blocks(file,
-				round_down(offset, 1 << blkbits) >> blkbits,
-				(round_up((offset + len), 1 << blkbits) -
-				 round_down(offset, 1 << blkbits)) >> blkbits,
-				new_size, flags);
-		if (ret)
-			goto out_mutex;
-
-	}
-
-	/* Zero range excluding the unaligned edges */
-	if (max_blocks > 0) {
-		flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN |
-			  EXT4_EX_NOCACHE);
+		goto out;
 
-		/*
-		 * Prevent page faults from reinstantiating pages we have
-		 * released from page cache.
-		 */
-		filemap_invalidate_lock(mapping);
+	/*
+	 * Prevent page faults from reinstantiating pages we have released
+	 * from page cache.
+	 */
+	filemap_invalidate_lock(mapping);
 
-		ret = ext4_break_layouts(inode);
-		if (ret) {
-			filemap_invalidate_unlock(mapping);
-			goto out_mutex;
-		}
+	ret = ext4_break_layouts(inode);
+	if (ret)
+		goto out_invalidate_lock;
 
+	/*
+	 * For journalled data we need to write (and checkpoint) pages before
+	 * discarding page cache to avoid inconsitent data on disk in case of
+	 * crash before zeroing trans is committed.
+	 */
+	if (ext4_should_journal_data(inode)) {
+		ret = filemap_write_and_wait_range(mapping, offset, end - 1);
+	} else {
 		ret = ext4_update_disksize_before_punch(inode, offset, len);
-		if (ret) {
-			filemap_invalidate_unlock(mapping);
-			goto out_mutex;
-		}
+		ext4_truncate_folios_range(inode, offset, end);
+	}
+	if (ret)
+		goto out_invalidate_lock;
 
-		/*
-		 * For journalled data we need to write (and checkpoint) pages
-		 * before discarding page cache to avoid inconsitent data on
-		 * disk in case of crash before zeroing trans is committed.
-		 */
-		if (ext4_should_journal_data(inode)) {
-			ret = filemap_write_and_wait_range(mapping, start,
-							   end - 1);
-			if (ret) {
-				filemap_invalidate_unlock(mapping);
-				goto out_mutex;
-			}
-		}
+	/* Now release the pages and zero block aligned part of pages */
+	truncate_pagecache_range(inode, offset, end - 1);
 
-		/* Now release the pages and zero block aligned part of pages */
-		ext4_truncate_folios_range(inode, start, end);
-		truncate_pagecache_range(inode, start, end - 1);
+	flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT;
+	/* Preallocate the range including the unaligned edges */
+	if (offset & (blocksize - 1) || end & (blocksize - 1)) {
+		ext4_lblk_t alloc_lblk = offset >> blkbits;
+		ext4_lblk_t len_lblk = EXT4_MAX_BLOCKS(len, offset, blkbits);
 
-		ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size,
-					     flags);
-		filemap_invalidate_unlock(mapping);
+		ret = ext4_alloc_file_blocks(file, alloc_lblk, len_lblk,
+					     new_size, flags);
 		if (ret)
-			goto out_mutex;
+			goto out_invalidate_lock;
 	}
-	if (!partial_begin && !partial_end)
-		goto out_mutex;
+
+	/* Zero range excluding the unaligned edges */
+	start_lblk = round_up(offset, blocksize) >> blkbits;
+	end_lblk = end >> blkbits;
+	if (end_lblk > start_lblk) {
+		ext4_lblk_t zero_blks = end_lblk - start_lblk;
+
+		flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN | EXT4_EX_NOCACHE);
+		ret = ext4_alloc_file_blocks(file, start_lblk, zero_blks,
+					     new_size, flags);
+		if (ret)
+			goto out_invalidate_lock;
+	}
+	/* Finish zeroing out if it doesn't contain partial block */
+	if (!(offset & (blocksize - 1)) && !(end & (blocksize - 1)))
+		goto out_invalidate_lock;
 
 	/*
 	 * In worst case we have to writeout two nonadjacent unwritten
@@ -4700,25 +4665,29 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 	if (IS_ERR(handle)) {
 		ret = PTR_ERR(handle);
 		ext4_std_error(inode->i_sb, ret);
-		goto out_mutex;
+		goto out_invalidate_lock;
 	}
 
+	/* Zero out partial block at the edges of the range */
+	ret = ext4_zero_partial_blocks(handle, inode, offset, len);
+	if (ret)
+		goto out_handle;
+
 	if (new_size)
 		ext4_update_inode_size(inode, new_size);
 	ret = ext4_mark_inode_dirty(handle, inode);
 	if (unlikely(ret))
 		goto out_handle;
-	/* Zero out partial block at the edges of the range */
-	ret = ext4_zero_partial_blocks(handle, inode, offset, len);
-	if (ret >= 0)
-		ext4_update_inode_fsync_trans(handle, inode, 1);
 
+	ext4_update_inode_fsync_trans(handle, inode, 1);
 	if (file->f_flags & O_SYNC)
 		ext4_handle_sync(handle);
 
 out_handle:
 	ext4_journal_stop(handle);
-out_mutex:
+out_invalidate_lock:
+	filemap_invalidate_unlock(mapping);
+out:
 	inode_unlock(inode);
 	return ret;
 }
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 06/27] ext4: refactor ext4_collapse_range()
  2024-10-22 11:10 [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio Zhang Yi
                   ` (5 preceding siblings ...)
  2024-10-22 11:10 ` [PATCH 05/27] ext4: refactor ext4_zero_range() Zhang Yi
@ 2024-10-22 11:10 ` Zhang Yi
  2024-12-04 11:58   ` Jan Kara
  2024-10-22 11:10 ` [PATCH 07/27] ext4: refactor ext4_insert_range() Zhang Yi
                   ` (20 subsequent siblings)
  27 siblings, 1 reply; 59+ messages in thread
From: Zhang Yi @ 2024-10-22 11:10 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	ritesh.list, hch, djwong, david, zokeefe, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Simplify ext4_collapse_range() and align its code style with that of
ext4_zero_range() and ext4_punch_hole(). Refactor it by: a) renaming
variables, b) removing redundant input parameter checks and moving
the remaining checks under i_rwsem in preparation for future
refactoring, and c) renaming the three stale error tags.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents.c | 103 +++++++++++++++++++++-------------------------
 1 file changed, 48 insertions(+), 55 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index f843342e5164..a4e95f3b5f09 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -5295,43 +5295,36 @@ static int ext4_collapse_range(struct file *file, loff_t offset, loff_t len)
 	struct inode *inode = file_inode(file);
 	struct super_block *sb = inode->i_sb;
 	struct address_space *mapping = inode->i_mapping;
-	ext4_lblk_t punch_start, punch_stop;
+	loff_t end = offset + len;
+	ext4_lblk_t start_lblk, end_lblk;
 	handle_t *handle;
 	unsigned int credits;
-	loff_t new_size, ioffset;
+	loff_t start, new_size;
 	int ret;
 
-	/*
-	 * We need to test this early because xfstests assumes that a
-	 * collapse range of (0, 1) will return EOPNOTSUPP if the file
-	 * system does not support collapse range.
-	 */
-	if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
-		return -EOPNOTSUPP;
+	trace_ext4_collapse_range(inode, offset, len);
 
-	/* Collapse range works only on fs cluster size aligned regions. */
-	if (!IS_ALIGNED(offset | len, EXT4_CLUSTER_SIZE(sb)))
-		return -EINVAL;
+	inode_lock(inode);
 
-	trace_ext4_collapse_range(inode, offset, len);
+	/* Currently just for extent based files */
+	if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
+		ret = -EOPNOTSUPP;
+		goto out;
+	}
 
-	punch_start = offset >> EXT4_BLOCK_SIZE_BITS(sb);
-	punch_stop = (offset + len) >> EXT4_BLOCK_SIZE_BITS(sb);
+	/* Collapse range works only on fs cluster size aligned regions. */
+	if (!IS_ALIGNED(offset | len, EXT4_CLUSTER_SIZE(sb))) {
+		ret = -EINVAL;
+		goto out;
+	}
 
-	inode_lock(inode);
 	/*
 	 * There is no need to overlap collapse range with EOF, in which case
 	 * it is effectively a truncate operation
 	 */
-	if (offset + len >= inode->i_size) {
+	if (end >= inode->i_size) {
 		ret = -EINVAL;
-		goto out_mutex;
-	}
-
-	/* Currently just for extent based files */
-	if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
-		ret = -EOPNOTSUPP;
-		goto out_mutex;
+		goto out;
 	}
 
 	/* Wait for existing dio to complete */
@@ -5339,7 +5332,7 @@ static int ext4_collapse_range(struct file *file, loff_t offset, loff_t len)
 
 	ret = file_modified(file);
 	if (ret)
-		goto out_mutex;
+		goto out;
 
 	/*
 	 * Prevent page faults from reinstantiating pages we have released from
@@ -5349,55 +5342,52 @@ static int ext4_collapse_range(struct file *file, loff_t offset, loff_t len)
 
 	ret = ext4_break_layouts(inode);
 	if (ret)
-		goto out_mmap;
+		goto out_invalidate_lock;
 
 	/*
+	 * Write tail of the last page before removed range and data that
+	 * will be shifted since they will get removed from the page cache
+	 * below. We are also protected from pages becoming dirty by
+	 * i_rwsem and invalidate_lock.
 	 * Need to round down offset to be aligned with page size boundary
 	 * for page size > block size.
 	 */
-	ioffset = round_down(offset, PAGE_SIZE);
-	/*
-	 * Write tail of the last page before removed range since it will get
-	 * removed from the page cache below.
-	 */
-	ret = filemap_write_and_wait_range(mapping, ioffset, offset);
-	if (ret)
-		goto out_mmap;
-	/*
-	 * Write data that will be shifted to preserve them when discarding
-	 * page cache below. We are also protected from pages becoming dirty
-	 * by i_rwsem and invalidate_lock.
-	 */
-	ret = filemap_write_and_wait_range(mapping, offset + len,
-					   LLONG_MAX);
+	start = round_down(offset, PAGE_SIZE);
+	ret = filemap_write_and_wait_range(mapping, start, offset);
+	if (!ret)
+		ret = filemap_write_and_wait_range(mapping, end, LLONG_MAX);
 	if (ret)
-		goto out_mmap;
-	truncate_pagecache(inode, ioffset);
+		goto out_invalidate_lock;
+
+	truncate_pagecache(inode, start);
 
 	credits = ext4_writepage_trans_blocks(inode);
 	handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE, credits);
 	if (IS_ERR(handle)) {
 		ret = PTR_ERR(handle);
-		goto out_mmap;
+		goto out_invalidate_lock;
 	}
 	ext4_fc_mark_ineligible(sb, EXT4_FC_REASON_FALLOC_RANGE, handle);
 
+	start_lblk = offset >> inode->i_blkbits;
+	end_lblk = (offset + len) >> inode->i_blkbits;
+
 	down_write(&EXT4_I(inode)->i_data_sem);
 	ext4_discard_preallocations(inode);
-	ext4_es_remove_extent(inode, punch_start, EXT_MAX_BLOCKS - punch_start);
+	ext4_es_remove_extent(inode, start_lblk, EXT_MAX_BLOCKS - start_lblk);
 
-	ret = ext4_ext_remove_space(inode, punch_start, punch_stop - 1);
+	ret = ext4_ext_remove_space(inode, start_lblk, end_lblk - 1);
 	if (ret) {
 		up_write(&EXT4_I(inode)->i_data_sem);
-		goto out_stop;
+		goto out_handle;
 	}
 	ext4_discard_preallocations(inode);
 
-	ret = ext4_ext_shift_extents(inode, handle, punch_stop,
-				     punch_stop - punch_start, SHIFT_LEFT);
+	ret = ext4_ext_shift_extents(inode, handle, end_lblk,
+				     end_lblk - start_lblk, SHIFT_LEFT);
 	if (ret) {
 		up_write(&EXT4_I(inode)->i_data_sem);
-		goto out_stop;
+		goto out_handle;
 	}
 
 	new_size = inode->i_size - len;
@@ -5405,16 +5395,19 @@ static int ext4_collapse_range(struct file *file, loff_t offset, loff_t len)
 	EXT4_I(inode)->i_disksize = new_size;
 
 	up_write(&EXT4_I(inode)->i_data_sem);
-	if (IS_SYNC(inode))
-		ext4_handle_sync(handle);
 	ret = ext4_mark_inode_dirty(handle, inode);
+	if (ret)
+		goto out_handle;
+
 	ext4_update_inode_fsync_trans(handle, inode, 1);
+	if (IS_SYNC(inode))
+		ext4_handle_sync(handle);
 
-out_stop:
+out_handle:
 	ext4_journal_stop(handle);
-out_mmap:
+out_invalidate_lock:
 	filemap_invalidate_unlock(mapping);
-out_mutex:
+out:
 	inode_unlock(inode);
 	return ret;
 }
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 07/27] ext4: refactor ext4_insert_range()
  2024-10-22 11:10 [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio Zhang Yi
                   ` (6 preceding siblings ...)
  2024-10-22 11:10 ` [PATCH 06/27] ext4: refactor ext4_collapse_range() Zhang Yi
@ 2024-10-22 11:10 ` Zhang Yi
  2024-12-04 12:02   ` Jan Kara
  2024-10-22 11:10 ` [PATCH 08/27] ext4: factor out ext4_do_fallocate() Zhang Yi
                   ` (19 subsequent siblings)
  27 siblings, 1 reply; 59+ messages in thread
From: Zhang Yi @ 2024-10-22 11:10 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	ritesh.list, hch, djwong, david, zokeefe, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Simplify ext4_insert_range() and align its code style with that of
ext4_collapse_range(). Refactor it by: a) renaming variables, b)
removing redundant input parameter checks and moving the remaining
checks under i_rwsem in preparation for future refactoring, and c)
renaming the three stale error tags.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents.c | 101 ++++++++++++++++++++++------------------------
 1 file changed, 48 insertions(+), 53 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index a4e95f3b5f09..4e35c2415e9b 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -5428,45 +5428,37 @@ static int ext4_insert_range(struct file *file, loff_t offset, loff_t len)
 	handle_t *handle;
 	struct ext4_ext_path *path;
 	struct ext4_extent *extent;
-	ext4_lblk_t offset_lblk, len_lblk, ee_start_lblk = 0;
+	ext4_lblk_t start_lblk, len_lblk, ee_start_lblk = 0;
 	unsigned int credits, ee_len;
-	int ret = 0, depth, split_flag = 0;
-	loff_t ioffset;
-
-	/*
-	 * We need to test this early because xfstests assumes that an
-	 * insert range of (0, 1) will return EOPNOTSUPP if the file
-	 * system does not support insert range.
-	 */
-	if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
-		return -EOPNOTSUPP;
-
-	/* Insert range works only on fs cluster size aligned regions. */
-	if (!IS_ALIGNED(offset | len, EXT4_CLUSTER_SIZE(sb)))
-		return -EINVAL;
+	int ret, depth, split_flag = 0;
+	loff_t start;
 
 	trace_ext4_insert_range(inode, offset, len);
 
-	offset_lblk = offset >> EXT4_BLOCK_SIZE_BITS(sb);
-	len_lblk = len >> EXT4_BLOCK_SIZE_BITS(sb);
-
 	inode_lock(inode);
+
 	/* Currently just for extent based files */
 	if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
 		ret = -EOPNOTSUPP;
-		goto out_mutex;
+		goto out;
 	}
 
-	/* Check whether the maximum file size would be exceeded */
-	if (len > inode->i_sb->s_maxbytes - inode->i_size) {
-		ret = -EFBIG;
-		goto out_mutex;
+	/* Insert range works only on fs cluster size aligned regions. */
+	if (!IS_ALIGNED(offset | len, EXT4_CLUSTER_SIZE(sb))) {
+		ret = -EINVAL;
+		goto out;
 	}
 
 	/* Offset must be less than i_size */
 	if (offset >= inode->i_size) {
 		ret = -EINVAL;
-		goto out_mutex;
+		goto out;
+	}
+
+	/* Check whether the maximum file size would be exceeded */
+	if (len > inode->i_sb->s_maxbytes - inode->i_size) {
+		ret = -EFBIG;
+		goto out;
 	}
 
 	/* Wait for existing dio to complete */
@@ -5474,7 +5466,7 @@ static int ext4_insert_range(struct file *file, loff_t offset, loff_t len)
 
 	ret = file_modified(file);
 	if (ret)
-		goto out_mutex;
+		goto out;
 
 	/*
 	 * Prevent page faults from reinstantiating pages we have released from
@@ -5484,25 +5476,24 @@ static int ext4_insert_range(struct file *file, loff_t offset, loff_t len)
 
 	ret = ext4_break_layouts(inode);
 	if (ret)
-		goto out_mmap;
+		goto out_invalidate_lock;
 
 	/*
-	 * Need to round down to align start offset to page size boundary
-	 * for page size > block size.
+	 * Write out all dirty pages. Need to round down to align start offset
+	 * to page size boundary for page size > block size.
 	 */
-	ioffset = round_down(offset, PAGE_SIZE);
-	/* Write out all dirty pages */
-	ret = filemap_write_and_wait_range(inode->i_mapping, ioffset,
-			LLONG_MAX);
+	start = round_down(offset, PAGE_SIZE);
+	ret = filemap_write_and_wait_range(mapping, start, LLONG_MAX);
 	if (ret)
-		goto out_mmap;
-	truncate_pagecache(inode, ioffset);
+		goto out_invalidate_lock;
+
+	truncate_pagecache(inode, start);
 
 	credits = ext4_writepage_trans_blocks(inode);
 	handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE, credits);
 	if (IS_ERR(handle)) {
 		ret = PTR_ERR(handle);
-		goto out_mmap;
+		goto out_invalidate_lock;
 	}
 	ext4_fc_mark_ineligible(sb, EXT4_FC_REASON_FALLOC_RANGE, handle);
 
@@ -5511,16 +5502,19 @@ static int ext4_insert_range(struct file *file, loff_t offset, loff_t len)
 	EXT4_I(inode)->i_disksize += len;
 	ret = ext4_mark_inode_dirty(handle, inode);
 	if (ret)
-		goto out_stop;
+		goto out_handle;
+
+	start_lblk = offset >> inode->i_blkbits;
+	len_lblk = len >> inode->i_blkbits;
 
 	down_write(&EXT4_I(inode)->i_data_sem);
 	ext4_discard_preallocations(inode);
 
-	path = ext4_find_extent(inode, offset_lblk, NULL, 0);
+	path = ext4_find_extent(inode, start_lblk, NULL, 0);
 	if (IS_ERR(path)) {
 		up_write(&EXT4_I(inode)->i_data_sem);
 		ret = PTR_ERR(path);
-		goto out_stop;
+		goto out_handle;
 	}
 
 	depth = ext_depth(inode);
@@ -5530,16 +5524,16 @@ static int ext4_insert_range(struct file *file, loff_t offset, loff_t len)
 		ee_len = ext4_ext_get_actual_len(extent);
 
 		/*
-		 * If offset_lblk is not the starting block of extent, split
-		 * the extent @offset_lblk
+		 * If start_lblk is not the starting block of extent, split
+		 * the extent @start_lblk
 		 */
-		if ((offset_lblk > ee_start_lblk) &&
-				(offset_lblk < (ee_start_lblk + ee_len))) {
+		if ((start_lblk > ee_start_lblk) &&
+				(start_lblk < (ee_start_lblk + ee_len))) {
 			if (ext4_ext_is_unwritten(extent))
 				split_flag = EXT4_EXT_MARK_UNWRIT1 |
 					EXT4_EXT_MARK_UNWRIT2;
 			path = ext4_split_extent_at(handle, inode, path,
-					offset_lblk, split_flag,
+					start_lblk, split_flag,
 					EXT4_EX_NOCACHE |
 					EXT4_GET_BLOCKS_PRE_IO |
 					EXT4_GET_BLOCKS_METADATA_NOFAIL);
@@ -5548,31 +5542,32 @@ static int ext4_insert_range(struct file *file, loff_t offset, loff_t len)
 		if (IS_ERR(path)) {
 			up_write(&EXT4_I(inode)->i_data_sem);
 			ret = PTR_ERR(path);
-			goto out_stop;
+			goto out_handle;
 		}
 	}
 
 	ext4_free_ext_path(path);
-	ext4_es_remove_extent(inode, offset_lblk, EXT_MAX_BLOCKS - offset_lblk);
+	ext4_es_remove_extent(inode, start_lblk, EXT_MAX_BLOCKS - start_lblk);
 
 	/*
-	 * if offset_lblk lies in a hole which is at start of file, use
+	 * if start_lblk lies in a hole which is at start of file, use
 	 * ee_start_lblk to shift extents
 	 */
 	ret = ext4_ext_shift_extents(inode, handle,
-		max(ee_start_lblk, offset_lblk), len_lblk, SHIFT_RIGHT);
-
+		max(ee_start_lblk, start_lblk), len_lblk, SHIFT_RIGHT);
 	up_write(&EXT4_I(inode)->i_data_sem);
+	if (ret)
+		goto out_handle;
+
+	ext4_update_inode_fsync_trans(handle, inode, 1);
 	if (IS_SYNC(inode))
 		ext4_handle_sync(handle);
-	if (ret >= 0)
-		ext4_update_inode_fsync_trans(handle, inode, 1);
 
-out_stop:
+out_handle:
 	ext4_journal_stop(handle);
-out_mmap:
+out_invalidate_lock:
 	filemap_invalidate_unlock(mapping);
-out_mutex:
+out:
 	inode_unlock(inode);
 	return ret;
 }
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 08/27] ext4: factor out ext4_do_fallocate()
  2024-10-22 11:10 [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio Zhang Yi
                   ` (7 preceding siblings ...)
  2024-10-22 11:10 ` [PATCH 07/27] ext4: refactor ext4_insert_range() Zhang Yi
@ 2024-10-22 11:10 ` Zhang Yi
  2024-10-22 11:10 ` [PATCH 09/27] ext4: move out inode_lock into ext4_fallocate() Zhang Yi
                   ` (18 subsequent siblings)
  27 siblings, 0 replies; 59+ messages in thread
From: Zhang Yi @ 2024-10-22 11:10 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	ritesh.list, hch, djwong, david, zokeefe, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Now the real job of normal fallocate are open coded in ext4_fallocate(),
factor out a new helper ext4_do_fallocate() to do the real job, like
others functions (e.g. ext4_zero_range()) in ext4_fallocate() do, this
can make the code more clear, no functional changes.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/extents.c | 125 ++++++++++++++++++++++------------------------
 1 file changed, 60 insertions(+), 65 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 4e35c2415e9b..2f727104f53d 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4692,6 +4692,58 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 	return ret;
 }
 
+static long ext4_do_fallocate(struct file *file, loff_t offset,
+			      loff_t len, int mode)
+{
+	struct inode *inode = file_inode(file);
+	loff_t end = offset + len;
+	loff_t new_size = 0;
+	ext4_lblk_t start_lblk, len_lblk;
+	int ret;
+
+	trace_ext4_fallocate_enter(inode, offset, len, mode);
+
+	start_lblk = offset >> inode->i_blkbits;
+	len_lblk = EXT4_MAX_BLOCKS(len, offset, inode->i_blkbits);
+
+	inode_lock(inode);
+
+	/* We only support preallocation for extent-based files only. */
+	if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
+		ret = -EOPNOTSUPP;
+		goto out;
+	}
+
+	if (!(mode & FALLOC_FL_KEEP_SIZE) &&
+	    (end > inode->i_size || end > EXT4_I(inode)->i_disksize)) {
+		new_size = end;
+		ret = inode_newsize_ok(inode, new_size);
+		if (ret)
+			goto out;
+	}
+
+	/* Wait all existing dio workers, newcomers will block on i_rwsem */
+	inode_dio_wait(inode);
+
+	ret = file_modified(file);
+	if (ret)
+		goto out;
+
+	ret = ext4_alloc_file_blocks(file, start_lblk, len_lblk, new_size,
+				     EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT);
+	if (ret)
+		goto out;
+
+	if (file->f_flags & O_SYNC && EXT4_SB(inode->i_sb)->s_journal) {
+		ret = ext4_fc_commit(EXT4_SB(inode->i_sb)->s_journal,
+					EXT4_I(inode)->i_sync_tid);
+	}
+out:
+	inode_unlock(inode);
+	trace_ext4_fallocate_exit(inode, offset, len_lblk, ret);
+	return ret;
+}
+
 /*
  * preallocate space for a file. This implements ext4's fallocate file
  * operation, which gets called from sys_fallocate system call.
@@ -4702,12 +4754,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 {
 	struct inode *inode = file_inode(file);
-	loff_t new_size = 0;
-	unsigned int max_blocks;
-	int ret = 0;
-	int flags;
-	ext4_lblk_t lblk;
-	unsigned int blkbits = inode->i_blkbits;
+	int ret;
 
 	/*
 	 * Encrypted inodes can't handle collapse range or insert
@@ -4729,71 +4776,19 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	ret = ext4_convert_inline_data(inode);
 	inode_unlock(inode);
 	if (ret)
-		goto exit;
+		return ret;
 
-	if (mode & FALLOC_FL_PUNCH_HOLE) {
+	if (mode & FALLOC_FL_PUNCH_HOLE)
 		ret = ext4_punch_hole(file, offset, len);
-		goto exit;
-	}
-
-	if (mode & FALLOC_FL_COLLAPSE_RANGE) {
+	else if (mode & FALLOC_FL_COLLAPSE_RANGE)
 		ret = ext4_collapse_range(file, offset, len);
-		goto exit;
-	}
-
-	if (mode & FALLOC_FL_INSERT_RANGE) {
+	else if (mode & FALLOC_FL_INSERT_RANGE)
 		ret = ext4_insert_range(file, offset, len);
-		goto exit;
-	}
-
-	if (mode & FALLOC_FL_ZERO_RANGE) {
+	else if (mode & FALLOC_FL_ZERO_RANGE)
 		ret = ext4_zero_range(file, offset, len, mode);
-		goto exit;
-	}
-	trace_ext4_fallocate_enter(inode, offset, len, mode);
-	lblk = offset >> blkbits;
-
-	max_blocks = EXT4_MAX_BLOCKS(len, offset, blkbits);
-	flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT;
-
-	inode_lock(inode);
-
-	/*
-	 * We only support preallocation for extent-based files only
-	 */
-	if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
-		ret = -EOPNOTSUPP;
-		goto out;
-	}
-
-	if (!(mode & FALLOC_FL_KEEP_SIZE) &&
-	    (offset + len > inode->i_size ||
-	     offset + len > EXT4_I(inode)->i_disksize)) {
-		new_size = offset + len;
-		ret = inode_newsize_ok(inode, new_size);
-		if (ret)
-			goto out;
-	}
-
-	/* Wait all existing dio workers, newcomers will block on i_rwsem */
-	inode_dio_wait(inode);
-
-	ret = file_modified(file);
-	if (ret)
-		goto out;
-
-	ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size, flags);
-	if (ret)
-		goto out;
+	else
+		ret = ext4_do_fallocate(file, offset, len, mode);
 
-	if (file->f_flags & O_SYNC && EXT4_SB(inode->i_sb)->s_journal) {
-		ret = ext4_fc_commit(EXT4_SB(inode->i_sb)->s_journal,
-					EXT4_I(inode)->i_sync_tid);
-	}
-out:
-	inode_unlock(inode);
-	trace_ext4_fallocate_exit(inode, offset, max_blocks, ret);
-exit:
 	return ret;
 }
 
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 09/27] ext4: move out inode_lock into ext4_fallocate()
  2024-10-22 11:10 [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio Zhang Yi
                   ` (8 preceding siblings ...)
  2024-10-22 11:10 ` [PATCH 08/27] ext4: factor out ext4_do_fallocate() Zhang Yi
@ 2024-10-22 11:10 ` Zhang Yi
  2024-12-04 12:05   ` Jan Kara
  2024-10-22 11:10 ` [PATCH 10/27] ext4: move out common parts " Zhang Yi
                   ` (17 subsequent siblings)
  27 siblings, 1 reply; 59+ messages in thread
From: Zhang Yi @ 2024-10-22 11:10 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	ritesh.list, hch, djwong, david, zokeefe, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Currently, all five sub-functions of ext4_fallocate() acquire the
inode's i_rwsem at the beginning and release it before exiting. This
process can be simplified by factoring out the management of i_rwsem
into the ext4_fallocate() function.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents.c | 90 +++++++++++++++--------------------------------
 fs/ext4/inode.c   | 13 +++----
 2 files changed, 33 insertions(+), 70 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 2f727104f53d..a2db4e85790f 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4573,23 +4573,18 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 	int ret, flags, credits;
 
 	trace_ext4_zero_range(inode, offset, len, mode);
+	WARN_ON_ONCE(!inode_is_locked(inode));
 
-	inode_lock(inode);
-
-	/*
-	 * Indirect files do not support unwritten extents
-	 */
-	if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
-		ret = -EOPNOTSUPP;
-		goto out;
-	}
+	/* Indirect files do not support unwritten extents */
+	if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
+		return -EOPNOTSUPP;
 
 	if (!(mode & FALLOC_FL_KEEP_SIZE) &&
 	    (end > inode->i_size || end > EXT4_I(inode)->i_disksize)) {
 		new_size = end;
 		ret = inode_newsize_ok(inode, new_size);
 		if (ret)
-			goto out;
+			return ret;
 	}
 
 	/* Wait all existing dio workers, newcomers will block on i_rwsem */
@@ -4597,7 +4592,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 
 	ret = file_modified(file);
 	if (ret)
-		goto out;
+		return ret;
 
 	/*
 	 * Prevent page faults from reinstantiating pages we have released
@@ -4687,8 +4682,6 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 	ext4_journal_stop(handle);
 out_invalidate_lock:
 	filemap_invalidate_unlock(mapping);
-out:
-	inode_unlock(inode);
 	return ret;
 }
 
@@ -4702,12 +4695,11 @@ static long ext4_do_fallocate(struct file *file, loff_t offset,
 	int ret;
 
 	trace_ext4_fallocate_enter(inode, offset, len, mode);
+	WARN_ON_ONCE(!inode_is_locked(inode));
 
 	start_lblk = offset >> inode->i_blkbits;
 	len_lblk = EXT4_MAX_BLOCKS(len, offset, inode->i_blkbits);
 
-	inode_lock(inode);
-
 	/* We only support preallocation for extent-based files only. */
 	if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
 		ret = -EOPNOTSUPP;
@@ -4739,7 +4731,6 @@ static long ext4_do_fallocate(struct file *file, loff_t offset,
 					EXT4_I(inode)->i_sync_tid);
 	}
 out:
-	inode_unlock(inode);
 	trace_ext4_fallocate_exit(inode, offset, len_lblk, ret);
 	return ret;
 }
@@ -4774,9 +4765,8 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 
 	inode_lock(inode);
 	ret = ext4_convert_inline_data(inode);
-	inode_unlock(inode);
 	if (ret)
-		return ret;
+		goto out;
 
 	if (mode & FALLOC_FL_PUNCH_HOLE)
 		ret = ext4_punch_hole(file, offset, len);
@@ -4788,7 +4778,8 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 		ret = ext4_zero_range(file, offset, len, mode);
 	else
 		ret = ext4_do_fallocate(file, offset, len, mode);
-
+out:
+	inode_unlock(inode);
 	return ret;
 }
 
@@ -5298,36 +5289,27 @@ static int ext4_collapse_range(struct file *file, loff_t offset, loff_t len)
 	int ret;
 
 	trace_ext4_collapse_range(inode, offset, len);
-
-	inode_lock(inode);
+	WARN_ON_ONCE(!inode_is_locked(inode));
 
 	/* Currently just for extent based files */
-	if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
-		ret = -EOPNOTSUPP;
-		goto out;
-	}
-
+	if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
+		return -EOPNOTSUPP;
 	/* Collapse range works only on fs cluster size aligned regions. */
-	if (!IS_ALIGNED(offset | len, EXT4_CLUSTER_SIZE(sb))) {
-		ret = -EINVAL;
-		goto out;
-	}
-
+	if (!IS_ALIGNED(offset | len, EXT4_CLUSTER_SIZE(sb)))
+		return -EINVAL;
 	/*
 	 * There is no need to overlap collapse range with EOF, in which case
 	 * it is effectively a truncate operation
 	 */
-	if (end >= inode->i_size) {
-		ret = -EINVAL;
-		goto out;
-	}
+	if (end >= inode->i_size)
+		return -EINVAL;
 
 	/* Wait for existing dio to complete */
 	inode_dio_wait(inode);
 
 	ret = file_modified(file);
 	if (ret)
-		goto out;
+		return ret;
 
 	/*
 	 * Prevent page faults from reinstantiating pages we have released from
@@ -5402,8 +5384,6 @@ static int ext4_collapse_range(struct file *file, loff_t offset, loff_t len)
 	ext4_journal_stop(handle);
 out_invalidate_lock:
 	filemap_invalidate_unlock(mapping);
-out:
-	inode_unlock(inode);
 	return ret;
 }
 
@@ -5429,39 +5409,27 @@ static int ext4_insert_range(struct file *file, loff_t offset, loff_t len)
 	loff_t start;
 
 	trace_ext4_insert_range(inode, offset, len);
-
-	inode_lock(inode);
+	WARN_ON_ONCE(!inode_is_locked(inode));
 
 	/* Currently just for extent based files */
-	if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
-		ret = -EOPNOTSUPP;
-		goto out;
-	}
-
+	if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
+		return -EOPNOTSUPP;
 	/* Insert range works only on fs cluster size aligned regions. */
-	if (!IS_ALIGNED(offset | len, EXT4_CLUSTER_SIZE(sb))) {
-		ret = -EINVAL;
-		goto out;
-	}
-
+	if (!IS_ALIGNED(offset | len, EXT4_CLUSTER_SIZE(sb)))
+		return -EINVAL;
 	/* Offset must be less than i_size */
-	if (offset >= inode->i_size) {
-		ret = -EINVAL;
-		goto out;
-	}
-
+	if (offset >= inode->i_size)
+		return -EINVAL;
 	/* Check whether the maximum file size would be exceeded */
-	if (len > inode->i_sb->s_maxbytes - inode->i_size) {
-		ret = -EFBIG;
-		goto out;
-	}
+	if (len > inode->i_sb->s_maxbytes - inode->i_size)
+		return -EFBIG;
 
 	/* Wait for existing dio to complete */
 	inode_dio_wait(inode);
 
 	ret = file_modified(file);
 	if (ret)
-		goto out;
+		return ret;
 
 	/*
 	 * Prevent page faults from reinstantiating pages we have released from
@@ -5562,8 +5530,6 @@ static int ext4_insert_range(struct file *file, loff_t offset, loff_t len)
 	ext4_journal_stop(handle);
 out_invalidate_lock:
 	filemap_invalidate_unlock(mapping);
-out:
-	inode_unlock(inode);
 	return ret;
 }
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 1d128333bd06..bea19cd6e676 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3962,15 +3962,14 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
 	unsigned long blocksize = i_blocksize(inode);
 	handle_t *handle;
 	unsigned int credits;
-	int ret = 0;
+	int ret;
 
 	trace_ext4_punch_hole(inode, offset, length, 0);
-
-	inode_lock(inode);
+	WARN_ON_ONCE(!inode_is_locked(inode));
 
 	/* No need to punch hole beyond i_size */
 	if (offset >= inode->i_size)
-		goto out;
+		return 0;
 
 	/*
 	 * If the hole extends beyond i_size, set the hole to end after
@@ -3990,7 +3989,7 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
 	if (offset & (blocksize - 1) || end & (blocksize - 1)) {
 		ret = ext4_inode_attach_jinode(inode);
 		if (ret < 0)
-			goto out;
+			return ret;
 	}
 
 	/* Wait all existing dio workers, newcomers will block on i_rwsem */
@@ -3998,7 +3997,7 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
 
 	ret = file_modified(file);
 	if (ret)
-		goto out;
+		return ret;
 
 	/*
 	 * Prevent page faults from reinstantiating pages we have released from
@@ -4082,8 +4081,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
 	ext4_journal_stop(handle);
 out_invalidate_lock:
 	filemap_invalidate_unlock(mapping);
-out:
-	inode_unlock(inode);
 	return ret;
 }
 
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 10/27] ext4: move out common parts into ext4_fallocate()
  2024-10-22 11:10 [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio Zhang Yi
                   ` (9 preceding siblings ...)
  2024-10-22 11:10 ` [PATCH 09/27] ext4: move out inode_lock into ext4_fallocate() Zhang Yi
@ 2024-10-22 11:10 ` Zhang Yi
  2024-12-04 12:10   ` Jan Kara
  2024-10-22 11:10 ` [PATCH 11/27] ext4: use reserved metadata blocks when splitting extent on endio Zhang Yi
                   ` (16 subsequent siblings)
  27 siblings, 1 reply; 59+ messages in thread
From: Zhang Yi @ 2024-10-22 11:10 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	ritesh.list, hch, djwong, david, zokeefe, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Currently, all zeroing ranges, punch holes, collapse ranges, and insert
ranges first wait for all existing direct I/O workers to complete, and
then they acquire the mapping's invalidate lock before performing the
actual work. These common components are nearly identical, so we can
simplify the code by factoring them out into the ext4_fallocate().

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents.c | 121 ++++++++++++++++------------------------------
 fs/ext4/inode.c   |  23 +--------
 2 files changed, 43 insertions(+), 101 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index a2db4e85790f..d5067d5aa449 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4587,23 +4587,6 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 			return ret;
 	}
 
-	/* Wait all existing dio workers, newcomers will block on i_rwsem */
-	inode_dio_wait(inode);
-
-	ret = file_modified(file);
-	if (ret)
-		return ret;
-
-	/*
-	 * Prevent page faults from reinstantiating pages we have released
-	 * from page cache.
-	 */
-	filemap_invalidate_lock(mapping);
-
-	ret = ext4_break_layouts(inode);
-	if (ret)
-		goto out_invalidate_lock;
-
 	/*
 	 * For journalled data we need to write (and checkpoint) pages before
 	 * discarding page cache to avoid inconsitent data on disk in case of
@@ -4616,7 +4599,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 		ext4_truncate_folios_range(inode, offset, end);
 	}
 	if (ret)
-		goto out_invalidate_lock;
+		return ret;
 
 	/* Now release the pages and zero block aligned part of pages */
 	truncate_pagecache_range(inode, offset, end - 1);
@@ -4630,7 +4613,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 		ret = ext4_alloc_file_blocks(file, alloc_lblk, len_lblk,
 					     new_size, flags);
 		if (ret)
-			goto out_invalidate_lock;
+			return ret;
 	}
 
 	/* Zero range excluding the unaligned edges */
@@ -4643,11 +4626,11 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 		ret = ext4_alloc_file_blocks(file, start_lblk, zero_blks,
 					     new_size, flags);
 		if (ret)
-			goto out_invalidate_lock;
+			return ret;
 	}
 	/* Finish zeroing out if it doesn't contain partial block */
 	if (!(offset & (blocksize - 1)) && !(end & (blocksize - 1)))
-		goto out_invalidate_lock;
+		return ret;
 
 	/*
 	 * In worst case we have to writeout two nonadjacent unwritten
@@ -4660,7 +4643,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 	if (IS_ERR(handle)) {
 		ret = PTR_ERR(handle);
 		ext4_std_error(inode->i_sb, ret);
-		goto out_invalidate_lock;
+		return ret;
 	}
 
 	/* Zero out partial block at the edges of the range */
@@ -4680,8 +4663,6 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 
 out_handle:
 	ext4_journal_stop(handle);
-out_invalidate_lock:
-	filemap_invalidate_unlock(mapping);
 	return ret;
 }
 
@@ -4714,13 +4695,6 @@ static long ext4_do_fallocate(struct file *file, loff_t offset,
 			goto out;
 	}
 
-	/* Wait all existing dio workers, newcomers will block on i_rwsem */
-	inode_dio_wait(inode);
-
-	ret = file_modified(file);
-	if (ret)
-		goto out;
-
 	ret = ext4_alloc_file_blocks(file, start_lblk, len_lblk, new_size,
 				     EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT);
 	if (ret)
@@ -4745,6 +4719,7 @@ static long ext4_do_fallocate(struct file *file, loff_t offset,
 long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 {
 	struct inode *inode = file_inode(file);
+	struct address_space *mapping = file->f_mapping;
 	int ret;
 
 	/*
@@ -4768,6 +4743,29 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	if (ret)
 		goto out;
 
+	/* Wait all existing dio workers, newcomers will block on i_rwsem */
+	inode_dio_wait(inode);
+
+	ret = file_modified(file);
+	if (ret)
+		return ret;
+
+	if ((mode & FALLOC_FL_MODE_MASK) == FALLOC_FL_ALLOCATE_RANGE) {
+		ret = ext4_do_fallocate(file, offset, len, mode);
+		goto out;
+	}
+
+	/*
+	 * Follow-up operations will drop page cache, hold invalidate lock
+	 * to prevent page faults from reinstantiating pages we have
+	 * released from page cache.
+	 */
+	filemap_invalidate_lock(mapping);
+
+	ret = ext4_break_layouts(inode);
+	if (ret)
+		goto out_invalidate_lock;
+
 	if (mode & FALLOC_FL_PUNCH_HOLE)
 		ret = ext4_punch_hole(file, offset, len);
 	else if (mode & FALLOC_FL_COLLAPSE_RANGE)
@@ -4777,7 +4775,10 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	else if (mode & FALLOC_FL_ZERO_RANGE)
 		ret = ext4_zero_range(file, offset, len, mode);
 	else
-		ret = ext4_do_fallocate(file, offset, len, mode);
+		ret = -EOPNOTSUPP;
+
+out_invalidate_lock:
+	filemap_invalidate_unlock(mapping);
 out:
 	inode_unlock(inode);
 	return ret;
@@ -5304,23 +5305,6 @@ static int ext4_collapse_range(struct file *file, loff_t offset, loff_t len)
 	if (end >= inode->i_size)
 		return -EINVAL;
 
-	/* Wait for existing dio to complete */
-	inode_dio_wait(inode);
-
-	ret = file_modified(file);
-	if (ret)
-		return ret;
-
-	/*
-	 * Prevent page faults from reinstantiating pages we have released from
-	 * page cache.
-	 */
-	filemap_invalidate_lock(mapping);
-
-	ret = ext4_break_layouts(inode);
-	if (ret)
-		goto out_invalidate_lock;
-
 	/*
 	 * Write tail of the last page before removed range and data that
 	 * will be shifted since they will get removed from the page cache
@@ -5334,16 +5318,15 @@ static int ext4_collapse_range(struct file *file, loff_t offset, loff_t len)
 	if (!ret)
 		ret = filemap_write_and_wait_range(mapping, end, LLONG_MAX);
 	if (ret)
-		goto out_invalidate_lock;
+		return ret;
 
 	truncate_pagecache(inode, start);
 
 	credits = ext4_writepage_trans_blocks(inode);
 	handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE, credits);
-	if (IS_ERR(handle)) {
-		ret = PTR_ERR(handle);
-		goto out_invalidate_lock;
-	}
+	if (IS_ERR(handle))
+		return PTR_ERR(handle);
+
 	ext4_fc_mark_ineligible(sb, EXT4_FC_REASON_FALLOC_RANGE, handle);
 
 	start_lblk = offset >> inode->i_blkbits;
@@ -5382,8 +5365,6 @@ static int ext4_collapse_range(struct file *file, loff_t offset, loff_t len)
 
 out_handle:
 	ext4_journal_stop(handle);
-out_invalidate_lock:
-	filemap_invalidate_unlock(mapping);
 	return ret;
 }
 
@@ -5424,23 +5405,6 @@ static int ext4_insert_range(struct file *file, loff_t offset, loff_t len)
 	if (len > inode->i_sb->s_maxbytes - inode->i_size)
 		return -EFBIG;
 
-	/* Wait for existing dio to complete */
-	inode_dio_wait(inode);
-
-	ret = file_modified(file);
-	if (ret)
-		return ret;
-
-	/*
-	 * Prevent page faults from reinstantiating pages we have released from
-	 * page cache.
-	 */
-	filemap_invalidate_lock(mapping);
-
-	ret = ext4_break_layouts(inode);
-	if (ret)
-		goto out_invalidate_lock;
-
 	/*
 	 * Write out all dirty pages. Need to round down to align start offset
 	 * to page size boundary for page size > block size.
@@ -5448,16 +5412,15 @@ static int ext4_insert_range(struct file *file, loff_t offset, loff_t len)
 	start = round_down(offset, PAGE_SIZE);
 	ret = filemap_write_and_wait_range(mapping, start, LLONG_MAX);
 	if (ret)
-		goto out_invalidate_lock;
+		return ret;
 
 	truncate_pagecache(inode, start);
 
 	credits = ext4_writepage_trans_blocks(inode);
 	handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE, credits);
-	if (IS_ERR(handle)) {
-		ret = PTR_ERR(handle);
-		goto out_invalidate_lock;
-	}
+	if (IS_ERR(handle))
+		return PTR_ERR(handle);
+
 	ext4_fc_mark_ineligible(sb, EXT4_FC_REASON_FALLOC_RANGE, handle);
 
 	/* Expand file to avoid data loss if there is error while shifting */
@@ -5528,8 +5491,6 @@ static int ext4_insert_range(struct file *file, loff_t offset, loff_t len)
 
 out_handle:
 	ext4_journal_stop(handle);
-out_invalidate_lock:
-	filemap_invalidate_unlock(mapping);
 	return ret;
 }
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index bea19cd6e676..1ccf84a64b7b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3992,23 +3992,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
 			return ret;
 	}
 
-	/* Wait all existing dio workers, newcomers will block on i_rwsem */
-	inode_dio_wait(inode);
-
-	ret = file_modified(file);
-	if (ret)
-		return ret;
-
-	/*
-	 * Prevent page faults from reinstantiating pages we have released from
-	 * page cache.
-	 */
-	filemap_invalidate_lock(mapping);
-
-	ret = ext4_break_layouts(inode);
-	if (ret)
-		goto out_invalidate_lock;
-
 	/*
 	 * For journalled data we need to write (and checkpoint) pages
 	 * before discarding page cache to avoid inconsitent data on
@@ -4021,7 +4004,7 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
 		ext4_truncate_folios_range(inode, offset, end);
 	}
 	if (ret)
-		goto out_invalidate_lock;
+		return ret;
 
 	/* Now release the pages and zero block aligned part of pages*/
 	truncate_pagecache_range(inode, offset, end - 1);
@@ -4034,7 +4017,7 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
 	if (IS_ERR(handle)) {
 		ret = PTR_ERR(handle);
 		ext4_std_error(sb, ret);
-		goto out_invalidate_lock;
+		return ret;
 	}
 
 	ret = ext4_zero_partial_blocks(handle, inode, offset, length);
@@ -4079,8 +4062,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
 		ext4_handle_sync(handle);
 out_handle:
 	ext4_journal_stop(handle);
-out_invalidate_lock:
-	filemap_invalidate_unlock(mapping);
 	return ret;
 }
 
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 11/27] ext4: use reserved metadata blocks when splitting extent on endio
  2024-10-22 11:10 [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio Zhang Yi
                   ` (10 preceding siblings ...)
  2024-10-22 11:10 ` [PATCH 10/27] ext4: move out common parts " Zhang Yi
@ 2024-10-22 11:10 ` Zhang Yi
  2024-12-04 12:16   ` Jan Kara
  2024-10-22 11:10 ` [PATCH 12/27] ext4: introduce seq counter for the extent status entry Zhang Yi
                   ` (15 subsequent siblings)
  27 siblings, 1 reply; 59+ messages in thread
From: Zhang Yi @ 2024-10-22 11:10 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	ritesh.list, hch, djwong, david, zokeefe, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

When performing buffered writes, we may need to split and convert an
unwritten extent into a written one during the end I/O process. However,
we do not reserve space specifically for these metadata changes, we only
reserve 2% of space or 4096 blocks. To address this, we use
EXT4_GET_BLOCKS_PRE_IO to potentially split extents in advance and
EXT4_GET_BLOCKS_METADATA_NOFAIL to utilize reserved space if necessary.

These two approaches can reduce the likelihood of running out of space
and losing data. However, these methods are merely best efforts, we
could still run out of space, and there is not much difference between
converting an extent during the writeback process and the end I/O
process, it won't increase the rick of losing data if we postpone the
conversion.

Therefore, also use EXT4_GET_BLOCKS_METADATA_NOFAIL in
ext4_convert_unwritten_extents_endio() to prepare for the buffered I/O
iomap conversion, which may perform extent conversion during the end I/O
process.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index d5067d5aa449..33bc2cc5aff4 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3767,6 +3767,8 @@ ext4_convert_unwritten_extents_endio(handle_t *handle, struct inode *inode,
 	 * illegal.
 	 */
 	if (ee_block != map->m_lblk || ee_len > map->m_len) {
+		int flags = EXT4_GET_BLOCKS_CONVERT |
+			    EXT4_GET_BLOCKS_METADATA_NOFAIL;
 #ifdef CONFIG_EXT4_DEBUG
 		ext4_warning(inode->i_sb, "Inode (%ld) finished: extent logical block %llu,"
 			     " len %u; IO logical block %llu, len %u",
@@ -3774,7 +3776,7 @@ ext4_convert_unwritten_extents_endio(handle_t *handle, struct inode *inode,
 			     (unsigned long long)map->m_lblk, map->m_len);
 #endif
 		path = ext4_split_convert_extents(handle, inode, map, path,
-						EXT4_GET_BLOCKS_CONVERT, NULL);
+						  flags, NULL);
 		if (IS_ERR(path))
 			return path;
 
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 12/27] ext4: introduce seq counter for the extent status entry
  2024-10-22 11:10 [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio Zhang Yi
                   ` (11 preceding siblings ...)
  2024-10-22 11:10 ` [PATCH 11/27] ext4: use reserved metadata blocks when splitting extent on endio Zhang Yi
@ 2024-10-22 11:10 ` Zhang Yi
  2024-12-04 12:42   ` Jan Kara
  2024-10-22 11:10 ` [PATCH 13/27] ext4: add a new iomap aops for regular file's buffered IO path Zhang Yi
                   ` (14 subsequent siblings)
  27 siblings, 1 reply; 59+ messages in thread
From: Zhang Yi @ 2024-10-22 11:10 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	ritesh.list, hch, djwong, david, zokeefe, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

In the iomap_write_iter(), the iomap buffered write frame does not hold
any locks between querying the inode extent mapping info and performing
page cache writes. As a result, the extent mapping can be changed due to
concurrent I/O in flight. Similarly, in the iomap_writepage_map(), the
write-back process faces a similar problem: concurrent changes can
invalidate the extent mapping before the I/O is submitted.

Therefore, both of these processes must recheck the mapping info after
acquiring the folio lock. To address this, similar to XFS, we propose
introducing an extent sequence number to serve as a validity cookie for
the extent. We will increment this number whenever the extent status
tree changes, thereby preparing for the buffered write iomap conversion.
Besides, it also changes the trace code style to make checkpatch.pl
happy.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/ext4.h              |  1 +
 fs/ext4/extents_status.c    | 13 ++++++++-
 fs/ext4/super.c             |  1 +
 include/trace/events/ext4.h | 57 +++++++++++++++++++++----------------
 4 files changed, 46 insertions(+), 26 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 6d0267afd4c1..44f6867d3037 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1123,6 +1123,7 @@ struct ext4_inode_info {
 	ext4_lblk_t i_es_shrink_lblk;	/* Offset where we start searching for
 					   extents to shrink. Protected by
 					   i_es_lock  */
+	unsigned int i_es_seq;		/* Change counter for extents */
 
 	/* ialloc */
 	ext4_group_t	i_last_alloc_group;
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index c786691dabd3..bea4f87db502 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -204,6 +204,13 @@ static inline ext4_lblk_t ext4_es_end(struct extent_status *es)
 	return es->es_lblk + es->es_len - 1;
 }
 
+static inline void ext4_es_inc_seq(struct inode *inode)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+
+	WRITE_ONCE(ei->i_es_seq, READ_ONCE(ei->i_es_seq) + 1);
+}
+
 /*
  * search through the tree for an delayed extent with a given offset.  If
  * it can't be found, try to find next extent.
@@ -872,6 +879,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 	BUG_ON(end < lblk);
 	WARN_ON_ONCE(status & EXTENT_STATUS_DELAYED);
 
+	ext4_es_inc_seq(inode);
 	newes.es_lblk = lblk;
 	newes.es_len = len;
 	ext4_es_store_pblock_status(&newes, pblk, status);
@@ -1519,13 +1527,15 @@ void ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 	if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
 		return;
 
-	trace_ext4_es_remove_extent(inode, lblk, len);
 	es_debug("remove [%u/%u) from extent status tree of inode %lu\n",
 		 lblk, len, inode->i_ino);
 
 	if (!len)
 		return;
 
+	ext4_es_inc_seq(inode);
+	trace_ext4_es_remove_extent(inode, lblk, len);
+
 	end = lblk + len - 1;
 	BUG_ON(end < lblk);
 
@@ -2107,6 +2117,7 @@ void ext4_es_insert_delayed_extent(struct inode *inode, ext4_lblk_t lblk,
 	WARN_ON_ONCE((EXT4_B2C(sbi, lblk) == EXT4_B2C(sbi, end)) &&
 		     end_allocated);
 
+	ext4_es_inc_seq(inode);
 	newes.es_lblk = lblk;
 	newes.es_len = len;
 	ext4_es_store_pblock_status(&newes, ~0, EXTENT_STATUS_DELAYED);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 16a4ce704460..a01e0bbe57c8 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1409,6 +1409,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
 	ei->i_es_all_nr = 0;
 	ei->i_es_shk_nr = 0;
 	ei->i_es_shrink_lblk = 0;
+	ei->i_es_seq = 0;
 	ei->i_reserved_data_blocks = 0;
 	spin_lock_init(&(ei->i_block_reservation_lock));
 	ext4_init_pending_tree(&ei->i_pending_tree);
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 156908641e68..6f2bf9035216 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2176,12 +2176,13 @@ DECLARE_EVENT_CLASS(ext4__es_extent,
 	TP_ARGS(inode, es),
 
 	TP_STRUCT__entry(
-		__field(	dev_t,		dev		)
-		__field(	ino_t,		ino		)
-		__field(	ext4_lblk_t,	lblk		)
-		__field(	ext4_lblk_t,	len		)
-		__field(	ext4_fsblk_t,	pblk		)
-		__field(	char, status	)
+		__field(dev_t,		dev)
+		__field(ino_t,		ino)
+		__field(ext4_lblk_t,	lblk)
+		__field(ext4_lblk_t,	len)
+		__field(ext4_fsblk_t,	pblk)
+		__field(char,		status)
+		__field(unsigned int,	seq)
 	),
 
 	TP_fast_assign(
@@ -2191,13 +2192,15 @@ DECLARE_EVENT_CLASS(ext4__es_extent,
 		__entry->len	= es->es_len;
 		__entry->pblk	= ext4_es_show_pblock(es);
 		__entry->status	= ext4_es_status(es);
+		__entry->seq	= EXT4_I(inode)->i_es_seq;
 	),
 
-	TP_printk("dev %d,%d ino %lu es [%u/%u) mapped %llu status %s",
+	TP_printk("dev %d,%d ino %lu es [%u/%u) mapped %llu status %s seq %u",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long) __entry->ino,
 		  __entry->lblk, __entry->len,
-		  __entry->pblk, show_extent_status(__entry->status))
+		  __entry->pblk, show_extent_status(__entry->status),
+		  __entry->seq)
 );
 
 DEFINE_EVENT(ext4__es_extent, ext4_es_insert_extent,
@@ -2218,10 +2221,11 @@ TRACE_EVENT(ext4_es_remove_extent,
 	TP_ARGS(inode, lblk, len),
 
 	TP_STRUCT__entry(
-		__field(	dev_t,	dev			)
-		__field(	ino_t,	ino			)
-		__field(	loff_t,	lblk			)
-		__field(	loff_t,	len			)
+		__field(dev_t,		dev)
+		__field(ino_t,		ino)
+		__field(loff_t,		lblk)
+		__field(loff_t,		len)
+		__field(unsigned int,	seq)
 	),
 
 	TP_fast_assign(
@@ -2229,12 +2233,13 @@ TRACE_EVENT(ext4_es_remove_extent,
 		__entry->ino	= inode->i_ino;
 		__entry->lblk	= lblk;
 		__entry->len	= len;
+		__entry->seq	= EXT4_I(inode)->i_es_seq;
 	),
 
-	TP_printk("dev %d,%d ino %lu es [%lld/%lld)",
+	TP_printk("dev %d,%d ino %lu es [%lld/%lld) seq %u",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long) __entry->ino,
-		  __entry->lblk, __entry->len)
+		  __entry->lblk, __entry->len, __entry->seq)
 );
 
 TRACE_EVENT(ext4_es_find_extent_range_enter,
@@ -2486,14 +2491,15 @@ TRACE_EVENT(ext4_es_insert_delayed_extent,
 	TP_ARGS(inode, es, lclu_allocated, end_allocated),
 
 	TP_STRUCT__entry(
-		__field(	dev_t,		dev		)
-		__field(	ino_t,		ino		)
-		__field(	ext4_lblk_t,	lblk		)
-		__field(	ext4_lblk_t,	len		)
-		__field(	ext4_fsblk_t,	pblk		)
-		__field(	char,		status		)
-		__field(	bool,		lclu_allocated	)
-		__field(	bool,		end_allocated	)
+		__field(dev_t,		dev)
+		__field(ino_t,		ino)
+		__field(ext4_lblk_t,	lblk)
+		__field(ext4_lblk_t,	len)
+		__field(ext4_fsblk_t,	pblk)
+		__field(char,		status)
+		__field(bool,		lclu_allocated)
+		__field(bool,		end_allocated)
+		__field(unsigned int,	seq)
 	),
 
 	TP_fast_assign(
@@ -2505,15 +2511,16 @@ TRACE_EVENT(ext4_es_insert_delayed_extent,
 		__entry->status		= ext4_es_status(es);
 		__entry->lclu_allocated	= lclu_allocated;
 		__entry->end_allocated	= end_allocated;
+		__entry->seq		= EXT4_I(inode)->i_es_seq;
 	),
 
-	TP_printk("dev %d,%d ino %lu es [%u/%u) mapped %llu status %s "
-		  "allocated %d %d",
+	TP_printk("dev %d,%d ino %lu es [%u/%u) mapped %llu status %s allocated %d %d seq %u",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long) __entry->ino,
 		  __entry->lblk, __entry->len,
 		  __entry->pblk, show_extent_status(__entry->status),
-		  __entry->lclu_allocated, __entry->end_allocated)
+		  __entry->lclu_allocated, __entry->end_allocated,
+		  __entry->seq)
 );
 
 /* fsmap traces */
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 13/27] ext4: add a new iomap aops for regular file's buffered IO path
  2024-10-22 11:10 [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio Zhang Yi
                   ` (12 preceding siblings ...)
  2024-10-22 11:10 ` [PATCH 12/27] ext4: introduce seq counter for the extent status entry Zhang Yi
@ 2024-10-22 11:10 ` Zhang Yi
  2024-10-22 11:10 ` [PATCH 14/27] ext4: implement buffered read iomap path Zhang Yi
                   ` (13 subsequent siblings)
  27 siblings, 0 replies; 59+ messages in thread
From: Zhang Yi @ 2024-10-22 11:10 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	ritesh.list, hch, djwong, david, zokeefe, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

This patch starts support for iomap in the buffered I/O path of ext4
regular files. First, it introduces a new iomap address space operation,
ext4_iomap_aops. Additionally, it adds an inode state flag,
EXT4_STATE_BUFFERED_IOMAP, which indicates that the inode uses the iomap
path instead of the original buffer_head path for buffered I/O. Most
callbacks of ext4_iomap_aops can directly utilize generic
implementations, the remaining functions .read_folio(), .readahead(),
and .writepages() will be implemented in later patches.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/ext4.h  |  1 +
 fs/ext4/inode.c | 32 ++++++++++++++++++++++++++++++++
 2 files changed, 33 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 44f6867d3037..ee170196bfff 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1916,6 +1916,7 @@ enum {
 	EXT4_STATE_VERITY_IN_PROGRESS,	/* building fs-verity Merkle tree */
 	EXT4_STATE_FC_COMMITTING,	/* Fast commit ongoing */
 	EXT4_STATE_ORPHAN_FILE,		/* Inode orphaned in orphan file */
+	EXT4_STATE_BUFFERED_IOMAP,	/* Inode use iomap for buffered IO */
 };
 
 #define EXT4_INODE_BIT_FNS(name, field, offset)				\
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 1ccf84a64b7b..b233f36efefa 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3526,6 +3526,22 @@ const struct iomap_ops ext4_iomap_report_ops = {
 	.iomap_begin = ext4_iomap_begin_report,
 };
 
+static int ext4_iomap_read_folio(struct file *file, struct folio *folio)
+{
+	return 0;
+}
+
+static void ext4_iomap_readahead(struct readahead_control *rac)
+{
+
+}
+
+static int ext4_iomap_writepages(struct address_space *mapping,
+				 struct writeback_control *wbc)
+{
+	return 0;
+}
+
 /*
  * For data=journal mode, folio should be marked dirty only when it was
  * writeably mapped. When that happens, it was already attached to the
@@ -3612,6 +3628,20 @@ static const struct address_space_operations ext4_da_aops = {
 	.swap_activate		= ext4_iomap_swap_activate,
 };
 
+static const struct address_space_operations ext4_iomap_aops = {
+	.read_folio		= ext4_iomap_read_folio,
+	.readahead		= ext4_iomap_readahead,
+	.writepages		= ext4_iomap_writepages,
+	.dirty_folio		= iomap_dirty_folio,
+	.bmap			= ext4_bmap,
+	.invalidate_folio	= iomap_invalidate_folio,
+	.release_folio		= iomap_release_folio,
+	.migrate_folio		= filemap_migrate_folio,
+	.is_partially_uptodate  = iomap_is_partially_uptodate,
+	.error_remove_folio	= generic_error_remove_folio,
+	.swap_activate		= ext4_iomap_swap_activate,
+};
+
 static const struct address_space_operations ext4_dax_aops = {
 	.writepages		= ext4_dax_writepages,
 	.dirty_folio		= noop_dirty_folio,
@@ -3633,6 +3663,8 @@ void ext4_set_aops(struct inode *inode)
 	}
 	if (IS_DAX(inode))
 		inode->i_mapping->a_ops = &ext4_dax_aops;
+	else if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP))
+		inode->i_mapping->a_ops = &ext4_iomap_aops;
 	else if (test_opt(inode->i_sb, DELALLOC))
 		inode->i_mapping->a_ops = &ext4_da_aops;
 	else
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 14/27] ext4: implement buffered read iomap path
  2024-10-22 11:10 [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio Zhang Yi
                   ` (13 preceding siblings ...)
  2024-10-22 11:10 ` [PATCH 13/27] ext4: add a new iomap aops for regular file's buffered IO path Zhang Yi
@ 2024-10-22 11:10 ` Zhang Yi
  2024-10-22 11:10 ` [PATCH 15/27] ext4: implement buffered write " Zhang Yi
                   ` (12 subsequent siblings)
  27 siblings, 0 replies; 59+ messages in thread
From: Zhang Yi @ 2024-10-22 11:10 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	ritesh.list, hch, djwong, david, zokeefe, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Introduce a new iomap_ops, ext4_iomap_buffered_read_ops to implement the
iomap read paths, specifically .read_folio() and .readahead() of
ext4_iomap_aops. This .iomap_begin() handle invokes ext4_map_blocks() to
query the extent mapping status of the read range and then converts the
mapping information to iomap.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/inode.c | 37 +++++++++++++++++++++++++++++++++++--
 1 file changed, 35 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b233f36efefa..f0bc4b58ac4f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3526,14 +3526,47 @@ const struct iomap_ops ext4_iomap_report_ops = {
 	.iomap_begin = ext4_iomap_begin_report,
 };
 
-static int ext4_iomap_read_folio(struct file *file, struct folio *folio)
+static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
+			loff_t length, unsigned int flags, struct iomap *iomap,
+			struct iomap *srcmap)
 {
+	int ret;
+	struct ext4_map_blocks map;
+	u8 blkbits = inode->i_blkbits;
+
+	if (unlikely(ext4_forced_shutdown(inode->i_sb)))
+		return -EIO;
+	if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
+		return -EINVAL;
+	/* Inline data support is not yet available. */
+	if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
+		return -ERANGE;
+
+	/* Calculate the first and last logical blocks respectively. */
+	map.m_lblk = offset >> blkbits;
+	map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
+			  EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
+
+	ret = ext4_map_blocks(NULL, inode, &map, 0);
+	if (ret < 0)
+		return ret;
+
+	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
 	return 0;
 }
 
-static void ext4_iomap_readahead(struct readahead_control *rac)
+const struct iomap_ops ext4_iomap_buffered_read_ops = {
+	.iomap_begin = ext4_iomap_buffered_read_begin,
+};
+
+static int ext4_iomap_read_folio(struct file *file, struct folio *folio)
 {
+	return iomap_read_folio(folio, &ext4_iomap_buffered_read_ops);
+}
 
+static void ext4_iomap_readahead(struct readahead_control *rac)
+{
+	iomap_readahead(rac, &ext4_iomap_buffered_read_ops);
 }
 
 static int ext4_iomap_writepages(struct address_space *mapping,
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 15/27] ext4: implement buffered write iomap path
  2024-10-22 11:10 [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio Zhang Yi
                   ` (14 preceding siblings ...)
  2024-10-22 11:10 ` [PATCH 14/27] ext4: implement buffered read iomap path Zhang Yi
@ 2024-10-22 11:10 ` Zhang Yi
  2024-10-22 11:10 ` [PATCH 16/27] ext4: don't order data for inode with EXT4_STATE_BUFFERED_IOMAP Zhang Yi
                   ` (11 subsequent siblings)
  27 siblings, 0 replies; 59+ messages in thread
From: Zhang Yi @ 2024-10-22 11:10 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	ritesh.list, hch, djwong, david, zokeefe, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Introduce two new iomap_ops: ext4_iomap_buffered_write_ops and
ext4_iomap_buffered_da_write_ops to implement the iomap write path.
These operations invoke ext4_da_map_blocks() to map delayed allocation
extents and introduce ext4_iomap_get_blocks() to directly allocate
blocks in non-delayed allocation mode. Additionally, implement
ext4_iomap_valid() to check the validity of extent mapping.

There are two key differences between the buffer_head write path and the
iomap write path:

1) In the iomap write path, we always allocate unwritten extents for new
   blocks, which means we consistently enable dioread_nolock. Therefore,
   we do not need to truncate blocks for short writes and write failure.
2) The iomap write frame maps multi-blocks in the ->iomap_begin()
   function, so we must remove the stale delayed allocation range from
   the short writes and write failure. Otherwise, this could result in a
   range of delayed extents being covered by a clean folio, leading to
   inaccurate space reservation.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/ext4.h  |   3 +
 fs/ext4/file.c  |  19 +++++-
 fs/ext4/inode.c | 155 +++++++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 169 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index ee170196bfff..a09f96ef17d8 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2985,6 +2985,7 @@ int ext4_walk_page_buffers(handle_t *handle,
 				     struct buffer_head *bh));
 int do_journal_get_write_access(handle_t *handle, struct inode *inode,
 				struct buffer_head *bh);
+int ext4_nonda_switch(struct super_block *sb);
 #define FALL_BACK_TO_NONDELALLOC 1
 #define CONVERT_INLINE_DATA	 2
 
@@ -3845,6 +3846,8 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
 extern const struct iomap_ops ext4_iomap_ops;
 extern const struct iomap_ops ext4_iomap_overwrite_ops;
 extern const struct iomap_ops ext4_iomap_report_ops;
+extern const struct iomap_ops ext4_iomap_buffered_write_ops;
+extern const struct iomap_ops ext4_iomap_buffered_da_write_ops;
 
 static inline int ext4_buffer_uptodate(struct buffer_head *bh)
 {
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index f14aed14b9cf..92471865b4e5 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -282,6 +282,20 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
 	return count;
 }
 
+static ssize_t ext4_iomap_buffered_write(struct kiocb *iocb,
+					 struct iov_iter *from)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	const struct iomap_ops *iomap_ops;
+
+	if (test_opt(inode->i_sb, DELALLOC) && !ext4_nonda_switch(inode->i_sb))
+		iomap_ops = &ext4_iomap_buffered_da_write_ops;
+	else
+		iomap_ops = &ext4_iomap_buffered_write_ops;
+
+	return iomap_file_buffered_write(iocb, from, iomap_ops, NULL);
+}
+
 static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
 					struct iov_iter *from)
 {
@@ -296,7 +310,10 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
 	if (ret <= 0)
 		goto out;
 
-	ret = generic_perform_write(iocb, from);
+	if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP))
+		ret = ext4_iomap_buffered_write(iocb, from);
+	else
+		ret = generic_perform_write(iocb, from);
 
 out:
 	inode_unlock(inode);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f0bc4b58ac4f..23cbcaab0a56 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2862,7 +2862,7 @@ static int ext4_dax_writepages(struct address_space *mapping,
 	return ret;
 }
 
-static int ext4_nonda_switch(struct super_block *sb)
+int ext4_nonda_switch(struct super_block *sb)
 {
 	s64 free_clusters, dirty_clusters;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
@@ -3257,6 +3257,15 @@ static bool ext4_inode_datasync_dirty(struct inode *inode)
 	return inode->i_state & I_DIRTY_DATASYNC;
 }
 
+static bool ext4_iomap_valid(struct inode *inode, const struct iomap *iomap)
+{
+	return iomap->validity_cookie == READ_ONCE(EXT4_I(inode)->i_es_seq);
+}
+
+static const struct iomap_folio_ops ext4_iomap_folio_ops = {
+	.iomap_valid = ext4_iomap_valid,
+};
+
 static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
 			   struct ext4_map_blocks *map, loff_t offset,
 			   loff_t length, unsigned int flags)
@@ -3287,6 +3296,9 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
 	    !ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
 		iomap->flags |= IOMAP_F_MERGED;
 
+	iomap->validity_cookie = READ_ONCE(EXT4_I(inode)->i_es_seq);
+	iomap->folio_ops = &ext4_iomap_folio_ops;
+
 	/*
 	 * Flags passed to ext4_map_blocks() for direct I/O writes can result
 	 * in m_flags having both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN bits
@@ -3526,11 +3538,57 @@ const struct iomap_ops ext4_iomap_report_ops = {
 	.iomap_begin = ext4_iomap_begin_report,
 };
 
-static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
-			loff_t length, unsigned int flags, struct iomap *iomap,
-			struct iomap *srcmap)
+static int ext4_iomap_get_blocks(struct inode *inode,
+				 struct ext4_map_blocks *map)
 {
-	int ret;
+	loff_t i_size = i_size_read(inode);
+	handle_t *handle;
+	int ret, needed_blocks;
+
+	/*
+	 * Check if the blocks have already been allocated, this could
+	 * avoid initiating a new journal transaction and return the
+	 * mapping information directly.
+	 */
+	if ((map->m_lblk + map->m_len) <=
+	    round_up(i_size, i_blocksize(inode)) >> inode->i_blkbits) {
+		ret = ext4_map_blocks(NULL, inode, map, 0);
+		if (ret < 0)
+			return ret;
+		if (map->m_flags & (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN |
+				    EXT4_MAP_DELAYED))
+			return 0;
+	}
+
+	/*
+	 * Reserve one block more for addition to orphan list in case
+	 * we allocate blocks but write fails for some reason.
+	 */
+	needed_blocks = ext4_writepage_trans_blocks(inode) + 1;
+	handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, needed_blocks);
+	if (IS_ERR(handle))
+		return PTR_ERR(handle);
+
+	ret = ext4_map_blocks(handle, inode, map,
+			      EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT);
+	/*
+	 * We need to stop handle here due to a potential deadlock caused
+	 * by the subsequent call to balance_dirty_pages(). This function
+	 * may wait for the dirty pages to be written back, which could
+	 * initiate another handle and cause it to wait for the first
+	 * handle to complete.
+	 */
+	ext4_journal_stop(handle);
+
+	return ret;
+}
+
+static int ext4_iomap_buffered_begin(struct inode *inode, loff_t offset,
+				     loff_t length, unsigned int flags,
+				     struct iomap *iomap, struct iomap *srcmap,
+				     bool delalloc)
+{
+	int ret, retries = 0;
 	struct ext4_map_blocks map;
 	u8 blkbits = inode->i_blkbits;
 
@@ -3541,13 +3599,23 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
 	/* Inline data support is not yet available. */
 	if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
 		return -ERANGE;
-
+retry:
 	/* Calculate the first and last logical blocks respectively. */
 	map.m_lblk = offset >> blkbits;
 	map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
 			  EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
+	if (flags & IOMAP_WRITE) {
+		if (delalloc)
+			ret = ext4_da_map_blocks(inode, &map);
+		else
+			ret = ext4_iomap_get_blocks(inode, &map);
 
-	ret = ext4_map_blocks(NULL, inode, &map, 0);
+		if (ret == -ENOSPC &&
+		    ext4_should_retry_alloc(inode->i_sb, &retries))
+			goto retry;
+	} else {
+		ret = ext4_map_blocks(NULL, inode, &map, 0);
+	}
 	if (ret < 0)
 		return ret;
 
@@ -3555,6 +3623,79 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
 	return 0;
 }
 
+static int ext4_iomap_buffered_read_begin(struct inode *inode,
+			loff_t offset, loff_t length, unsigned int flags,
+			struct iomap *iomap, struct iomap *srcmap)
+{
+	return ext4_iomap_buffered_begin(inode, offset, length, flags,
+					 iomap, srcmap, false);
+}
+
+static int ext4_iomap_buffered_write_begin(struct inode *inode,
+			loff_t offset, loff_t length, unsigned int flags,
+			struct iomap *iomap, struct iomap *srcmap)
+{
+	return ext4_iomap_buffered_begin(inode, offset, length, flags,
+					 iomap, srcmap, false);
+}
+
+static int ext4_iomap_buffered_da_write_begin(struct inode *inode,
+			loff_t offset, loff_t length, unsigned int flags,
+			struct iomap *iomap, struct iomap *srcmap)
+{
+	return ext4_iomap_buffered_begin(inode, offset, length, flags,
+					 iomap, srcmap, true);
+}
+
+/*
+ * Drop the staled delayed allocation range from the write failure,
+ * including both start and end blocks. If not, we could leave a range
+ * of delayed extents covered by a clean folio, it could lead to
+ * inaccurate space reservation.
+ */
+static void ext4_iomap_punch_delalloc(struct inode *inode, loff_t offset,
+				     loff_t length, struct iomap *iomap)
+{
+	down_write(&EXT4_I(inode)->i_data_sem);
+	ext4_es_remove_extent(inode, offset >> inode->i_blkbits,
+			DIV_ROUND_UP_ULL(length, EXT4_BLOCK_SIZE(inode->i_sb)));
+	up_write(&EXT4_I(inode)->i_data_sem);
+}
+
+static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
+					    loff_t length, ssize_t written,
+					    unsigned int flags,
+					    struct iomap *iomap)
+{
+	loff_t start_byte, end_byte;
+
+	/* If we didn't reserve the blocks, we're not allowed to punch them. */
+	if (iomap->type != IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW))
+		return 0;
+
+	/* Nothing to do if we've written the entire delalloc extent */
+	start_byte = iomap_last_written_block(inode, offset, written);
+	end_byte = round_up(offset + length, i_blocksize(inode));
+	if (start_byte >= end_byte)
+		return 0;
+
+	filemap_invalidate_lock(inode->i_mapping);
+	iomap_write_delalloc_release(inode, start_byte, end_byte, flags,
+				     iomap, ext4_iomap_punch_delalloc);
+	filemap_invalidate_unlock(inode->i_mapping);
+	return 0;
+}
+
+
+const struct iomap_ops ext4_iomap_buffered_write_ops = {
+	.iomap_begin = ext4_iomap_buffered_write_begin,
+};
+
+const struct iomap_ops ext4_iomap_buffered_da_write_ops = {
+	.iomap_begin = ext4_iomap_buffered_da_write_begin,
+	.iomap_end = ext4_iomap_buffered_da_write_end,
+};
+
 const struct iomap_ops ext4_iomap_buffered_read_ops = {
 	.iomap_begin = ext4_iomap_buffered_read_begin,
 };
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 16/27] ext4: don't order data for inode with EXT4_STATE_BUFFERED_IOMAP
  2024-10-22 11:10 [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio Zhang Yi
                   ` (15 preceding siblings ...)
  2024-10-22 11:10 ` [PATCH 15/27] ext4: implement buffered write " Zhang Yi
@ 2024-10-22 11:10 ` Zhang Yi
  2024-10-22 11:10 ` [PATCH 17/27] ext4: implement writeback iomap path Zhang Yi
                   ` (10 subsequent siblings)
  27 siblings, 0 replies; 59+ messages in thread
From: Zhang Yi @ 2024-10-22 11:10 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	ritesh.list, hch, djwong, david, zokeefe, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

In the iomap buffered I/O path, there is no risk of exposing stale data
because we always allocate unwritten extents for new allocated blocks,
the extent changes to written only when the I/O is completed. Therefore,
we do not need to order data in this mode.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/ext4_jbd2.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 0c77697d5e90..9dca10027032 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -467,6 +467,14 @@ static inline int ext4_should_journal_data(struct inode *inode)
 
 static inline int ext4_should_order_data(struct inode *inode)
 {
+	/*
+	 * There is no need to order data for inodes with iomap buffered I/O
+	 * path since it always allocate unwritten extents for new allocated
+	 * blocks and have no risk of stale data.
+	 */
+	if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP))
+		return 0;
+
 	return ext4_inode_journal_mode(inode) & EXT4_INODE_ORDERED_DATA_MODE;
 }
 
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 17/27] ext4: implement writeback iomap path
  2024-10-22 11:10 [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio Zhang Yi
                   ` (16 preceding siblings ...)
  2024-10-22 11:10 ` [PATCH 16/27] ext4: don't order data for inode with EXT4_STATE_BUFFERED_IOMAP Zhang Yi
@ 2024-10-22 11:10 ` Zhang Yi
  2024-10-22 11:10 ` [PATCH 18/27] ext4: implement mmap " Zhang Yi
                   ` (9 subsequent siblings)
  27 siblings, 0 replies; 59+ messages in thread
From: Zhang Yi @ 2024-10-22 11:10 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	ritesh.list, hch, djwong, david, zokeefe, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Implement ext4_iomap_writepages(), introduce ext4_writeback_ops, and
create an end I/O extent conversion worker to implement the iomap
buffered write-back path. In the map_blocks() handler, we first query
the longest range of existing mapped extents. If the block range has not
already been allocated, we attempt to allocate a range of blocks that is
as long as possible to minimize the number of block mappings. This
allocation is based on the write-back length and the delalloc extent
length, rather than allocating for a single folio at a time. In the
->prepare_ioend() handler, we register the end I/O worker to convert
unwritten extents into written extents.

There are three key differences between the buffer_head write-back path
and the iomap write-back path:

1) Since we aim to allocate a range of blocks as long as possible within
   the writeback length for each invocation of ->map_blocks(), we may
   allocate a long range but write less in certain corner cases.
   Therefore, we cannot convert the extent to written in advance within
   ->map_blocks(). Fortunately, there is minimal risk of losing data
   between split extents during the write-back and the end I/O process.
   We defer this action to the end I/O worker, where we can accurately
   determine the actual written length. Besides, we should remove the
   warning in ext4_convert_unwritten_extents_endio().
2) Since we do not order data, the journal thread is not required to
   write back data. Besides, we also do not need to use the reserve
   handle when converting the unwritten extent in the end I/O worker, we
   can start normal handle directly.
3) We can also delay updating the i_disksize until the end of the I/O,
   which could prevent the exposure of zero data that may occur during a
   system crash while performing buffer append writes in the buffer_head
   buffered write path.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/ext4.h    |   4 +
 fs/ext4/extents.c |  22 +++---
 fs/ext4/inode.c   | 188 +++++++++++++++++++++++++++++++++++++++++++++-
 fs/ext4/page-io.c | 105 ++++++++++++++++++++++++++
 fs/ext4/super.c   |   2 +
 5 files changed, 311 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index a09f96ef17d8..d4d594d97634 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1151,6 +1151,8 @@ struct ext4_inode_info {
 	 */
 	struct list_head i_rsv_conversion_list;
 	struct work_struct i_rsv_conversion_work;
+	struct list_head i_iomap_ioend_list;
+	struct work_struct i_iomap_ioend_work;
 
 	spinlock_t i_block_reservation_lock;
 
@@ -3773,6 +3775,8 @@ int ext4_bio_write_folio(struct ext4_io_submit *io, struct folio *page,
 		size_t len);
 extern struct ext4_io_end_vec *ext4_alloc_io_end_vec(ext4_io_end_t *io_end);
 extern struct ext4_io_end_vec *ext4_last_io_end_vec(ext4_io_end_t *io_end);
+extern void ext4_iomap_end_io(struct work_struct *work);
+extern void ext4_iomap_end_bio(struct bio *bio);
 
 /* mmp.c */
 extern int ext4_multi_mount_protect(struct super_block *, ext4_fsblk_t);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 33bc2cc5aff4..4b30e6f0a634 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3760,20 +3760,24 @@ ext4_convert_unwritten_extents_endio(handle_t *handle, struct inode *inode,
 	ext_debug(inode, "logical block %llu, max_blocks %u\n",
 		  (unsigned long long)ee_block, ee_len);
 
-	/* If extent is larger than requested it is a clear sign that we still
-	 * have some extent state machine issues left. So extent_split is still
-	 * required.
-	 * TODO: Once all related issues will be fixed this situation should be
-	 * illegal.
+	/*
+	 * If the extent is larger than requested, we should split it here.
+	 * For inodes using the iomap buffered I/O path, we do not split in
+	 * advance during the write-back process. Therefore, we may need to
+	 * perform the split during the end I/O process here. However,
+	 * other inodes should not require this action.
 	 */
 	if (ee_block != map->m_lblk || ee_len > map->m_len) {
 		int flags = EXT4_GET_BLOCKS_CONVERT |
 			    EXT4_GET_BLOCKS_METADATA_NOFAIL;
 #ifdef CONFIG_EXT4_DEBUG
-		ext4_warning(inode->i_sb, "Inode (%ld) finished: extent logical block %llu,"
-			     " len %u; IO logical block %llu, len %u",
-			     inode->i_ino, (unsigned long long)ee_block, ee_len,
-			     (unsigned long long)map->m_lblk, map->m_len);
+		if (!ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) {
+			ext4_warning(inode->i_sb,
+				     "Inode (%ld) finished: extent logical block %llu, len %u; IO logical block %llu, len %u",
+				     inode->i_ino, (unsigned long long)ee_block,
+				     ee_len, (unsigned long long)map->m_lblk,
+				     map->m_len);
+		}
 #endif
 		path = ext4_split_convert_extents(handle, inode, map, path,
 						  flags, NULL);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 23cbcaab0a56..a260942fd2dd 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -44,6 +44,7 @@
 #include <linux/iversion.h>
 
 #include "ext4_jbd2.h"
+#include "ext4_extents.h"
 #include "xattr.h"
 #include "acl.h"
 #include "truncate.h"
@@ -3710,10 +3711,195 @@ static void ext4_iomap_readahead(struct readahead_control *rac)
 	iomap_readahead(rac, &ext4_iomap_buffered_read_ops);
 }
 
+struct ext4_writeback_ctx {
+	struct iomap_writepage_ctx ctx;
+	struct writeback_control *wbc;
+	unsigned int data_seq;
+};
+
+static int ext4_iomap_map_one_extent(struct inode *inode,
+				     struct ext4_map_blocks *map)
+{
+	struct extent_status es;
+	handle_t *handle = NULL;
+	int credits, map_flags;
+	int retval;
+
+	credits = ext4_da_writepages_trans_blocks(inode);
+	handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, credits);
+	if (IS_ERR(handle))
+		return PTR_ERR(handle);
+
+	map->m_flags = 0;
+	/*
+	 * It is necessary to look up extent and map blocks under i_data_sem
+	 * in write mode, otherwise, the delalloc extent may become stale
+	 * during concurrent truncate operations.
+	 */
+	down_write(&EXT4_I(inode)->i_data_sem);
+	if (likely(ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es))) {
+		retval = es.es_len - (map->m_lblk - es.es_lblk);
+		map->m_len = min_t(unsigned int, retval, map->m_len);
+
+		if (ext4_es_is_delayed(&es)) {
+			map->m_flags |= EXT4_MAP_DELAYED;
+			trace_ext4_da_write_pages_extent(inode, map);
+			/*
+			 * Call ext4_map_create_blocks() to allocate any
+			 * delayed allocation blocks. It is possible that
+			 * we're going to need more metadata blocks, however
+			 * we must not fail because we're in writeback and
+			 * there is nothing we can do so it might result in
+			 * data loss. So use reserved blocks to allocate
+			 * metadata if possible.
+			 */
+			map_flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT |
+				    EXT4_GET_BLOCKS_METADATA_NOFAIL;
+
+			retval = ext4_map_create_blocks(handle, inode, map,
+							map_flags);
+			goto out;
+		}
+		if (unlikely(ext4_es_is_hole(&es)))
+			goto out;
+
+		/* Found written or unwritten extent. */
+		map->m_pblk = ext4_es_pblock(&es) + map->m_lblk -
+			      es.es_lblk;
+		map->m_flags = ext4_es_is_written(&es) ?
+			       EXT4_MAP_MAPPED : EXT4_MAP_UNWRITTEN;
+		goto out;
+	}
+
+	retval = ext4_map_query_blocks(handle, inode, map);
+out:
+	up_write(&EXT4_I(inode)->i_data_sem);
+	ext4_journal_stop(handle);
+	return retval < 0 ? retval : 0;
+}
+
+static int ext4_iomap_map_blocks(struct iomap_writepage_ctx *wpc,
+				 struct inode *inode, loff_t offset,
+				 unsigned int dirty_len)
+{
+	struct ext4_writeback_ctx *ewpc =
+			container_of(wpc, struct ext4_writeback_ctx, ctx);
+	struct super_block *sb = inode->i_sb;
+	struct journal_s *journal = EXT4_SB(sb)->s_journal;
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_map_blocks map;
+	unsigned int blkbits = inode->i_blkbits;
+	unsigned int index = offset >> blkbits;
+	unsigned int end, len;
+	int ret;
+
+	if (unlikely(ext4_forced_shutdown(inode->i_sb)))
+		return -EIO;
+
+	/* Check validity of the cached writeback mapping. */
+	if (offset >= wpc->iomap.offset &&
+	    offset < wpc->iomap.offset + wpc->iomap.length &&
+	    ewpc->data_seq == READ_ONCE(ei->i_es_seq))
+		return 0;
+
+	end = min_t(unsigned int, (ewpc->wbc->range_end >> blkbits),
+				  (UINT_MAX - 1));
+	len = (end > index + dirty_len) ? end - index + 1 : dirty_len;
+
+retry:
+	map.m_lblk = index;
+	map.m_len = min_t(unsigned int, MAX_WRITEPAGES_EXTENT_LEN, len);
+	ret = ext4_map_blocks(NULL, inode, &map, 0);
+	if (ret < 0)
+		return ret;
+
+	/*
+	 * The map is not a delalloc extent, it must either be a hole
+	 * or an extent which have already been allocated.
+	 */
+	if (!(map.m_flags & EXT4_MAP_DELAYED))
+		goto out;
+
+	/* Map one delalloc extent. */
+	ret = ext4_iomap_map_one_extent(inode, &map);
+	if (ret < 0) {
+		if (ext4_forced_shutdown(sb))
+			return ret;
+
+		/*
+		 * Retry transient ENOSPC errors, if
+		 * ext4_count_free_blocks() is non-zero, a commit
+		 * should free up blocks.
+		 */
+		if (ret == -ENOSPC && journal && ext4_count_free_clusters(sb)) {
+			jbd2_journal_force_commit_nested(journal);
+			goto retry;
+		}
+
+		ext4_msg(sb, KERN_CRIT,
+			 "Delayed block allocation failed for inode %lu at logical offset %llu with max blocks %u with error %d",
+			 inode->i_ino, (unsigned long long)map.m_lblk,
+			 (unsigned int)map.m_len, -ret);
+		ext4_msg(sb, KERN_CRIT,
+			 "This should not happen!! Data will be lost\n");
+		if (ret == -ENOSPC)
+			ext4_print_free_blocks(inode);
+		return ret;
+	}
+out:
+	ewpc->data_seq = READ_ONCE(ei->i_es_seq);
+	ext4_set_iomap(inode, &wpc->iomap, &map, offset,
+		       map.m_len << blkbits, 0);
+	return 0;
+}
+
+static int ext4_iomap_prepare_ioend(struct iomap_ioend *ioend, int status)
+{
+	struct ext4_inode_info *ei = EXT4_I(ioend->io_inode);
+
+	/* Need to convert unwritten extents when I/Os are completed. */
+	if (ioend->io_type == IOMAP_UNWRITTEN ||
+	    ioend->io_offset + ioend->io_size > READ_ONCE(ei->i_disksize))
+		ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
+
+	return status;
+}
+
+static void ext4_iomap_discard_folio(struct folio *folio, loff_t pos)
+{
+	struct inode *inode = folio->mapping->host;
+	loff_t length = folio_pos(folio) + folio_size(folio) - pos;
+
+	ext4_iomap_punch_delalloc(inode, pos, length, NULL);
+}
+
+static const struct iomap_writeback_ops ext4_writeback_ops = {
+	.map_blocks = ext4_iomap_map_blocks,
+	.prepare_ioend = ext4_iomap_prepare_ioend,
+	.discard_folio = ext4_iomap_discard_folio,
+};
+
 static int ext4_iomap_writepages(struct address_space *mapping,
 				 struct writeback_control *wbc)
 {
-	return 0;
+	struct inode *inode = mapping->host;
+	struct super_block *sb = inode->i_sb;
+	long nr = wbc->nr_to_write;
+	int alloc_ctx, ret;
+	struct ext4_writeback_ctx ewpc = {
+		.wbc = wbc,
+	};
+
+	if (unlikely(ext4_forced_shutdown(sb)))
+		return -EIO;
+
+	alloc_ctx = ext4_writepages_down_read(sb);
+	trace_ext4_writepages(inode, wbc);
+	ret = iomap_writepages(mapping, wbc, &ewpc.ctx, &ext4_writeback_ops);
+	trace_ext4_writepages_result(inode, wbc, ret, nr - wbc->nr_to_write);
+	ext4_writepages_up_read(sb, alloc_ctx);
+
+	return ret;
 }
 
 /*
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index ad5543866d21..659ee0fb7cea 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -22,6 +22,7 @@
 #include <linux/bio.h>
 #include <linux/workqueue.h>
 #include <linux/kernel.h>
+#include <linux/iomap.h>
 #include <linux/slab.h>
 #include <linux/mm.h>
 #include <linux/sched/mm.h>
@@ -562,3 +563,107 @@ int ext4_bio_write_folio(struct ext4_io_submit *io, struct folio *folio,
 
 	return 0;
 }
+
+static void ext4_iomap_finish_ioend(struct iomap_ioend *ioend)
+{
+	struct inode *inode = ioend->io_inode;
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	loff_t pos = ioend->io_offset;
+	size_t size = ioend->io_size;
+	loff_t new_disksize;
+	handle_t *handle;
+	int credits;
+	int ret, err;
+
+	ret = blk_status_to_errno(ioend->io_bio.bi_status);
+	if (unlikely(ret))
+		goto out;
+
+	/*
+	 * We may need to convert up to one extent per block in
+	 * the page and we may dirty the inode.
+	 */
+	credits = ext4_chunk_trans_blocks(inode,
+			EXT4_MAX_BLOCKS(size, pos, inode->i_blkbits));
+	handle = ext4_journal_start(inode, EXT4_HT_EXT_CONVERT, credits);
+	if (IS_ERR(handle)) {
+		ret = PTR_ERR(handle);
+		goto out_err;
+	}
+
+	if (ioend->io_type == IOMAP_UNWRITTEN) {
+		ret = ext4_convert_unwritten_extents(handle, inode, pos, size);
+		if (ret)
+			goto out_journal;
+	}
+
+	/*
+	 * Update on-disk size after IO is completed. Races with
+	 * truncate are avoided by checking i_size under i_data_sem.
+	 */
+	new_disksize = pos + size;
+	if (new_disksize > READ_ONCE(ei->i_disksize)) {
+		down_write(&ei->i_data_sem);
+		new_disksize = min(new_disksize, i_size_read(inode));
+		if (new_disksize > ei->i_disksize)
+			ei->i_disksize = new_disksize;
+		up_write(&ei->i_data_sem);
+		ret = ext4_mark_inode_dirty(handle, inode);
+		if (ret)
+			EXT4_ERROR_INODE_ERR(inode, -ret,
+					     "Failed to mark inode dirty");
+	}
+
+out_journal:
+	err = ext4_journal_stop(handle);
+	if (!ret)
+		ret = err;
+out_err:
+	if (ret < 0 && !ext4_forced_shutdown(inode->i_sb)) {
+		ext4_msg(inode->i_sb, KERN_EMERG,
+			 "failed to convert unwritten extents to written extents or update inode size -- potential data loss! (inode %lu, error %d)",
+			 inode->i_ino, ret);
+	}
+out:
+	iomap_finish_ioends(ioend, ret);
+}
+
+/*
+ * Work on buffered iomap completed IO, to convert unwritten extents to
+ * mapped extents
+ */
+void ext4_iomap_end_io(struct work_struct *work)
+{
+	struct ext4_inode_info *ei = container_of(work, struct ext4_inode_info,
+						  i_iomap_ioend_work);
+	struct iomap_ioend *ioend;
+	struct list_head ioend_list;
+	unsigned long flags;
+
+	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
+	list_replace_init(&ei->i_iomap_ioend_list, &ioend_list);
+	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
+
+	iomap_sort_ioends(&ioend_list);
+	while (!list_empty(&ioend_list)) {
+		ioend = list_entry(ioend_list.next, struct iomap_ioend, io_list);
+		list_del_init(&ioend->io_list);
+		iomap_ioend_try_merge(ioend, &ioend_list);
+		ext4_iomap_finish_ioend(ioend);
+	}
+}
+
+void ext4_iomap_end_bio(struct bio *bio)
+{
+	struct iomap_ioend *ioend = iomap_ioend_from_bio(bio);
+	struct ext4_inode_info *ei = EXT4_I(ioend->io_inode);
+	struct ext4_sb_info *sbi = EXT4_SB(ioend->io_inode->i_sb);
+	unsigned long flags;
+
+	/* Only reserved conversions from writeback should enter here */
+	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
+	if (list_empty(&ei->i_iomap_ioend_list))
+		queue_work(sbi->rsv_conversion_wq, &ei->i_iomap_ioend_work);
+	list_add_tail(&ioend->io_list, &ei->i_iomap_ioend_list);
+	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
+}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index a01e0bbe57c8..56baadec27e0 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1419,11 +1419,13 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
 #endif
 	ei->jinode = NULL;
 	INIT_LIST_HEAD(&ei->i_rsv_conversion_list);
+	INIT_LIST_HEAD(&ei->i_iomap_ioend_list);
 	spin_lock_init(&ei->i_completed_io_lock);
 	ei->i_sync_tid = 0;
 	ei->i_datasync_tid = 0;
 	atomic_set(&ei->i_unwritten, 0);
 	INIT_WORK(&ei->i_rsv_conversion_work, ext4_end_io_rsv_work);
+	INIT_WORK(&ei->i_iomap_ioend_work, ext4_iomap_end_io);
 	ext4_fc_init_inode(&ei->vfs_inode);
 	mutex_init(&ei->i_fc_lock);
 	return &ei->vfs_inode;
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 18/27] ext4: implement mmap iomap path
  2024-10-22 11:10 [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio Zhang Yi
                   ` (17 preceding siblings ...)
  2024-10-22 11:10 ` [PATCH 17/27] ext4: implement writeback iomap path Zhang Yi
@ 2024-10-22 11:10 ` Zhang Yi
  2024-10-22 11:10 ` [PATCH 19/27] ext4: do not always order data when partial zeroing out a block Zhang Yi
                   ` (8 subsequent siblings)
  27 siblings, 0 replies; 59+ messages in thread
From: Zhang Yi @ 2024-10-22 11:10 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	ritesh.list, hch, djwong, david, zokeefe, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Introduce ext4_iomap_page_mkwrite() to implement the mmap iomap path. It
invoke iomap_page_mkwrite() and passes
ext4_iomap_buffered_[da_]write_ops to make the folio dirty and map
blocks, Almost all other work are handled by iomap_page_mkwrite().

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/inode.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index a260942fd2dd..0a9b73534257 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -6478,6 +6478,23 @@ static int ext4_bh_unmapped(handle_t *handle, struct inode *inode,
 	return !buffer_mapped(bh);
 }
 
+static vm_fault_t ext4_iomap_page_mkwrite(struct vm_fault *vmf)
+{
+	struct inode *inode = file_inode(vmf->vma->vm_file);
+	const struct iomap_ops *iomap_ops;
+
+	/*
+	 * ext4_nonda_switch() could writeback this folio, so have to
+	 * call it before lock folio.
+	 */
+	if (test_opt(inode->i_sb, DELALLOC) && !ext4_nonda_switch(inode->i_sb))
+		iomap_ops = &ext4_iomap_buffered_da_write_ops;
+	else
+		iomap_ops = &ext4_iomap_buffered_write_ops;
+
+	return iomap_page_mkwrite(vmf, iomap_ops);
+}
+
 vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
@@ -6501,6 +6518,11 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
 
 	filemap_invalidate_lock_shared(mapping);
 
+	if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) {
+		ret = ext4_iomap_page_mkwrite(vmf);
+		goto out;
+	}
+
 	err = ext4_convert_inline_data(inode);
 	if (err)
 		goto out_ret;
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 19/27] ext4: do not always order data when partial zeroing out a block
  2024-10-22 11:10 [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio Zhang Yi
                   ` (18 preceding siblings ...)
  2024-10-22 11:10 ` [PATCH 18/27] ext4: implement mmap " Zhang Yi
@ 2024-10-22 11:10 ` Zhang Yi
  2024-10-22 11:10 ` [PATCH 20/27] ext4: do not start handle if unnecessary while " Zhang Yi
                   ` (7 subsequent siblings)
  27 siblings, 0 replies; 59+ messages in thread
From: Zhang Yi @ 2024-10-22 11:10 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	ritesh.list, hch, djwong, david, zokeefe, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

When zeroing out a partial block during a partial truncate, zeroing
range, or punching a hole, it is essential to order the data only during
the partial truncate. This is necessary because there is a risk of
exposing stale data. Consider a scenario in which a crash occurs just
after the i_disksize transaction has been submitted but before the
zeroed data is written out. In this case, the tail block will retain
stale data, which could be exposed on the next expand truncate
operation. However, partial zeroing range and punching hole don not have
this risk. Therefore, we could move the ext4_jbd2_inode_add_write() out
to ext4_truncate(), only order data for the partial truncate.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/inode.c | 50 +++++++++++++++++++++++++++++++++++++------------
 1 file changed, 38 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0a9b73534257..97be75cde481 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4038,7 +4038,9 @@ void ext4_set_aops(struct inode *inode)
  * racing writeback can come later and flush the stale pagecache to disk.
  */
 static int __ext4_block_zero_page_range(handle_t *handle,
-		struct address_space *mapping, loff_t from, loff_t length)
+					struct address_space *mapping,
+					loff_t from, loff_t length,
+					bool *did_zero)
 {
 	ext4_fsblk_t index = from >> PAGE_SHIFT;
 	unsigned offset = from & (PAGE_SIZE-1);
@@ -4116,14 +4118,16 @@ static int __ext4_block_zero_page_range(handle_t *handle,
 
 	if (ext4_should_journal_data(inode)) {
 		err = ext4_dirty_journalled_data(handle, bh);
+		if (err)
+			goto unlock;
 	} else {
 		err = 0;
 		mark_buffer_dirty(bh);
-		if (ext4_should_order_data(inode))
-			err = ext4_jbd2_inode_add_write(handle, inode, from,
-					length);
 	}
 
+	if (did_zero)
+		*did_zero = true;
+
 unlock:
 	folio_unlock(folio);
 	folio_put(folio);
@@ -4138,7 +4142,9 @@ static int __ext4_block_zero_page_range(handle_t *handle,
  * that corresponds to 'from'
  */
 static int ext4_block_zero_page_range(handle_t *handle,
-		struct address_space *mapping, loff_t from, loff_t length)
+				      struct address_space *mapping,
+				      loff_t from, loff_t length,
+				      bool *did_zero)
 {
 	struct inode *inode = mapping->host;
 	unsigned offset = from & (PAGE_SIZE-1);
@@ -4156,7 +4162,8 @@ static int ext4_block_zero_page_range(handle_t *handle,
 		return dax_zero_range(inode, from, length, NULL,
 				      &ext4_iomap_ops);
 	}
-	return __ext4_block_zero_page_range(handle, mapping, from, length);
+	return __ext4_block_zero_page_range(handle, mapping, from, length,
+					    did_zero);
 }
 
 /*
@@ -4166,12 +4173,15 @@ static int ext4_block_zero_page_range(handle_t *handle,
  * of that block so it doesn't yield old data if the file is later grown.
  */
 static int ext4_block_truncate_page(handle_t *handle,
-		struct address_space *mapping, loff_t from)
+				    struct address_space *mapping, loff_t from,
+				    loff_t *zero_len)
 {
 	unsigned offset = from & (PAGE_SIZE-1);
 	unsigned length;
 	unsigned blocksize;
 	struct inode *inode = mapping->host;
+	bool did_zero = false;
+	int ret;
 
 	/* If we are processing an encrypted inode during orphan list handling */
 	if (IS_ENCRYPTED(inode) && !fscrypt_has_encryption_key(inode))
@@ -4180,7 +4190,13 @@ static int ext4_block_truncate_page(handle_t *handle,
 	blocksize = inode->i_sb->s_blocksize;
 	length = blocksize - (offset & (blocksize - 1));
 
-	return ext4_block_zero_page_range(handle, mapping, from, length);
+	ret = ext4_block_zero_page_range(handle, mapping, from, length,
+					 &did_zero);
+	if (ret)
+		return ret;
+
+	*zero_len = length;
+	return 0;
 }
 
 int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
@@ -4203,13 +4219,14 @@ int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
 	if (start == end &&
 	    (partial_start || (partial_end != sb->s_blocksize - 1))) {
 		err = ext4_block_zero_page_range(handle, mapping,
-						 lstart, length);
+						 lstart, length, NULL);
 		return err;
 	}
 	/* Handle partial zero out on the start of the range */
 	if (partial_start) {
 		err = ext4_block_zero_page_range(handle, mapping,
-						 lstart, sb->s_blocksize);
+						 lstart, sb->s_blocksize,
+						 NULL);
 		if (err)
 			return err;
 	}
@@ -4217,7 +4234,7 @@ int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
 	if (partial_end != sb->s_blocksize - 1)
 		err = ext4_block_zero_page_range(handle, mapping,
 						 byte_end - partial_end,
-						 partial_end + 1);
+						 partial_end + 1, NULL);
 	return err;
 }
 
@@ -4517,6 +4534,7 @@ int ext4_truncate(struct inode *inode)
 	int err = 0, err2;
 	handle_t *handle;
 	struct address_space *mapping = inode->i_mapping;
+	loff_t zero_len = 0;
 
 	/*
 	 * There is a possibility that we're either freeing the inode
@@ -4560,7 +4578,15 @@ int ext4_truncate(struct inode *inode)
 	}
 
 	if (inode->i_size & (inode->i_sb->s_blocksize - 1))
-		ext4_block_truncate_page(handle, mapping, inode->i_size);
+		ext4_block_truncate_page(handle, mapping, inode->i_size,
+					 &zero_len);
+
+	if (zero_len && ext4_should_order_data(inode)) {
+		err = ext4_jbd2_inode_add_write(handle, inode, inode->i_size,
+						zero_len);
+		if (err)
+			goto out_stop;
+	}
 
 	/*
 	 * We add the inode to the orphan list, so that if this
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 20/27] ext4: do not start handle if unnecessary while partial zeroing out a block
  2024-10-22 11:10 [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio Zhang Yi
                   ` (19 preceding siblings ...)
  2024-10-22 11:10 ` [PATCH 19/27] ext4: do not always order data when partial zeroing out a block Zhang Yi
@ 2024-10-22 11:10 ` Zhang Yi
  2024-10-22 11:10 ` [PATCH 21/27] ext4: implement zero_range iomap path Zhang Yi
                   ` (6 subsequent siblings)
  27 siblings, 0 replies; 59+ messages in thread
From: Zhang Yi @ 2024-10-22 11:10 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	ritesh.list, hch, djwong, david, zokeefe, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

When zeroing out a partial block in __ext4_block_zero_page_range()
during a partial truncate, zeroing range, or punching a hole, we only
need to start a handle in data=journal mode because we need to log the
zeroed data block, we don't need this handle in other modes. Therefore,
we can start a handle in ext4_block_zero_page_range() and avoid
performing the zeroing process under a running handle if it is in
data=ordered or writeback mode.

This change is essential for the conversion to iomap buffered I/O, as
it helps prevent a potential deadlock issue. After we switch to using
iomap_zero_range() to zero out a partial block in the later patches,
iomap_zero_range() may write out dirty folios and wait for I/O to
complete before the zeroing out. However, we can't wait I/O to complete
under running handle because the end I/O process may also wait this
handle to stop if the running transaction has begun to commit or the
journal is running out of space.

Therefore, let's postpone the start of handle in the of the partial
truncation, zeroing range, and hole punching, in preparation for the
buffered write iomap conversion.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/ext4.h    |  4 +--
 fs/ext4/extents.c | 22 ++++++---------
 fs/ext4/inode.c   | 70 +++++++++++++++++++++++++----------------------
 3 files changed, 47 insertions(+), 49 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index d4d594d97634..e1b7f7024f07 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3034,8 +3034,8 @@ extern void ext4_set_aops(struct inode *inode);
 extern int ext4_writepage_trans_blocks(struct inode *);
 extern int ext4_normal_submit_inode_data_buffers(struct jbd2_inode *jinode);
 extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks);
-extern int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
-			     loff_t lstart, loff_t lend);
+extern int ext4_zero_partial_blocks(struct inode *inode,
+				    loff_t lstart, loff_t lend);
 extern vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf);
 extern qsize_t *ext4_get_reserved_space(struct inode *inode);
 extern int ext4_get_projid(struct inode *inode, kprojid_t *projid);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 4b30e6f0a634..20e56cd17847 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4576,7 +4576,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 	ext4_lblk_t start_lblk, end_lblk;
 	unsigned int blocksize = i_blocksize(inode);
 	unsigned int blkbits = inode->i_blkbits;
-	int ret, flags, credits;
+	int ret, flags;
 
 	trace_ext4_zero_range(inode, offset, len, mode);
 	WARN_ON_ONCE(!inode_is_locked(inode));
@@ -4638,27 +4638,21 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 	if (!(offset & (blocksize - 1)) && !(end & (blocksize - 1)))
 		return ret;
 
-	/*
-	 * In worst case we have to writeout two nonadjacent unwritten
-	 * blocks and update the inode
-	 */
-	credits = (2 * ext4_ext_index_trans_blocks(inode, 2)) + 1;
-	if (ext4_should_journal_data(inode))
-		credits += 2;
-	handle = ext4_journal_start(inode, EXT4_HT_MISC, credits);
+	/* Zero out partial block at the edges of the range */
+	ret = ext4_zero_partial_blocks(inode, offset, len);
+	if (ret)
+		return ret;
+
+	handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
 	if (IS_ERR(handle)) {
 		ret = PTR_ERR(handle);
 		ext4_std_error(inode->i_sb, ret);
 		return ret;
 	}
 
-	/* Zero out partial block at the edges of the range */
-	ret = ext4_zero_partial_blocks(handle, inode, offset, len);
-	if (ret)
-		goto out_handle;
-
 	if (new_size)
 		ext4_update_inode_size(inode, new_size);
+
 	ret = ext4_mark_inode_dirty(handle, inode);
 	if (unlikely(ret))
 		goto out_handle;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 97be75cde481..34701afe61c2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4037,8 +4037,7 @@ void ext4_set_aops(struct inode *inode)
  * ext4_punch_hole, etc) which needs to be properly zeroed out. Otherwise a
  * racing writeback can come later and flush the stale pagecache to disk.
  */
-static int __ext4_block_zero_page_range(handle_t *handle,
-					struct address_space *mapping,
+static int __ext4_block_zero_page_range(struct address_space *mapping,
 					loff_t from, loff_t length,
 					bool *did_zero)
 {
@@ -4046,16 +4045,25 @@ static int __ext4_block_zero_page_range(handle_t *handle,
 	unsigned offset = from & (PAGE_SIZE-1);
 	unsigned blocksize, pos;
 	ext4_lblk_t iblock;
+	handle_t *handle;
 	struct inode *inode = mapping->host;
 	struct buffer_head *bh;
 	struct folio *folio;
 	int err = 0;
 
+	if (ext4_should_journal_data(inode)) {
+		handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
+		if (IS_ERR(handle))
+			return PTR_ERR(handle);
+	}
+
 	folio = __filemap_get_folio(mapping, from >> PAGE_SHIFT,
 				    FGP_LOCK | FGP_ACCESSED | FGP_CREAT,
 				    mapping_gfp_constraint(mapping, ~__GFP_FS));
-	if (IS_ERR(folio))
-		return PTR_ERR(folio);
+	if (IS_ERR(folio)) {
+		err = PTR_ERR(folio);
+		goto out;
+	}
 
 	blocksize = inode->i_sb->s_blocksize;
 
@@ -4106,22 +4114,24 @@ static int __ext4_block_zero_page_range(handle_t *handle,
 			}
 		}
 	}
+
 	if (ext4_should_journal_data(inode)) {
 		BUFFER_TRACE(bh, "get write access");
 		err = ext4_journal_get_write_access(handle, inode->i_sb, bh,
 						    EXT4_JTR_NONE);
 		if (err)
 			goto unlock;
-	}
-	folio_zero_range(folio, offset, length);
-	BUFFER_TRACE(bh, "zeroed end of block");
 
-	if (ext4_should_journal_data(inode)) {
+		folio_zero_range(folio, offset, length);
+		BUFFER_TRACE(bh, "zeroed end of block");
+
 		err = ext4_dirty_journalled_data(handle, bh);
 		if (err)
 			goto unlock;
 	} else {
-		err = 0;
+		folio_zero_range(folio, offset, length);
+		BUFFER_TRACE(bh, "zeroed end of block");
+
 		mark_buffer_dirty(bh);
 	}
 
@@ -4131,6 +4141,9 @@ static int __ext4_block_zero_page_range(handle_t *handle,
 unlock:
 	folio_unlock(folio);
 	folio_put(folio);
+out:
+	if (ext4_should_journal_data(inode))
+		ext4_journal_stop(handle);
 	return err;
 }
 
@@ -4141,8 +4154,7 @@ static int __ext4_block_zero_page_range(handle_t *handle,
  * the end of the block it will be shortened to end of the block
  * that corresponds to 'from'
  */
-static int ext4_block_zero_page_range(handle_t *handle,
-				      struct address_space *mapping,
+static int ext4_block_zero_page_range(struct address_space *mapping,
 				      loff_t from, loff_t length,
 				      bool *did_zero)
 {
@@ -4162,8 +4174,7 @@ static int ext4_block_zero_page_range(handle_t *handle,
 		return dax_zero_range(inode, from, length, NULL,
 				      &ext4_iomap_ops);
 	}
-	return __ext4_block_zero_page_range(handle, mapping, from, length,
-					    did_zero);
+	return __ext4_block_zero_page_range(mapping, from, length, did_zero);
 }
 
 /*
@@ -4172,8 +4183,7 @@ static int ext4_block_zero_page_range(handle_t *handle,
  * This required during truncate. We need to physically zero the tail end
  * of that block so it doesn't yield old data if the file is later grown.
  */
-static int ext4_block_truncate_page(handle_t *handle,
-				    struct address_space *mapping, loff_t from,
+static int ext4_block_truncate_page(struct address_space *mapping, loff_t from,
 				    loff_t *zero_len)
 {
 	unsigned offset = from & (PAGE_SIZE-1);
@@ -4190,8 +4200,7 @@ static int ext4_block_truncate_page(handle_t *handle,
 	blocksize = inode->i_sb->s_blocksize;
 	length = blocksize - (offset & (blocksize - 1));
 
-	ret = ext4_block_zero_page_range(handle, mapping, from, length,
-					 &did_zero);
+	ret = ext4_block_zero_page_range(mapping, from, length, &did_zero);
 	if (ret)
 		return ret;
 
@@ -4199,8 +4208,7 @@ static int ext4_block_truncate_page(handle_t *handle,
 	return 0;
 }
 
-int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
-			     loff_t lstart, loff_t length)
+int ext4_zero_partial_blocks(struct inode *inode, loff_t lstart, loff_t length)
 {
 	struct super_block *sb = inode->i_sb;
 	struct address_space *mapping = inode->i_mapping;
@@ -4218,21 +4226,19 @@ int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
 	/* Handle partial zero within the single block */
 	if (start == end &&
 	    (partial_start || (partial_end != sb->s_blocksize - 1))) {
-		err = ext4_block_zero_page_range(handle, mapping,
-						 lstart, length, NULL);
+		err = ext4_block_zero_page_range(mapping, lstart, length, NULL);
 		return err;
 	}
 	/* Handle partial zero out on the start of the range */
 	if (partial_start) {
-		err = ext4_block_zero_page_range(handle, mapping,
-						 lstart, sb->s_blocksize,
-						 NULL);
+		err = ext4_block_zero_page_range(mapping, lstart,
+						 sb->s_blocksize, NULL);
 		if (err)
 			return err;
 	}
 	/* Handle partial zero out on the end of the range */
 	if (partial_end != sb->s_blocksize - 1)
-		err = ext4_block_zero_page_range(handle, mapping,
+		err = ext4_block_zero_page_range(mapping,
 						 byte_end - partial_end,
 						 partial_end + 1, NULL);
 	return err;
@@ -4418,6 +4424,10 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
 	/* Now release the pages and zero block aligned part of pages*/
 	truncate_pagecache_range(inode, offset, end - 1);
 
+	ret = ext4_zero_partial_blocks(inode, offset, length);
+	if (ret)
+		return ret;
+
 	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
 		credits = ext4_writepage_trans_blocks(inode);
 	else
@@ -4429,10 +4439,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
 		return ret;
 	}
 
-	ret = ext4_zero_partial_blocks(handle, inode, offset, length);
-	if (ret)
-		goto out_handle;
-
 	/* If there are blocks to remove, do it */
 	start_lblk = round_up(offset, blocksize) >> inode->i_blkbits;
 	end_lblk = end >> inode->i_blkbits;
@@ -4564,6 +4570,8 @@ int ext4_truncate(struct inode *inode)
 		err = ext4_inode_attach_jinode(inode);
 		if (err)
 			goto out_trace;
+
+		ext4_block_truncate_page(mapping, inode->i_size, &zero_len);
 	}
 
 	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
@@ -4577,10 +4585,6 @@ int ext4_truncate(struct inode *inode)
 		goto out_trace;
 	}
 
-	if (inode->i_size & (inode->i_sb->s_blocksize - 1))
-		ext4_block_truncate_page(handle, mapping, inode->i_size,
-					 &zero_len);
-
 	if (zero_len && ext4_should_order_data(inode)) {
 		err = ext4_jbd2_inode_add_write(handle, inode, inode->i_size,
 						zero_len);
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 21/27] ext4: implement zero_range iomap path
  2024-10-22 11:10 [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio Zhang Yi
                   ` (20 preceding siblings ...)
  2024-10-22 11:10 ` [PATCH 20/27] ext4: do not start handle if unnecessary while " Zhang Yi
@ 2024-10-22 11:10 ` Zhang Yi
  2024-10-22 11:10 ` [PATCH 22/27] ext4: disable online defrag when inode using iomap buffered I/O path Zhang Yi
                   ` (5 subsequent siblings)
  27 siblings, 0 replies; 59+ messages in thread
From: Zhang Yi @ 2024-10-22 11:10 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	ritesh.list, hch, djwong, david, zokeefe, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Introduce ext4_iomap_zero_range() to implement the zero_range iomap
path. Currently, this function direct invokes iomap_zero_range() to zero
out a mapped partial block during the truncate down, zeroing range and
punching hole. Almost all operations are handled by iomap_zero_range().

One important aspect to consider is the truncate-down operation. Since
we do not order the data, it is essential to write out zeroed data
before the i_disksize update transaction is committed. Otherwise, stale
data may left over in the last block, which could be exposed during the
next expand truncate operation.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/inode.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 34701afe61c2..50e4afd17e93 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4147,6 +4147,13 @@ static int __ext4_block_zero_page_range(struct address_space *mapping,
 	return err;
 }
 
+static int ext4_iomap_zero_range(struct inode *inode, loff_t from,
+				 loff_t length, bool *did_zero)
+{
+	return iomap_zero_range(inode, from, length, did_zero,
+				&ext4_iomap_buffered_write_ops);
+}
+
 /*
  * ext4_block_zero_page_range() zeros out a mapping of length 'length'
  * starting from file offset 'from'.  The range to be zero'd must
@@ -4173,6 +4180,8 @@ static int ext4_block_zero_page_range(struct address_space *mapping,
 	if (IS_DAX(inode)) {
 		return dax_zero_range(inode, from, length, NULL,
 				      &ext4_iomap_ops);
+	} else if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) {
+		return ext4_iomap_zero_range(inode, from, length, did_zero);
 	}
 	return __ext4_block_zero_page_range(mapping, from, length, did_zero);
 }
@@ -4572,6 +4581,22 @@ int ext4_truncate(struct inode *inode)
 			goto out_trace;
 
 		ext4_block_truncate_page(mapping, inode->i_size, &zero_len);
+		/*
+		 * inode with an iomap buffered I/O path does not order data,
+		 * so it is necessary to write out zeroed data before the
+		 * updating i_disksize transaction is committed. Otherwise,
+		 * stale data may remain in the last block, which could be
+		 * exposed during the next expand truncate operation.
+		 */
+		if (zero_len && ext4_test_inode_state(inode,
+					EXT4_STATE_BUFFERED_IOMAP)) {
+			loff_t zero_end = inode->i_size + zero_len;
+
+			err = filemap_write_and_wait_range(mapping,
+					inode->i_size, zero_end - 1);
+			if (err)
+				goto out_trace;
+		}
 	}
 
 	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 22/27] ext4: disable online defrag when inode using iomap buffered I/O path
  2024-10-22 11:10 [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio Zhang Yi
                   ` (21 preceding siblings ...)
  2024-10-22 11:10 ` [PATCH 21/27] ext4: implement zero_range iomap path Zhang Yi
@ 2024-10-22 11:10 ` Zhang Yi
  2024-10-22 11:10 ` [PATCH 23/27] ext4: disable inode journal mode when " Zhang Yi
                   ` (4 subsequent siblings)
  27 siblings, 0 replies; 59+ messages in thread
From: Zhang Yi @ 2024-10-22 11:10 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	ritesh.list, hch, djwong, david, zokeefe, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Online defragmentation does not yet support the inode using iomap
buffered I/O path, as it still relies on ext4_get_block() to get blocks
and copy data. Therefore, we must disable it for the time being.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/move_extent.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index b64661ea6e0e..508e342b4a1d 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -610,6 +610,13 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, __u64 orig_blk,
 		return -EOPNOTSUPP;
 	}
 
+	if (ext4_test_inode_state(orig_inode, EXT4_STATE_BUFFERED_IOMAP) ||
+	    ext4_test_inode_state(donor_inode, EXT4_STATE_BUFFERED_IOMAP)) {
+		ext4_msg(orig_inode->i_sb, KERN_ERR,
+			 "Online defrag not supported for inode with iomap buffered IO path");
+		return -EOPNOTSUPP;
+	}
+
 	/* Protect orig and donor inodes against a truncate */
 	lock_two_nondirectories(orig_inode, donor_inode);
 
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 23/27] ext4: disable inode journal mode when using iomap buffered I/O path
  2024-10-22 11:10 [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio Zhang Yi
                   ` (22 preceding siblings ...)
  2024-10-22 11:10 ` [PATCH 22/27] ext4: disable online defrag when inode using iomap buffered I/O path Zhang Yi
@ 2024-10-22 11:10 ` Zhang Yi
  2024-10-22 11:10 ` [PATCH 24/27] ext4: partially enable iomap for the buffered I/O path of regular files Zhang Yi
                   ` (3 subsequent siblings)
  27 siblings, 0 replies; 59+ messages in thread
From: Zhang Yi @ 2024-10-22 11:10 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	ritesh.list, hch, djwong, david, zokeefe, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

inode with data=journal mode will not support the iomap buffered I/O
path, so just disable this mode if EXT4_STATE_BUFFERED_IOMAP is set.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/ext4_jbd2.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index da4a82456383..367a29babe09 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -16,7 +16,8 @@ int ext4_inode_journal_mode(struct inode *inode)
 	    ext4_test_inode_flag(inode, EXT4_INODE_EA_INODE) ||
 	    test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA ||
 	    (ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA) &&
-	    !test_opt(inode->i_sb, DELALLOC))) {
+	    !test_opt(inode->i_sb, DELALLOC) &&
+	    !ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP))) {
 		/* We do not support data journalling for encrypted data */
 		if (S_ISREG(inode->i_mode) && IS_ENCRYPTED(inode))
 			return EXT4_INODE_ORDERED_DATA_MODE;  /* ordered */
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 24/27] ext4: partially enable iomap for the buffered I/O path of regular files
  2024-10-22 11:10 [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio Zhang Yi
                   ` (23 preceding siblings ...)
  2024-10-22 11:10 ` [PATCH 23/27] ext4: disable inode journal mode when " Zhang Yi
@ 2024-10-22 11:10 ` Zhang Yi
  2024-10-22 11:10 ` [PATCH 25/27] ext4: enable large folio for regular file with iomap buffered I/O path Zhang Yi
                   ` (2 subsequent siblings)
  27 siblings, 0 replies; 59+ messages in thread
From: Zhang Yi @ 2024-10-22 11:10 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	ritesh.list, hch, djwong, david, zokeefe, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Partially enable iomap for the buffered I/O path of regular files with
the default mount option. This supports default filesystem features and
the bigalloc feature. However, it does not yet support inline data,
fs_verity, fs_crypt, online defrag and data=journal mode. Some of these
features should be supported gradually in the future. The filesystem
will fallback to the buffered_head path automatically if these mount
options or features are enabled.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/ext4.h   |  1 +
 fs/ext4/ialloc.c |  3 +++
 fs/ext4/inode.c  | 32 ++++++++++++++++++++++++++++++++
 3 files changed, 36 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index e1b7f7024f07..0096191b454c 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2987,6 +2987,7 @@ int ext4_walk_page_buffers(handle_t *handle,
 				     struct buffer_head *bh));
 int do_journal_get_write_access(handle_t *handle, struct inode *inode,
 				struct buffer_head *bh);
+bool ext4_should_use_buffered_iomap(struct inode *inode);
 int ext4_nonda_switch(struct super_block *sb);
 #define FALL_BACK_TO_NONDELALLOC 1
 #define CONVERT_INLINE_DATA	 2
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 7f1a5f90dbbd..2e3e257b9808 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -1333,6 +1333,9 @@ struct inode *__ext4_new_inode(struct mnt_idmap *idmap,
 		}
 	}
 
+	if (ext4_should_use_buffered_iomap(inode))
+		ext4_set_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP);
+
 	ext4_update_inode_fsync_trans(handle, inode, 1);
 
 	err = ext4_mark_inode_dirty(handle, inode);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 50e4afd17e93..512094dc4117 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -776,6 +776,8 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
 
 	if (ext4_has_inline_data(inode))
 		return -ERANGE;
+	if (WARN_ON(ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)))
+		return -EINVAL;
 
 	map.m_lblk = iblock;
 	map.m_len = bh->b_size >> inode->i_blkbits;
@@ -2572,6 +2574,9 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
 
 	trace_ext4_writepages(inode, wbc);
 
+	if (WARN_ON(ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)))
+		return -EINVAL;
+
 	/*
 	 * No pages to write? This is mainly a kludge to avoid starting
 	 * a transaction for special inodes like journal inode on last iput()
@@ -5144,6 +5149,30 @@ static const char *check_igot_inode(struct inode *inode, ext4_iget_flags flags)
 	return NULL;
 }
 
+bool ext4_should_use_buffered_iomap(struct inode *inode)
+{
+	struct super_block *sb = inode->i_sb;
+
+	if (ext4_has_feature_inline_data(sb))
+		return false;
+	if (ext4_has_feature_verity(sb))
+		return false;
+	if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA)
+		return false;
+	if (!S_ISREG(inode->i_mode))
+		return false;
+	if (IS_DAX(inode))
+		return false;
+	if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
+		return false;
+	if (ext4_test_inode_flag(inode, EXT4_INODE_EA_INODE))
+		return false;
+	if (ext4_test_inode_flag(inode, EXT4_INODE_ENCRYPT))
+		return false;
+
+	return true;
+}
+
 struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
 			  ext4_iget_flags flags, const char *function,
 			  unsigned int line)
@@ -5408,6 +5437,9 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
 	if (ret)
 		goto bad_inode;
 
+	if (ext4_should_use_buffered_iomap(inode))
+		ext4_set_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP);
+
 	if (S_ISREG(inode->i_mode)) {
 		inode->i_op = &ext4_file_inode_operations;
 		inode->i_fop = &ext4_file_operations;
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 25/27] ext4: enable large folio for regular file with iomap buffered I/O path
  2024-10-22 11:10 [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio Zhang Yi
                   ` (24 preceding siblings ...)
  2024-10-22 11:10 ` [PATCH 24/27] ext4: partially enable iomap for the buffered I/O path of regular files Zhang Yi
@ 2024-10-22 11:10 ` Zhang Yi
  2024-10-22 11:10 ` [PATCH 26/27] ext4: change mount options code style Zhang Yi
  2024-10-22 11:10 ` [PATCH 27/27] ext4: introduce a mount option for iomap buffered I/O path Zhang Yi
  27 siblings, 0 replies; 59+ messages in thread
From: Zhang Yi @ 2024-10-22 11:10 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	ritesh.list, hch, djwong, david, zokeefe, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Since we have converted the buffered I/O path to iomap for regular
files, we can enable large folio support as well. This should result in
significant performance gains for large I/O operations.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/ialloc.c | 4 +++-
 fs/ext4/inode.c  | 4 +++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 2e3e257b9808..6ff03fb74867 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -1333,8 +1333,10 @@ struct inode *__ext4_new_inode(struct mnt_idmap *idmap,
 		}
 	}
 
-	if (ext4_should_use_buffered_iomap(inode))
+	if (ext4_should_use_buffered_iomap(inode)) {
 		ext4_set_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP);
+		mapping_set_large_folios(inode->i_mapping);
+	}
 
 	ext4_update_inode_fsync_trans(handle, inode, 1);
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 512094dc4117..97abc88e6658 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5437,8 +5437,10 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
 	if (ret)
 		goto bad_inode;
 
-	if (ext4_should_use_buffered_iomap(inode))
+	if (ext4_should_use_buffered_iomap(inode)) {
 		ext4_set_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP);
+		mapping_set_large_folios(inode->i_mapping);
+	}
 
 	if (S_ISREG(inode->i_mode)) {
 		inode->i_op = &ext4_file_inode_operations;
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 26/27] ext4: change mount options code style
  2024-10-22 11:10 [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio Zhang Yi
                   ` (25 preceding siblings ...)
  2024-10-22 11:10 ` [PATCH 25/27] ext4: enable large folio for regular file with iomap buffered I/O path Zhang Yi
@ 2024-10-22 11:10 ` Zhang Yi
  2024-10-22 11:10 ` [PATCH 27/27] ext4: introduce a mount option for iomap buffered I/O path Zhang Yi
  27 siblings, 0 replies; 59+ messages in thread
From: Zhang Yi @ 2024-10-22 11:10 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	ritesh.list, hch, djwong, david, zokeefe, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Just remove the space between the macro name and open parenthesis to
satisfy the checkpatch.pl script and prevent it from complaining when we
add new mount options in the subsequent patch. This will not result in
any logical changes.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/super.c | 175 +++++++++++++++++++++++-------------------------
 1 file changed, 84 insertions(+), 91 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 56baadec27e0..89955081c4fe 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1723,101 +1723,94 @@ static const struct constant_table ext4_param_dax[] = {
  * separate for now.
  */
 static const struct fs_parameter_spec ext4_param_specs[] = {
-	fsparam_flag	("bsddf",		Opt_bsd_df),
-	fsparam_flag	("minixdf",		Opt_minix_df),
-	fsparam_flag	("grpid",		Opt_grpid),
-	fsparam_flag	("bsdgroups",		Opt_grpid),
-	fsparam_flag	("nogrpid",		Opt_nogrpid),
-	fsparam_flag	("sysvgroups",		Opt_nogrpid),
-	fsparam_gid	("resgid",		Opt_resgid),
-	fsparam_uid	("resuid",		Opt_resuid),
-	fsparam_u32	("sb",			Opt_sb),
-	fsparam_enum	("errors",		Opt_errors, ext4_param_errors),
-	fsparam_flag	("nouid32",		Opt_nouid32),
-	fsparam_flag	("debug",		Opt_debug),
-	fsparam_flag	("oldalloc",		Opt_removed),
-	fsparam_flag	("orlov",		Opt_removed),
-	fsparam_flag	("user_xattr",		Opt_user_xattr),
-	fsparam_flag	("acl",			Opt_acl),
-	fsparam_flag	("norecovery",		Opt_noload),
-	fsparam_flag	("noload",		Opt_noload),
-	fsparam_flag	("bh",			Opt_removed),
-	fsparam_flag	("nobh",		Opt_removed),
-	fsparam_u32	("commit",		Opt_commit),
-	fsparam_u32	("min_batch_time",	Opt_min_batch_time),
-	fsparam_u32	("max_batch_time",	Opt_max_batch_time),
-	fsparam_u32	("journal_dev",		Opt_journal_dev),
-	fsparam_bdev	("journal_path",	Opt_journal_path),
-	fsparam_flag	("journal_checksum",	Opt_journal_checksum),
-	fsparam_flag	("nojournal_checksum",	Opt_nojournal_checksum),
-	fsparam_flag	("journal_async_commit",Opt_journal_async_commit),
-	fsparam_flag	("abort",		Opt_abort),
-	fsparam_enum	("data",		Opt_data, ext4_param_data),
-	fsparam_enum	("data_err",		Opt_data_err,
+	fsparam_flag("bsddf",			Opt_bsd_df),
+	fsparam_flag("minixdf",			Opt_minix_df),
+	fsparam_flag("grpid",			Opt_grpid),
+	fsparam_flag("bsdgroups",		Opt_grpid),
+	fsparam_flag("nogrpid",			Opt_nogrpid),
+	fsparam_flag("sysvgroups",		Opt_nogrpid),
+	fsparam_gid("resgid",			Opt_resgid),
+	fsparam_uid("resuid",			Opt_resuid),
+	fsparam_u32("sb",			Opt_sb),
+	fsparam_enum("errors",			Opt_errors, ext4_param_errors),
+	fsparam_flag("nouid32",			Opt_nouid32),
+	fsparam_flag("debug",			Opt_debug),
+	fsparam_flag("oldalloc",		Opt_removed),
+	fsparam_flag("orlov",			Opt_removed),
+	fsparam_flag("user_xattr",		Opt_user_xattr),
+	fsparam_flag("acl",			Opt_acl),
+	fsparam_flag("norecovery",		Opt_noload),
+	fsparam_flag("noload",			Opt_noload),
+	fsparam_flag("bh",			Opt_removed),
+	fsparam_flag("nobh",			Opt_removed),
+	fsparam_u32("commit",			Opt_commit),
+	fsparam_u32("min_batch_time",		Opt_min_batch_time),
+	fsparam_u32("max_batch_time",		Opt_max_batch_time),
+	fsparam_u32("journal_dev",		Opt_journal_dev),
+	fsparam_bdev("journal_path",		Opt_journal_path),
+	fsparam_flag("journal_checksum",	Opt_journal_checksum),
+	fsparam_flag("nojournal_checksum",	Opt_nojournal_checksum),
+	fsparam_flag("journal_async_commit",	Opt_journal_async_commit),
+	fsparam_flag("abort",			Opt_abort),
+	fsparam_enum("data",			Opt_data, ext4_param_data),
+	fsparam_enum("data_err",		Opt_data_err,
 						ext4_param_data_err),
-	fsparam_string_empty
-			("usrjquota",		Opt_usrjquota),
-	fsparam_string_empty
-			("grpjquota",		Opt_grpjquota),
-	fsparam_enum	("jqfmt",		Opt_jqfmt, ext4_param_jqfmt),
-	fsparam_flag	("grpquota",		Opt_grpquota),
-	fsparam_flag	("quota",		Opt_quota),
-	fsparam_flag	("noquota",		Opt_noquota),
-	fsparam_flag	("usrquota",		Opt_usrquota),
-	fsparam_flag	("prjquota",		Opt_prjquota),
-	fsparam_flag	("barrier",		Opt_barrier),
-	fsparam_u32	("barrier",		Opt_barrier),
-	fsparam_flag	("nobarrier",		Opt_nobarrier),
-	fsparam_flag	("i_version",		Opt_removed),
-	fsparam_flag	("dax",			Opt_dax),
-	fsparam_enum	("dax",			Opt_dax_type, ext4_param_dax),
-	fsparam_u32	("stripe",		Opt_stripe),
-	fsparam_flag	("delalloc",		Opt_delalloc),
-	fsparam_flag	("nodelalloc",		Opt_nodelalloc),
-	fsparam_flag	("warn_on_error",	Opt_warn_on_error),
-	fsparam_flag	("nowarn_on_error",	Opt_nowarn_on_error),
-	fsparam_u32	("debug_want_extra_isize",
-						Opt_debug_want_extra_isize),
-	fsparam_flag	("mblk_io_submit",	Opt_removed),
-	fsparam_flag	("nomblk_io_submit",	Opt_removed),
-	fsparam_flag	("block_validity",	Opt_block_validity),
-	fsparam_flag	("noblock_validity",	Opt_noblock_validity),
-	fsparam_u32	("inode_readahead_blks",
-						Opt_inode_readahead_blks),
-	fsparam_u32	("journal_ioprio",	Opt_journal_ioprio),
-	fsparam_u32	("auto_da_alloc",	Opt_auto_da_alloc),
-	fsparam_flag	("auto_da_alloc",	Opt_auto_da_alloc),
-	fsparam_flag	("noauto_da_alloc",	Opt_noauto_da_alloc),
-	fsparam_flag	("dioread_nolock",	Opt_dioread_nolock),
-	fsparam_flag	("nodioread_nolock",	Opt_dioread_lock),
-	fsparam_flag	("dioread_lock",	Opt_dioread_lock),
-	fsparam_flag	("discard",		Opt_discard),
-	fsparam_flag	("nodiscard",		Opt_nodiscard),
-	fsparam_u32	("init_itable",		Opt_init_itable),
-	fsparam_flag	("init_itable",		Opt_init_itable),
-	fsparam_flag	("noinit_itable",	Opt_noinit_itable),
+	fsparam_string_empty("usrjquota",	Opt_usrjquota),
+	fsparam_string_empty("grpjquota",	Opt_grpjquota),
+	fsparam_enum("jqfmt",			Opt_jqfmt, ext4_param_jqfmt),
+	fsparam_flag("grpquota",		Opt_grpquota),
+	fsparam_flag("quota",			Opt_quota),
+	fsparam_flag("noquota",			Opt_noquota),
+	fsparam_flag("usrquota",		Opt_usrquota),
+	fsparam_flag("prjquota",		Opt_prjquota),
+	fsparam_flag("barrier",			Opt_barrier),
+	fsparam_u32("barrier",			Opt_barrier),
+	fsparam_flag("nobarrier",		Opt_nobarrier),
+	fsparam_flag("i_version",		Opt_removed),
+	fsparam_flag("dax",			Opt_dax),
+	fsparam_enum("dax",			Opt_dax_type, ext4_param_dax),
+	fsparam_u32("stripe",			Opt_stripe),
+	fsparam_flag("delalloc",		Opt_delalloc),
+	fsparam_flag("nodelalloc",		Opt_nodelalloc),
+	fsparam_flag("warn_on_error",		Opt_warn_on_error),
+	fsparam_flag("nowarn_on_error",		Opt_nowarn_on_error),
+	fsparam_u32("debug_want_extra_isize",	Opt_debug_want_extra_isize),
+	fsparam_flag("mblk_io_submit",		Opt_removed),
+	fsparam_flag("nomblk_io_submit",	Opt_removed),
+	fsparam_flag("block_validity",		Opt_block_validity),
+	fsparam_flag("noblock_validity",	Opt_noblock_validity),
+	fsparam_u32("inode_readahead_blks",	Opt_inode_readahead_blks),
+	fsparam_u32("journal_ioprio",		Opt_journal_ioprio),
+	fsparam_u32("auto_da_alloc",		Opt_auto_da_alloc),
+	fsparam_flag("auto_da_alloc",		Opt_auto_da_alloc),
+	fsparam_flag("noauto_da_alloc",		Opt_noauto_da_alloc),
+	fsparam_flag("dioread_nolock",		Opt_dioread_nolock),
+	fsparam_flag("nodioread_nolock",	Opt_dioread_lock),
+	fsparam_flag("dioread_lock",		Opt_dioread_lock),
+	fsparam_flag("discard",			Opt_discard),
+	fsparam_flag("nodiscard",		Opt_nodiscard),
+	fsparam_u32("init_itable",		Opt_init_itable),
+	fsparam_flag("init_itable",		Opt_init_itable),
+	fsparam_flag("noinit_itable",		Opt_noinit_itable),
 #ifdef CONFIG_EXT4_DEBUG
-	fsparam_flag	("fc_debug_force",	Opt_fc_debug_force),
-	fsparam_u32	("fc_debug_max_replay",	Opt_fc_debug_max_replay),
+	fsparam_flag("fc_debug_force",		Opt_fc_debug_force),
+	fsparam_u32("fc_debug_max_replay",	Opt_fc_debug_max_replay),
 #endif
-	fsparam_u32	("max_dir_size_kb",	Opt_max_dir_size_kb),
-	fsparam_flag	("test_dummy_encryption",
-						Opt_test_dummy_encryption),
-	fsparam_string	("test_dummy_encryption",
-						Opt_test_dummy_encryption),
-	fsparam_flag	("inlinecrypt",		Opt_inlinecrypt),
-	fsparam_flag	("nombcache",		Opt_nombcache),
-	fsparam_flag	("no_mbcache",		Opt_nombcache),	/* for backward compatibility */
-	fsparam_flag	("prefetch_block_bitmaps",
-						Opt_removed),
-	fsparam_flag	("no_prefetch_block_bitmaps",
+	fsparam_u32("max_dir_size_kb",		Opt_max_dir_size_kb),
+	fsparam_flag("test_dummy_encryption",	Opt_test_dummy_encryption),
+	fsparam_string("test_dummy_encryption",	Opt_test_dummy_encryption),
+	fsparam_flag("inlinecrypt",		Opt_inlinecrypt),
+	fsparam_flag("nombcache",		Opt_nombcache),
+	fsparam_flag("no_mbcache",		Opt_nombcache),	/* for backward compatibility */
+	fsparam_flag("prefetch_block_bitmaps",	Opt_removed),
+	fsparam_flag("no_prefetch_block_bitmaps",
 						Opt_no_prefetch_block_bitmaps),
-	fsparam_s32	("mb_optimize_scan",	Opt_mb_optimize_scan),
-	fsparam_string	("check",		Opt_removed),	/* mount option from ext2/3 */
-	fsparam_flag	("nocheck",		Opt_removed),	/* mount option from ext2/3 */
-	fsparam_flag	("reservation",		Opt_removed),	/* mount option from ext2/3 */
-	fsparam_flag	("noreservation",	Opt_removed),	/* mount option from ext2/3 */
-	fsparam_u32	("journal",		Opt_removed),	/* mount option from ext2/3 */
+	fsparam_s32("mb_optimize_scan",		Opt_mb_optimize_scan),
+	fsparam_string("check",			Opt_removed),	/* mount option from ext2/3 */
+	fsparam_flag("nocheck",			Opt_removed),	/* mount option from ext2/3 */
+	fsparam_flag("reservation",		Opt_removed),	/* mount option from ext2/3 */
+	fsparam_flag("noreservation",		Opt_removed),	/* mount option from ext2/3 */
+	fsparam_u32("journal",			Opt_removed),	/* mount option from ext2/3 */
 	{}
 };
 
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 27/27] ext4: introduce a mount option for iomap buffered I/O path
  2024-10-22 11:10 [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio Zhang Yi
                   ` (26 preceding siblings ...)
  2024-10-22 11:10 ` [PATCH 26/27] ext4: change mount options code style Zhang Yi
@ 2024-10-22 11:10 ` Zhang Yi
  27 siblings, 0 replies; 59+ messages in thread
From: Zhang Yi @ 2024-10-22 11:10 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	ritesh.list, hch, djwong, david, zokeefe, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

From: Zhang Yi <yi.zhang@huawei.com>

Introduce the buffered_iomap and the nobuffered_iomap mount options to
enable the iomap buffered I/O path for regular files. This option is
currently disabled by default until we can support more comprehensive
features along this path.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/ext4.h  | 1 +
 fs/ext4/inode.c | 2 ++
 fs/ext4/super.c | 7 +++++++
 3 files changed, 10 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 0096191b454c..c2a44530e026 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1257,6 +1257,7 @@ struct ext4_inode_info {
 						    * scanning in mballoc
 						    */
 #define EXT4_MOUNT2_ABORT		0x00000100 /* Abort filesystem */
+#define EXT4_MOUNT2_BUFFERED_IOMAP	0x00000200 /* Use iomap for buffered IO */
 
 #define clear_opt(sb, opt)		EXT4_SB(sb)->s_mount_opt &= \
 						~EXT4_MOUNT_##opt
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 97abc88e6658..b6e041a423f9 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5153,6 +5153,8 @@ bool ext4_should_use_buffered_iomap(struct inode *inode)
 {
 	struct super_block *sb = inode->i_sb;
 
+	if (!test_opt2(sb, BUFFERED_IOMAP))
+		return false;
 	if (ext4_has_feature_inline_data(sb))
 		return false;
 	if (ext4_has_feature_verity(sb))
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 89955081c4fe..435a866359d9 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1675,6 +1675,7 @@ enum {
 	Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
 	Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
 	Opt_no_prefetch_block_bitmaps, Opt_mb_optimize_scan,
+	Opt_buffered_iomap, Opt_nobuffered_iomap,
 	Opt_errors, Opt_data, Opt_data_err, Opt_jqfmt, Opt_dax_type,
 #ifdef CONFIG_EXT4_DEBUG
 	Opt_fc_debug_max_replay, Opt_fc_debug_force
@@ -1806,6 +1807,8 @@ static const struct fs_parameter_spec ext4_param_specs[] = {
 	fsparam_flag("no_prefetch_block_bitmaps",
 						Opt_no_prefetch_block_bitmaps),
 	fsparam_s32("mb_optimize_scan",		Opt_mb_optimize_scan),
+	fsparam_flag("buffered_iomap",		Opt_buffered_iomap),
+	fsparam_flag("nobuffered_iomap",	Opt_nobuffered_iomap),
 	fsparam_string("check",			Opt_removed),	/* mount option from ext2/3 */
 	fsparam_flag("nocheck",			Opt_removed),	/* mount option from ext2/3 */
 	fsparam_flag("reservation",		Opt_removed),	/* mount option from ext2/3 */
@@ -1900,6 +1903,10 @@ static const struct mount_opts {
 	{Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET},
 	{Opt_no_prefetch_block_bitmaps, EXT4_MOUNT_NO_PREFETCH_BLOCK_BITMAPS,
 	 MOPT_SET},
+	{Opt_buffered_iomap, EXT4_MOUNT2_BUFFERED_IOMAP,
+	 MOPT_SET | MOPT_2 | MOPT_EXT4_ONLY},
+	{Opt_nobuffered_iomap, EXT4_MOUNT2_BUFFERED_IOMAP,
+	 MOPT_CLEAR | MOPT_2 | MOPT_EXT4_ONLY},
 #ifdef CONFIG_EXT4_DEBUG
 	{Opt_fc_debug_force, EXT4_MOUNT2_JOURNAL_FAST_COMMIT,
 	 MOPT_SET | MOPT_2 | MOPT_EXT4_ONLY},
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio
  2024-10-22  9:22   ` Zhang Yi
@ 2024-10-23 12:13     ` Sedat Dilek
  2024-10-24  7:44       ` Zhang Yi
  0 siblings, 1 reply; 59+ messages in thread
From: Sedat Dilek @ 2024-10-23 12:13 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, ritesh.list, hch, djwong, david, zokeefe, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

On Tue, Oct 22, 2024 at 11:22 AM Zhang Yi <yi.zhang@huaweicloud.com> wrote:
>
> On 2024/10/22 14:59, Sedat Dilek wrote:
> > On Tue, Oct 22, 2024 at 5:13 AM Zhang Yi <yi.zhang@huaweicloud.com> wrote:
> >>
> >> From: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> Hello！
> >>
> >> This patch series is the latest version based on my previous RFC
> >> series[1], which converts the buffered I/O path of ext4 regular files to
> >> iomap and enables large folios. After several months of work, almost all
> >> preparatory changes have been upstreamed, thanks a lot for the review
> >> and comments from Jan, Dave, Christoph, Darrick and Ritesh. Now it is
> >> time for the main implementation of this conversion.
> >>
> >> This series is the main part of iomap buffered iomap conversion, it's
> >> based on 6.12-rc4, and the code context is also depend on my anohter
> >> cleanup series[1] (I've put that in this seris so we can merge it
> >> directly), fixed all minor bugs found in my previous RFC v4 series.
> >> Additionally, I've update change logs in each patch and also includes
> >> some code modifications as Dave's suggestions. This series implements
> >> the core iomap APIs on ext4 and introduces a mount option called
> >> "buffered_iomap" to enable the iomap buffered I/O path. We have already
> >> supported the default features, default mount options and bigalloc
> >> feature. However, we do not yet support online defragmentation, inline
> >> data, fs_verify, fs_crypt, ext3, and data=journal mode, ext4 will fall
> >> to buffered_head I/O path automatically if you use those features and
> >> options. Some of these features should be supported gradually in the
> >> near future.
> >>
> >> Most of the implementations resemble the original buffered_head path;
> >> however, there are four key differences.
> >>
> >> 1. The first aspect is the block allocation in the writeback path. The
> >>    iomap frame will invoke ->map_blocks() at least once for each dirty
> >>    folio. To ensure optimal writeback performance, we aim to allocate a
> >>    range of delalloc blocks that is as long as possible within the
> >>    writeback length for each invocation. In certain situations, we may
> >>    allocate a range of blocks that exceeds the amount we will actually
> >>    write back. Therefore,
> >> 1) we cannot allocate a written extent for those blocks because it may
> >>    expose stale data in such short write cases. Instead, we should
> >>    allocate an unwritten extent, which means we must always enable the
> >>    dioread_nolock option. This change could also bring many other
> >>    benefits.
> >> 2) We should postpone updating the 'i_disksize' until the end of the I/O
> >>    process, based on the actual written length. This approach can also
> >>    prevent the exposure of zero data, which may occur if there is a
> >>    power failure during an append write.
> >> 3) We do not need to pre-split extents during write-back, we can
> >>    postpone this task until the end I/O process while converting
> >>    unwritten extents.
> >>
> >> 2. The second reason is that since we always allocate unwritten space
> >>    for new blocks, there is no risk of exposing stale data. As a result,
> >>    we do not need to order the data, which allows us to disable the
> >>    data=ordered mode. Consequently, we also do not require the reserved
> >>    handle when converting the unwritten extent in the final I/O worker,
> >>    we can directly start with the normal handle.
> >>
> >> Series details:
> >>
> >> Patch 1-10 is just another series of mine that refactors the fallocate
> >> functions[1]. This series relies on the code context of that but has no
> >> logical dependencies. I put this here just for easy access and merge.
> >>
> >> Patch 11-21 implement the iomap buffered read/write path, dirty folio
> >> write back path and mmap path for ext4 regular file.
> >>
> >> Patch 22-23 disable the unsupported online-defragmentation function and
> >> disable the changing of the inode journal flag to data=journal mode.
> >> Please look at the following patch for details.
> >>
> >> Patch 24-27 introduce "buffered_iomap" mount option (is not enabled by
> >> default now) to partially enable the iomap buffered I/O path and also
> >> enable large folio.
> >>
> >>
> >> About performance:
> >>
> >> Fio tests with psync on my machine with Intel Xeon Gold 6240 CPU with
> >> 400GB system ram, 200GB ramdisk and 4TB nvme ssd disk.
> >>
> >>  fio -directory=/mnt -direct=0 -iodepth=$iodepth -fsync=$sync -rw=$rw \
> >>      -numjobs=${numjobs} -bs=${bs} -ioengine=psync -size=$size \
> >>      -runtime=60 -norandommap=0 -fallocate=none -overwrite=$overwrite \
> >>      -group_reportin -name=$name --output=/tmp/test_log
> >>
> >
> > Hi Zhang Yi,
> >
> > can you clarify about the FIO values for the diverse parameters?
> >
>
> Hi Sedat,
>
> Sure, the test I present here is a simple single-thread and single-I/O
> depth case with psync ioengine. Most of the FIO parameters are shown
> in the tables below.
>

Hi Zhang Yi,

Thanks for your reply.

Can you share a FIO config file with all (relevant) settings?
Maybe it is in the below link?

Link: https://packages.debian.org/sid/all/fio-examples/filelist

> For the rest, the 'iodepth' and 'numjobs' are always set to 1 and the
> 'size' is 40GB. During the write cache test, I also disable the write
> back process through:
>
>  echo 0 > /proc/sys/vm/dirty_writeback_centisecs
>  echo 100 > /proc/sys/vm/dirty_background_ratio
>  echo 100 > /proc/sys/vm/dirty_ratio
>

^^ Ist this info in one of the patches? If not - can you add this info
to the next version's cover-letter?

The patchset and improvements are valid only for powerful servers or
has a notebook user any benefits of this?
If you have benchmark data, please share this.

I can NOT promise if I will give that patchset a try.

Best thanks.

Best regards,
-Sedat-

> Thanks,
> Yi.
>
> >
> >>  == buffer read ==
> >>
> >>                 buffer_head        iomap + large folio
> >>  type     bs    IOPS    BW(MiB/s)  IOPS    BW(MiB/s)
> >>  -------------------------------------------------------
> >>  hole     4K    576k    2253       762k    2975     +32%
> >>  hole     64K   48.7k   3043       77.8k   4860     +60%
> >>  hole     1M    2960    2960       4942    4942     +67%
> >>  ramdisk  4K    443k    1732       530k    2069     +19%
> >>  ramdisk  64K   34.5k   2156       45.6k   2850     +32%
> >>  ramdisk  1M    2093    2093       2841    2841     +36%
> >>  nvme     4K    339k    1323       364k    1425     +8%
> >>  nvme     64K   23.6k   1471       25.2k   1574     +7%
> >>  nvme     1M    2012    2012       2153    2153     +7%
> >>
> >>
> >>  == buffer write ==
> >>
> >>                                        buffer_head  iomap + large folio
> >>  type   Overwrite Sync Writeback  bs   IOPS   BW    IOPS   BW(MiB/s)
> >>  ----------------------------------------------------------------------
> >>  cache      N    N    N    4K     417k    1631    440k    1719   +5%
> >>  cache      N    N    N    64K    33.4k   2088    81.5k   5092   +144%
> >>  cache      N    N    N    1M     2143    2143    5716    5716   +167%
> >>  cache      Y    N    N    4K     449k    1755    469k    1834   +5%
> >>  cache      Y    N    N    64K    36.6k   2290    82.3k   5142   +125%
> >>  cache      Y    N    N    1M     2352    2352    5577    5577   +137%
> >>  ramdisk    N    N    Y    4K     365k    1424    354k    1384   -3%
> >>  ramdisk    N    N    Y    64K    31.2k   1950    74.2k   4640   +138%
> >>  ramdisk    N    N    Y    1M     1968    1968    5201    5201   +164%
> >>  ramdisk    N    Y    N    4K     9984    39      12.9k   51     +29%
> >>  ramdisk    N    Y    N    64K    5936    371     8960    560    +51%
> >>  ramdisk    N    Y    N    1M     1050    1050    1835    1835   +75%
> >>  ramdisk    Y    N    Y    4K     411k    1609    443k    1731   +8%
> >>  ramdisk    Y    N    Y    64K    34.1k   2134    77.5k   4844   +127%
> >>  ramdisk    Y    N    Y    1M     2248    2248    5372    5372   +139%
> >>  ramdisk    Y    Y    N    4K     182k    711     186k    730    +3%
> >>  ramdisk    Y    Y    N    64K    18.7k   1170    34.7k   2171   +86%
> >>  ramdisk    Y    Y    N    1M     1229    1229    2269    2269   +85%
> >>  nvme       N    N    Y    4K     373k    1458    387k    1512   +4%
> >>  nvme       N    N    Y    64K    29.2k   1827    70.9k   4431   +143%
> >>  nvme       N    N    Y    1M     1835    1835    4919    4919   +168%
> >>  nvme       N    Y    N    4K     11.7k   46      11.7k   46      0%
> >>  nvme       N    Y    N    64K    6453    403     8661    541    +34%
> >>  nvme       N    Y    N    1M     649     649     1351    1351   +108%
> >>  nvme       Y    N    Y    4K     372k    1456    433k    1693   +16%
> >>  nvme       Y    N    Y    64K    33.0k   2064    74.7k   4669   +126%
> >>  nvme       Y    N    Y    1M     2131    2131    5273    5273   +147%
> >>  nvme       Y    Y    N    4K     56.7k   222     56.4k   220    -1%
> >>  nvme       Y    Y    N    64K    13.4k   840     19.4k   1214   +45%
> >>  nvme       Y    Y    N    1M     714     714     1504    1504   +111%
> >>
> >> Thanks,
> >> Yi.
> >>
> >> Major changes since RFC v4:
> >>  - Disable unsupported online defragmentation, do not fall back to
> >>    buffer_head path.
> >>  - Wite and wait data back while doing partial block truncate down to
> >>    fix a stale data problem.
> >>  - Disable the online changing of the inode journal flag to data=journal
> >>    mode.
> >>  - Since iomap can zero out dirty pages with unwritten extent, do not
> >>    write data before zeroing out in ext4_zero_range(), and also do not
> >>    zero partial blocks under a started journal handle.
> >>
> >> [1] https://lore.kernel.org/linux-ext4/20241010133333.146793-1-yi.zhang@huawei.com/
> >>
> >> ---
> >> RFC v4: https://lore.kernel.org/linux-ext4/20240410142948.2817554-1-yi.zhang@huaweicloud.com/
> >> RFC v3: https://lore.kernel.org/linux-ext4/20240127015825.1608160-1-yi.zhang@huaweicloud.com/
> >> RFC v2: https://lore.kernel.org/linux-ext4/20240102123918.799062-1-yi.zhang@huaweicloud.com/
> >> RFC v1: https://lore.kernel.org/linux-ext4/20231123125121.4064694-1-yi.zhang@huaweicloud.com/
> >>
> >>
> >> Zhang Yi (27):
> >>   ext4: remove writable userspace mappings before truncating page cache
> >>   ext4: don't explicit update times in ext4_fallocate()
> >>   ext4: don't write back data before punch hole in nojournal mode
> >>   ext4: refactor ext4_punch_hole()
> >>   ext4: refactor ext4_zero_range()
> >>   ext4: refactor ext4_collapse_range()
> >>   ext4: refactor ext4_insert_range()
> >>   ext4: factor out ext4_do_fallocate()
> >>   ext4: move out inode_lock into ext4_fallocate()
> >>   ext4: move out common parts into ext4_fallocate()
> >>   ext4: use reserved metadata blocks when splitting extent on endio
> >>   ext4: introduce seq counter for the extent status entry
> >>   ext4: add a new iomap aops for regular file's buffered IO path
> >>   ext4: implement buffered read iomap path
> >>   ext4: implement buffered write iomap path
> >>   ext4: don't order data for inode with EXT4_STATE_BUFFERED_IOMAP
> >>   ext4: implement writeback iomap path
> >>   ext4: implement mmap iomap path
> >>   ext4: do not always order data when partial zeroing out a block
> >>   ext4: do not start handle if unnecessary while partial zeroing out a
> >>     block
> >>   ext4: implement zero_range iomap path
> >>   ext4: disable online defrag when inode using iomap buffered I/O path
> >>   ext4: disable inode journal mode when using iomap buffered I/O path
> >>   ext4: partially enable iomap for the buffered I/O path of regular
> >>     files
> >>   ext4: enable large folio for regular file with iomap buffered I/O path
> >>   ext4: change mount options code style
> >>   ext4: introduce a mount option for iomap buffered I/O path
> >>
> >>  fs/ext4/ext4.h              |  17 +-
> >>  fs/ext4/ext4_jbd2.c         |   3 +-
> >>  fs/ext4/ext4_jbd2.h         |   8 +
> >>  fs/ext4/extents.c           | 568 +++++++++++----------------
> >>  fs/ext4/extents_status.c    |  13 +-
> >>  fs/ext4/file.c              |  19 +-
> >>  fs/ext4/ialloc.c            |   5 +
> >>  fs/ext4/inode.c             | 755 ++++++++++++++++++++++++++++++------
> >>  fs/ext4/move_extent.c       |   7 +
> >>  fs/ext4/page-io.c           | 105 +++++
> >>  fs/ext4/super.c             | 185 ++++-----
> >>  include/trace/events/ext4.h |  57 +--
> >>  12 files changed, 1153 insertions(+), 589 deletions(-)
> >>
> >> --
> >> 2.46.1
> >>
> >>
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio
  2024-10-23 12:13     ` Sedat Dilek
@ 2024-10-24  7:44       ` Zhang Yi
  0 siblings, 0 replies; 59+ messages in thread
From: Zhang Yi @ 2024-10-24  7:44 UTC (permalink / raw)
  To: sedat.dilek
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, ritesh.list, hch, djwong, david, zokeefe, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

[-- Attachment #1: Type: text/plain, Size: 13530 bytes --]

On 2024/10/23 20:13, Sedat Dilek wrote:
> On Tue, Oct 22, 2024 at 11:22 AM Zhang Yi <yi.zhang@huaweicloud.com> wrote:
>>
>> On 2024/10/22 14:59, Sedat Dilek wrote:
>>> On Tue, Oct 22, 2024 at 5:13 AM Zhang Yi <yi.zhang@huaweicloud.com> wrote:
>>>>
>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>
>>>> Hello！
>>>>
>>>> This patch series is the latest version based on my previous RFC
>>>> series[1], which converts the buffered I/O path of ext4 regular files to
>>>> iomap and enables large folios. After several months of work, almost all
>>>> preparatory changes have been upstreamed, thanks a lot for the review
>>>> and comments from Jan, Dave, Christoph, Darrick and Ritesh. Now it is
>>>> time for the main implementation of this conversion.
>>>>
>>>> This series is the main part of iomap buffered iomap conversion, it's
>>>> based on 6.12-rc4, and the code context is also depend on my anohter
>>>> cleanup series[1] (I've put that in this seris so we can merge it
>>>> directly), fixed all minor bugs found in my previous RFC v4 series.
>>>> Additionally, I've update change logs in each patch and also includes
>>>> some code modifications as Dave's suggestions. This series implements
>>>> the core iomap APIs on ext4 and introduces a mount option called
>>>> "buffered_iomap" to enable the iomap buffered I/O path. We have already
>>>> supported the default features, default mount options and bigalloc
>>>> feature. However, we do not yet support online defragmentation, inline
>>>> data, fs_verify, fs_crypt, ext3, and data=journal mode, ext4 will fall
>>>> to buffered_head I/O path automatically if you use those features and
>>>> options. Some of these features should be supported gradually in the
>>>> near future.
>>>>
>>>> Most of the implementations resemble the original buffered_head path;
>>>> however, there are four key differences.
>>>>
>>>> 1. The first aspect is the block allocation in the writeback path. The
>>>>    iomap frame will invoke ->map_blocks() at least once for each dirty
>>>>    folio. To ensure optimal writeback performance, we aim to allocate a
>>>>    range of delalloc blocks that is as long as possible within the
>>>>    writeback length for each invocation. In certain situations, we may
>>>>    allocate a range of blocks that exceeds the amount we will actually
>>>>    write back. Therefore,
>>>> 1) we cannot allocate a written extent for those blocks because it may
>>>>    expose stale data in such short write cases. Instead, we should
>>>>    allocate an unwritten extent, which means we must always enable the
>>>>    dioread_nolock option. This change could also bring many other
>>>>    benefits.
>>>> 2) We should postpone updating the 'i_disksize' until the end of the I/O
>>>>    process, based on the actual written length. This approach can also
>>>>    prevent the exposure of zero data, which may occur if there is a
>>>>    power failure during an append write.
>>>> 3) We do not need to pre-split extents during write-back, we can
>>>>    postpone this task until the end I/O process while converting
>>>>    unwritten extents.
>>>>
>>>> 2. The second reason is that since we always allocate unwritten space
>>>>    for new blocks, there is no risk of exposing stale data. As a result,
>>>>    we do not need to order the data, which allows us to disable the
>>>>    data=ordered mode. Consequently, we also do not require the reserved
>>>>    handle when converting the unwritten extent in the final I/O worker,
>>>>    we can directly start with the normal handle.
>>>>
>>>> Series details:
>>>>
>>>> Patch 1-10 is just another series of mine that refactors the fallocate
>>>> functions[1]. This series relies on the code context of that but has no
>>>> logical dependencies. I put this here just for easy access and merge.
>>>>
>>>> Patch 11-21 implement the iomap buffered read/write path, dirty folio
>>>> write back path and mmap path for ext4 regular file.
>>>>
>>>> Patch 22-23 disable the unsupported online-defragmentation function and
>>>> disable the changing of the inode journal flag to data=journal mode.
>>>> Please look at the following patch for details.
>>>>
>>>> Patch 24-27 introduce "buffered_iomap" mount option (is not enabled by
>>>> default now) to partially enable the iomap buffered I/O path and also
>>>> enable large folio.
>>>>
>>>>
>>>> About performance:
>>>>
>>>> Fio tests with psync on my machine with Intel Xeon Gold 6240 CPU with
>>>> 400GB system ram, 200GB ramdisk and 4TB nvme ssd disk.
>>>>
>>>>  fio -directory=/mnt -direct=0 -iodepth=$iodepth -fsync=$sync -rw=$rw \
>>>>      -numjobs=${numjobs} -bs=${bs} -ioengine=psync -size=$size \
>>>>      -runtime=60 -norandommap=0 -fallocate=none -overwrite=$overwrite \
>>>>      -group_reportin -name=$name --output=/tmp/test_log
>>>>
>>>
>>> Hi Zhang Yi,
>>>
>>> can you clarify about the FIO values for the diverse parameters?
>>>
>>
>> Hi Sedat,
>>
>> Sure, the test I present here is a simple single-thread and single-I/O
>> depth case with psync ioengine. Most of the FIO parameters are shown
>> in the tables below.
>>
> 
> Hi Zhang Yi,
> 
> Thanks for your reply.
> 
> Can you share a FIO config file with all (relevant) settings?
> Maybe it is in the below link?
> 
> Link: https://packages.debian.org/sid/all/fio-examples/filelist

No, I didn't have this configuration file. I simply wrote two straightforward
scripts to do this test. This serves as a reference, primarily used for
performance analysis in basic read/write operations with different backends.
More complex cases should be adjusted based on the actual circumstances.

I have attached the scripts, feel free to use them. I suggest adjusting the
parameters according to your machine configuration and service I/O model.

> 
>> For the rest, the 'iodepth' and 'numjobs' are always set to 1 and the
>> 'size' is 40GB. During the write cache test, I also disable the write
>> back process through:
>>
>>  echo 0 > /proc/sys/vm/dirty_writeback_centisecs
>>  echo 100 > /proc/sys/vm/dirty_background_ratio
>>  echo 100 > /proc/sys/vm/dirty_ratio
>>
> 
> ^^ Ist this info in one of the patches? If not - can you add this info
> to the next version's cover-letter?
> 
> The patchset and improvements are valid only for powerful servers or
> has a notebook user any benefits of this?

The performance improvement is primarily attributed to the cost savings of
the kernel software stack with large I/O. Therefore, when the CPU becomes a
bottleneck, performance should improves, i.e. the faster the disk, the more
pronounced the benefits, regardless of whether the system is a server or a
notebook.

Thanks,
Yi.

> If you have benchmark data, please share this.
> 
> I can NOT promise if I will give that patchset a try.
> 
> Best thanks.
> 
> Best regards,
> -Sedat-
> 
>> Thanks,
>> Yi.
>>
>>>
>>>>  == buffer read ==
>>>>
>>>>                 buffer_head        iomap + large folio
>>>>  type     bs    IOPS    BW(MiB/s)  IOPS    BW(MiB/s)
>>>>  -------------------------------------------------------
>>>>  hole     4K    576k    2253       762k    2975     +32%
>>>>  hole     64K   48.7k   3043       77.8k   4860     +60%
>>>>  hole     1M    2960    2960       4942    4942     +67%
>>>>  ramdisk  4K    443k    1732       530k    2069     +19%
>>>>  ramdisk  64K   34.5k   2156       45.6k   2850     +32%
>>>>  ramdisk  1M    2093    2093       2841    2841     +36%
>>>>  nvme     4K    339k    1323       364k    1425     +8%
>>>>  nvme     64K   23.6k   1471       25.2k   1574     +7%
>>>>  nvme     1M    2012    2012       2153    2153     +7%
>>>>
>>>>
>>>>  == buffer write ==
>>>>
>>>>                                        buffer_head  iomap + large folio
>>>>  type   Overwrite Sync Writeback  bs   IOPS   BW    IOPS   BW(MiB/s)
>>>>  ----------------------------------------------------------------------
>>>>  cache      N    N    N    4K     417k    1631    440k    1719   +5%
>>>>  cache      N    N    N    64K    33.4k   2088    81.5k   5092   +144%
>>>>  cache      N    N    N    1M     2143    2143    5716    5716   +167%
>>>>  cache      Y    N    N    4K     449k    1755    469k    1834   +5%
>>>>  cache      Y    N    N    64K    36.6k   2290    82.3k   5142   +125%
>>>>  cache      Y    N    N    1M     2352    2352    5577    5577   +137%
>>>>  ramdisk    N    N    Y    4K     365k    1424    354k    1384   -3%
>>>>  ramdisk    N    N    Y    64K    31.2k   1950    74.2k   4640   +138%
>>>>  ramdisk    N    N    Y    1M     1968    1968    5201    5201   +164%
>>>>  ramdisk    N    Y    N    4K     9984    39      12.9k   51     +29%
>>>>  ramdisk    N    Y    N    64K    5936    371     8960    560    +51%
>>>>  ramdisk    N    Y    N    1M     1050    1050    1835    1835   +75%
>>>>  ramdisk    Y    N    Y    4K     411k    1609    443k    1731   +8%
>>>>  ramdisk    Y    N    Y    64K    34.1k   2134    77.5k   4844   +127%
>>>>  ramdisk    Y    N    Y    1M     2248    2248    5372    5372   +139%
>>>>  ramdisk    Y    Y    N    4K     182k    711     186k    730    +3%
>>>>  ramdisk    Y    Y    N    64K    18.7k   1170    34.7k   2171   +86%
>>>>  ramdisk    Y    Y    N    1M     1229    1229    2269    2269   +85%
>>>>  nvme       N    N    Y    4K     373k    1458    387k    1512   +4%
>>>>  nvme       N    N    Y    64K    29.2k   1827    70.9k   4431   +143%
>>>>  nvme       N    N    Y    1M     1835    1835    4919    4919   +168%
>>>>  nvme       N    Y    N    4K     11.7k   46      11.7k   46      0%
>>>>  nvme       N    Y    N    64K    6453    403     8661    541    +34%
>>>>  nvme       N    Y    N    1M     649     649     1351    1351   +108%
>>>>  nvme       Y    N    Y    4K     372k    1456    433k    1693   +16%
>>>>  nvme       Y    N    Y    64K    33.0k   2064    74.7k   4669   +126%
>>>>  nvme       Y    N    Y    1M     2131    2131    5273    5273   +147%
>>>>  nvme       Y    Y    N    4K     56.7k   222     56.4k   220    -1%
>>>>  nvme       Y    Y    N    64K    13.4k   840     19.4k   1214   +45%
>>>>  nvme       Y    Y    N    1M     714     714     1504    1504   +111%
>>>>
>>>> Thanks,
>>>> Yi.
>>>>
>>>> Major changes since RFC v4:
>>>>  - Disable unsupported online defragmentation, do not fall back to
>>>>    buffer_head path.
>>>>  - Wite and wait data back while doing partial block truncate down to
>>>>    fix a stale data problem.
>>>>  - Disable the online changing of the inode journal flag to data=journal
>>>>    mode.
>>>>  - Since iomap can zero out dirty pages with unwritten extent, do not
>>>>    write data before zeroing out in ext4_zero_range(), and also do not
>>>>    zero partial blocks under a started journal handle.
>>>>
>>>> [1] https://lore.kernel.org/linux-ext4/20241010133333.146793-1-yi.zhang@huawei.com/
>>>>
>>>> ---
>>>> RFC v4: https://lore.kernel.org/linux-ext4/20240410142948.2817554-1-yi.zhang@huaweicloud.com/
>>>> RFC v3: https://lore.kernel.org/linux-ext4/20240127015825.1608160-1-yi.zhang@huaweicloud.com/
>>>> RFC v2: https://lore.kernel.org/linux-ext4/20240102123918.799062-1-yi.zhang@huaweicloud.com/
>>>> RFC v1: https://lore.kernel.org/linux-ext4/20231123125121.4064694-1-yi.zhang@huaweicloud.com/
>>>>
>>>>
>>>> Zhang Yi (27):
>>>>   ext4: remove writable userspace mappings before truncating page cache
>>>>   ext4: don't explicit update times in ext4_fallocate()
>>>>   ext4: don't write back data before punch hole in nojournal mode
>>>>   ext4: refactor ext4_punch_hole()
>>>>   ext4: refactor ext4_zero_range()
>>>>   ext4: refactor ext4_collapse_range()
>>>>   ext4: refactor ext4_insert_range()
>>>>   ext4: factor out ext4_do_fallocate()
>>>>   ext4: move out inode_lock into ext4_fallocate()
>>>>   ext4: move out common parts into ext4_fallocate()
>>>>   ext4: use reserved metadata blocks when splitting extent on endio
>>>>   ext4: introduce seq counter for the extent status entry
>>>>   ext4: add a new iomap aops for regular file's buffered IO path
>>>>   ext4: implement buffered read iomap path
>>>>   ext4: implement buffered write iomap path
>>>>   ext4: don't order data for inode with EXT4_STATE_BUFFERED_IOMAP
>>>>   ext4: implement writeback iomap path
>>>>   ext4: implement mmap iomap path
>>>>   ext4: do not always order data when partial zeroing out a block
>>>>   ext4: do not start handle if unnecessary while partial zeroing out a
>>>>     block
>>>>   ext4: implement zero_range iomap path
>>>>   ext4: disable online defrag when inode using iomap buffered I/O path
>>>>   ext4: disable inode journal mode when using iomap buffered I/O path
>>>>   ext4: partially enable iomap for the buffered I/O path of regular
>>>>     files
>>>>   ext4: enable large folio for regular file with iomap buffered I/O path
>>>>   ext4: change mount options code style
>>>>   ext4: introduce a mount option for iomap buffered I/O path
>>>>
>>>>  fs/ext4/ext4.h              |  17 +-
>>>>  fs/ext4/ext4_jbd2.c         |   3 +-
>>>>  fs/ext4/ext4_jbd2.h         |   8 +
>>>>  fs/ext4/extents.c           | 568 +++++++++++----------------
>>>>  fs/ext4/extents_status.c    |  13 +-
>>>>  fs/ext4/file.c              |  19 +-
>>>>  fs/ext4/ialloc.c            |   5 +
>>>>  fs/ext4/inode.c             | 755 ++++++++++++++++++++++++++++++------
>>>>  fs/ext4/move_extent.c       |   7 +
>>>>  fs/ext4/page-io.c           | 105 +++++
>>>>  fs/ext4/super.c             | 185 ++++-----
>>>>  include/trace/events/ext4.h |  57 +--
>>>>  12 files changed, 1153 insertions(+), 589 deletions(-)
>>>>
>>>> --
>>>> 2.46.1
>>>>
>>>>
>>

[-- Attachment #2: ext4_iomap_test_read.sh --]
[-- Type: text/plain, Size: 2451 bytes --]

#!/bin/bash

ramdev=$1
nvmedev=$2

MOUNT_OPT=""
test_size=40G

function run_fio()
{
	local rw=read
	local sync=$1
	local bs=$2
	local iodepth=$3
	local numjobs=$4
	local overwrite=$5
	local name=1
	local size=$6

	fio -directory=/mnt -direct=0 -iodepth=$iodepth -fsync=$sync -rw=$rw \
	    -numjobs=${numjobs} -bs=${bs} -ioengine=psync -size=$size \
	    -runtime=60 -norandommap=0 -fallocate=none -overwrite=$overwrite \
	    -group_reportin -name=$name --output=/tmp/log

	cat /tmp/log >> /tmp/fio_result
}

function init_env()
{
	local hole=$1
	local size=$2
	local dev=$3

	rm -rf /mnt/*

	if [[ "$hole" == "1" ]]; then
		truncate -s $size /mnt/1.0.0
	else
		xfs_io -f -c "pwrite 0 $size" /mnt/1.0.0
	fi

	umount /mnt
	mount -o $MOUNT_OPT $dev /mnt
}

function reset_env()
{
	local dev=$1

	umount /mnt
	mount -o $MOUNT_OPT $dev /mnt
}

function do_one_test()
{
	local sync=0
	local hole=$1
	local size=$2
	local dev=$3

	echo "-------------------" | tee -a /tmp/fio_result

	echo "=== 4K:" | tee -a /tmp/fio_result
	reset_env $dev
	run_fio $sync 4k 1 1 0 $size

	echo "=== 64K:" | tee -a /tmp/fio_result
	reset_env $dev
	run_fio $sync 64k 1 1 0 $size

	echo "=== 1M:" | tee -a /tmp/fio_result
	reset_env $dev
	run_fio $sync 1M 1 1 0 $size

	echo "-------------------" | tee -a /tmp/fio_result
}

function run_one_round()
{
	local hole=$1
	local size=$2
	local dev=$3

	init_env $hole $size $dev
	do_one_test $hole $size $dev
}

function run_test()
{
	echo "---- TEST RAMDEV ----" | tee -a /tmp/fio_result
	mount -o $MOUNT_OPT $ramdev /mnt

	echo "----- 1. READ HOLE" | tee -a /tmp/fio_result
	run_one_round 1 $test_size $ramdev

	echo "----- 2. READ RAM DATA" | tee -a /tmp/fio_result
	run_one_round 0 $test_size $ramdev
	umount /mnt

	echo "---- TEST NVMEDEV ----" | tee -a /tmp/fio_result
	echo "----- 3. READ NVME DATA" | tee -a /tmp/fio_result
	mount -o $MOUNT_OPT $nvmedev /mnt
	run_one_round 0 $test_size $nvmedev
	umount /mnt
}

if [ -z "$ramdev" ] || [ -z "$nvmedev" ]; then
	echo "$0 <ramdev> <nvmedev>"
	exit
fi

umount /mnt
mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 -F $ramdev
mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 -F $nvmedev

cp /tmp/fio_result /tmp/fio_result.old
rm -f /tmp/fio_result

## TEST base ramdev
echo "==== TEST BASE ====" | tee -a /tmp/fio_result
MOUNT_OPT="nobuffered_iomap"
run_test

## TEST iomap ramdev
echo "==== TEST IOMAP ====" | tee -a /tmp/fio_result
MOUNT_OPT="buffered_iomap"
run_test

[-- Attachment #3: ext4_iomap_test_write.sh --]
[-- Type: text/plain, Size: 3203 bytes --]

#!/bin/bash

ramdev=$1
nvmedev=$2

MOUNT_OPT=""
test_size=40G

function run_fio()
{
	local rw=write
	local sync=$1
	local bs=$2
	local iodepth=$3
	local numjobs=$4
	local overwrite=$5
	local name=1
	local size=$6

	fio -directory=/mnt -direct=0 -iodepth=$iodepth -fsync=$sync -rw=$rw \
	    -numjobs=${numjobs} -bs=${bs} -ioengine=psync -size=$size \
	    -runtime=60 -norandommap=0 -fallocate=none -overwrite=$overwrite \
	    -group_reportin -name=$name --output=/tmp/log

	cat /tmp/log >> /tmp/fio_result
}

function init_env()
{
	local dev=$1

	rm -rf /mnt/*
	umount /mnt
	mount -o $MOUNT_OPT $dev /mnt
}

function reset_env()
{
	local overwrite=$1
	local dev=$2

	if [[ "$overwrite" == "0" ]]; then
		rm -rf /mnt/*
	fi
	umount /mnt
	mount -o $MOUNT_OPT $dev /mnt
}

function do_one_test()
{
	local sync=$1
	local overwrite=$2
	local size=$3
	local dev=$4

	echo "-------------------" | tee -a /tmp/fio_result

	echo "=== 4K:" | tee -a /tmp/fio_result
	reset_env $overwrite $dev
	run_fio $sync 4k 1 1 $overwrite $size

	echo "=== 64K:" | tee -a /tmp/fio_result
	reset_env $overwrite $dev
	run_fio $sync 64k 1 1 $overwrite $size

	echo "=== 1M:" | tee -a /tmp/fio_result
	reset_env $overwrite $dev
	run_fio $sync 1M 1 1 $overwrite $size

	echo "-------------------" | tee -a /tmp/fio_result
}

function run_one_round()
{
	local sync=$1
	local overwrite=$2
	local size=$3
	local dev=$4

	echo "Sync:$sync, Overwrite:$overwrite" | tee -a /tmp/fio_result
	init_env $dev
	do_one_test $sync $overwrite $size $dev
}

function run_test()
{
	echo "---- TEST RAMDEV ----" | tee -a /tmp/fio_result
	mount -o $MOUNT_OPT $ramdev /mnt

	echo "----- 1. WRITE CACHE" | tee -a /tmp/fio_result
	# Stop writeback
	echo 0 > /proc/sys/vm/dirty_writeback_centisecs
	echo 30000 > /proc/sys/vm/dirty_expire_centisecs
	echo 100 > /proc/sys/vm/dirty_background_ratio
	echo 100 > /proc/sys/vm/dirty_ratio
	run_one_round 0 0 $test_size $ramdev
	run_one_round 0 1 $test_size $ramdev

	echo "----- 2. WRITE RAM DISK" | tee -a /tmp/fio_result
	# Restore writeback
	echo 500 > /proc/sys/vm/dirty_writeback_centisecs
	echo 3000 > /proc/sys/vm/dirty_expire_centisecs
	echo 10 > /proc/sys/vm/dirty_background_ratio
	echo 20 > /proc/sys/vm/dirty_ratio
	run_one_round 0 0 $test_size $ramdev
	run_one_round 0 1 $test_size $ramdev
	run_one_round 1 0 $test_size $ramdev
	run_one_round 1 1 $test_size $ramdev
	umount /mnt

	echo "---- TEST NVMEDEV ----" | tee -a /tmp/fio_result
	echo "----- 3. WRITE NVME DISK" | tee -a /tmp/fio_result
	mount -o $MOUNT_OPT $nvmedev /mnt
	run_one_round 0 0 $test_size $nvmedev
	run_one_round 0 1 $test_size $nvmedev
	run_one_round 1 0 $test_size $nvmedev
	run_one_round 1 1 $test_size $nvmedev
	umount /mnt
}

if [ -z "$ramdev" ] || [ -z "$nvmedev" ]; then
	echo "$0 <ramdev> <nvmedev>"
	exit
fi

umount /mnt
mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 -F $ramdev
mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 -F $nvmedev

cp /tmp/fio_result /tmp/fio_result.old
rm -f /tmp/fio_result

## TEST base
echo "==== TEST BASE ====" | tee -a /tmp/fio_result
MOUNT_OPT="nobuffered_iomap"
run_test

## TEST iomap
echo "==== TEST IOMAP ====" | tee -a /tmp/fio_result
MOUNT_OPT="buffered_iomap"
run_test

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 03/27] ext4: don't write back data before punch hole in nojournal mode
  2024-10-22 11:10 ` [PATCH 03/27] ext4: don't write back data before punch hole in nojournal mode Zhang Yi
@ 2024-11-18 23:15   ` Darrick J. Wong
  2024-11-20  2:56     ` Zhang Yi
  2024-12-04 11:27   ` Jan Kara
  1 sibling, 1 reply; 59+ messages in thread
From: Darrick J. Wong @ 2024-11-18 23:15 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, ritesh.list, hch, david, zokeefe, yi.zhang, chengzhihao1,
	yukuai3, yangerkun

On Tue, Oct 22, 2024 at 07:10:34PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> There is no need to write back all data before punching a hole in
> data=ordered|writeback mode since it will be dropped soon after removing
> space, so just remove the filemap_write_and_wait_range() in these modes.
> However, in data=journal mode, we need to write dirty pages out before
> discarding page cache in case of crash before committing the freeing
> data transaction, which could expose old, stale data.

Can't the same thing happen with non-journaled data writes?

Say you write 1GB of "A"s to a file and fsync.  Then you write "B"s to
the same 1GB of file and immediately start punching it.  If the system
reboots before the mapping updates all get written to disk, won't you
risk seeing some of those "A" because we no longer flush the "B"s?

Also, since the program didn't explicitly fsync the Bs, why bother
flushing the dirty data at all?  Are data=journal writes supposed to be
synchronous flushing writes nowadays?

--D

> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  fs/ext4/inode.c | 26 +++++++++++++++-----------
>  1 file changed, 15 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index f8796f7b0f94..94b923afcd9c 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3965,17 +3965,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>  
>  	trace_ext4_punch_hole(inode, offset, length, 0);
>  
> -	/*
> -	 * Write out all dirty pages to avoid race conditions
> -	 * Then release them.
> -	 */
> -	if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> -		ret = filemap_write_and_wait_range(mapping, offset,
> -						   offset + length - 1);
> -		if (ret)
> -			return ret;
> -	}
> -
>  	inode_lock(inode);
>  
>  	/* No need to punch hole beyond i_size */
> @@ -4037,6 +4026,21 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>  		ret = ext4_update_disksize_before_punch(inode, offset, length);
>  		if (ret)
>  			goto out_dio;
> +
> +		/*
> +		 * For journalled data we need to write (and checkpoint) pages
> +		 * before discarding page cache to avoid inconsitent data on
> +		 * disk in case of crash before punching trans is committed.
> +		 */
> +		if (ext4_should_journal_data(inode)) {
> +			ret = filemap_write_and_wait_range(mapping,
> +					first_block_offset, last_block_offset);
> +			if (ret)
> +				goto out_dio;
> +		}
> +
> +		ext4_truncate_folios_range(inode, first_block_offset,
> +					   last_block_offset + 1);
>  		truncate_pagecache_range(inode, first_block_offset,
>  					 last_block_offset);
>  	}
> -- 
> 2.46.1
> 
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 04/27] ext4: refactor ext4_punch_hole()
  2024-10-22 11:10 ` [PATCH 04/27] ext4: refactor ext4_punch_hole() Zhang Yi
@ 2024-11-18 23:27   ` Darrick J. Wong
  2024-11-20  3:18     ` Zhang Yi
  2024-12-04 11:36   ` Jan Kara
  1 sibling, 1 reply; 59+ messages in thread
From: Darrick J. Wong @ 2024-11-18 23:27 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, ritesh.list, hch, david, zokeefe, yi.zhang, chengzhihao1,
	yukuai3, yangerkun

On Tue, Oct 22, 2024 at 07:10:35PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> The current implementation of ext4_punch_hole() contains complex
> position calculations and stale error tags. To improve the code's
> clarity and maintainability, it is essential to clean up the code and
> improve its readability, this can be achieved by: a) simplifying and
> renaming variables; b) eliminating unnecessary position calculations;
> c) writing back all data in data=journal mode, and drop page cache from
> the original offset to the end, rather than using aligned blocks,
> d) renaming the stale error tags.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  fs/ext4/inode.c | 140 +++++++++++++++++++++---------------------------
>  1 file changed, 62 insertions(+), 78 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 94b923afcd9c..1d128333bd06 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3955,13 +3955,14 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>  {
>  	struct inode *inode = file_inode(file);
>  	struct super_block *sb = inode->i_sb;
> -	ext4_lblk_t first_block, stop_block;
> +	ext4_lblk_t start_lblk, end_lblk;
>  	struct address_space *mapping = inode->i_mapping;
> -	loff_t first_block_offset, last_block_offset, max_length;
> -	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> +	loff_t max_end = EXT4_SB(sb)->s_bitmap_maxbytes - sb->s_blocksize;
> +	loff_t end = offset + length;
> +	unsigned long blocksize = i_blocksize(inode);
>  	handle_t *handle;
>  	unsigned int credits;
> -	int ret = 0, ret2 = 0;
> +	int ret = 0;
>  
>  	trace_ext4_punch_hole(inode, offset, length, 0);
>  
> @@ -3969,36 +3970,27 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>  
>  	/* No need to punch hole beyond i_size */
>  	if (offset >= inode->i_size)
> -		goto out_mutex;
> +		goto out;
>  
>  	/*
> -	 * If the hole extends beyond i_size, set the hole
> -	 * to end after the page that contains i_size
> +	 * If the hole extends beyond i_size, set the hole to end after
> +	 * the page that contains i_size, and also make sure that the hole
> +	 * within one block before last range.
>  	 */
> -	if (offset + length > inode->i_size) {
> -		length = inode->i_size +
> -		   PAGE_SIZE - (inode->i_size & (PAGE_SIZE - 1)) -
> -		   offset;
> -	}
> +	if (end > inode->i_size)
> +		end = round_up(inode->i_size, PAGE_SIZE);
> +	if (end > max_end)
> +		end = max_end;
> +	length = end - offset;
>  
>  	/*
> -	 * For punch hole the length + offset needs to be within one block
> -	 * before last range. Adjust the length if it goes beyond that limit.
> +	 * Attach jinode to inode for jbd2 if we do any zeroing of partial
> +	 * block.
>  	 */
> -	max_length = sbi->s_bitmap_maxbytes - inode->i_sb->s_blocksize;
> -	if (offset + length > max_length)
> -		length = max_length - offset;
> -
> -	if (offset & (sb->s_blocksize - 1) ||
> -	    (offset + length) & (sb->s_blocksize - 1)) {
> -		/*
> -		 * Attach jinode to inode for jbd2 if we do any zeroing of
> -		 * partial block
> -		 */
> +	if (offset & (blocksize - 1) || end & (blocksize - 1)) {

IS_ALIGNED(offset | end, blocksize) ?

>  		ret = ext4_inode_attach_jinode(inode);
>  		if (ret < 0)
> -			goto out_mutex;
> -
> +			goto out;
>  	}
>  
>  	/* Wait all existing dio workers, newcomers will block on i_rwsem */
> @@ -4006,7 +3998,7 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>  
>  	ret = file_modified(file);
>  	if (ret)
> -		goto out_mutex;
> +		goto out;
>  
>  	/*
>  	 * Prevent page faults from reinstantiating pages we have released from
> @@ -4016,34 +4008,24 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>  
>  	ret = ext4_break_layouts(inode);
>  	if (ret)
> -		goto out_dio;
> -
> -	first_block_offset = round_up(offset, sb->s_blocksize);
> -	last_block_offset = round_down((offset + length), sb->s_blocksize) - 1;
> +		goto out_invalidate_lock;
>  
> -	/* Now release the pages and zero block aligned part of pages*/
> -	if (last_block_offset > first_block_offset) {
> +	/*
> +	 * For journalled data we need to write (and checkpoint) pages
> +	 * before discarding page cache to avoid inconsitent data on

inconsistent

> +	 * disk in case of crash before punching trans is committed.
> +	 */
> +	if (ext4_should_journal_data(inode)) {
> +		ret = filemap_write_and_wait_range(mapping, offset, end - 1);
> +	} else {
>  		ret = ext4_update_disksize_before_punch(inode, offset, length);
> -		if (ret)
> -			goto out_dio;
> -
> -		/*
> -		 * For journalled data we need to write (and checkpoint) pages
> -		 * before discarding page cache to avoid inconsitent data on
> -		 * disk in case of crash before punching trans is committed.
> -		 */
> -		if (ext4_should_journal_data(inode)) {
> -			ret = filemap_write_and_wait_range(mapping,
> -					first_block_offset, last_block_offset);
> -			if (ret)
> -				goto out_dio;
> -		}
> -
> -		ext4_truncate_folios_range(inode, first_block_offset,
> -					   last_block_offset + 1);
> -		truncate_pagecache_range(inode, first_block_offset,
> -					 last_block_offset);
> +		ext4_truncate_folios_range(inode, offset, end);
>  	}
> +	if (ret)
> +		goto out_invalidate_lock;
> +
> +	/* Now release the pages and zero block aligned part of pages*/
> +	truncate_pagecache_range(inode, offset, end - 1);
>  
>  	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
>  		credits = ext4_writepage_trans_blocks(inode);
> @@ -4053,52 +4035,54 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>  	if (IS_ERR(handle)) {
>  		ret = PTR_ERR(handle);
>  		ext4_std_error(sb, ret);
> -		goto out_dio;
> +		goto out_invalidate_lock;
>  	}
>  
> -	ret = ext4_zero_partial_blocks(handle, inode, offset,
> -				       length);
> +	ret = ext4_zero_partial_blocks(handle, inode, offset, length);
>  	if (ret)
> -		goto out_stop;
> -
> -	first_block = (offset + sb->s_blocksize - 1) >>
> -		EXT4_BLOCK_SIZE_BITS(sb);
> -	stop_block = (offset + length) >> EXT4_BLOCK_SIZE_BITS(sb);
> +		goto out_handle;
>  
>  	/* If there are blocks to remove, do it */
> -	if (stop_block > first_block) {
> -		ext4_lblk_t hole_len = stop_block - first_block;
> +	start_lblk = round_up(offset, blocksize) >> inode->i_blkbits;

egad I wish ext4 had nicer unit conversion helpers.

static inline ext4_lblk_t
EXT4_B_TO_LBLK(struct ext4_sb_info *sbi, ..., loff_t offset)
{
	return round_up(offset, blocksize) >> inode->i_blkbits;
}

	start_lblk = EXT4_B_TO_LBLK(sbi, offset);

ah well.

> +	end_lblk = end >> inode->i_blkbits;
> +
> +	if (end_lblk > start_lblk) {
> +		ext4_lblk_t hole_len = end_lblk - start_lblk;
>  
>  		down_write(&EXT4_I(inode)->i_data_sem);
>  		ext4_discard_preallocations(inode);
>  
> -		ext4_es_remove_extent(inode, first_block, hole_len);
> +		ext4_es_remove_extent(inode, start_lblk, hole_len);
>  
>  		if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> -			ret = ext4_ext_remove_space(inode, first_block,
> -						    stop_block - 1);
> +			ret = ext4_ext_remove_space(inode, start_lblk,
> +						    end_lblk - 1);
>  		else
> -			ret = ext4_ind_remove_space(handle, inode, first_block,
> -						    stop_block);
> +			ret = ext4_ind_remove_space(handle, inode, start_lblk,
> +						    end_lblk);
> +		if (ret) {
> +			up_write(&EXT4_I(inode)->i_data_sem);
> +			goto out_handle;
> +		}
>  
> -		ext4_es_insert_extent(inode, first_block, hole_len, ~0,
> +		ext4_es_insert_extent(inode, start_lblk, hole_len, ~0,
>  				      EXTENT_STATUS_HOLE, 0);
>  		up_write(&EXT4_I(inode)->i_data_sem);
>  	}
> -	ext4_fc_track_range(handle, inode, first_block, stop_block);
> +	ext4_fc_track_range(handle, inode, start_lblk, end_lblk);
> +
> +	ret = ext4_mark_inode_dirty(handle, inode);
> +	if (unlikely(ret))
> +		goto out_handle;
> +
> +	ext4_update_inode_fsync_trans(handle, inode, 1);
>  	if (IS_SYNC(inode))
>  		ext4_handle_sync(handle);
> -
> -	ret2 = ext4_mark_inode_dirty(handle, inode);
> -	if (unlikely(ret2))
> -		ret = ret2;
> -	if (ret >= 0)
> -		ext4_update_inode_fsync_trans(handle, inode, 1);
> -out_stop:
> +out_handle:
>  	ext4_journal_stop(handle);
> -out_dio:
> +out_invalidate_lock:
>  	filemap_invalidate_unlock(mapping);
> -out_mutex:
> +out:

Why drop "_mutex"?  You're unlocking *something* on the way out.

--D

>  	inode_unlock(inode);
>  	return ret;
>  }
> -- 
> 2.46.1
> 
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 03/27] ext4: don't write back data before punch hole in nojournal mode
  2024-11-18 23:15   ` Darrick J. Wong
@ 2024-11-20  2:56     ` Zhang Yi
  2024-12-04 11:26       ` Jan Kara
  0 siblings, 1 reply; 59+ messages in thread
From: Zhang Yi @ 2024-11-20  2:56 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, ritesh.list, hch, david, zokeefe, yi.zhang, chengzhihao1,
	yukuai3, yangerkun

On 2024/11/19 7:15, Darrick J. Wong wrote:
> On Tue, Oct 22, 2024 at 07:10:34PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> There is no need to write back all data before punching a hole in
>> data=ordered|writeback mode since it will be dropped soon after removing
>> space, so just remove the filemap_write_and_wait_range() in these modes.
>> However, in data=journal mode, we need to write dirty pages out before
>> discarding page cache in case of crash before committing the freeing
>> data transaction, which could expose old, stale data.
> 
> Can't the same thing happen with non-journaled data writes?
> 
> Say you write 1GB of "A"s to a file and fsync.  Then you write "B"s to
> the same 1GB of file and immediately start punching it.  If the system
> reboots before the mapping updates all get written to disk, won't you
> risk seeing some of those "A" because we no longer flush the "B"s?
> 
> Also, since the program didn't explicitly fsync the Bs, why bother
> flushing the dirty data at all?  Are data=journal writes supposed to be
> synchronous flushing writes nowadays?

Thanks you for your replay.

This case is not exactly the problem that can occur in data=journal
mode, the problem is even if we fsync "B"s before punching the hole, we
may still encounter old data ("A"s or even order) if the system reboots
before the hole-punching process is completed.

The details of this problem is the ext4_punch_hole()->
truncate_pagecache_range()-> ..->journal_unmap_buffer() will drop the
checkpoint transaction, which may contain B's journaled data. Consequently,
the journal tail could move advance beyond this point. If we do not flush
the data before dropping the cache and a crash occurs before the punching
transaction is committed, B's transaction will never recover, resulting
in the loss of B's data. Therefore, this cannot happen in non-journaled
data writes.

This flush logic is copied from ext4_zero_range() since it has the same
problem, Jan added it in commit 783ae448b7a2 ("ext4: Fix special handling
of journalled data from extent zeroing"), please see it for more details.
Jan, please correct me if my understanding is incorrect.

Thanks,
Yi.

> 
> --D
> 
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
>>  fs/ext4/inode.c | 26 +++++++++++++++-----------
>>  1 file changed, 15 insertions(+), 11 deletions(-)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index f8796f7b0f94..94b923afcd9c 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -3965,17 +3965,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>>  
>>  	trace_ext4_punch_hole(inode, offset, length, 0);
>>  
>> -	/*
>> -	 * Write out all dirty pages to avoid race conditions
>> -	 * Then release them.
>> -	 */
>> -	if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
>> -		ret = filemap_write_and_wait_range(mapping, offset,
>> -						   offset + length - 1);
>> -		if (ret)
>> -			return ret;
>> -	}
>> -
>>  	inode_lock(inode);
>>  
>>  	/* No need to punch hole beyond i_size */
>> @@ -4037,6 +4026,21 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>>  		ret = ext4_update_disksize_before_punch(inode, offset, length);
>>  		if (ret)
>>  			goto out_dio;
>> +
>> +		/*
>> +		 * For journalled data we need to write (and checkpoint) pages
>> +		 * before discarding page cache to avoid inconsitent data on
>> +		 * disk in case of crash before punching trans is committed.
>> +		 */
>> +		if (ext4_should_journal_data(inode)) {
>> +			ret = filemap_write_and_wait_range(mapping,
>> +					first_block_offset, last_block_offset);
>> +			if (ret)
>> +				goto out_dio;
>> +		}
>> +
>> +		ext4_truncate_folios_range(inode, first_block_offset,
>> +					   last_block_offset + 1);
>>  		truncate_pagecache_range(inode, first_block_offset,
>>  					 last_block_offset);
>>  	}
>> -- 
>> 2.46.1
>>
>>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 04/27] ext4: refactor ext4_punch_hole()
  2024-11-18 23:27   ` Darrick J. Wong
@ 2024-11-20  3:18     ` Zhang Yi
  0 siblings, 0 replies; 59+ messages in thread
From: Zhang Yi @ 2024-11-20  3:18 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, ritesh.list, hch, david, zokeefe, yi.zhang, chengzhihao1,
	yukuai3, yangerkun

On 2024/11/19 7:27, Darrick J. Wong wrote:
> On Tue, Oct 22, 2024 at 07:10:35PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> The current implementation of ext4_punch_hole() contains complex
>> position calculations and stale error tags. To improve the code's
>> clarity and maintainability, it is essential to clean up the code and
>> improve its readability, this can be achieved by: a) simplifying and
>> renaming variables; b) eliminating unnecessary position calculations;
>> c) writing back all data in data=journal mode, and drop page cache from
>> the original offset to the end, rather than using aligned blocks,
>> d) renaming the stale error tags.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
>>  fs/ext4/inode.c | 140 +++++++++++++++++++++---------------------------
>>  1 file changed, 62 insertions(+), 78 deletions(-)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 94b923afcd9c..1d128333bd06 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -3955,13 +3955,14 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>>  {
>>  	struct inode *inode = file_inode(file);
>>  	struct super_block *sb = inode->i_sb;
>> -	ext4_lblk_t first_block, stop_block;
>> +	ext4_lblk_t start_lblk, end_lblk;
>>  	struct address_space *mapping = inode->i_mapping;
>> -	loff_t first_block_offset, last_block_offset, max_length;
>> -	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
>> +	loff_t max_end = EXT4_SB(sb)->s_bitmap_maxbytes - sb->s_blocksize;
>> +	loff_t end = offset + length;
>> +	unsigned long blocksize = i_blocksize(inode);
>>  	handle_t *handle;
>>  	unsigned int credits;
>> -	int ret = 0, ret2 = 0;
>> +	int ret = 0;
>>  
>>  	trace_ext4_punch_hole(inode, offset, length, 0);
>>  
>> @@ -3969,36 +3970,27 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>>  
>>  	/* No need to punch hole beyond i_size */
>>  	if (offset >= inode->i_size)
>> -		goto out_mutex;
>> +		goto out;
>>  
>>  	/*
>> -	 * If the hole extends beyond i_size, set the hole
>> -	 * to end after the page that contains i_size
>> +	 * If the hole extends beyond i_size, set the hole to end after
>> +	 * the page that contains i_size, and also make sure that the hole
>> +	 * within one block before last range.
>>  	 */
>> -	if (offset + length > inode->i_size) {
>> -		length = inode->i_size +
>> -		   PAGE_SIZE - (inode->i_size & (PAGE_SIZE - 1)) -
>> -		   offset;
>> -	}
>> +	if (end > inode->i_size)
>> +		end = round_up(inode->i_size, PAGE_SIZE);
>> +	if (end > max_end)
>> +		end = max_end;
>> +	length = end - offset;
>>  
>>  	/*
>> -	 * For punch hole the length + offset needs to be within one block
>> -	 * before last range. Adjust the length if it goes beyond that limit.
>> +	 * Attach jinode to inode for jbd2 if we do any zeroing of partial
>> +	 * block.
>>  	 */
>> -	max_length = sbi->s_bitmap_maxbytes - inode->i_sb->s_blocksize;
>> -	if (offset + length > max_length)
>> -		length = max_length - offset;
>> -
>> -	if (offset & (sb->s_blocksize - 1) ||
>> -	    (offset + length) & (sb->s_blocksize - 1)) {
>> -		/*
>> -		 * Attach jinode to inode for jbd2 if we do any zeroing of
>> -		 * partial block
>> -		 */
>> +	if (offset & (blocksize - 1) || end & (blocksize - 1)) {
> 
> IS_ALIGNED(offset | end, blocksize) ?

Right, this helper looks better, thanks for pointing this out.

> 
>>  		ret = ext4_inode_attach_jinode(inode);
>>  		if (ret < 0)
>> -			goto out_mutex;
>> -
>> +			goto out;
>>  	}
>>  
>>  	/* Wait all existing dio workers, newcomers will block on i_rwsem */
>> @@ -4006,7 +3998,7 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>>  
>>  	ret = file_modified(file);
>>  	if (ret)
>> -		goto out_mutex;
>> +		goto out;
>>  
>>  	/*
>>  	 * Prevent page faults from reinstantiating pages we have released from
>> @@ -4016,34 +4008,24 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>>  
>>  	ret = ext4_break_layouts(inode);
>>  	if (ret)
>> -		goto out_dio;
>> -
>> -	first_block_offset = round_up(offset, sb->s_blocksize);
>> -	last_block_offset = round_down((offset + length), sb->s_blocksize) - 1;
>> +		goto out_invalidate_lock;
>>  
>> -	/* Now release the pages and zero block aligned part of pages*/
>> -	if (last_block_offset > first_block_offset) {
>> +	/*
>> +	 * For journalled data we need to write (and checkpoint) pages
>> +	 * before discarding page cache to avoid inconsitent data on
> 
> inconsistent

Yeah.

> 
>> +	 * disk in case of crash before punching trans is committed.
>> +	 */
>> +	if (ext4_should_journal_data(inode)) {
>> +		ret = filemap_write_and_wait_range(mapping, offset, end - 1);
>> +	} else {
>>  		ret = ext4_update_disksize_before_punch(inode, offset, length);
>> -		if (ret)
>> -			goto out_dio;
>> -
>> -		/*
>> -		 * For journalled data we need to write (and checkpoint) pages
>> -		 * before discarding page cache to avoid inconsitent data on
>> -		 * disk in case of crash before punching trans is committed.
>> -		 */
>> -		if (ext4_should_journal_data(inode)) {
>> -			ret = filemap_write_and_wait_range(mapping,
>> -					first_block_offset, last_block_offset);
>> -			if (ret)
>> -				goto out_dio;
>> -		}
>> -
>> -		ext4_truncate_folios_range(inode, first_block_offset,
>> -					   last_block_offset + 1);
>> -		truncate_pagecache_range(inode, first_block_offset,
>> -					 last_block_offset);
>> +		ext4_truncate_folios_range(inode, offset, end);
>>  	}
>> +	if (ret)
>> +		goto out_invalidate_lock;
>> +
>> +	/* Now release the pages and zero block aligned part of pages*/
>> +	truncate_pagecache_range(inode, offset, end - 1);
>>  
>>  	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
>>  		credits = ext4_writepage_trans_blocks(inode);
>> @@ -4053,52 +4035,54 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>>  	if (IS_ERR(handle)) {
>>  		ret = PTR_ERR(handle);
>>  		ext4_std_error(sb, ret);
>> -		goto out_dio;
>> +		goto out_invalidate_lock;
>>  	}
>>  
>> -	ret = ext4_zero_partial_blocks(handle, inode, offset,
>> -				       length);
>> +	ret = ext4_zero_partial_blocks(handle, inode, offset, length);
>>  	if (ret)
>> -		goto out_stop;
>> -
>> -	first_block = (offset + sb->s_blocksize - 1) >>
>> -		EXT4_BLOCK_SIZE_BITS(sb);
>> -	stop_block = (offset + length) >> EXT4_BLOCK_SIZE_BITS(sb);
>> +		goto out_handle;
>>  
>>  	/* If there are blocks to remove, do it */
>> -	if (stop_block > first_block) {
>> -		ext4_lblk_t hole_len = stop_block - first_block;
>> +	start_lblk = round_up(offset, blocksize) >> inode->i_blkbits;
> 
> egad I wish ext4 had nicer unit conversion helpers.
> 
> static inline ext4_lblk_t
> EXT4_B_TO_LBLK(struct ext4_sb_info *sbi, ..., loff_t offset)
> {
> 	return round_up(offset, blocksize) >> inode->i_blkbits;
> }
> 
> 	start_lblk = EXT4_B_TO_LBLK(sbi, offset);
> 
> ah well.
> 

Sure, it looks clearer.

>> +	end_lblk = end >> inode->i_blkbits;
>> +
>> +	if (end_lblk > start_lblk) {
>> +		ext4_lblk_t hole_len = end_lblk - start_lblk;
>>  
>>  		down_write(&EXT4_I(inode)->i_data_sem);
>>  		ext4_discard_preallocations(inode);
>>  
>> -		ext4_es_remove_extent(inode, first_block, hole_len);
>> +		ext4_es_remove_extent(inode, start_lblk, hole_len);
>>  
>>  		if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
>> -			ret = ext4_ext_remove_space(inode, first_block,
>> -						    stop_block - 1);
>> +			ret = ext4_ext_remove_space(inode, start_lblk,
>> +						    end_lblk - 1);
>>  		else
>> -			ret = ext4_ind_remove_space(handle, inode, first_block,
>> -						    stop_block);
>> +			ret = ext4_ind_remove_space(handle, inode, start_lblk,
>> +						    end_lblk);
>> +		if (ret) {
>> +			up_write(&EXT4_I(inode)->i_data_sem);
>> +			goto out_handle;
>> +		}
>>  
>> -		ext4_es_insert_extent(inode, first_block, hole_len, ~0,
>> +		ext4_es_insert_extent(inode, start_lblk, hole_len, ~0,
>>  				      EXTENT_STATUS_HOLE, 0);
>>  		up_write(&EXT4_I(inode)->i_data_sem);
>>  	}
>> -	ext4_fc_track_range(handle, inode, first_block, stop_block);
>> +	ext4_fc_track_range(handle, inode, start_lblk, end_lblk);
>> +
>> +	ret = ext4_mark_inode_dirty(handle, inode);
>> +	if (unlikely(ret))
>> +		goto out_handle;
>> +
>> +	ext4_update_inode_fsync_trans(handle, inode, 1);
>>  	if (IS_SYNC(inode))
>>  		ext4_handle_sync(handle);
>> -
>> -	ret2 = ext4_mark_inode_dirty(handle, inode);
>> -	if (unlikely(ret2))
>> -		ret = ret2;
>> -	if (ret >= 0)
>> -		ext4_update_inode_fsync_trans(handle, inode, 1);
>> -out_stop:
>> +out_handle:
>>  	ext4_journal_stop(handle);
>> -out_dio:
>> +out_invalidate_lock:
>>  	filemap_invalidate_unlock(mapping);
>> -out_mutex:
>> +out:
> 
> Why drop "_mutex"?  You're unlocking *something* on the way out.
> 

"_mutex" is no longer accurate, as the inode has changed to using rwsem instead.
But never mind, this "out" tag is also removed in patch 9.

Thanks,
Yi.

> 
>>  	inode_unlock(inode);
>>  	return ret;
>>  }
>> -- 
>> 2.46.1
>>
>>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 01/27] ext4: remove writable userspace mappings before truncating page cache
  2024-10-22 11:10 ` [PATCH 01/27] ext4: remove writable userspace mappings before truncating page cache Zhang Yi
@ 2024-12-04 11:13   ` Jan Kara
  2024-12-06  7:59     ` Zhang Yi
  0 siblings, 1 reply; 59+ messages in thread
From: Jan Kara @ 2024-12-04 11:13 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, ritesh.list, hch, djwong, david, zokeefe, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

I'm sorry for the huge delay here...

On Tue 22-10-24 19:10:32, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> When zeroing a range of folios on the filesystem which block size is
> less than the page size, the file's mapped partial blocks within one
> page will be marked as unwritten, we should remove writable userspace
> mappings to ensure that ext4_page_mkwrite() can be called during
> subsequent write access to these folios. Otherwise, data written by
> subsequent mmap writes may not be saved to disk.
> 
>  $mkfs.ext4 -b 1024 /dev/vdb
>  $mount /dev/vdb /mnt
>  $xfs_io -t -f -c "pwrite -S 0x58 0 4096" -c "mmap -rw 0 4096" \
>                -c "mwrite -S 0x5a 2048 2048" -c "fzero 2048 2048" \
>                -c "mwrite -S 0x59 2048 2048" -c "close" /mnt/foo
> 
>  $od -Ax -t x1z /mnt/foo
>  000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
>  *
>  000800 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59
>  *
>  001000
> 
>  $umount /mnt && mount /dev/vdb /mnt
>  $od -Ax -t x1z /mnt/foo
>  000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
>  *
>  000800 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>  *
>  001000
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

This is a great catch! I think this may be source of the sporadic data
corruption issues we observe with blocksize < pagesize.

> +static inline void ext4_truncate_folio(struct inode *inode,
> +				       loff_t start, loff_t end)
> +{
> +	unsigned long blocksize = i_blocksize(inode);
> +	struct folio *folio;
> +
> +	if (round_up(start, blocksize) >= round_down(end, blocksize))
> +		return;
> +
> +	folio = filemap_lock_folio(inode->i_mapping, start >> PAGE_SHIFT);
> +	if (IS_ERR(folio))
> +		return;
> +
> +	if (folio_mkclean(folio))
> +		folio_mark_dirty(folio);
> +	folio_unlock(folio);
> +	folio_put(folio);

I don't think this is enough. In your example from the changelog, this would
leave the page at index 0 dirty and still with 0x5a values in 2048-4096 range.
Then truncate_pagecache_range() does nothing, ext4_alloc_file_blocks()
converts blocks under 2048-4096 to unwritten state. But what handles
zeroing of page cache in 2048-4096 range? ext4_zero_partial_blocks() zeroes
only partial blocks, not full blocks. Am I missing something?

If I'm right, I'd keep it simple and just writeout these partial folios with
filemap_write_and_wait_range() and expand the range
truncate_pagecache_range() removes to include these partial folios. The
overhead won't be big and it isn't like this is some very performance
sensitive path.

> +}
> +
> +/*
> + * When truncating a range of folios, if the block size is less than the
> + * page size, the file's mapped partial blocks within one page could be
> + * freed or converted to unwritten. We should call this function to remove
> + * writable userspace mappings so that ext4_page_mkwrite() can be called
> + * during subsequent write access to these folios.
> + */
> +void ext4_truncate_folios_range(struct inode *inode, loff_t start, loff_t end)

Maybe call this ext4_truncate_page_cache_block_range()? And assert that
start & end are block aligned. Because this essentially prepares page cache
for manipulation with a block range.

> +{
> +	unsigned long blocksize = i_blocksize(inode);
> +
> +	if (end > inode->i_size)
> +		end = inode->i_size;
> +	if (start >= end || blocksize >= PAGE_SIZE)
> +		return;
> +
> +	ext4_truncate_folio(inode, start, min(round_up(start, PAGE_SIZE), end));
> +	if (end > round_up(start, PAGE_SIZE))
> +		ext4_truncate_folio(inode, round_down(end, PAGE_SIZE), end);
> +}

So I'd move the following truncate_pagecache_range() into
ext4_truncate_folios_range(). And also the preceding:

                /*
                 * For journalled data we need to write (and checkpoint) pages
                 * before discarding page cache to avoid inconsitent data on
                 * disk in case of crash before zeroing trans is committed.
                 */
                if (ext4_should_journal_data(inode)) {
                        ret = filemap_write_and_wait_range(mapping, start,
                                                           end - 1);
		...

into this function. So that it can be self-contained "do the right thing
with page cache to prepare for block range manipulations".

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 03/27] ext4: don't write back data before punch hole in nojournal mode
  2024-11-20  2:56     ` Zhang Yi
@ 2024-12-04 11:26       ` Jan Kara
  0 siblings, 0 replies; 59+ messages in thread
From: Jan Kara @ 2024-12-04 11:26 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Darrick J. Wong, linux-ext4, linux-fsdevel, linux-kernel, tytso,
	adilger.kernel, jack, ritesh.list, hch, david, zokeefe, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

On Wed 20-11-24 10:56:53, Zhang Yi wrote:
> On 2024/11/19 7:15, Darrick J. Wong wrote:
> > On Tue, Oct 22, 2024 at 07:10:34PM +0800, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> There is no need to write back all data before punching a hole in
> >> data=ordered|writeback mode since it will be dropped soon after removing
> >> space, so just remove the filemap_write_and_wait_range() in these modes.
> >> However, in data=journal mode, we need to write dirty pages out before
> >> discarding page cache in case of crash before committing the freeing
> >> data transaction, which could expose old, stale data.
> > 
> > Can't the same thing happen with non-journaled data writes?
> > 
> > Say you write 1GB of "A"s to a file and fsync.  Then you write "B"s to
> > the same 1GB of file and immediately start punching it.  If the system
> > reboots before the mapping updates all get written to disk, won't you
> > risk seeing some of those "A" because we no longer flush the "B"s?
> > 
> > Also, since the program didn't explicitly fsync the Bs, why bother
> > flushing the dirty data at all?  Are data=journal writes supposed to be
> > synchronous flushing writes nowadays?
> 
> Thanks you for your replay.
> 
> This case is not exactly the problem that can occur in data=journal
> mode, the problem is even if we fsync "B"s before punching the hole, we
> may still encounter old data ("A"s or even order) if the system reboots
> before the hole-punching process is completed.
> 
> The details of this problem is the ext4_punch_hole()->
> truncate_pagecache_range()-> ..->journal_unmap_buffer() will drop the
> checkpoint transaction, which may contain B's journaled data. Consequently,
> the journal tail could move advance beyond this point. If we do not flush
> the data before dropping the cache and a crash occurs before the punching
> transaction is committed, B's transaction will never recover, resulting
> in the loss of B's data. Therefore, this cannot happen in non-journaled
> data writes.

Yes, you're correct. The logic in journal_unmap_buffer() (used when freeing
journaled data blocks) assumes that if there's no running / committing
transaction, then orphan replay is going to fixup possible partial
operations and thus it simply discards the block that's being freed. That
works for truncate but not for hole punch or range zeroing. 

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 03/27] ext4: don't write back data before punch hole in nojournal mode
  2024-10-22 11:10 ` [PATCH 03/27] ext4: don't write back data before punch hole in nojournal mode Zhang Yi
  2024-11-18 23:15   ` Darrick J. Wong
@ 2024-12-04 11:27   ` Jan Kara
  1 sibling, 0 replies; 59+ messages in thread
From: Jan Kara @ 2024-12-04 11:27 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, ritesh.list, hch, djwong, david, zokeefe, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

On Tue 22-10-24 19:10:34, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> There is no need to write back all data before punching a hole in
> data=ordered|writeback mode since it will be dropped soon after removing
> space, so just remove the filemap_write_and_wait_range() in these modes.
> However, in data=journal mode, we need to write dirty pages out before
> discarding page cache in case of crash before committing the freeing
> data transaction, which could expose old, stale data.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

With the ext4_truncate_page_cache_block_range() function I propose, this
will get slightly simpler. But overall the patch looks good.

								Honza

> ---
>  fs/ext4/inode.c | 26 +++++++++++++++-----------
>  1 file changed, 15 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index f8796f7b0f94..94b923afcd9c 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3965,17 +3965,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>  
>  	trace_ext4_punch_hole(inode, offset, length, 0);
>  
> -	/*
> -	 * Write out all dirty pages to avoid race conditions
> -	 * Then release them.
> -	 */
> -	if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> -		ret = filemap_write_and_wait_range(mapping, offset,
> -						   offset + length - 1);
> -		if (ret)
> -			return ret;
> -	}
> -
>  	inode_lock(inode);
>  
>  	/* No need to punch hole beyond i_size */
> @@ -4037,6 +4026,21 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>  		ret = ext4_update_disksize_before_punch(inode, offset, length);
>  		if (ret)
>  			goto out_dio;
> +
> +		/*
> +		 * For journalled data we need to write (and checkpoint) pages
> +		 * before discarding page cache to avoid inconsitent data on
> +		 * disk in case of crash before punching trans is committed.
> +		 */
> +		if (ext4_should_journal_data(inode)) {
> +			ret = filemap_write_and_wait_range(mapping,
> +					first_block_offset, last_block_offset);
> +			if (ret)
> +				goto out_dio;
> +		}
> +
> +		ext4_truncate_folios_range(inode, first_block_offset,
> +					   last_block_offset + 1);
>  		truncate_pagecache_range(inode, first_block_offset,
>  					 last_block_offset);
>  	}
> -- 
> 2.46.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 04/27] ext4: refactor ext4_punch_hole()
  2024-10-22 11:10 ` [PATCH 04/27] ext4: refactor ext4_punch_hole() Zhang Yi
  2024-11-18 23:27   ` Darrick J. Wong
@ 2024-12-04 11:36   ` Jan Kara
  1 sibling, 0 replies; 59+ messages in thread
From: Jan Kara @ 2024-12-04 11:36 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, ritesh.list, hch, djwong, david, zokeefe, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

On Tue 22-10-24 19:10:35, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> The current implementation of ext4_punch_hole() contains complex
> position calculations and stale error tags. To improve the code's
> clarity and maintainability, it is essential to clean up the code and
> improve its readability, this can be achieved by: a) simplifying and
> renaming variables; b) eliminating unnecessary position calculations;
> c) writing back all data in data=journal mode, and drop page cache from
> the original offset to the end, rather than using aligned blocks,
> d) renaming the stale error tags.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Again, this should get slightly simplified with the new function (no need
for special data=journal handling) but overall it looks fine.

> -out_dio:
> +out_invalidate_lock:
>  	filemap_invalidate_unlock(mapping);
> -out_mutex:
> +out:
>  	inode_unlock(inode);
>  	return ret;
>  }

I agree with Darrick that just 'out' is not a great name when we are
actually releasing inode->i_rwsem. So perhaps "out_inode_lock:"?

							Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 05/27] ext4: refactor ext4_zero_range()
  2024-10-22 11:10 ` [PATCH 05/27] ext4: refactor ext4_zero_range() Zhang Yi
@ 2024-12-04 11:52   ` Jan Kara
  2024-12-06  8:09     ` Zhang Yi
  0 siblings, 1 reply; 59+ messages in thread
From: Jan Kara @ 2024-12-04 11:52 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, ritesh.list, hch, djwong, david, zokeefe, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

On Tue 22-10-24 19:10:36, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> The current implementation of ext4_zero_range() contains complex
> position calculations and stale error tags. To improve the code's
> clarity and maintainability, it is essential to clean up the code and
> improve its readability, this can be achieved by: a) simplifying and
> renaming variables, making the style the same as ext4_punch_hole(); b)
> eliminating unnecessary position calculations, writing back all data in
> data=journal mode, and drop page cache from the original offset to the
> end, rather than using aligned blocks; c) renaming the stale out_mutex
> tags.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

...

> -		goto out_mutex;
> -
> -	/* Preallocate the range including the unaligned edges */
> -	if (partial_begin || partial_end) {
> -		ret = ext4_alloc_file_blocks(file,
> -				round_down(offset, 1 << blkbits) >> blkbits,
> -				(round_up((offset + len), 1 << blkbits) -
> -				 round_down(offset, 1 << blkbits)) >> blkbits,
> -				new_size, flags);
> -		if (ret)
> -			goto out_mutex;
> -
> -	}

So I think we should keep this first ext4_alloc_file_blocks() call before
we truncate the page cache. Otherwise if ext4_alloc_file_blocks() fails due
to ENOSPC, we have already lost the dirty data originally in the zeroed
range. All the other failure modes are kind of catastrophic anyway, so they
are fine after dropping the page cache. But this is can be quite common and
should be handled more gracefully.

								Honza

> -
> -	/* Zero range excluding the unaligned edges */
> -	if (max_blocks > 0) {
> -		flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN |
> -			  EXT4_EX_NOCACHE);
> +		goto out;
>  
> -		/*
> -		 * Prevent page faults from reinstantiating pages we have
> -		 * released from page cache.
> -		 */
> -		filemap_invalidate_lock(mapping);
> +	/*
> +	 * Prevent page faults from reinstantiating pages we have released
> +	 * from page cache.
> +	 */
> +	filemap_invalidate_lock(mapping);
>  
> -		ret = ext4_break_layouts(inode);
> -		if (ret) {
> -			filemap_invalidate_unlock(mapping);
> -			goto out_mutex;
> -		}
> +	ret = ext4_break_layouts(inode);
> +	if (ret)
> +		goto out_invalidate_lock;
>  
> +	/*
> +	 * For journalled data we need to write (and checkpoint) pages before
> +	 * discarding page cache to avoid inconsitent data on disk in case of
> +	 * crash before zeroing trans is committed.
> +	 */
> +	if (ext4_should_journal_data(inode)) {
> +		ret = filemap_write_and_wait_range(mapping, offset, end - 1);
> +	} else {
>  		ret = ext4_update_disksize_before_punch(inode, offset, len);
> -		if (ret) {
> -			filemap_invalidate_unlock(mapping);
> -			goto out_mutex;
> -		}
> +		ext4_truncate_folios_range(inode, offset, end);
> +	}
> +	if (ret)
> +		goto out_invalidate_lock;
>  
> -		/*
> -		 * For journalled data we need to write (and checkpoint) pages
> -		 * before discarding page cache to avoid inconsitent data on
> -		 * disk in case of crash before zeroing trans is committed.
> -		 */
> -		if (ext4_should_journal_data(inode)) {
> -			ret = filemap_write_and_wait_range(mapping, start,
> -							   end - 1);
> -			if (ret) {
> -				filemap_invalidate_unlock(mapping);
> -				goto out_mutex;
> -			}
> -		}
> +	/* Now release the pages and zero block aligned part of pages */
> +	truncate_pagecache_range(inode, offset, end - 1);
>  
> -		/* Now release the pages and zero block aligned part of pages */
> -		ext4_truncate_folios_range(inode, start, end);
> -		truncate_pagecache_range(inode, start, end - 1);
> +	flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT;
> +	/* Preallocate the range including the unaligned edges */
> +	if (offset & (blocksize - 1) || end & (blocksize - 1)) {
> +		ext4_lblk_t alloc_lblk = offset >> blkbits;
> +		ext4_lblk_t len_lblk = EXT4_MAX_BLOCKS(len, offset, blkbits);
>  
> -		ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size,
> -					     flags);
> -		filemap_invalidate_unlock(mapping);
> +		ret = ext4_alloc_file_blocks(file, alloc_lblk, len_lblk,
> +					     new_size, flags);
>  		if (ret)
> -			goto out_mutex;
> +			goto out_invalidate_lock;
>  	}
> -	if (!partial_begin && !partial_end)
> -		goto out_mutex;
> +
> +	/* Zero range excluding the unaligned edges */
> +	start_lblk = round_up(offset, blocksize) >> blkbits;
> +	end_lblk = end >> blkbits;
> +	if (end_lblk > start_lblk) {
> +		ext4_lblk_t zero_blks = end_lblk - start_lblk;
> +
> +		flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN | EXT4_EX_NOCACHE);
> +		ret = ext4_alloc_file_blocks(file, start_lblk, zero_blks,
> +					     new_size, flags);
> +		if (ret)
> +			goto out_invalidate_lock;
> +	}
> +	/* Finish zeroing out if it doesn't contain partial block */
> +	if (!(offset & (blocksize - 1)) && !(end & (blocksize - 1)))
> +		goto out_invalidate_lock;
>  
>  	/*
>  	 * In worst case we have to writeout two nonadjacent unwritten
> @@ -4700,25 +4665,29 @@ static long ext4_zero_range(struct file *file, loff_t offset,
>  	if (IS_ERR(handle)) {
>  		ret = PTR_ERR(handle);
>  		ext4_std_error(inode->i_sb, ret);
> -		goto out_mutex;
> +		goto out_invalidate_lock;
>  	}
>  
> +	/* Zero out partial block at the edges of the range */
> +	ret = ext4_zero_partial_blocks(handle, inode, offset, len);
> +	if (ret)
> +		goto out_handle;
> +
>  	if (new_size)
>  		ext4_update_inode_size(inode, new_size);
>  	ret = ext4_mark_inode_dirty(handle, inode);
>  	if (unlikely(ret))
>  		goto out_handle;
> -	/* Zero out partial block at the edges of the range */
> -	ret = ext4_zero_partial_blocks(handle, inode, offset, len);
> -	if (ret >= 0)
> -		ext4_update_inode_fsync_trans(handle, inode, 1);
>  
> +	ext4_update_inode_fsync_trans(handle, inode, 1);
>  	if (file->f_flags & O_SYNC)
>  		ext4_handle_sync(handle);
>  
>  out_handle:
>  	ext4_journal_stop(handle);
> -out_mutex:
> +out_invalidate_lock:
> +	filemap_invalidate_unlock(mapping);
> +out:
>  	inode_unlock(inode);
>  	return ret;
>  }
> -- 
> 2.46.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 06/27] ext4: refactor ext4_collapse_range()
  2024-10-22 11:10 ` [PATCH 06/27] ext4: refactor ext4_collapse_range() Zhang Yi
@ 2024-12-04 11:58   ` Jan Kara
  0 siblings, 0 replies; 59+ messages in thread
From: Jan Kara @ 2024-12-04 11:58 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, ritesh.list, hch, djwong, david, zokeefe, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

On Tue 22-10-24 19:10:37, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Simplify ext4_collapse_range() and align its code style with that of
> ext4_zero_range() and ext4_punch_hole(). Refactor it by: a) renaming
> variables, b) removing redundant input parameter checks and moving
> the remaining checks under i_rwsem in preparation for future
> refactoring, and c) renaming the three stale error tags.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

Just one nit below:

> -out_stop:
> +out_handle:
>  	ext4_journal_stop(handle);
> -out_mmap:
> +out_invalidate_lock:
>  	filemap_invalidate_unlock(mapping);
> -out_mutex:
> +out:
>  	inode_unlock(inode);
>  	return ret;
>  }

Again, I think "out_inode_lock" would be a better name than just "out".

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 07/27] ext4: refactor ext4_insert_range()
  2024-10-22 11:10 ` [PATCH 07/27] ext4: refactor ext4_insert_range() Zhang Yi
@ 2024-12-04 12:02   ` Jan Kara
  0 siblings, 0 replies; 59+ messages in thread
From: Jan Kara @ 2024-12-04 12:02 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, ritesh.list, hch, djwong, david, zokeefe, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

On Tue 22-10-24 19:10:38, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Simplify ext4_insert_range() and align its code style with that of
> ext4_collapse_range(). Refactor it by: a) renaming variables, b)
> removing redundant input parameter checks and moving the remaining
> checks under i_rwsem in preparation for future refactoring, and c)
> renaming the three stale error tags.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

Same nit as in the previous patch:

> -out_stop:
> +out_handle:
>  	ext4_journal_stop(handle);
> -out_mmap:
> +out_invalidate_lock:
>  	filemap_invalidate_unlock(mapping);
> -out_mutex:
> +out:
>  	inode_unlock(inode);
>  	return ret;
>  }

I think 'out_inode_lock' is better than plain 'out'.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 09/27] ext4: move out inode_lock into ext4_fallocate()
  2024-10-22 11:10 ` [PATCH 09/27] ext4: move out inode_lock into ext4_fallocate() Zhang Yi
@ 2024-12-04 12:05   ` Jan Kara
  2024-12-06  8:13     ` Zhang Yi
  0 siblings, 1 reply; 59+ messages in thread
From: Jan Kara @ 2024-12-04 12:05 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, ritesh.list, hch, djwong, david, zokeefe, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

On Tue 22-10-24 19:10:40, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Currently, all five sub-functions of ext4_fallocate() acquire the
> inode's i_rwsem at the beginning and release it before exiting. This
> process can be simplified by factoring out the management of i_rwsem
> into the ext4_fallocate() function.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Ah, nice. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

and please ignore my comments about renaming 'out' labels :).

								Honza

> ---
>  fs/ext4/extents.c | 90 +++++++++++++++--------------------------------
>  fs/ext4/inode.c   | 13 +++----
>  2 files changed, 33 insertions(+), 70 deletions(-)
> 
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 2f727104f53d..a2db4e85790f 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4573,23 +4573,18 @@ static long ext4_zero_range(struct file *file, loff_t offset,
>  	int ret, flags, credits;
>  
>  	trace_ext4_zero_range(inode, offset, len, mode);
> +	WARN_ON_ONCE(!inode_is_locked(inode));
>  
> -	inode_lock(inode);
> -
> -	/*
> -	 * Indirect files do not support unwritten extents
> -	 */
> -	if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
> -		ret = -EOPNOTSUPP;
> -		goto out;
> -	}
> +	/* Indirect files do not support unwritten extents */
> +	if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
> +		return -EOPNOTSUPP;
>  
>  	if (!(mode & FALLOC_FL_KEEP_SIZE) &&
>  	    (end > inode->i_size || end > EXT4_I(inode)->i_disksize)) {
>  		new_size = end;
>  		ret = inode_newsize_ok(inode, new_size);
>  		if (ret)
> -			goto out;
> +			return ret;
>  	}
>  
>  	/* Wait all existing dio workers, newcomers will block on i_rwsem */
> @@ -4597,7 +4592,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
>  
>  	ret = file_modified(file);
>  	if (ret)
> -		goto out;
> +		return ret;
>  
>  	/*
>  	 * Prevent page faults from reinstantiating pages we have released
> @@ -4687,8 +4682,6 @@ static long ext4_zero_range(struct file *file, loff_t offset,
>  	ext4_journal_stop(handle);
>  out_invalidate_lock:
>  	filemap_invalidate_unlock(mapping);
> -out:
> -	inode_unlock(inode);
>  	return ret;
>  }
>  
> @@ -4702,12 +4695,11 @@ static long ext4_do_fallocate(struct file *file, loff_t offset,
>  	int ret;
>  
>  	trace_ext4_fallocate_enter(inode, offset, len, mode);
> +	WARN_ON_ONCE(!inode_is_locked(inode));
>  
>  	start_lblk = offset >> inode->i_blkbits;
>  	len_lblk = EXT4_MAX_BLOCKS(len, offset, inode->i_blkbits);
>  
> -	inode_lock(inode);
> -
>  	/* We only support preallocation for extent-based files only. */
>  	if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
>  		ret = -EOPNOTSUPP;
> @@ -4739,7 +4731,6 @@ static long ext4_do_fallocate(struct file *file, loff_t offset,
>  					EXT4_I(inode)->i_sync_tid);
>  	}
>  out:
> -	inode_unlock(inode);
>  	trace_ext4_fallocate_exit(inode, offset, len_lblk, ret);
>  	return ret;
>  }
> @@ -4774,9 +4765,8 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>  
>  	inode_lock(inode);
>  	ret = ext4_convert_inline_data(inode);
> -	inode_unlock(inode);
>  	if (ret)
> -		return ret;
> +		goto out;
>  
>  	if (mode & FALLOC_FL_PUNCH_HOLE)
>  		ret = ext4_punch_hole(file, offset, len);
> @@ -4788,7 +4778,8 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>  		ret = ext4_zero_range(file, offset, len, mode);
>  	else
>  		ret = ext4_do_fallocate(file, offset, len, mode);
> -
> +out:
> +	inode_unlock(inode);
>  	return ret;
>  }
>  
> @@ -5298,36 +5289,27 @@ static int ext4_collapse_range(struct file *file, loff_t offset, loff_t len)
>  	int ret;
>  
>  	trace_ext4_collapse_range(inode, offset, len);
> -
> -	inode_lock(inode);
> +	WARN_ON_ONCE(!inode_is_locked(inode));
>  
>  	/* Currently just for extent based files */
> -	if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
> -		ret = -EOPNOTSUPP;
> -		goto out;
> -	}
> -
> +	if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> +		return -EOPNOTSUPP;
>  	/* Collapse range works only on fs cluster size aligned regions. */
> -	if (!IS_ALIGNED(offset | len, EXT4_CLUSTER_SIZE(sb))) {
> -		ret = -EINVAL;
> -		goto out;
> -	}
> -
> +	if (!IS_ALIGNED(offset | len, EXT4_CLUSTER_SIZE(sb)))
> +		return -EINVAL;
>  	/*
>  	 * There is no need to overlap collapse range with EOF, in which case
>  	 * it is effectively a truncate operation
>  	 */
> -	if (end >= inode->i_size) {
> -		ret = -EINVAL;
> -		goto out;
> -	}
> +	if (end >= inode->i_size)
> +		return -EINVAL;
>  
>  	/* Wait for existing dio to complete */
>  	inode_dio_wait(inode);
>  
>  	ret = file_modified(file);
>  	if (ret)
> -		goto out;
> +		return ret;
>  
>  	/*
>  	 * Prevent page faults from reinstantiating pages we have released from
> @@ -5402,8 +5384,6 @@ static int ext4_collapse_range(struct file *file, loff_t offset, loff_t len)
>  	ext4_journal_stop(handle);
>  out_invalidate_lock:
>  	filemap_invalidate_unlock(mapping);
> -out:
> -	inode_unlock(inode);
>  	return ret;
>  }
>  
> @@ -5429,39 +5409,27 @@ static int ext4_insert_range(struct file *file, loff_t offset, loff_t len)
>  	loff_t start;
>  
>  	trace_ext4_insert_range(inode, offset, len);
> -
> -	inode_lock(inode);
> +	WARN_ON_ONCE(!inode_is_locked(inode));
>  
>  	/* Currently just for extent based files */
> -	if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
> -		ret = -EOPNOTSUPP;
> -		goto out;
> -	}
> -
> +	if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> +		return -EOPNOTSUPP;
>  	/* Insert range works only on fs cluster size aligned regions. */
> -	if (!IS_ALIGNED(offset | len, EXT4_CLUSTER_SIZE(sb))) {
> -		ret = -EINVAL;
> -		goto out;
> -	}
> -
> +	if (!IS_ALIGNED(offset | len, EXT4_CLUSTER_SIZE(sb)))
> +		return -EINVAL;
>  	/* Offset must be less than i_size */
> -	if (offset >= inode->i_size) {
> -		ret = -EINVAL;
> -		goto out;
> -	}
> -
> +	if (offset >= inode->i_size)
> +		return -EINVAL;
>  	/* Check whether the maximum file size would be exceeded */
> -	if (len > inode->i_sb->s_maxbytes - inode->i_size) {
> -		ret = -EFBIG;
> -		goto out;
> -	}
> +	if (len > inode->i_sb->s_maxbytes - inode->i_size)
> +		return -EFBIG;
>  
>  	/* Wait for existing dio to complete */
>  	inode_dio_wait(inode);
>  
>  	ret = file_modified(file);
>  	if (ret)
> -		goto out;
> +		return ret;
>  
>  	/*
>  	 * Prevent page faults from reinstantiating pages we have released from
> @@ -5562,8 +5530,6 @@ static int ext4_insert_range(struct file *file, loff_t offset, loff_t len)
>  	ext4_journal_stop(handle);
>  out_invalidate_lock:
>  	filemap_invalidate_unlock(mapping);
> -out:
> -	inode_unlock(inode);
>  	return ret;
>  }
>  
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 1d128333bd06..bea19cd6e676 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3962,15 +3962,14 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>  	unsigned long blocksize = i_blocksize(inode);
>  	handle_t *handle;
>  	unsigned int credits;
> -	int ret = 0;
> +	int ret;
>  
>  	trace_ext4_punch_hole(inode, offset, length, 0);
> -
> -	inode_lock(inode);
> +	WARN_ON_ONCE(!inode_is_locked(inode));
>  
>  	/* No need to punch hole beyond i_size */
>  	if (offset >= inode->i_size)
> -		goto out;
> +		return 0;
>  
>  	/*
>  	 * If the hole extends beyond i_size, set the hole to end after
> @@ -3990,7 +3989,7 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>  	if (offset & (blocksize - 1) || end & (blocksize - 1)) {
>  		ret = ext4_inode_attach_jinode(inode);
>  		if (ret < 0)
> -			goto out;
> +			return ret;
>  	}
>  
>  	/* Wait all existing dio workers, newcomers will block on i_rwsem */
> @@ -3998,7 +3997,7 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>  
>  	ret = file_modified(file);
>  	if (ret)
> -		goto out;
> +		return ret;
>  
>  	/*
>  	 * Prevent page faults from reinstantiating pages we have released from
> @@ -4082,8 +4081,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>  	ext4_journal_stop(handle);
>  out_invalidate_lock:
>  	filemap_invalidate_unlock(mapping);
> -out:
> -	inode_unlock(inode);
>  	return ret;
>  }
>  
> -- 
> 2.46.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 10/27] ext4: move out common parts into ext4_fallocate()
  2024-10-22 11:10 ` [PATCH 10/27] ext4: move out common parts " Zhang Yi
@ 2024-12-04 12:10   ` Jan Kara
  0 siblings, 0 replies; 59+ messages in thread
From: Jan Kara @ 2024-12-04 12:10 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, ritesh.list, hch, djwong, david, zokeefe, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

On Tue 22-10-24 19:10:41, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Currently, all zeroing ranges, punch holes, collapse ranges, and insert
> ranges first wait for all existing direct I/O workers to complete, and
> then they acquire the mapping's invalidate lock before performing the
> actual work. These common components are nearly identical, so we can
> simplify the code by factoring them out into the ext4_fallocate().
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Nice. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/extents.c | 121 ++++++++++++++++------------------------------
>  fs/ext4/inode.c   |  23 +--------
>  2 files changed, 43 insertions(+), 101 deletions(-)
> 
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index a2db4e85790f..d5067d5aa449 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4587,23 +4587,6 @@ static long ext4_zero_range(struct file *file, loff_t offset,
>  			return ret;
>  	}
>  
> -	/* Wait all existing dio workers, newcomers will block on i_rwsem */
> -	inode_dio_wait(inode);
> -
> -	ret = file_modified(file);
> -	if (ret)
> -		return ret;
> -
> -	/*
> -	 * Prevent page faults from reinstantiating pages we have released
> -	 * from page cache.
> -	 */
> -	filemap_invalidate_lock(mapping);
> -
> -	ret = ext4_break_layouts(inode);
> -	if (ret)
> -		goto out_invalidate_lock;
> -
>  	/*
>  	 * For journalled data we need to write (and checkpoint) pages before
>  	 * discarding page cache to avoid inconsitent data on disk in case of
> @@ -4616,7 +4599,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
>  		ext4_truncate_folios_range(inode, offset, end);
>  	}
>  	if (ret)
> -		goto out_invalidate_lock;
> +		return ret;
>  
>  	/* Now release the pages and zero block aligned part of pages */
>  	truncate_pagecache_range(inode, offset, end - 1);
> @@ -4630,7 +4613,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
>  		ret = ext4_alloc_file_blocks(file, alloc_lblk, len_lblk,
>  					     new_size, flags);
>  		if (ret)
> -			goto out_invalidate_lock;
> +			return ret;
>  	}
>  
>  	/* Zero range excluding the unaligned edges */
> @@ -4643,11 +4626,11 @@ static long ext4_zero_range(struct file *file, loff_t offset,
>  		ret = ext4_alloc_file_blocks(file, start_lblk, zero_blks,
>  					     new_size, flags);
>  		if (ret)
> -			goto out_invalidate_lock;
> +			return ret;
>  	}
>  	/* Finish zeroing out if it doesn't contain partial block */
>  	if (!(offset & (blocksize - 1)) && !(end & (blocksize - 1)))
> -		goto out_invalidate_lock;
> +		return ret;
>  
>  	/*
>  	 * In worst case we have to writeout two nonadjacent unwritten
> @@ -4660,7 +4643,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
>  	if (IS_ERR(handle)) {
>  		ret = PTR_ERR(handle);
>  		ext4_std_error(inode->i_sb, ret);
> -		goto out_invalidate_lock;
> +		return ret;
>  	}
>  
>  	/* Zero out partial block at the edges of the range */
> @@ -4680,8 +4663,6 @@ static long ext4_zero_range(struct file *file, loff_t offset,
>  
>  out_handle:
>  	ext4_journal_stop(handle);
> -out_invalidate_lock:
> -	filemap_invalidate_unlock(mapping);
>  	return ret;
>  }
>  
> @@ -4714,13 +4695,6 @@ static long ext4_do_fallocate(struct file *file, loff_t offset,
>  			goto out;
>  	}
>  
> -	/* Wait all existing dio workers, newcomers will block on i_rwsem */
> -	inode_dio_wait(inode);
> -
> -	ret = file_modified(file);
> -	if (ret)
> -		goto out;
> -
>  	ret = ext4_alloc_file_blocks(file, start_lblk, len_lblk, new_size,
>  				     EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT);
>  	if (ret)
> @@ -4745,6 +4719,7 @@ static long ext4_do_fallocate(struct file *file, loff_t offset,
>  long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>  {
>  	struct inode *inode = file_inode(file);
> +	struct address_space *mapping = file->f_mapping;
>  	int ret;
>  
>  	/*
> @@ -4768,6 +4743,29 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>  	if (ret)
>  		goto out;
>  
> +	/* Wait all existing dio workers, newcomers will block on i_rwsem */
> +	inode_dio_wait(inode);
> +
> +	ret = file_modified(file);
> +	if (ret)
> +		return ret;
> +
> +	if ((mode & FALLOC_FL_MODE_MASK) == FALLOC_FL_ALLOCATE_RANGE) {
> +		ret = ext4_do_fallocate(file, offset, len, mode);
> +		goto out;
> +	}
> +
> +	/*
> +	 * Follow-up operations will drop page cache, hold invalidate lock
> +	 * to prevent page faults from reinstantiating pages we have
> +	 * released from page cache.
> +	 */
> +	filemap_invalidate_lock(mapping);
> +
> +	ret = ext4_break_layouts(inode);
> +	if (ret)
> +		goto out_invalidate_lock;
> +
>  	if (mode & FALLOC_FL_PUNCH_HOLE)
>  		ret = ext4_punch_hole(file, offset, len);
>  	else if (mode & FALLOC_FL_COLLAPSE_RANGE)
> @@ -4777,7 +4775,10 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>  	else if (mode & FALLOC_FL_ZERO_RANGE)
>  		ret = ext4_zero_range(file, offset, len, mode);
>  	else
> -		ret = ext4_do_fallocate(file, offset, len, mode);
> +		ret = -EOPNOTSUPP;
> +
> +out_invalidate_lock:
> +	filemap_invalidate_unlock(mapping);
>  out:
>  	inode_unlock(inode);
>  	return ret;
> @@ -5304,23 +5305,6 @@ static int ext4_collapse_range(struct file *file, loff_t offset, loff_t len)
>  	if (end >= inode->i_size)
>  		return -EINVAL;
>  
> -	/* Wait for existing dio to complete */
> -	inode_dio_wait(inode);
> -
> -	ret = file_modified(file);
> -	if (ret)
> -		return ret;
> -
> -	/*
> -	 * Prevent page faults from reinstantiating pages we have released from
> -	 * page cache.
> -	 */
> -	filemap_invalidate_lock(mapping);
> -
> -	ret = ext4_break_layouts(inode);
> -	if (ret)
> -		goto out_invalidate_lock;
> -
>  	/*
>  	 * Write tail of the last page before removed range and data that
>  	 * will be shifted since they will get removed from the page cache
> @@ -5334,16 +5318,15 @@ static int ext4_collapse_range(struct file *file, loff_t offset, loff_t len)
>  	if (!ret)
>  		ret = filemap_write_and_wait_range(mapping, end, LLONG_MAX);
>  	if (ret)
> -		goto out_invalidate_lock;
> +		return ret;
>  
>  	truncate_pagecache(inode, start);
>  
>  	credits = ext4_writepage_trans_blocks(inode);
>  	handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE, credits);
> -	if (IS_ERR(handle)) {
> -		ret = PTR_ERR(handle);
> -		goto out_invalidate_lock;
> -	}
> +	if (IS_ERR(handle))
> +		return PTR_ERR(handle);
> +
>  	ext4_fc_mark_ineligible(sb, EXT4_FC_REASON_FALLOC_RANGE, handle);
>  
>  	start_lblk = offset >> inode->i_blkbits;
> @@ -5382,8 +5365,6 @@ static int ext4_collapse_range(struct file *file, loff_t offset, loff_t len)
>  
>  out_handle:
>  	ext4_journal_stop(handle);
> -out_invalidate_lock:
> -	filemap_invalidate_unlock(mapping);
>  	return ret;
>  }
>  
> @@ -5424,23 +5405,6 @@ static int ext4_insert_range(struct file *file, loff_t offset, loff_t len)
>  	if (len > inode->i_sb->s_maxbytes - inode->i_size)
>  		return -EFBIG;
>  
> -	/* Wait for existing dio to complete */
> -	inode_dio_wait(inode);
> -
> -	ret = file_modified(file);
> -	if (ret)
> -		return ret;
> -
> -	/*
> -	 * Prevent page faults from reinstantiating pages we have released from
> -	 * page cache.
> -	 */
> -	filemap_invalidate_lock(mapping);
> -
> -	ret = ext4_break_layouts(inode);
> -	if (ret)
> -		goto out_invalidate_lock;
> -
>  	/*
>  	 * Write out all dirty pages. Need to round down to align start offset
>  	 * to page size boundary for page size > block size.
> @@ -5448,16 +5412,15 @@ static int ext4_insert_range(struct file *file, loff_t offset, loff_t len)
>  	start = round_down(offset, PAGE_SIZE);
>  	ret = filemap_write_and_wait_range(mapping, start, LLONG_MAX);
>  	if (ret)
> -		goto out_invalidate_lock;
> +		return ret;
>  
>  	truncate_pagecache(inode, start);
>  
>  	credits = ext4_writepage_trans_blocks(inode);
>  	handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE, credits);
> -	if (IS_ERR(handle)) {
> -		ret = PTR_ERR(handle);
> -		goto out_invalidate_lock;
> -	}
> +	if (IS_ERR(handle))
> +		return PTR_ERR(handle);
> +
>  	ext4_fc_mark_ineligible(sb, EXT4_FC_REASON_FALLOC_RANGE, handle);
>  
>  	/* Expand file to avoid data loss if there is error while shifting */
> @@ -5528,8 +5491,6 @@ static int ext4_insert_range(struct file *file, loff_t offset, loff_t len)
>  
>  out_handle:
>  	ext4_journal_stop(handle);
> -out_invalidate_lock:
> -	filemap_invalidate_unlock(mapping);
>  	return ret;
>  }
>  
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index bea19cd6e676..1ccf84a64b7b 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3992,23 +3992,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>  			return ret;
>  	}
>  
> -	/* Wait all existing dio workers, newcomers will block on i_rwsem */
> -	inode_dio_wait(inode);
> -
> -	ret = file_modified(file);
> -	if (ret)
> -		return ret;
> -
> -	/*
> -	 * Prevent page faults from reinstantiating pages we have released from
> -	 * page cache.
> -	 */
> -	filemap_invalidate_lock(mapping);
> -
> -	ret = ext4_break_layouts(inode);
> -	if (ret)
> -		goto out_invalidate_lock;
> -
>  	/*
>  	 * For journalled data we need to write (and checkpoint) pages
>  	 * before discarding page cache to avoid inconsitent data on
> @@ -4021,7 +4004,7 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>  		ext4_truncate_folios_range(inode, offset, end);
>  	}
>  	if (ret)
> -		goto out_invalidate_lock;
> +		return ret;
>  
>  	/* Now release the pages and zero block aligned part of pages*/
>  	truncate_pagecache_range(inode, offset, end - 1);
> @@ -4034,7 +4017,7 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>  	if (IS_ERR(handle)) {
>  		ret = PTR_ERR(handle);
>  		ext4_std_error(sb, ret);
> -		goto out_invalidate_lock;
> +		return ret;
>  	}
>  
>  	ret = ext4_zero_partial_blocks(handle, inode, offset, length);
> @@ -4079,8 +4062,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>  		ext4_handle_sync(handle);
>  out_handle:
>  	ext4_journal_stop(handle);
> -out_invalidate_lock:
> -	filemap_invalidate_unlock(mapping);
>  	return ret;
>  }
>  
> -- 
> 2.46.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 11/27] ext4: use reserved metadata blocks when splitting extent on endio
  2024-10-22 11:10 ` [PATCH 11/27] ext4: use reserved metadata blocks when splitting extent on endio Zhang Yi
@ 2024-12-04 12:16   ` Jan Kara
  0 siblings, 0 replies; 59+ messages in thread
From: Jan Kara @ 2024-12-04 12:16 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, ritesh.list, hch, djwong, david, zokeefe, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

On Tue 22-10-24 19:10:42, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> When performing buffered writes, we may need to split and convert an
> unwritten extent into a written one during the end I/O process. However,
> we do not reserve space specifically for these metadata changes, we only
> reserve 2% of space or 4096 blocks. To address this, we use
> EXT4_GET_BLOCKS_PRE_IO to potentially split extents in advance and
> EXT4_GET_BLOCKS_METADATA_NOFAIL to utilize reserved space if necessary.
> 
> These two approaches can reduce the likelihood of running out of space
> and losing data. However, these methods are merely best efforts, we
> could still run out of space, and there is not much difference between
> converting an extent during the writeback process and the end I/O
> process, it won't increase the rick of losing data if we postpone the
> conversion.
> 
> Therefore, also use EXT4_GET_BLOCKS_METADATA_NOFAIL in
> ext4_convert_unwritten_extents_endio() to prepare for the buffered I/O
> iomap conversion, which may perform extent conversion during the end I/O
> process.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Yeah, I agree. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/extents.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index d5067d5aa449..33bc2cc5aff4 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -3767,6 +3767,8 @@ ext4_convert_unwritten_extents_endio(handle_t *handle, struct inode *inode,
>  	 * illegal.
>  	 */
>  	if (ee_block != map->m_lblk || ee_len > map->m_len) {
> +		int flags = EXT4_GET_BLOCKS_CONVERT |
> +			    EXT4_GET_BLOCKS_METADATA_NOFAIL;
>  #ifdef CONFIG_EXT4_DEBUG
>  		ext4_warning(inode->i_sb, "Inode (%ld) finished: extent logical block %llu,"
>  			     " len %u; IO logical block %llu, len %u",
> @@ -3774,7 +3776,7 @@ ext4_convert_unwritten_extents_endio(handle_t *handle, struct inode *inode,
>  			     (unsigned long long)map->m_lblk, map->m_len);
>  #endif
>  		path = ext4_split_convert_extents(handle, inode, map, path,
> -						EXT4_GET_BLOCKS_CONVERT, NULL);
> +						  flags, NULL);
>  		if (IS_ERR(path))
>  			return path;
>  
> -- 
> 2.46.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 12/27] ext4: introduce seq counter for the extent status entry
  2024-10-22 11:10 ` [PATCH 12/27] ext4: introduce seq counter for the extent status entry Zhang Yi
@ 2024-12-04 12:42   ` Jan Kara
  2024-12-06  8:55     ` Zhang Yi
  0 siblings, 1 reply; 59+ messages in thread
From: Jan Kara @ 2024-12-04 12:42 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	jack, ritesh.list, hch, djwong, david, zokeefe, yi.zhang,
	chengzhihao1, yukuai3, yangerkun

On Tue 22-10-24 19:10:43, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> In the iomap_write_iter(), the iomap buffered write frame does not hold
> any locks between querying the inode extent mapping info and performing
> page cache writes. As a result, the extent mapping can be changed due to
> concurrent I/O in flight. Similarly, in the iomap_writepage_map(), the
> write-back process faces a similar problem: concurrent changes can
> invalidate the extent mapping before the I/O is submitted.
> 
> Therefore, both of these processes must recheck the mapping info after
> acquiring the folio lock. To address this, similar to XFS, we propose
> introducing an extent sequence number to serve as a validity cookie for
> the extent. We will increment this number whenever the extent status
> tree changes, thereby preparing for the buffered write iomap conversion.
> Besides, it also changes the trace code style to make checkpatch.pl
> happy.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Overall using some sequence counter makes sense.

> diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
> index c786691dabd3..bea4f87db502 100644
> --- a/fs/ext4/extents_status.c
> +++ b/fs/ext4/extents_status.c
> @@ -204,6 +204,13 @@ static inline ext4_lblk_t ext4_es_end(struct extent_status *es)
>  	return es->es_lblk + es->es_len - 1;
>  }
>  
> +static inline void ext4_es_inc_seq(struct inode *inode)
> +{
> +	struct ext4_inode_info *ei = EXT4_I(inode);
> +
> +	WRITE_ONCE(ei->i_es_seq, READ_ONCE(ei->i_es_seq) + 1);
> +}

This looks potentially dangerous because we can loose i_es_seq updates this
way. Like

CPU1					CPU2
x = READ_ONCE(ei->i_es_seq)
					x = READ_ONCE(ei->i_es_seq)
					WRITE_ONCE(ei->i_es_seq, x + 1)
					...
					potentially many times
WRITE_ONCE(ei->i_es_seq, x + 1)
  -> the counter goes back leading to possibly false equality checks

I think you'll need to use atomic_t and appropriate functions here.

> @@ -872,6 +879,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
>  	BUG_ON(end < lblk);
>  	WARN_ON_ONCE(status & EXTENT_STATUS_DELAYED);
>  
> +	ext4_es_inc_seq(inode);

I'm somewhat wondering: Are extent status tree modifications the right
place to advance the sequence counter? The counter needs to advance
whenever the mapping information changes. This means that we'd be
needlessly advancing the counter (and thus possibly forcing retries) when
we are just adding new information from ordinary extent tree into cache.
Also someone can be doing extent tree manipulations without touching extent
status tree (if the information was already pruned from there). So I think
needs some very good documentation what are the expectations from the
sequence counter and explanations why they are satisfied so that we don't
break this in the future.

								Honza
 

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 01/27] ext4: remove writable userspace mappings before truncating page cache
  2024-12-04 11:13   ` Jan Kara
@ 2024-12-06  7:59     ` Zhang Yi
  2024-12-06 15:49       ` Jan Kara
  0 siblings, 1 reply; 59+ messages in thread
From: Zhang Yi @ 2024-12-06  7:59 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	ritesh.list, hch, djwong, david, zokeefe, yi.zhang, chengzhihao1,
	yukuai3, yangerkun

On 2024/12/4 19:13, Jan Kara wrote:
> I'm sorry for the huge delay here...
> 
It's fine, I know you're probably been busy lately, and this series has
undergone significant modifications, which should require considerable
time for review. Thanks a lot for taking time to review this series!

> On Tue 22-10-24 19:10:32, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> When zeroing a range of folios on the filesystem which block size is
>> less than the page size, the file's mapped partial blocks within one
>> page will be marked as unwritten, we should remove writable userspace
>> mappings to ensure that ext4_page_mkwrite() can be called during
>> subsequent write access to these folios. Otherwise, data written by
>> subsequent mmap writes may not be saved to disk.
>>
>>  $mkfs.ext4 -b 1024 /dev/vdb
>>  $mount /dev/vdb /mnt
>>  $xfs_io -t -f -c "pwrite -S 0x58 0 4096" -c "mmap -rw 0 4096" \
>>                -c "mwrite -S 0x5a 2048 2048" -c "fzero 2048 2048" \
>>                -c "mwrite -S 0x59 2048 2048" -c "close" /mnt/foo
>>
>>  $od -Ax -t x1z /mnt/foo
>>  000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
>>  *
>>  000800 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59
>>  *
>>  001000
>>
>>  $umount /mnt && mount /dev/vdb /mnt
>>  $od -Ax -t x1z /mnt/foo
>>  000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
>>  *
>>  000800 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>>  *
>>  001000
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> 
> This is a great catch! I think this may be source of the sporadic data
> corruption issues we observe with blocksize < pagesize.
> 
>> +static inline void ext4_truncate_folio(struct inode *inode,
>> +				       loff_t start, loff_t end)
>> +{
>> +	unsigned long blocksize = i_blocksize(inode);
>> +	struct folio *folio;
>> +
>> +	if (round_up(start, blocksize) >= round_down(end, blocksize))
>> +		return;
>> +
>> +	folio = filemap_lock_folio(inode->i_mapping, start >> PAGE_SHIFT);
>> +	if (IS_ERR(folio))
>> +		return;
>> +
>> +	if (folio_mkclean(folio))
>> +		folio_mark_dirty(folio);
>> +	folio_unlock(folio);
>> +	folio_put(folio);
> 
> I don't think this is enough. In your example from the changelog, this would
> leave the page at index 0 dirty and still with 0x5a values in 2048-4096 range.
> Then truncate_pagecache_range() does nothing, ext4_alloc_file_blocks()
> converts blocks under 2048-4096 to unwritten state. But what handles
> zeroing of page cache in 2048-4096 range? ext4_zero_partial_blocks() zeroes
> only partial blocks, not full blocks. Am I missing something?
> 

Sorry, I don't understand why truncate_pagecache_range() does nothing? In my
example, the variable 'start' is 2048, the variable 'end' is 4096, and the
call process truncate_pagecache_range(inode, 2048, 4096-1)->..->
truncate_inode_partial_folio()->folio_zero_range() does zeroing the 2048-4096
range. I also tested it below, it was zeroed.

  xfs_io -t -f -c "pwrite -S 0x58 0 4096" -c "mmap -rw 0 4096" \
               -c "mwrite -S 0x5a 2048 2048" \
               -c "fzero 2048 2048" -c "close" /mnt/foo

  od -Ax -t x1z /mnt/foo
  000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58  >XXXXXXXXXXXXXXXX<
  *
  000800 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  >................<
  *
  001000

> If I'm right, I'd keep it simple and just writeout these partial folios with
> filemap_write_and_wait_range() and expand the range
> truncate_pagecache_range() removes to include these partial folios. The

What I mean is the truncate_pagecache_range() has already covered the partial
folios. right?

> overhead won't be big and it isn't like this is some very performance
> sensitive path.
> 
>> +}
>> +
>> +/*
>> + * When truncating a range of folios, if the block size is less than the
>> + * page size, the file's mapped partial blocks within one page could be
>> + * freed or converted to unwritten. We should call this function to remove
>> + * writable userspace mappings so that ext4_page_mkwrite() can be called
>> + * during subsequent write access to these folios.
>> + */
>> +void ext4_truncate_folios_range(struct inode *inode, loff_t start, loff_t end)
> 
> Maybe call this ext4_truncate_page_cache_block_range()? And assert that
> start & end are block aligned. Because this essentially prepares page cache
> for manipulation with a block range.

Ha, it's a good idea, I agree with you that move truncate_pagecache_range()
and the hunk of flushing in journal data mode into this function. But I don't
understand why assert that 'start & end' are block aligned? I think
ext4_truncate_page_cache_block_range() should allow passing unaligned input
parameters and aligned them itself, especially, after patch 04 and 05,
ext4_zero_range() and ext4_punch_hole() will passing offset and offset+len
directly, which may block unaligned.

Thanks,
Yi.

> 
>> +{
>> +	unsigned long blocksize = i_blocksize(inode);
>> +
>> +	if (end > inode->i_size)
>> +		end = inode->i_size;
>> +	if (start >= end || blocksize >= PAGE_SIZE)
>> +		return;
>> +
>> +	ext4_truncate_folio(inode, start, min(round_up(start, PAGE_SIZE), end));
>> +	if (end > round_up(start, PAGE_SIZE))
>> +		ext4_truncate_folio(inode, round_down(end, PAGE_SIZE), end);
>> +}
> 
> So I'd move the following truncate_pagecache_range() into
> ext4_truncate_folios_range(). And also the preceding:
> 
>                 /*
>                  * For journalled data we need to write (and checkpoint) pages
>                  * before discarding page cache to avoid inconsitent data on
>                  * disk in case of crash before zeroing trans is committed.
>                  */
>                 if (ext4_should_journal_data(inode)) {
>                         ret = filemap_write_and_wait_range(mapping, start,
>                                                            end - 1);
> 		...
> 
> into this function. So that it can be self-contained "do the right thing
> with page cache to prepare for block range manipulations".
> 
> 								Honza


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 05/27] ext4: refactor ext4_zero_range()
  2024-12-04 11:52   ` Jan Kara
@ 2024-12-06  8:09     ` Zhang Yi
  0 siblings, 0 replies; 59+ messages in thread
From: Zhang Yi @ 2024-12-06  8:09 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	ritesh.list, hch, djwong, david, zokeefe, yi.zhang, chengzhihao1,
	yukuai3, yangerkun

On 2024/12/4 19:52, Jan Kara wrote:
> On Tue 22-10-24 19:10:36, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> The current implementation of ext4_zero_range() contains complex
>> position calculations and stale error tags. To improve the code's
>> clarity and maintainability, it is essential to clean up the code and
>> improve its readability, this can be achieved by: a) simplifying and
>> renaming variables, making the style the same as ext4_punch_hole(); b)
>> eliminating unnecessary position calculations, writing back all data in
>> data=journal mode, and drop page cache from the original offset to the
>> end, rather than using aligned blocks; c) renaming the stale out_mutex
>> tags.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> 
> ...
> 
>> -		goto out_mutex;
>> -
>> -	/* Preallocate the range including the unaligned edges */
>> -	if (partial_begin || partial_end) {
>> -		ret = ext4_alloc_file_blocks(file,
>> -				round_down(offset, 1 << blkbits) >> blkbits,
>> -				(round_up((offset + len), 1 << blkbits) -
>> -				 round_down(offset, 1 << blkbits)) >> blkbits,
>> -				new_size, flags);
>> -		if (ret)
>> -			goto out_mutex;
>> -
>> -	}
> 
> So I think we should keep this first ext4_alloc_file_blocks() call before
> we truncate the page cache. Otherwise if ext4_alloc_file_blocks() fails due
> to ENOSPC, we have already lost the dirty data originally in the zeroed
> range. All the other failure modes are kind of catastrophic anyway, so they
> are fine after dropping the page cache. But this is can be quite common and
> should be handled more gracefully.
> 

Ha, right, I missed this error case, I will revise it.

Thanks,
Yi.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 09/27] ext4: move out inode_lock into ext4_fallocate()
  2024-12-04 12:05   ` Jan Kara
@ 2024-12-06  8:13     ` Zhang Yi
  2024-12-06 15:51       ` Jan Kara
  0 siblings, 1 reply; 59+ messages in thread
From: Zhang Yi @ 2024-12-06  8:13 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	ritesh.list, hch, djwong, david, zokeefe, yi.zhang, chengzhihao1,
	yukuai3, yangerkun

On 2024/12/4 20:05, Jan Kara wrote:
> On Tue 22-10-24 19:10:40, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> Currently, all five sub-functions of ext4_fallocate() acquire the
>> inode's i_rwsem at the beginning and release it before exiting. This
>> process can be simplified by factoring out the management of i_rwsem
>> into the ext4_fallocate() function.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> 
> Ah, nice. Feel free to add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> 
> and please ignore my comments about renaming 'out' labels :).
> 
> 								Honza
> 

...

>> @@ -4774,9 +4765,8 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>>  
>>  	inode_lock(inode);
>>  	ret = ext4_convert_inline_data(inode);
>> -	inode_unlock(inode);
>>  	if (ret)
>> -		return ret;
>> +		goto out;
>>  
>>  	if (mode & FALLOC_FL_PUNCH_HOLE)
>>  		ret = ext4_punch_hole(file, offset, len);
>> @@ -4788,7 +4778,8 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>>  		ret = ext4_zero_range(file, offset, len, mode);
>>  	else
>>  		ret = ext4_do_fallocate(file, offset, len, mode);
>> -
>> +out:
>> +	inode_unlock(inode);
>>  	return ret;
>>  }
>>  

I guess you may want to suggest rename this out to out_inode_lock as well.

Thanks,
Yi.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 12/27] ext4: introduce seq counter for the extent status entry
  2024-12-04 12:42   ` Jan Kara
@ 2024-12-06  8:55     ` Zhang Yi
  2024-12-06 16:21       ` Jan Kara
  0 siblings, 1 reply; 59+ messages in thread
From: Zhang Yi @ 2024-12-06  8:55 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	ritesh.list, hch, djwong, david, zokeefe, yi.zhang, chengzhihao1,
	yukuai3, yangerkun

On 2024/12/4 20:42, Jan Kara wrote:
> On Tue 22-10-24 19:10:43, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> In the iomap_write_iter(), the iomap buffered write frame does not hold
>> any locks between querying the inode extent mapping info and performing
>> page cache writes. As a result, the extent mapping can be changed due to
>> concurrent I/O in flight. Similarly, in the iomap_writepage_map(), the
>> write-back process faces a similar problem: concurrent changes can
>> invalidate the extent mapping before the I/O is submitted.
>>
>> Therefore, both of these processes must recheck the mapping info after
>> acquiring the folio lock. To address this, similar to XFS, we propose
>> introducing an extent sequence number to serve as a validity cookie for
>> the extent. We will increment this number whenever the extent status
>> tree changes, thereby preparing for the buffered write iomap conversion.
>> Besides, it also changes the trace code style to make checkpatch.pl
>> happy.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> 
> Overall using some sequence counter makes sense.
> 
>> diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
>> index c786691dabd3..bea4f87db502 100644
>> --- a/fs/ext4/extents_status.c
>> +++ b/fs/ext4/extents_status.c
>> @@ -204,6 +204,13 @@ static inline ext4_lblk_t ext4_es_end(struct extent_status *es)
>>  	return es->es_lblk + es->es_len - 1;
>>  }
>>  
>> +static inline void ext4_es_inc_seq(struct inode *inode)
>> +{
>> +	struct ext4_inode_info *ei = EXT4_I(inode);
>> +
>> +	WRITE_ONCE(ei->i_es_seq, READ_ONCE(ei->i_es_seq) + 1);
>> +}
> 
> This looks potentially dangerous because we can loose i_es_seq updates this
> way. Like
> 
> CPU1					CPU2
> x = READ_ONCE(ei->i_es_seq)
> 					x = READ_ONCE(ei->i_es_seq)
> 					WRITE_ONCE(ei->i_es_seq, x + 1)
> 					...
> 					potentially many times
> WRITE_ONCE(ei->i_es_seq, x + 1)
>   -> the counter goes back leading to possibly false equality checks
> 

In my current implementation, I don't think this race condition can
happen since all ext4_es_inc_seq() invocations are under
EXT4_I(inode)->i_es_lock. So I think it works fine now, or was I
missed something?

> I think you'll need to use atomic_t and appropriate functions here.
> 
>> @@ -872,6 +879,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
>>  	BUG_ON(end < lblk);
>>  	WARN_ON_ONCE(status & EXTENT_STATUS_DELAYED);
>>  
>> +	ext4_es_inc_seq(inode);
> 
> I'm somewhat wondering: Are extent status tree modifications the right
> place to advance the sequence counter? The counter needs to advance
> whenever the mapping information changes. This means that we'd be
> needlessly advancing the counter (and thus possibly forcing retries) when
> we are just adding new information from ordinary extent tree into cache.
> Also someone can be doing extent tree manipulations without touching extent
> status tree (if the information was already pruned from there). 

Sorry, I don't quite understand here. IIUC, we can't modify the extent
tree without also touching extent status tree; otherwise, the extent
status tree will become stale, potentially leading to undesirable and
unexpected outcomes later on, as the extent lookup paths rely on and
always trust the status tree. If this situation happens, would it be
considered a bug? Additionally, I have checked the code but didn't find
any concrete cases where this could happen. Was I overlooked something?

> So I think
> needs some very good documentation what are the expectations from the
> sequence counter and explanations why they are satisfied so that we don't
> break this in the future.
> 

Yeah, it's a good suggestion, where do you suggest putting this
documentation, how about in the front of extents_status.c?

Thanks,
Yi.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 01/27] ext4: remove writable userspace mappings before truncating page cache
  2024-12-06  7:59     ` Zhang Yi
@ 2024-12-06 15:49       ` Jan Kara
  0 siblings, 0 replies; 59+ messages in thread
From: Jan Kara @ 2024-12-06 15:49 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Jan Kara, linux-ext4, linux-fsdevel, linux-kernel, tytso,
	adilger.kernel, ritesh.list, hch, djwong, david, zokeefe,
	yi.zhang, chengzhihao1, yukuai3, yangerkun

On Fri 06-12-24 15:59:44, Zhang Yi wrote:
> On 2024/12/4 19:13, Jan Kara wrote:
> > On Tue 22-10-24 19:10:32, Zhang Yi wrote:
> >> +static inline void ext4_truncate_folio(struct inode *inode,
> >> +				       loff_t start, loff_t end)
> >> +{
> >> +	unsigned long blocksize = i_blocksize(inode);
> >> +	struct folio *folio;
> >> +
> >> +	if (round_up(start, blocksize) >= round_down(end, blocksize))
> >> +		return;
> >> +
> >> +	folio = filemap_lock_folio(inode->i_mapping, start >> PAGE_SHIFT);
> >> +	if (IS_ERR(folio))
> >> +		return;
> >> +
> >> +	if (folio_mkclean(folio))
> >> +		folio_mark_dirty(folio);
> >> +	folio_unlock(folio);
> >> +	folio_put(folio);
> > 
> > I don't think this is enough. In your example from the changelog, this would
> > leave the page at index 0 dirty and still with 0x5a values in 2048-4096 range.
> > Then truncate_pagecache_range() does nothing, ext4_alloc_file_blocks()
> > converts blocks under 2048-4096 to unwritten state. But what handles
> > zeroing of page cache in 2048-4096 range? ext4_zero_partial_blocks() zeroes
> > only partial blocks, not full blocks. Am I missing something?
> > 
> 
> Sorry, I don't understand why truncate_pagecache_range() does nothing? In my
> example, the variable 'start' is 2048, the variable 'end' is 4096, and the
> call process truncate_pagecache_range(inode, 2048, 4096-1)->..->
> truncate_inode_partial_folio()->folio_zero_range() does zeroing the 2048-4096
> range. I also tested it below, it was zeroed.
> 
>   xfs_io -t -f -c "pwrite -S 0x58 0 4096" -c "mmap -rw 0 4096" \
>                -c "mwrite -S 0x5a 2048 2048" \
>                -c "fzero 2048 2048" -c "close" /mnt/foo
> 
>   od -Ax -t x1z /mnt/foo
>   000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58  >XXXXXXXXXXXXXXXX<
>   *
>   000800 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  >................<
>   *
>   001000

Yeah, sorry. I've got totally confused here. truncate_pagecache_range()
indeed does all the zeroing we need. Your version of ext4_truncate_folio()
should do the right thing.

> > If I'm right, I'd keep it simple and just writeout these partial folios with
> > filemap_write_and_wait_range() and expand the range
> > truncate_pagecache_range() removes to include these partial folios. The
> 
> What I mean is the truncate_pagecache_range() has already covered the partial
> folios. right?

Right, it should cover the partial folios.

> > overhead won't be big and it isn't like this is some very performance
> > sensitive path.
> > 
> >> +}
> >> +
> >> +/*
> >> + * When truncating a range of folios, if the block size is less than the
> >> + * page size, the file's mapped partial blocks within one page could be
> >> + * freed or converted to unwritten. We should call this function to remove
> >> + * writable userspace mappings so that ext4_page_mkwrite() can be called
> >> + * during subsequent write access to these folios.
> >> + */
> >> +void ext4_truncate_folios_range(struct inode *inode, loff_t start, loff_t end)
> > 
> > Maybe call this ext4_truncate_page_cache_block_range()? And assert that
> > start & end are block aligned. Because this essentially prepares page cache
> > for manipulation with a block range.
> 
> Ha, it's a good idea, I agree with you that move truncate_pagecache_range()
> and the hunk of flushing in journal data mode into this function. But I don't
> understand why assert that 'start & end' are block aligned?

Yes, that shouldn't be needed since truncate_pagecache_range() will do the
right thing.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 09/27] ext4: move out inode_lock into ext4_fallocate()
  2024-12-06  8:13     ` Zhang Yi
@ 2024-12-06 15:51       ` Jan Kara
  0 siblings, 0 replies; 59+ messages in thread
From: Jan Kara @ 2024-12-06 15:51 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Jan Kara, linux-ext4, linux-fsdevel, linux-kernel, tytso,
	adilger.kernel, ritesh.list, hch, djwong, david, zokeefe,
	yi.zhang, chengzhihao1, yukuai3, yangerkun

On Fri 06-12-24 16:13:14, Zhang Yi wrote:
> On 2024/12/4 20:05, Jan Kara wrote:
> > On Tue 22-10-24 19:10:40, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> Currently, all five sub-functions of ext4_fallocate() acquire the
> >> inode's i_rwsem at the beginning and release it before exiting. This
> >> process can be simplified by factoring out the management of i_rwsem
> >> into the ext4_fallocate() function.
> >>
> >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> > 
> > Ah, nice. Feel free to add:
> > 
> > Reviewed-by: Jan Kara <jack@suse.cz>
> > 
> > and please ignore my comments about renaming 'out' labels :).
> > 
> > 								Honza
> > 
> 
> ...
> 
> >> @@ -4774,9 +4765,8 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> >>  
> >>  	inode_lock(inode);
> >>  	ret = ext4_convert_inline_data(inode);
> >> -	inode_unlock(inode);
> >>  	if (ret)
> >> -		return ret;
> >> +		goto out;
> >>  
> >>  	if (mode & FALLOC_FL_PUNCH_HOLE)
> >>  		ret = ext4_punch_hole(file, offset, len);
> >> @@ -4788,7 +4778,8 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> >>  		ret = ext4_zero_range(file, offset, len, mode);
> >>  	else
> >>  		ret = ext4_do_fallocate(file, offset, len, mode);
> >> -
> >> +out:
> >> +	inode_unlock(inode);
> >>  	return ret;
> >>  }
> >>  
> 
> I guess you may want to suggest rename this out to out_inode_lock as well.

Right. This one should better be out_inode_lock.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 12/27] ext4: introduce seq counter for the extent status entry
  2024-12-06  8:55     ` Zhang Yi
@ 2024-12-06 16:21       ` Jan Kara
  2024-12-09  8:32         ` Zhang Yi
  0 siblings, 1 reply; 59+ messages in thread
From: Jan Kara @ 2024-12-06 16:21 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Jan Kara, linux-ext4, linux-fsdevel, linux-kernel, tytso,
	adilger.kernel, ritesh.list, hch, djwong, david, zokeefe,
	yi.zhang, chengzhihao1, yukuai3, yangerkun

On Fri 06-12-24 16:55:01, Zhang Yi wrote:
> On 2024/12/4 20:42, Jan Kara wrote:
> > On Tue 22-10-24 19:10:43, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> In the iomap_write_iter(), the iomap buffered write frame does not hold
> >> any locks between querying the inode extent mapping info and performing
> >> page cache writes. As a result, the extent mapping can be changed due to
> >> concurrent I/O in flight. Similarly, in the iomap_writepage_map(), the
> >> write-back process faces a similar problem: concurrent changes can
> >> invalidate the extent mapping before the I/O is submitted.
> >>
> >> Therefore, both of these processes must recheck the mapping info after
> >> acquiring the folio lock. To address this, similar to XFS, we propose
> >> introducing an extent sequence number to serve as a validity cookie for
> >> the extent. We will increment this number whenever the extent status
> >> tree changes, thereby preparing for the buffered write iomap conversion.
> >> Besides, it also changes the trace code style to make checkpatch.pl
> >> happy.
> >>
> >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> > 
> > Overall using some sequence counter makes sense.
> > 
> >> diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
> >> index c786691dabd3..bea4f87db502 100644
> >> --- a/fs/ext4/extents_status.c
> >> +++ b/fs/ext4/extents_status.c
> >> @@ -204,6 +204,13 @@ static inline ext4_lblk_t ext4_es_end(struct extent_status *es)
> >>  	return es->es_lblk + es->es_len - 1;
> >>  }
> >>  
> >> +static inline void ext4_es_inc_seq(struct inode *inode)
> >> +{
> >> +	struct ext4_inode_info *ei = EXT4_I(inode);
> >> +
> >> +	WRITE_ONCE(ei->i_es_seq, READ_ONCE(ei->i_es_seq) + 1);
> >> +}
> > 
> > This looks potentially dangerous because we can loose i_es_seq updates this
> > way. Like
> > 
> > CPU1					CPU2
> > x = READ_ONCE(ei->i_es_seq)
> > 					x = READ_ONCE(ei->i_es_seq)
> > 					WRITE_ONCE(ei->i_es_seq, x + 1)
> > 					...
> > 					potentially many times
> > WRITE_ONCE(ei->i_es_seq, x + 1)
> >   -> the counter goes back leading to possibly false equality checks
> > 
> 
> In my current implementation, I don't think this race condition can
> happen since all ext4_es_inc_seq() invocations are under
> EXT4_I(inode)->i_es_lock. So I think it works fine now, or was I
> missed something?

Hum, as far as I've checked, at least the place in ext4_es_insert_extent()
where you call ext4_es_inc_seq() doesn't hold i_es_lock (yet). If you meant
to protect the updates by i_es_lock, then move the call sites and please
add a comment about it. Also it should be enough to do:

WRITE_ONCE(ei->i_es_seq, ei->i_es_seq + 1);

since we cannot be really racing with other writers.

> > I think you'll need to use atomic_t and appropriate functions here.
> > 
> >> @@ -872,6 +879,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
> >>  	BUG_ON(end < lblk);
> >>  	WARN_ON_ONCE(status & EXTENT_STATUS_DELAYED);
> >>  
> >> +	ext4_es_inc_seq(inode);
> > 
> > I'm somewhat wondering: Are extent status tree modifications the right
> > place to advance the sequence counter? The counter needs to advance
> > whenever the mapping information changes. This means that we'd be
> > needlessly advancing the counter (and thus possibly forcing retries) when
> > we are just adding new information from ordinary extent tree into cache.
> > Also someone can be doing extent tree manipulations without touching extent
> > status tree (if the information was already pruned from there). 
> 
> Sorry, I don't quite understand here. IIUC, we can't modify the extent
> tree without also touching extent status tree; otherwise, the extent
> status tree will become stale, potentially leading to undesirable and
> unexpected outcomes later on, as the extent lookup paths rely on and
> always trust the status tree. If this situation happens, would it be
> considered a bug? Additionally, I have checked the code but didn't find
> any concrete cases where this could happen. Was I overlooked something?

What I'm worried about is that this seems a bit fragile because e.g. in
ext4_collapse_range() we do:

ext4_es_remove_extent(inode, start, EXT_MAX_BLOCKS - start)
<now go and manipulate the extent tree>

So if somebody managed to sneak in between ext4_es_remove_extent() and
the extent tree manipulation, he could get a block mapping which is shortly
after invalidated by the extent tree changes. And as I'm checking now,
writeback code *can* sneak in there because during extent tree
manipulations we call ext4_datasem_ensure_credits() which can drop
i_data_sem to restart a transaction.

Now we do writeout & invalidate page cache before we start to do these
extent tree dances so I don't see how this could lead to *actual* use
after free issues but it makes me somewhat nervous. So that's why I'd like
to have some clear rules from which it is obvious that the counter makes
sure we do not use stale mappings.

> > So I think
> > needs some very good documentation what are the expectations from the
> > sequence counter and explanations why they are satisfied so that we don't
> > break this in the future.
> 
> Yeah, it's a good suggestion, where do you suggest putting this
> documentation, how about in the front of extents_status.c?

I think at the function incrementing the counter would be fine.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 12/27] ext4: introduce seq counter for the extent status entry
  2024-12-06 16:21       ` Jan Kara
@ 2024-12-09  8:32         ` Zhang Yi
  2024-12-10 12:57           ` Jan Kara
  0 siblings, 1 reply; 59+ messages in thread
From: Zhang Yi @ 2024-12-09  8:32 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	ritesh.list, hch, djwong, david, zokeefe, yi.zhang, chengzhihao1,
	yukuai3, yangerkun

On 2024/12/7 0:21, Jan Kara wrote:
> On Fri 06-12-24 16:55:01, Zhang Yi wrote:
>> On 2024/12/4 20:42, Jan Kara wrote:
>>> On Tue 22-10-24 19:10:43, Zhang Yi wrote:
>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>
>>>> In the iomap_write_iter(), the iomap buffered write frame does not hold
>>>> any locks between querying the inode extent mapping info and performing
>>>> page cache writes. As a result, the extent mapping can be changed due to
>>>> concurrent I/O in flight. Similarly, in the iomap_writepage_map(), the
>>>> write-back process faces a similar problem: concurrent changes can
>>>> invalidate the extent mapping before the I/O is submitted.
>>>>
>>>> Therefore, both of these processes must recheck the mapping info after
>>>> acquiring the folio lock. To address this, similar to XFS, we propose
>>>> introducing an extent sequence number to serve as a validity cookie for
>>>> the extent. We will increment this number whenever the extent status
>>>> tree changes, thereby preparing for the buffered write iomap conversion.
>>>> Besides, it also changes the trace code style to make checkpatch.pl
>>>> happy.
>>>>
>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>>
>>> Overall using some sequence counter makes sense.
>>>
>>>> diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
>>>> index c786691dabd3..bea4f87db502 100644
>>>> --- a/fs/ext4/extents_status.c
>>>> +++ b/fs/ext4/extents_status.c
>>>> @@ -204,6 +204,13 @@ static inline ext4_lblk_t ext4_es_end(struct extent_status *es)
>>>>  	return es->es_lblk + es->es_len - 1;
>>>>  }
>>>>  
>>>> +static inline void ext4_es_inc_seq(struct inode *inode)
>>>> +{
>>>> +	struct ext4_inode_info *ei = EXT4_I(inode);
>>>> +
>>>> +	WRITE_ONCE(ei->i_es_seq, READ_ONCE(ei->i_es_seq) + 1);
>>>> +}
>>>
>>> This looks potentially dangerous because we can loose i_es_seq updates this
>>> way. Like
>>>
>>> CPU1					CPU2
>>> x = READ_ONCE(ei->i_es_seq)
>>> 					x = READ_ONCE(ei->i_es_seq)
>>> 					WRITE_ONCE(ei->i_es_seq, x + 1)
>>> 					...
>>> 					potentially many times
>>> WRITE_ONCE(ei->i_es_seq, x + 1)
>>>   -> the counter goes back leading to possibly false equality checks
>>>
>>
>> In my current implementation, I don't think this race condition can
>> happen since all ext4_es_inc_seq() invocations are under
>> EXT4_I(inode)->i_es_lock. So I think it works fine now, or was I
>> missed something?
> 
> Hum, as far as I've checked, at least the place in ext4_es_insert_extent()
> where you call ext4_es_inc_seq() doesn't hold i_es_lock (yet). If you meant
> to protect the updates by i_es_lock, then move the call sites and please
> add a comment about it. Also it should be enough to do:
> 
> WRITE_ONCE(ei->i_es_seq, ei->i_es_seq + 1);
> 
> since we cannot be really racing with other writers.

Oh, sorry, I mentioned the wrong lock. What I intended to say is
i_data_sem.

Currently, all instances where we update the extent status tree will
hold i_data_sem in write mode, preventing any race conditions in these
scenarios. However, we may hold i_data_sem in read mode while loading
a new entry from the extent tree (e.g., ext4_map_query_blocks()). In
these cases, a race condition could occur, but it doesn't modify the
extents, and the new loading range should not be related to the
mapping range we obtained (If it covers with the range we have, it
must first remove the old extents entry, which is protected by
i_data_sem, ensuring that i_es_seq increases by at least one).
Therefore, it should not use stale mapping and trigger any real issues.

However, after thinking about it again, I agree with you that this
approach is subtle, fragile and make us hard to understand, now I think
we should move it into i_es_lock.

> 
>>> I think you'll need to use atomic_t and appropriate functions here.
>>>
>>>> @@ -872,6 +879,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
>>>>  	BUG_ON(end < lblk);
>>>>  	WARN_ON_ONCE(status & EXTENT_STATUS_DELAYED);
>>>>  
>>>> +	ext4_es_inc_seq(inode);
>>>
>>> I'm somewhat wondering: Are extent status tree modifications the right
>>> place to advance the sequence counter? The counter needs to advance
>>> whenever the mapping information changes. This means that we'd be
>>> needlessly advancing the counter (and thus possibly forcing retries) when
>>> we are just adding new information from ordinary extent tree into cache.
>>> Also someone can be doing extent tree manipulations without touching extent
>>> status tree (if the information was already pruned from there). 
>>
>> Sorry, I don't quite understand here. IIUC, we can't modify the extent
>> tree without also touching extent status tree; otherwise, the extent
>> status tree will become stale, potentially leading to undesirable and
>> unexpected outcomes later on, as the extent lookup paths rely on and
>> always trust the status tree. If this situation happens, would it be
>> considered a bug? Additionally, I have checked the code but didn't find
>> any concrete cases where this could happen. Was I overlooked something?
> 
> What I'm worried about is that this seems a bit fragile because e.g. in
> ext4_collapse_range() we do:
> 
> ext4_es_remove_extent(inode, start, EXT_MAX_BLOCKS - start)
> <now go and manipulate the extent tree>
> 
> So if somebody managed to sneak in between ext4_es_remove_extent() and
> the extent tree manipulation, he could get a block mapping which is shortly
> after invalidated by the extent tree changes. And as I'm checking now,
> writeback code *can* sneak in there because during extent tree
> manipulations we call ext4_datasem_ensure_credits() which can drop
> i_data_sem to restart a transaction.
> 
> Now we do writeout & invalidate page cache before we start to do these
> extent tree dances so I don't see how this could lead to *actual* use
> after free issues but it makes me somewhat nervous. So that's why I'd like
> to have some clear rules from which it is obvious that the counter makes
> sure we do not use stale mappings.

Yes, I see. I think the rule should be as follows:

First, when the iomap infrastructure is creating or querying file
mapping information, we must ensure that the mapping information
always passes through the extent status tree, which means
ext4_map_blocks(), ext4_map_query_blocks(), and
ext4_map_create_blocks() should cache the extent status entries that
we intend to use.

Second, when updating the extent tree, we must hold the i_data_sem in
write mode and update the extent status tree atomically. Additionally,
if we cannot update the extent tree while holding a single i_data_sem,
we should first remove all related extent status entries within the
specified range, then manipulate the extent tree, ensuring that the
extent status entries are always up-to-date if they exist (as
ext4_collapse_range() does).

Finally, if we want to manipulate the extent tree without caching, we
should also remove the extent status entries first.

In summary, ensure that the extent status tree and the extent tree are
consistent under one i_data_sem. If we can't, remove the extent status
entry before manipulating the extent tree.

Do you agree?

> 
>>> So I think
>>> needs some very good documentation what are the expectations from the
>>> sequence counter and explanations why they are satisfied so that we don't
>>> break this in the future.
>>
>> Yeah, it's a good suggestion, where do you suggest putting this
>> documentation, how about in the front of extents_status.c?
> 
> I think at the function incrementing the counter would be fine.
> 

Sure, thanks for pointing this out.

Thanks,
Yi.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 12/27] ext4: introduce seq counter for the extent status entry
  2024-12-09  8:32         ` Zhang Yi
@ 2024-12-10 12:57           ` Jan Kara
  2024-12-11  7:59             ` Zhang Yi
  0 siblings, 1 reply; 59+ messages in thread
From: Jan Kara @ 2024-12-10 12:57 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Jan Kara, linux-ext4, linux-fsdevel, linux-kernel, tytso,
	adilger.kernel, ritesh.list, hch, djwong, david, zokeefe,
	yi.zhang, chengzhihao1, yukuai3, yangerkun

On Mon 09-12-24 16:32:41, Zhang Yi wrote:
> On 2024/12/7 0:21, Jan Kara wrote:
> >>> I think you'll need to use atomic_t and appropriate functions here.
> >>>
> >>>> @@ -872,6 +879,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
> >>>>  	BUG_ON(end < lblk);
> >>>>  	WARN_ON_ONCE(status & EXTENT_STATUS_DELAYED);
> >>>>  
> >>>> +	ext4_es_inc_seq(inode);
> >>>
> >>> I'm somewhat wondering: Are extent status tree modifications the right
> >>> place to advance the sequence counter? The counter needs to advance
> >>> whenever the mapping information changes. This means that we'd be
> >>> needlessly advancing the counter (and thus possibly forcing retries) when
> >>> we are just adding new information from ordinary extent tree into cache.
> >>> Also someone can be doing extent tree manipulations without touching extent
> >>> status tree (if the information was already pruned from there). 
> >>
> >> Sorry, I don't quite understand here. IIUC, we can't modify the extent
> >> tree without also touching extent status tree; otherwise, the extent
> >> status tree will become stale, potentially leading to undesirable and
> >> unexpected outcomes later on, as the extent lookup paths rely on and
> >> always trust the status tree. If this situation happens, would it be
> >> considered a bug? Additionally, I have checked the code but didn't find
> >> any concrete cases where this could happen. Was I overlooked something?
> > 
> > What I'm worried about is that this seems a bit fragile because e.g. in
> > ext4_collapse_range() we do:
> > 
> > ext4_es_remove_extent(inode, start, EXT_MAX_BLOCKS - start)
> > <now go and manipulate the extent tree>
> > 
> > So if somebody managed to sneak in between ext4_es_remove_extent() and
> > the extent tree manipulation, he could get a block mapping which is shortly
> > after invalidated by the extent tree changes. And as I'm checking now,
> > writeback code *can* sneak in there because during extent tree
> > manipulations we call ext4_datasem_ensure_credits() which can drop
> > i_data_sem to restart a transaction.
> > 
> > Now we do writeout & invalidate page cache before we start to do these
> > extent tree dances so I don't see how this could lead to *actual* use
> > after free issues but it makes me somewhat nervous. So that's why I'd like
> > to have some clear rules from which it is obvious that the counter makes
> > sure we do not use stale mappings.
> 
> Yes, I see. I think the rule should be as follows:
> 
> First, when the iomap infrastructure is creating or querying file
> mapping information, we must ensure that the mapping information
> always passes through the extent status tree, which means
> ext4_map_blocks(), ext4_map_query_blocks(), and
> ext4_map_create_blocks() should cache the extent status entries that
> we intend to use.

OK, this currently holds. There's just one snag that during fastcommit
replay ext4_es_insert_extent() doesn't do anything. I don't think there's
any race possible during that stage but it's another case to think about.

> Second, when updating the extent tree, we must hold the i_data_sem in
> write mode and update the extent status tree atomically.

Fine.

> Additionally,
> if we cannot update the extent tree while holding a single i_data_sem,
> we should first remove all related extent status entries within the
> specified range, then manipulate the extent tree, ensuring that the
> extent status entries are always up-to-date if they exist (as
> ext4_collapse_range() does).

In this case, I think we need to provide more details. In particular I
would require that in all such cases we must:
a) hold i_rwsem exclusively and hold invalidate_lock exclusively ->
   provides exclusion against page faults, reads, writes
b) evict all page cache in the affected range -> should stop writeback -
   *but* currently there's one case which could be problematic. Assume we
   do punch hole 0..N and the page at N+1 is dirty. Punch hole does all of
   the above and starts removing blocks, needs to restart transaction so it
   drops i_data_sem. Writeback starts for page N+1, needs to load extent
   block into memory, ext4_cache_extents() now loads back some extents
   covering range 0..N into extent status tree. So the only protection
   against using freed blocks is that nobody should be mapping anything in
   the range 0..N because we hold those locks & have evicted page cache.

So I think we need to also document, that anybody mapping blocks needs to
hold i_rwsem or invalidate_lock or a page lock, ideally asserting that in
ext4_map_blocks() to catch cases we missed. Asserting for page lock will
not be really doable but luckily only page writeback needs that so that can
get some extemption from the assert.

> Finally, if we want to manipulate the extent tree without caching, we
> should also remove the extent status entries first.

Based on the above, I don't think this is really needed. We only must make
sure that after all extent tree updates are done and before we release
invalidate_lock, all extents from extent status tree in the modified range
must be evicted / replaced to match reality.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 12/27] ext4: introduce seq counter for the extent status entry
  2024-12-10 12:57           ` Jan Kara
@ 2024-12-11  7:59             ` Zhang Yi
  2024-12-11 16:00               ` Jan Kara
  0 siblings, 1 reply; 59+ messages in thread
From: Zhang Yi @ 2024-12-11  7:59 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	ritesh.list, hch, djwong, david, zokeefe, yi.zhang, chengzhihao1,
	yukuai3, yangerkun

On 2024/12/10 20:57, Jan Kara wrote:
> On Mon 09-12-24 16:32:41, Zhang Yi wrote:
>> On 2024/12/7 0:21, Jan Kara wrote:
>>>>> I think you'll need to use atomic_t and appropriate functions here.
>>>>>
>>>>>> @@ -872,6 +879,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
>>>>>>  	BUG_ON(end < lblk);
>>>>>>  	WARN_ON_ONCE(status & EXTENT_STATUS_DELAYED);
>>>>>>  
>>>>>> +	ext4_es_inc_seq(inode);
>>>>>
>>>>> I'm somewhat wondering: Are extent status tree modifications the right
>>>>> place to advance the sequence counter? The counter needs to advance
>>>>> whenever the mapping information changes. This means that we'd be
>>>>> needlessly advancing the counter (and thus possibly forcing retries) when
>>>>> we are just adding new information from ordinary extent tree into cache.
>>>>> Also someone can be doing extent tree manipulations without touching extent
>>>>> status tree (if the information was already pruned from there). 
>>>>
>>>> Sorry, I don't quite understand here. IIUC, we can't modify the extent
>>>> tree without also touching extent status tree; otherwise, the extent
>>>> status tree will become stale, potentially leading to undesirable and
>>>> unexpected outcomes later on, as the extent lookup paths rely on and
>>>> always trust the status tree. If this situation happens, would it be
>>>> considered a bug? Additionally, I have checked the code but didn't find
>>>> any concrete cases where this could happen. Was I overlooked something?
>>>
>>> What I'm worried about is that this seems a bit fragile because e.g. in
>>> ext4_collapse_range() we do:
>>>
>>> ext4_es_remove_extent(inode, start, EXT_MAX_BLOCKS - start)
>>> <now go and manipulate the extent tree>
>>>
>>> So if somebody managed to sneak in between ext4_es_remove_extent() and
>>> the extent tree manipulation, he could get a block mapping which is shortly
>>> after invalidated by the extent tree changes. And as I'm checking now,
>>> writeback code *can* sneak in there because during extent tree
>>> manipulations we call ext4_datasem_ensure_credits() which can drop
>>> i_data_sem to restart a transaction.
>>>
>>> Now we do writeout & invalidate page cache before we start to do these
>>> extent tree dances so I don't see how this could lead to *actual* use
>>> after free issues but it makes me somewhat nervous. So that's why I'd like
>>> to have some clear rules from which it is obvious that the counter makes
>>> sure we do not use stale mappings.
>>
>> Yes, I see. I think the rule should be as follows:
>>
>> First, when the iomap infrastructure is creating or querying file
>> mapping information, we must ensure that the mapping information
>> always passes through the extent status tree, which means
>> ext4_map_blocks(), ext4_map_query_blocks(), and
>> ext4_map_create_blocks() should cache the extent status entries that
>> we intend to use.
> 
> OK, this currently holds. There's just one snag that during fastcommit
> replay ext4_es_insert_extent() doesn't do anything. I don't think there's
> any race possible during that stage but it's another case to think about.

OK.

> 
>> Second, when updating the extent tree, we must hold the i_data_sem in
>> write mode and update the extent status tree atomically.
> 
> Fine.
> 
>> Additionally,
>> if we cannot update the extent tree while holding a single i_data_sem,
>> we should first remove all related extent status entries within the
>> specified range, then manipulate the extent tree, ensuring that the
>> extent status entries are always up-to-date if they exist (as
>> ext4_collapse_range() does).
> 
> In this case, I think we need to provide more details. In particular I
> would require that in all such cases we must:
> a) hold i_rwsem exclusively and hold invalidate_lock exclusively ->
>    provides exclusion against page faults, reads, writes

Yes.

> b) evict all page cache in the affected range -> should stop writeback -
>    *but* currently there's one case which could be problematic. Assume we
>    do punch hole 0..N and the page at N+1 is dirty. Punch hole does all of
>    the above and starts removing blocks, needs to restart transaction so it
>    drops i_data_sem. Writeback starts for page N+1, needs to load extent
>    block into memory, ext4_cache_extents() now loads back some extents
>    covering range 0..N into extent status tree. 

This completely confuses me. Do you mention the case below,

There are many extent entries in the page range 0..N+1, for example,

   0                                  N N+1
   |                                  |/
  [www][wwwwww][wwwwwwww]...[wwwww][wwww]...
                |      |
                N-a    N-b

Punch hole is removing each extent entries from N..0
(ext4_ext_remove_space() removes blocks from end to start), and could
drop i_data_sem just after manipulating(removing) the extent entry
[N-a,N-b], At the same time, a concurrent writeback start write back
page N+1 since the writeback only hold page lock, doesn't hold i_rwsem
and invalidate_lock. It may load back the extents 0..N-a into the
extent status tree again while finding extent that contains N+1?
Finally it may left some stale extent status entries after punch hole
is done?

If my understanding is correct, isn't that a problem that exists now?
I mean without this patch series.

>    So the only protection
>    against using freed blocks is that nobody should be mapping anything in
>    the range 0..N because we hold those locks & have evicted page cache.
> 
> So I think we need to also document, that anybody mapping blocks needs to
> hold i_rwsem or invalidate_lock or a page lock, ideally asserting that in
> ext4_map_blocks() to catch cases we missed. Asserting for page lock will
> not be really doable but luckily only page writeback needs that so that can
> get some extemption from the assert.

In the case above, it seems that merely holding a page lock is
insufficient?

> 
>> Finally, if we want to manipulate the extent tree without caching, we
>> should also remove the extent status entries first.
> 
> Based on the above, I don't think this is really needed. We only must make
> sure that after all extent tree updates are done and before we release
> invalidate_lock, all extents from extent status tree in the modified range
> must be evicted / replaced to match reality.
> 
Yeah, I agree with you.

Thanks,
Yi.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 12/27] ext4: introduce seq counter for the extent status entry
  2024-12-11  7:59             ` Zhang Yi
@ 2024-12-11 16:00               ` Jan Kara
  2024-12-12  2:32                 ` Zhang Yi
  0 siblings, 1 reply; 59+ messages in thread
From: Jan Kara @ 2024-12-11 16:00 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Jan Kara, linux-ext4, linux-fsdevel, linux-kernel, tytso,
	adilger.kernel, ritesh.list, hch, djwong, david, zokeefe,
	yi.zhang, chengzhihao1, yukuai3, yangerkun

On Wed 11-12-24 15:59:51, Zhang Yi wrote:
> On 2024/12/10 20:57, Jan Kara wrote:
> > On Mon 09-12-24 16:32:41, Zhang Yi wrote:
> > b) evict all page cache in the affected range -> should stop writeback -
> >    *but* currently there's one case which could be problematic. Assume we
> >    do punch hole 0..N and the page at N+1 is dirty. Punch hole does all of
> >    the above and starts removing blocks, needs to restart transaction so it
> >    drops i_data_sem. Writeback starts for page N+1, needs to load extent
> >    block into memory, ext4_cache_extents() now loads back some extents
> >    covering range 0..N into extent status tree. 
> 
> This completely confuses me. Do you mention the case below,
> 
> There are many extent entries in the page range 0..N+1, for example,
> 
>    0                                  N N+1
>    |                                  |/
>   [www][wwwwww][wwwwwwww]...[wwwww][wwww]...
>                 |      |
>                 N-a    N-b
> 
> Punch hole is removing each extent entries from N..0
> (ext4_ext_remove_space() removes blocks from end to start), and could
> drop i_data_sem just after manipulating(removing) the extent entry
> [N-a,N-b], At the same time, a concurrent writeback start write back
> page N+1 since the writeback only hold page lock, doesn't hold i_rwsem
> and invalidate_lock. It may load back the extents 0..N-a into the
> extent status tree again while finding extent that contains N+1?

Yes, because when we load extents from extent tree, we insert all extents
from the leaf of the extent tree into extent status tree. That's what
ext4_cache_extents() call does.

> Finally it may left some stale extent status entries after punch hole
> is done?

Yes, there may be stale extents in the extent status tree when
ext4_ext_remove_space() returns. But punch hole in particular then does:

ext4_es_insert_extent(inode, first_block, hole_len, ~0,
                                      EXTENT_STATUS_HOLE, 0);

which overwrites these stale extents with appropriate information.

> If my understanding is correct, isn't that a problem that exists now?
> I mean without this patch series.

Yes, the situation isn't really related to your patches. But with your
patches we are starting to rely even more on extent status tree vs extent
tree consistecy. So I wanted to spell out this situation to verify new
problem isn't introduced and so that we create rules that handle this
situation well.

> >    So the only protection
> >    against using freed blocks is that nobody should be mapping anything in
> >    the range 0..N because we hold those locks & have evicted page cache.
> > 
> > So I think we need to also document, that anybody mapping blocks needs to
> > hold i_rwsem or invalidate_lock or a page lock, ideally asserting that in
> > ext4_map_blocks() to catch cases we missed. Asserting for page lock will
> > not be really doable but luckily only page writeback needs that so that can
> > get some extemption from the assert.
> 
> In the case above, it seems that merely holding a page lock is
> insufficient?

Well, holding page lock(s) for the range you are operating on is enough to
make sure there cannot be parallel operations on that range like truncate,
punch hole or similar, because they always remove the page cache before
starting their work and because they hold invalidate_lock, new pages cannot
be created while they are working.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 12/27] ext4: introduce seq counter for the extent status entry
  2024-12-11 16:00               ` Jan Kara
@ 2024-12-12  2:32                 ` Zhang Yi
  0 siblings, 0 replies; 59+ messages in thread
From: Zhang Yi @ 2024-12-12  2:32 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	ritesh.list, hch, djwong, david, zokeefe, yi.zhang, chengzhihao1,
	yukuai3, yangerkun

On 2024/12/12 0:00, Jan Kara wrote:
> On Wed 11-12-24 15:59:51, Zhang Yi wrote:
>> On 2024/12/10 20:57, Jan Kara wrote:
>>> On Mon 09-12-24 16:32:41, Zhang Yi wrote:
>>> b) evict all page cache in the affected range -> should stop writeback -
>>>    *but* currently there's one case which could be problematic. Assume we
>>>    do punch hole 0..N and the page at N+1 is dirty. Punch hole does all of
>>>    the above and starts removing blocks, needs to restart transaction so it
>>>    drops i_data_sem. Writeback starts for page N+1, needs to load extent
>>>    block into memory, ext4_cache_extents() now loads back some extents
>>>    covering range 0..N into extent status tree. 
>>
>> This completely confuses me. Do you mention the case below,
>>
>> There are many extent entries in the page range 0..N+1, for example,
>>
>>    0                                  N N+1
>>    |                                  |/
>>   [www][wwwwww][wwwwwwww]...[wwwww][wwww]...
>>                 |      |
>>                 N-a    N-b
>>
>> Punch hole is removing each extent entries from N..0
>> (ext4_ext_remove_space() removes blocks from end to start), and could
>> drop i_data_sem just after manipulating(removing) the extent entry
>> [N-a,N-b], At the same time, a concurrent writeback start write back
>> page N+1 since the writeback only hold page lock, doesn't hold i_rwsem
>> and invalidate_lock. It may load back the extents 0..N-a into the
>> extent status tree again while finding extent that contains N+1?
> 
> Yes, because when we load extents from extent tree, we insert all extents
> from the leaf of the extent tree into extent status tree. That's what
> ext4_cache_extents() call does.
> 
>> Finally it may left some stale extent status entries after punch hole
>> is done?
> 
> Yes, there may be stale extents in the extent status tree when
> ext4_ext_remove_space() returns. But punch hole in particular then does:
> 
> ext4_es_insert_extent(inode, first_block, hole_len, ~0,
>                                       EXTENT_STATUS_HOLE, 0);
> 
> which overwrites these stale extents with appropriate information.
> 

Yes, that's correct! I missed this insert yesterday. It looks fine now,
as it holds the i_rwsem and invalidate_lock, and has evicted the page
cache in this case. Thanks a lot for your detail explanation. I will
add these document in my next iteration.

Thanks!
Yi.

>> If my understanding is correct, isn't that a problem that exists now?
>> I mean without this patch series.
> 
> Yes, the situation isn't really related to your patches. But with your
> patches we are starting to rely even more on extent status tree vs extent
> tree consistecy. So I wanted to spell out this situation to verify new
> problem isn't introduced and so that we create rules that handle this
> situation well.
> 
>>>    So the only protection
>>>    against using freed blocks is that nobody should be mapping anything in
>>>    the range 0..N because we hold those locks & have evicted page cache.
>>>
>>> So I think we need to also document, that anybody mapping blocks needs to
>>> hold i_rwsem or invalidate_lock or a page lock, ideally asserting that in
>>> ext4_map_blocks() to catch cases we missed. Asserting for page lock will
>>> not be really doable but luckily only page writeback needs that so that can
>>> get some extemption from the assert.
>>
>> In the case above, it seems that merely holding a page lock is
>> insufficient?
> 
> Well, holding page lock(s) for the range you are operating on is enough to
> make sure there cannot be parallel operations on that range like truncate,
> punch hole or similar, because they always remove the page cache before
> starting their work and because they hold invalidate_lock, new pages cannot
> be created while they are working.
> 
> 								Honza


^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2024-12-12  2:32 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-22 11:10 [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio Zhang Yi
2024-10-22  6:59 ` Sedat Dilek
2024-10-22  9:22   ` Zhang Yi
2024-10-23 12:13     ` Sedat Dilek
2024-10-24  7:44       ` Zhang Yi
2024-10-22 11:10 ` [PATCH 01/27] ext4: remove writable userspace mappings before truncating page cache Zhang Yi
2024-12-04 11:13   ` Jan Kara
2024-12-06  7:59     ` Zhang Yi
2024-12-06 15:49       ` Jan Kara
2024-10-22 11:10 ` [PATCH 02/27] ext4: don't explicit update times in ext4_fallocate() Zhang Yi
2024-10-22 11:10 ` [PATCH 03/27] ext4: don't write back data before punch hole in nojournal mode Zhang Yi
2024-11-18 23:15   ` Darrick J. Wong
2024-11-20  2:56     ` Zhang Yi
2024-12-04 11:26       ` Jan Kara
2024-12-04 11:27   ` Jan Kara
2024-10-22 11:10 ` [PATCH 04/27] ext4: refactor ext4_punch_hole() Zhang Yi
2024-11-18 23:27   ` Darrick J. Wong
2024-11-20  3:18     ` Zhang Yi
2024-12-04 11:36   ` Jan Kara
2024-10-22 11:10 ` [PATCH 05/27] ext4: refactor ext4_zero_range() Zhang Yi
2024-12-04 11:52   ` Jan Kara
2024-12-06  8:09     ` Zhang Yi
2024-10-22 11:10 ` [PATCH 06/27] ext4: refactor ext4_collapse_range() Zhang Yi
2024-12-04 11:58   ` Jan Kara
2024-10-22 11:10 ` [PATCH 07/27] ext4: refactor ext4_insert_range() Zhang Yi
2024-12-04 12:02   ` Jan Kara
2024-10-22 11:10 ` [PATCH 08/27] ext4: factor out ext4_do_fallocate() Zhang Yi
2024-10-22 11:10 ` [PATCH 09/27] ext4: move out inode_lock into ext4_fallocate() Zhang Yi
2024-12-04 12:05   ` Jan Kara
2024-12-06  8:13     ` Zhang Yi
2024-12-06 15:51       ` Jan Kara
2024-10-22 11:10 ` [PATCH 10/27] ext4: move out common parts " Zhang Yi
2024-12-04 12:10   ` Jan Kara
2024-10-22 11:10 ` [PATCH 11/27] ext4: use reserved metadata blocks when splitting extent on endio Zhang Yi
2024-12-04 12:16   ` Jan Kara
2024-10-22 11:10 ` [PATCH 12/27] ext4: introduce seq counter for the extent status entry Zhang Yi
2024-12-04 12:42   ` Jan Kara
2024-12-06  8:55     ` Zhang Yi
2024-12-06 16:21       ` Jan Kara
2024-12-09  8:32         ` Zhang Yi
2024-12-10 12:57           ` Jan Kara
2024-12-11  7:59             ` Zhang Yi
2024-12-11 16:00               ` Jan Kara
2024-12-12  2:32                 ` Zhang Yi
2024-10-22 11:10 ` [PATCH 13/27] ext4: add a new iomap aops for regular file's buffered IO path Zhang Yi
2024-10-22 11:10 ` [PATCH 14/27] ext4: implement buffered read iomap path Zhang Yi
2024-10-22 11:10 ` [PATCH 15/27] ext4: implement buffered write " Zhang Yi
2024-10-22 11:10 ` [PATCH 16/27] ext4: don't order data for inode with EXT4_STATE_BUFFERED_IOMAP Zhang Yi
2024-10-22 11:10 ` [PATCH 17/27] ext4: implement writeback iomap path Zhang Yi
2024-10-22 11:10 ` [PATCH 18/27] ext4: implement mmap " Zhang Yi
2024-10-22 11:10 ` [PATCH 19/27] ext4: do not always order data when partial zeroing out a block Zhang Yi
2024-10-22 11:10 ` [PATCH 20/27] ext4: do not start handle if unnecessary while " Zhang Yi
2024-10-22 11:10 ` [PATCH 21/27] ext4: implement zero_range iomap path Zhang Yi
2024-10-22 11:10 ` [PATCH 22/27] ext4: disable online defrag when inode using iomap buffered I/O path Zhang Yi
2024-10-22 11:10 ` [PATCH 23/27] ext4: disable inode journal mode when " Zhang Yi
2024-10-22 11:10 ` [PATCH 24/27] ext4: partially enable iomap for the buffered I/O path of regular files Zhang Yi
2024-10-22 11:10 ` [PATCH 25/27] ext4: enable large folio for regular file with iomap buffered I/O path Zhang Yi
2024-10-22 11:10 ` [PATCH 26/27] ext4: change mount options code style Zhang Yi
2024-10-22 11:10 ` [PATCH 27/27] ext4: introduce a mount option for iomap buffered I/O path Zhang Yi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).