linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: libaokun@huaweicloud.com
To: linux-ext4@vger.kernel.org
Cc: tytso@mit.edu, adilger.kernel@dilger.ca, jack@suse.cz,
	linux-kernel@vger.kernel.org, kernel@pankajraghav.com,
	mcgrof@kernel.org, ebiggers@kernel.org, willy@infradead.org,
	yi.zhang@huawei.com, yangerkun@huawei.com,
	chengzhihao1@huawei.com, libaokun1@huawei.com,
	libaokun@huaweicloud.com
Subject: [PATCH v3 00/24] ext4: enable block size larger than page size
Date: Tue, 11 Nov 2025 22:26:10 +0800	[thread overview]
Message-ID: <20251111142634.3301616-1-libaokun@huaweicloud.com> (raw)

From: Baokun Li <libaokun1@huawei.com>

Changes since v2:
 * Collect RVB from Jan Kara, Zhang Yi and Pankaj Raghav.
    (Thank you for your review!)
 * Patch 21: Before switching the inode journalling mode, drop all
    page cache of that inode and invoke filemap_write_and_wait()
    unconditionally. (Suggested by Jan Kara)
 * Patch 22: Extend fs-verity to support large folios in addition to
    large block size. (Suggested by Jan Kara)
 * Patch 24: Add a blocksize_gt_pagesize sysfs interface to help users
    (e.g., mke2fs) determine whether the current kernel supports bs > ps.
    In addition, remove the experimental tag. (Suggested by Theodore Ts'o)

[v2]: https://lore.kernel.org/r/20251107144249.435029-1-libaokun@huaweicloud.com

Changes since v1:
 * Collect RVB from Jan Kara and Zhang Yi. (Thanks for your review!)
 * Patch 4: Just use blocksize in the rounding.(Suggested by Jan Kara)
 * Patch 7: use kvmalloc() instead of allocating contiguous physical
    pages.(Suggested by Jan Kara)
 * Patch 12: Fix some typos.(Suggested by Jan Kara)
 * Use clearer naming: EXT4_LBLK_TO_PG() and EXT4_PG_TO_LBLK().
    (Suggested by Jan Kara)
 * Patch 21: removed. After rebasing on Ted’s latest dev branch, this
    patch is no longer needed.
 * Patch 22-23: removed. The issue was resolved by removing the WARN_ON
    in the MM code, so we now rely on patch [1].(Suggested by Matthew)
 * Add new Patch 21 to support data=journal under LBS. (Suggested by
    Jan Kara)
 * Add new Patch 22 to support fs verity under LBS.
 * New Patch 23: add the s_max_folio_order field instead of introducing
    the EXT4_MF_LARGE_FOLIO flag.
 * New Patch 24: rebase adaptation.

[v1]: https://lore.kernel.org/r/20251025032221.2905818-1-libaokun@huaweicloud.com

======

This series enables block size > page size (Large Block Size) in EXT4.

Since large folios are already supported for regular files, the required
changes are not substantial, but they are scattered across the code.
The changes primarily focus on cleaning up potential division-by-zero
errors, resolving negative left/right shifts, and correctly handling
mutually exclusive mount options.

One somewhat troublesome issue is that allocating page units greater than
order-1 with __GFP_NOFAIL in __alloc_pages_slowpath() can trigger an
unexpected WARN_ON. With LBS support, EXT4 and jbd2 may use __GFP_NOFAIL
to allocate large folios when reading metadata. The issue was resolved by
removing the WARN_ON in the MM code, so we now rely on patch [1].

[1]: https://lore.kernel.org/r/20251105085652.4081123-1-libaokun@huaweicloud.com

Patch series based on Ted’s latest dev branch.

`kvm-xfstests -c ext4/all -g auto` has been executed with no new failures.
`kvm-xfstests -c ext4/64k -g auto` has been executed and no Oops was
observed, but allocation failures for large folios may trigger warn_alloc()
warnings, tests with 32k or smaller block sizes have not exhibited any
page allocation failures.

Here are some performance test data for your reference:

Testing EXT4 filesystems with different block sizes, measuring
single-threaded dd bandwidth for BIO/DIO with varying bs values.

Before(PAGE_SIZE=4096):

      BIO     | bs=4k    | bs=8k    | bs=16k   | bs=32k   | bs=64k
--------------|----------|----------|----------|----------|------------
 4k           | 1.5 GB/s | 2.1 GB/s | 2.8 GB/s | 3.4 GB/s | 3.8 GB/s
 8k (bigalloc)| 1.4 GB/s | 2.0 GB/s | 2.6 GB/s | 3.1 GB/s | 3.4 GB/s
 16k(bigalloc)| 1.5 GB/s | 2.0 GB/s | 2.6 GB/s | 3.2 GB/s | 3.6 GB/s
 32k(bigalloc)| 1.5 GB/s | 2.1 GB/s | 2.7 GB/s | 3.3 GB/s | 3.7 GB/s
 64k(bigalloc)| 1.5 GB/s | 2.1 GB/s | 2.8 GB/s | 3.4 GB/s | 3.8 GB/s
              
      DIO     | bs=4k    | bs=8k    | bs=16k   | bs=32k   | bs=64k
--------------|----------|----------|----------|----------|------------
 4k           | 194 MB/s | 366 MB/s | 626 MB/s | 1.0 GB/s | 1.4 GB/s
 8k (bigalloc)| 188 MB/s | 359 MB/s | 612 MB/s | 996 MB/s | 1.4 GB/s
 16k(bigalloc)| 208 MB/s | 378 MB/s | 642 MB/s | 1.0 GB/s | 1.4 GB/s
 32k(bigalloc)| 184 MB/s | 368 MB/s | 637 MB/s | 995 MB/s | 1.4 GB/s
 64k(bigalloc)| 208 MB/s | 389 MB/s | 634 MB/s | 1.0 GB/s | 1.4 GB/s

Patched(PAGE_SIZE=4096):

   BIO   | bs=4k    | bs=8k    | bs=16k   | bs=32k   | bs=64k
---------|----------|----------|----------|----------|------------
 4k      | 1.5 GB/s | 2.1 GB/s | 2.8 GB/s | 3.4 GB/s | 3.8 GB/s
 8k (LBS)| 1.7 GB/s | 2.3 GB/s | 3.2 GB/s | 4.2 GB/s | 4.7 GB/s
 16k(LBS)| 2.0 GB/s | 2.7 GB/s | 3.6 GB/s | 4.7 GB/s | 5.4 GB/s
 32k(LBS)| 2.2 GB/s | 3.1 GB/s | 3.9 GB/s | 4.9 GB/s | 5.7 GB/s
 64k(LBS)| 2.4 GB/s | 3.3 GB/s | 4.2 GB/s | 5.1 GB/s | 6.0 GB/s

   DIO   | bs=4k    | bs=8k    | bs=16k   | bs=32k   | bs=64k
---------|----------|----------|----------|----------|------------
 4k      | 204 MB/s | 355 MB/s | 627 MB/s | 1.0 GB/s | 1.4 GB/s
 8k (LBS)| 210 MB/s | 356 MB/s | 602 MB/s | 997 MB/s | 1.4 GB/s
 16k(LBS)| 191 MB/s | 361 MB/s | 589 MB/s | 981 MB/s | 1.4 GB/s
 32k(LBS)| 181 MB/s | 330 MB/s | 581 MB/s | 951 MB/s | 1.3 GB/s
 64k(LBS)| 148 MB/s | 272 MB/s | 499 MB/s | 840 MB/s | 1.3 GB/s


The results show:

 * The code changes have almost no impact on the original 4k write
   performance of ext4.
 * Compared with bigalloc, LBS improves BIO write performance by about 50%
   on average.
 * Compared with bigalloc, LBS shows degradation in DIO write performance,
   which increases as the filesystem block size grows and the test bs
   decreases, with a maximum degradation of about 30%.

The DIO regression is primarily due to the increased time spent in
crc32c_arch() within ext4_block_bitmap_csum_set() during block allocation,
as the block size grows larger. This indicates that larger filesystem block
sizes are not always better; please choose an appropriate block size based
on your I/O workload characteristics.

We are also planning further optimizations for block allocation under LBS
in the future.

Comments and questions are, as always, welcome.

Thanks,
Baokun


Baokun Li (21):
  ext4: remove page offset calculation in ext4_block_truncate_page()
  ext4: remove PAGE_SIZE checks for rec_len conversion
  ext4: make ext4_punch_hole() support large block size
  ext4: enable DIOREAD_NOLOCK by default for BS > PS as well
  ext4: introduce s_min_folio_order for future BS > PS support
  ext4: support large block size in ext4_calculate_overhead()
  ext4: support large block size in ext4_readdir()
  ext4: add EXT4_LBLK_TO_B macro for logical block to bytes conversion
  ext4: add EXT4_LBLK_TO_PG and EXT4_PG_TO_LBLK for block/page
    conversion
  ext4: support large block size in ext4_mb_load_buddy_gfp()
  ext4: support large block size in ext4_mb_get_buddy_page_lock()
  ext4: support large block size in ext4_mb_init_cache()
  ext4: prepare buddy cache inode for BS > PS with large folios
  ext4: support large block size in ext4_mpage_readpages()
  ext4: support large block size in ext4_block_write_begin()
  ext4: support large block size in mpage_map_and_submit_buffers()
  ext4: support large block size in mpage_prepare_extent_to_map()
  ext4: make data=journal support large block size
  ext4: support verifying data from large folios with fs-verity
  ext4: add checks for large folio incompatibilities when BS > PS
  ext4: enable block size larger than page size

Zhihao Cheng (3):
  ext4: remove page offset calculation in ext4_block_zero_page_range()
  ext4: rename 'page' references to 'folio' in multi-block allocator
  ext4: support large block size in __ext4_block_zero_page_range()

 fs/ext4/dir.c       |   8 +--
 fs/ext4/ext4.h      |  26 ++++-----
 fs/ext4/ext4_jbd2.c |   3 +-
 fs/ext4/extents.c   |   2 +-
 fs/ext4/inode.c     | 111 ++++++++++++++---------------------
 fs/ext4/mballoc.c   | 137 +++++++++++++++++++++++---------------------
 fs/ext4/namei.c     |   8 +--
 fs/ext4/readpage.c  |   7 +--
 fs/ext4/super.c     |  61 ++++++++++++++++----
 fs/ext4/sysfs.c     |   6 ++
 fs/ext4/verity.c    |   2 +-
 11 files changed, 196 insertions(+), 175 deletions(-)

-- 
2.46.1


             reply	other threads:[~2025-11-11 14:35 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-11 14:26 libaokun [this message]
2025-11-11 14:26 ` [PATCH v3 01/24] ext4: remove page offset calculation in ext4_block_zero_page_range() libaokun
2025-11-11 14:26 ` [PATCH v3 02/24] ext4: remove page offset calculation in ext4_block_truncate_page() libaokun
2025-11-11 14:26 ` [PATCH v3 03/24] ext4: remove PAGE_SIZE checks for rec_len conversion libaokun
2025-11-11 14:26 ` [PATCH v3 04/24] ext4: make ext4_punch_hole() support large block size libaokun
2025-11-11 14:26 ` [PATCH v3 05/24] ext4: enable DIOREAD_NOLOCK by default for BS > PS as well libaokun
2025-11-11 14:26 ` [PATCH v3 06/24] ext4: introduce s_min_folio_order for future BS > PS support libaokun
2025-11-11 14:26 ` [PATCH v3 07/24] ext4: support large block size in ext4_calculate_overhead() libaokun
2025-11-11 14:26 ` [PATCH v3 08/24] ext4: support large block size in ext4_readdir() libaokun
2025-11-11 14:26 ` [PATCH v3 09/24] ext4: add EXT4_LBLK_TO_B macro for logical block to bytes conversion libaokun
2025-11-11 14:26 ` [PATCH v3 10/24] ext4: add EXT4_LBLK_TO_PG and EXT4_PG_TO_LBLK for block/page conversion libaokun
2025-11-11 14:26 ` [PATCH v3 11/24] ext4: support large block size in ext4_mb_load_buddy_gfp() libaokun
2025-11-11 14:26 ` [PATCH v3 12/24] ext4: support large block size in ext4_mb_get_buddy_page_lock() libaokun
2025-11-11 14:26 ` [PATCH v3 13/24] ext4: support large block size in ext4_mb_init_cache() libaokun
2025-11-11 14:26 ` [PATCH v3 14/24] ext4: prepare buddy cache inode for BS > PS with large folios libaokun
2025-11-11 14:26 ` [PATCH v3 15/24] ext4: rename 'page' references to 'folio' in multi-block allocator libaokun
2025-11-11 14:26 ` [PATCH v3 16/24] ext4: support large block size in ext4_mpage_readpages() libaokun
2025-11-11 14:26 ` [PATCH v3 17/24] ext4: support large block size in ext4_block_write_begin() libaokun
2025-11-11 14:26 ` [PATCH v3 18/24] ext4: support large block size in mpage_map_and_submit_buffers() libaokun
2025-11-11 14:26 ` [PATCH v3 19/24] ext4: support large block size in mpage_prepare_extent_to_map() libaokun
2025-11-11 14:26 ` [PATCH v3 20/24] ext4: support large block size in __ext4_block_zero_page_range() libaokun
2025-11-11 14:26 ` [PATCH v3 21/24] ext4: make data=journal support large block size libaokun
2025-11-12  6:52   ` Zhang Yi
2025-11-12 15:56   ` Jan Kara
2025-11-19 12:41   ` Dan Carpenter
2025-11-20  1:21     ` Baokun Li
2025-11-20 15:41       ` Theodore Tso
2025-11-21  1:59         ` Baokun Li
2025-11-11 14:26 ` [PATCH v3 22/24] ext4: support verifying data from large folios with fs-verity libaokun
2025-11-12  6:54   ` Zhang Yi
2025-11-12 15:57   ` Jan Kara
2025-11-11 14:26 ` [PATCH v3 23/24] ext4: add checks for large folio incompatibilities when BS > PS libaokun
2025-11-12  6:56   ` Zhang Yi
2025-11-11 14:26 ` [PATCH v3 24/24] ext4: enable block size larger than page size libaokun
2025-11-11 18:01   ` Pankaj Raghav
2025-11-11 21:11     ` Theodore Ts'o
2025-11-12  1:20       ` Baokun Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251111142634.3301616-1-libaokun@huaweicloud.com \
    --to=libaokun@huaweicloud.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=chengzhihao1@huawei.com \
    --cc=ebiggers@kernel.org \
    --cc=jack@suse.cz \
    --cc=kernel@pankajraghav.com \
    --cc=libaokun1@huawei.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mcgrof@kernel.org \
    --cc=tytso@mit.edu \
    --cc=willy@infradead.org \
    --cc=yangerkun@huawei.com \
    --cc=yi.zhang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).