* [PATCH 00/25] ext4: enable block size larger than page size
@ 2025-10-25 3:21 libaokun
2025-10-25 3:21 ` [PATCH 01/25] ext4: remove page offset calculation in ext4_block_zero_page_range() libaokun
` (24 more replies)
0 siblings, 25 replies; 68+ messages in thread
From: libaokun @ 2025-10-25 3:21 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, linux-kernel, kernel, mcgrof,
linux-fsdevel, linux-mm, yi.zhang, yangerkun, chengzhihao1,
libaokun1, libaokun
From: Baokun Li <libaokun1@huawei.com>
This series enables block size > page size (Large Block Size) in EXT4.
Since large folios are already supported for regular files, the required
changes are not substantial, but they are scattered across the code.
The changes primarily focus on cleaning up potential division-by-zero
errors, resolving negative left/right shifts, and correctly handling
mutually exclusive mount options.
One somewhat troublesome issue is that allocating page units greater than
order-1 with __GFP_NOFAIL in __alloc_pages_slowpath() can trigger an
unexpected WARN_ON. With LBS support, EXT4 and jbd2 may use __GFP_NOFAIL
to allocate large folios when reading metadata.
To avoid this warning, when jbd2_alloc() and grow_dev_folio() attempt to
allocate with order greater than 1, the __GFP_NOFAIL flag is not passed
down; instead, the functions retry internally to satisfy the allocation.
Patch series based on 6.18-rc2. `kvm-xfstests -c ext4/all -g auto` has
been executed with no new failures. `kvm-xfstests -c ext4/64k -g auto`
has been executed and no Oops was observed.
Here are some performance test data for your reference:
Testing EXT4 filesystems with different block sizes, measuring
single-threaded dd bandwidth for BIO/DIO with varying bs values.
Before(PAGE_SIZE=4096):
BIO | bs=4k | bs=8k | bs=16k | bs=32k | bs=64k
--------------|----------|----------|----------|----------|------------
4k | 1.5 GB/s | 2.1 GB/s | 2.8 GB/s | 3.4 GB/s | 3.8 GB/s
8k (bigalloc)| 1.4 GB/s | 2.0 GB/s | 2.6 GB/s | 3.1 GB/s | 3.4 GB/s
16k(bigalloc)| 1.5 GB/s | 2.0 GB/s | 2.6 GB/s | 3.2 GB/s | 3.6 GB/s
32k(bigalloc)| 1.5 GB/s | 2.1 GB/s | 2.7 GB/s | 3.3 GB/s | 3.7 GB/s
64k(bigalloc)| 1.5 GB/s | 2.1 GB/s | 2.8 GB/s | 3.4 GB/s | 3.8 GB/s
DIO | bs=4k | bs=8k | bs=16k | bs=32k | bs=64k
--------------|----------|----------|----------|----------|------------
4k | 194 MB/s | 366 MB/s | 626 MB/s | 1.0 GB/s | 1.4 GB/s
8k (bigalloc)| 188 MB/s | 359 MB/s | 612 MB/s | 996 MB/s | 1.4 GB/s
16k(bigalloc)| 208 MB/s | 378 MB/s | 642 MB/s | 1.0 GB/s | 1.4 GB/s
32k(bigalloc)| 184 MB/s | 368 MB/s | 637 MB/s | 995 MB/s | 1.4 GB/s
64k(bigalloc)| 208 MB/s | 389 MB/s | 634 MB/s | 1.0 GB/s | 1.4 GB/s
Patched(PAGE_SIZE=4096):
BIO | bs=4k | bs=8k | bs=16k | bs=32k | bs=64k
---------|----------|----------|----------|----------|------------
4k | 1.5 GB/s | 2.1 GB/s | 2.8 GB/s | 3.4 GB/s | 3.8 GB/s
8k (LBS)| 1.7 GB/s | 2.3 GB/s | 3.2 GB/s | 4.2 GB/s | 4.7 GB/s
16k(LBS)| 2.0 GB/s | 2.7 GB/s | 3.6 GB/s | 4.7 GB/s | 5.4 GB/s
32k(LBS)| 2.2 GB/s | 3.1 GB/s | 3.9 GB/s | 4.9 GB/s | 5.7 GB/s
64k(LBS)| 2.4 GB/s | 3.3 GB/s | 4.2 GB/s | 5.1 GB/s | 6.0 GB/s
DIO | bs=4k | bs=8k | bs=16k | bs=32k | bs=64k
---------|----------|----------|----------|----------|------------
4k | 204 MB/s | 355 MB/s | 627 MB/s | 1.0 GB/s | 1.4 GB/s
8k (LBS)| 210 MB/s | 356 MB/s | 602 MB/s | 997 MB/s | 1.4 GB/s
16k(LBS)| 191 MB/s | 361 MB/s | 589 MB/s | 981 MB/s | 1.4 GB/s
32k(LBS)| 181 MB/s | 330 MB/s | 581 MB/s | 951 MB/s | 1.3 GB/s
64k(LBS)| 148 MB/s | 272 MB/s | 499 MB/s | 840 MB/s | 1.3 GB/s
The results show:
* The code changes have almost no impact on the original 4k write
performance of ext4.
* Compared with bigalloc, LBS improves BIO write performance by about 50%
on average.
* Compared with bigalloc, LBS shows degradation in DIO write performance,
which increases as the filesystem block size grows and the test bs
decreases, with a maximum degradation of about 30%.
The DIO regression is primarily due to the increased time spent in
crc32c_arch() within ext4_block_bitmap_csum_set() during block allocation,
as the block size grows larger. This indicates that larger filesystem block
sizes are not always better; please choose an appropriate block size based
on your I/O workload characteristics.
We are also planning further optimizations for block allocation under LBS
in the future.
Comments and questions are, as always, welcome.
Thanks,
Baokun
Baokun Li (21):
ext4: remove page offset calculation in ext4_block_truncate_page()
ext4: remove PAGE_SIZE checks for rec_len conversion
ext4: make ext4_punch_hole() support large block size
ext4: enable DIOREAD_NOLOCK by default for BS > PS as well
ext4: introduce s_min_folio_order for future BS > PS support
ext4: support large block size in ext4_calculate_overhead()
ext4: support large block size in ext4_readdir()
ext4: add EXT4_LBLK_TO_B macro for logical block to bytes conversion
ext4: add EXT4_LBLK_TO_P and EXT4_P_TO_LBLK for block/page conversion
ext4: support large block size in ext4_mb_load_buddy_gfp()
ext4: support large block size in ext4_mb_get_buddy_page_lock()
ext4: support large block size in ext4_mb_init_cache()
ext4: prepare buddy cache inode for BS > PS with large folios
ext4: support large block size in ext4_mpage_readpages()
ext4: support large block size in ext4_block_write_begin()
ext4: support large block size in mpage_map_and_submit_buffers()
ext4: support large block size in mpage_prepare_extent_to_map()
fs/buffer: prevent WARN_ON in __alloc_pages_slowpath() when BS > PS
jbd2: prevent WARN_ON in __alloc_pages_slowpath() when BS > PS
ext4: add checks for large folio incompatibilities when BS > PS
ext4: enable block size larger than page size
Zhihao Cheng (4):
ext4: remove page offset calculation in ext4_block_zero_page_range()
ext4: rename 'page' references to 'folio' in multi-block allocator
ext4: support large block size in __ext4_block_zero_page_range()
ext4: make online defragmentation support large block size
fs/buffer.c | 33 +++++++++-
fs/ext4/dir.c | 8 +--
fs/ext4/ext4.h | 27 ++++-----
fs/ext4/extents.c | 2 +-
fs/ext4/inode.c | 69 ++++++++++-----------
fs/ext4/mballoc.c | 137 ++++++++++++++++++++++--------------------
fs/ext4/move_extent.c | 20 +++---
fs/ext4/namei.c | 8 +--
fs/ext4/readpage.c | 7 +--
fs/ext4/super.c | 52 ++++++++++++----
fs/ext4/verity.c | 2 +-
fs/jbd2/journal.c | 28 ++++++++-
12 files changed, 234 insertions(+), 159 deletions(-)
--
2.46.1
^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH 01/25] ext4: remove page offset calculation in ext4_block_zero_page_range()
2025-10-25 3:21 [PATCH 00/25] ext4: enable block size larger than page size libaokun
@ 2025-10-25 3:21 ` libaokun
2025-11-03 7:41 ` Jan Kara
2025-10-25 3:21 ` [PATCH 02/25] ext4: remove page offset calculation in ext4_block_truncate_page() libaokun
` (23 subsequent siblings)
24 siblings, 1 reply; 68+ messages in thread
From: libaokun @ 2025-10-25 3:21 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, linux-kernel, kernel, mcgrof,
linux-fsdevel, linux-mm, yi.zhang, yangerkun, chengzhihao1,
libaokun1, libaokun
From: Zhihao Cheng <chengzhihao1@huawei.com>
For bs <= ps scenarios, calculating the offset within the block is
sufficient. For bs > ps, an initial page offset calculation can lead to
incorrect behavior. Thus this redundant calculation has been removed.
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/inode.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e99306a8f47c..0742039c53a7 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4157,9 +4157,8 @@ static int ext4_block_zero_page_range(handle_t *handle,
struct address_space *mapping, loff_t from, loff_t length)
{
struct inode *inode = mapping->host;
- unsigned offset = from & (PAGE_SIZE-1);
unsigned blocksize = inode->i_sb->s_blocksize;
- unsigned max = blocksize - (offset & (blocksize - 1));
+ unsigned int max = blocksize - (from & (blocksize - 1));
/*
* correct length if it does not fall between
--
2.46.1
^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH 02/25] ext4: remove page offset calculation in ext4_block_truncate_page()
2025-10-25 3:21 [PATCH 00/25] ext4: enable block size larger than page size libaokun
2025-10-25 3:21 ` [PATCH 01/25] ext4: remove page offset calculation in ext4_block_zero_page_range() libaokun
@ 2025-10-25 3:21 ` libaokun
2025-11-03 7:42 ` Jan Kara
2025-10-25 3:21 ` [PATCH 03/25] ext4: remove PAGE_SIZE checks for rec_len conversion libaokun
` (22 subsequent siblings)
24 siblings, 1 reply; 68+ messages in thread
From: libaokun @ 2025-10-25 3:21 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, linux-kernel, kernel, mcgrof,
linux-fsdevel, linux-mm, yi.zhang, yangerkun, chengzhihao1,
libaokun1, libaokun
From: Baokun Li <libaokun1@huawei.com>
For bs <= ps scenarios, calculating the offset within the block is
sufficient. For bs > ps, an initial page offset calculation can lead to
incorrect behavior. Thus this redundant calculation has been removed.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/inode.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0742039c53a7..4c04af7e51c9 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4183,7 +4183,6 @@ static int ext4_block_zero_page_range(handle_t *handle,
static int ext4_block_truncate_page(handle_t *handle,
struct address_space *mapping, loff_t from)
{
- unsigned offset = from & (PAGE_SIZE-1);
unsigned length;
unsigned blocksize;
struct inode *inode = mapping->host;
@@ -4192,8 +4191,8 @@ static int ext4_block_truncate_page(handle_t *handle,
if (IS_ENCRYPTED(inode) && !fscrypt_has_encryption_key(inode))
return 0;
- blocksize = inode->i_sb->s_blocksize;
- length = blocksize - (offset & (blocksize - 1));
+ blocksize = i_blocksize(inode);
+ length = blocksize - (from & (blocksize - 1));
return ext4_block_zero_page_range(handle, mapping, from, length);
}
--
2.46.1
^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH 03/25] ext4: remove PAGE_SIZE checks for rec_len conversion
2025-10-25 3:21 [PATCH 00/25] ext4: enable block size larger than page size libaokun
2025-10-25 3:21 ` [PATCH 01/25] ext4: remove page offset calculation in ext4_block_zero_page_range() libaokun
2025-10-25 3:21 ` [PATCH 02/25] ext4: remove page offset calculation in ext4_block_truncate_page() libaokun
@ 2025-10-25 3:21 ` libaokun
2025-11-03 7:43 ` Jan Kara
2025-10-25 3:22 ` [PATCH 04/25] ext4: make ext4_punch_hole() support large block size libaokun
` (21 subsequent siblings)
24 siblings, 1 reply; 68+ messages in thread
From: libaokun @ 2025-10-25 3:21 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, linux-kernel, kernel, mcgrof,
linux-fsdevel, linux-mm, yi.zhang, yangerkun, chengzhihao1,
libaokun1, libaokun
From: Baokun Li <libaokun1@huawei.com>
Previously, ext4_rec_len_(to|from)_disk only performed complex rec_len
conversions when PAGE_SIZE >= 65536 to reduce complexity.
However, we are soon to support file system block sizes greater than
page size, which makes these conditional checks unnecessary. Thus, these
checks are now removed.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/ext4.h | 12 ------------
1 file changed, 12 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 24c414605b08..93c2bf4d125a 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2464,28 +2464,19 @@ static inline unsigned int ext4_dir_rec_len(__u8 name_len,
return (rec_len & ~EXT4_DIR_ROUND);
}
-/*
- * If we ever get support for fs block sizes > page_size, we'll need
- * to remove the #if statements in the next two functions...
- */
static inline unsigned int
ext4_rec_len_from_disk(__le16 dlen, unsigned blocksize)
{
unsigned len = le16_to_cpu(dlen);
-#if (PAGE_SIZE >= 65536)
if (len == EXT4_MAX_REC_LEN || len == 0)
return blocksize;
return (len & 65532) | ((len & 3) << 16);
-#else
- return len;
-#endif
}
static inline __le16 ext4_rec_len_to_disk(unsigned len, unsigned blocksize)
{
BUG_ON((len > blocksize) || (blocksize > (1 << 18)) || (len & 3));
-#if (PAGE_SIZE >= 65536)
if (len < 65536)
return cpu_to_le16(len);
if (len == blocksize) {
@@ -2495,9 +2486,6 @@ static inline __le16 ext4_rec_len_to_disk(unsigned len, unsigned blocksize)
return cpu_to_le16(0);
}
return cpu_to_le16((len & 65532) | ((len >> 16) & 3));
-#else
- return cpu_to_le16(len);
-#endif
}
/*
--
2.46.1
^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH 04/25] ext4: make ext4_punch_hole() support large block size
2025-10-25 3:21 [PATCH 00/25] ext4: enable block size larger than page size libaokun
` (2 preceding siblings ...)
2025-10-25 3:21 ` [PATCH 03/25] ext4: remove PAGE_SIZE checks for rec_len conversion libaokun
@ 2025-10-25 3:22 ` libaokun
2025-11-03 8:05 ` Jan Kara
2025-10-25 3:22 ` [PATCH 05/25] ext4: enable DIOREAD_NOLOCK by default for BS > PS as well libaokun
` (20 subsequent siblings)
24 siblings, 1 reply; 68+ messages in thread
From: libaokun @ 2025-10-25 3:22 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, linux-kernel, kernel, mcgrof,
linux-fsdevel, linux-mm, yi.zhang, yangerkun, chengzhihao1,
libaokun1, libaokun
From: Baokun Li <libaokun1@huawei.com>
Since the block size may be greater than the page size, when a hole
extends beyond i_size, we need to align the hole's end upwards to the
larger of PAGE_SIZE and blocksize.
This is to prevent the issues seen in commit 2be4751b21ae ("ext4: fix
2nd xfstests 127 punch hole failure") from reappearing after BS > PS
is supported.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/inode.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4c04af7e51c9..a63513a3db53 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4401,7 +4401,8 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
* the page that contains i_size.
*/
if (end > inode->i_size)
- end = round_up(inode->i_size, PAGE_SIZE);
+ end = round_up(inode->i_size,
+ umax(PAGE_SIZE, sb->s_blocksize));
if (end > max_end)
end = max_end;
length = end - offset;
--
2.46.1
^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH 05/25] ext4: enable DIOREAD_NOLOCK by default for BS > PS as well
2025-10-25 3:21 [PATCH 00/25] ext4: enable block size larger than page size libaokun
` (3 preceding siblings ...)
2025-10-25 3:22 ` [PATCH 04/25] ext4: make ext4_punch_hole() support large block size libaokun
@ 2025-10-25 3:22 ` libaokun
2025-11-03 8:06 ` Jan Kara
2025-10-25 3:22 ` [PATCH 06/25] ext4: introduce s_min_folio_order for future BS > PS support libaokun
` (19 subsequent siblings)
24 siblings, 1 reply; 68+ messages in thread
From: libaokun @ 2025-10-25 3:22 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, linux-kernel, kernel, mcgrof,
linux-fsdevel, linux-mm, yi.zhang, yangerkun, chengzhihao1,
libaokun1, libaokun
From: Baokun Li <libaokun1@huawei.com>
The dioread_nolock related processes already support large folio, so
dioread_nolock is enabled by default regardless of whether the blocksize
is less than, equal to, or greater than PAGE_SIZE.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/super.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 894529f9b0cc..aa5aee4d1b63 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4383,8 +4383,7 @@ static void ext4_set_def_opts(struct super_block *sb,
((def_mount_opts & EXT4_DEFM_NODELALLOC) == 0))
set_opt(sb, DELALLOC);
- if (sb->s_blocksize <= PAGE_SIZE)
- set_opt(sb, DIOREAD_NOLOCK);
+ set_opt(sb, DIOREAD_NOLOCK);
}
static int ext4_handle_clustersize(struct super_block *sb)
--
2.46.1
^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH 06/25] ext4: introduce s_min_folio_order for future BS > PS support
2025-10-25 3:21 [PATCH 00/25] ext4: enable block size larger than page size libaokun
` (4 preceding siblings ...)
2025-10-25 3:22 ` [PATCH 05/25] ext4: enable DIOREAD_NOLOCK by default for BS > PS as well libaokun
@ 2025-10-25 3:22 ` libaokun
2025-11-03 8:19 ` Jan Kara
2025-10-25 3:22 ` [PATCH 07/25] ext4: support large block size in ext4_calculate_overhead() libaokun
` (18 subsequent siblings)
24 siblings, 1 reply; 68+ messages in thread
From: libaokun @ 2025-10-25 3:22 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, linux-kernel, kernel, mcgrof,
linux-fsdevel, linux-mm, yi.zhang, yangerkun, chengzhihao1,
libaokun1, libaokun
From: Baokun Li <libaokun1@huawei.com>
This commit introduces the s_min_folio_order field to the ext4_sb_info
structure. This field will store the minimum folio order required by the
current filesystem, laying groundwork for future support of block sizes
greater than PAGE_SIZE.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/ext4.h | 3 +++
fs/ext4/inode.c | 3 ++-
fs/ext4/super.c | 10 +++++-----
3 files changed, 10 insertions(+), 6 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 93c2bf4d125a..bca6c3709673 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1677,6 +1677,9 @@ struct ext4_sb_info {
/* record the last minlen when FITRIM is called. */
unsigned long s_last_trim_minblks;
+ /* minimum folio order of a page cache allocation */
+ unsigned int s_min_folio_order;
+
/* Precomputed FS UUID checksum for seeding other checksums */
__u32 s_csum_seed;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index a63513a3db53..889761ed51dd 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5174,7 +5174,8 @@ void ext4_set_inode_mapping_order(struct inode *inode)
if (!ext4_should_enable_large_folio(inode))
return;
- mapping_set_folio_order_range(inode->i_mapping, 0,
+ mapping_set_folio_order_range(inode->i_mapping,
+ EXT4_SB(inode->i_sb)->s_min_folio_order,
EXT4_MAX_PAGECACHE_ORDER(inode));
}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index aa5aee4d1b63..d353e25a5b92 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -5100,11 +5100,8 @@ static int ext4_load_super(struct super_block *sb, ext4_fsblk_t *lsb,
* If the default block size is not the same as the real block size,
* we need to reload it.
*/
- if (sb->s_blocksize == blocksize) {
- *lsb = logical_sb_block;
- sbi->s_sbh = bh;
- return 0;
- }
+ if (sb->s_blocksize == blocksize)
+ goto success;
/*
* bh must be released before kill_bdev(), otherwise
@@ -5135,6 +5132,9 @@ static int ext4_load_super(struct super_block *sb, ext4_fsblk_t *lsb,
ext4_msg(sb, KERN_ERR, "Magic mismatch, very weird!");
goto out;
}
+
+success:
+ sbi->s_min_folio_order = get_order(blocksize);
*lsb = logical_sb_block;
sbi->s_sbh = bh;
return 0;
--
2.46.1
^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH 07/25] ext4: support large block size in ext4_calculate_overhead()
2025-10-25 3:21 [PATCH 00/25] ext4: enable block size larger than page size libaokun
` (5 preceding siblings ...)
2025-10-25 3:22 ` [PATCH 06/25] ext4: introduce s_min_folio_order for future BS > PS support libaokun
@ 2025-10-25 3:22 ` libaokun
2025-11-03 8:14 ` Jan Kara
2025-10-25 3:22 ` [PATCH 08/25] ext4: support large block size in ext4_readdir() libaokun
` (17 subsequent siblings)
24 siblings, 1 reply; 68+ messages in thread
From: libaokun @ 2025-10-25 3:22 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, linux-kernel, kernel, mcgrof,
linux-fsdevel, linux-mm, yi.zhang, yangerkun, chengzhihao1,
libaokun1, libaokun
From: Baokun Li <libaokun1@huawei.com>
ext4_calculate_overhead() used a single page for its bitmap buffer, which
worked fine when PAGE_SIZE >= block size. However, with block size greater
than page size (BS > PS) support, the bitmap can exceed a single page.
To address this, we now use __get_free_pages() to allocate multiple pages,
sized to the block size, to properly support BS > PS.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/super.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index d353e25a5b92..7338c708ea1d 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4182,7 +4182,8 @@ int ext4_calculate_overhead(struct super_block *sb)
unsigned int j_blocks, j_inum = le32_to_cpu(es->s_journal_inum);
ext4_group_t i, ngroups = ext4_get_groups_count(sb);
ext4_fsblk_t overhead = 0;
- char *buf = (char *) get_zeroed_page(GFP_NOFS);
+ gfp_t gfp = GFP_NOFS | __GFP_ZERO;
+ char *buf = (char *)__get_free_pages(gfp, sbi->s_min_folio_order);
if (!buf)
return -ENOMEM;
@@ -4207,7 +4208,7 @@ int ext4_calculate_overhead(struct super_block *sb)
blks = count_overhead(sb, i, buf);
overhead += blks;
if (blks)
- memset(buf, 0, PAGE_SIZE);
+ memset(buf, 0, sb->s_blocksize);
cond_resched();
}
@@ -4230,7 +4231,7 @@ int ext4_calculate_overhead(struct super_block *sb)
}
sbi->s_overhead = overhead;
smp_wmb();
- free_page((unsigned long) buf);
+ free_pages((unsigned long)buf, sbi->s_min_folio_order);
return 0;
}
--
2.46.1
^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH 08/25] ext4: support large block size in ext4_readdir()
2025-10-25 3:21 [PATCH 00/25] ext4: enable block size larger than page size libaokun
` (6 preceding siblings ...)
2025-10-25 3:22 ` [PATCH 07/25] ext4: support large block size in ext4_calculate_overhead() libaokun
@ 2025-10-25 3:22 ` libaokun
2025-11-03 8:27 ` Jan Kara
2025-10-25 3:22 ` [PATCH 09/25] ext4: add EXT4_LBLK_TO_B macro for logical block to bytes conversion libaokun
` (16 subsequent siblings)
24 siblings, 1 reply; 68+ messages in thread
From: libaokun @ 2025-10-25 3:22 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, linux-kernel, kernel, mcgrof,
linux-fsdevel, linux-mm, yi.zhang, yangerkun, chengzhihao1,
libaokun1, libaokun
From: Baokun Li <libaokun1@huawei.com>
In ext4_readdir(), page_cache_sync_readahead() is used to readahead mapped
physical blocks. With LBS support, this can lead to a negative right shift.
To fix this, the page index is now calculated by first converting the
physical block number (pblk) to a file position (pos) before converting
it to a page index. Also, the correct number of pages to readahead is now
passed.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/dir.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index d4164c507a90..256fe2c1d4c1 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -192,13 +192,13 @@ static int ext4_readdir(struct file *file, struct dir_context *ctx)
continue;
}
if (err > 0) {
- pgoff_t index = map.m_pblk >>
- (PAGE_SHIFT - inode->i_blkbits);
+ pgoff_t index = map.m_pblk << inode->i_blkbits >>
+ PAGE_SHIFT;
if (!ra_has_index(&file->f_ra, index))
page_cache_sync_readahead(
sb->s_bdev->bd_mapping,
- &file->f_ra, file,
- index, 1);
+ &file->f_ra, file, index,
+ 1 << EXT4_SB(sb)->s_min_folio_order);
file->f_ra.prev_pos = (loff_t)index << PAGE_SHIFT;
bh = ext4_bread(NULL, inode, map.m_lblk, 0);
if (IS_ERR(bh)) {
--
2.46.1
^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH 09/25] ext4: add EXT4_LBLK_TO_B macro for logical block to bytes conversion
2025-10-25 3:21 [PATCH 00/25] ext4: enable block size larger than page size libaokun
` (7 preceding siblings ...)
2025-10-25 3:22 ` [PATCH 08/25] ext4: support large block size in ext4_readdir() libaokun
@ 2025-10-25 3:22 ` libaokun
2025-11-03 8:21 ` Jan Kara
2025-10-25 3:22 ` [PATCH 10/25] ext4: add EXT4_LBLK_TO_P and EXT4_P_TO_LBLK for block/page conversion libaokun
` (15 subsequent siblings)
24 siblings, 1 reply; 68+ messages in thread
From: libaokun @ 2025-10-25 3:22 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, linux-kernel, kernel, mcgrof,
linux-fsdevel, linux-mm, yi.zhang, yangerkun, chengzhihao1,
libaokun1, libaokun
From: Baokun Li <libaokun1@huawei.com>
No functional changes.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/ext4.h | 1 +
fs/ext4/extents.c | 2 +-
fs/ext4/inode.c | 20 +++++++++-----------
fs/ext4/namei.c | 8 +++-----
fs/ext4/verity.c | 2 +-
5 files changed, 15 insertions(+), 18 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index bca6c3709673..9b236f620b3a 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -367,6 +367,7 @@ struct ext4_io_submit {
blkbits))
#define EXT4_B_TO_LBLK(inode, offset) \
(round_up((offset), i_blocksize(inode)) >> (inode)->i_blkbits)
+#define EXT4_LBLK_TO_B(inode, lblk) ((loff_t)(lblk) << (inode)->i_blkbits)
/* Translate a block number to a cluster number */
#define EXT4_B2C(sbi, blk) ((blk) >> (sbi)->s_cluster_bits)
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index ca5499e9412b..da640c88b863 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4562,7 +4562,7 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
* allow a full retry cycle for any remaining allocations
*/
retries = 0;
- epos = (loff_t)(map.m_lblk + ret) << blkbits;
+ epos = EXT4_LBLK_TO_B(inode, map.m_lblk + ret);
inode_set_ctime_current(inode);
if (new_size) {
if (epos > new_size)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 889761ed51dd..73c1da90b604 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -825,9 +825,8 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
!(flags & EXT4_GET_BLOCKS_ZERO) &&
!ext4_is_quota_file(inode) &&
ext4_should_order_data(inode)) {
- loff_t start_byte =
- (loff_t)map->m_lblk << inode->i_blkbits;
- loff_t length = (loff_t)map->m_len << inode->i_blkbits;
+ loff_t start_byte = EXT4_LBLK_TO_B(inode, map->m_lblk);
+ loff_t length = EXT4_LBLK_TO_B(inode, map->m_len);
if (flags & EXT4_GET_BLOCKS_IO_SUBMIT)
ret = ext4_jbd2_inode_add_wait(handle, inode,
@@ -2225,7 +2224,6 @@ static int mpage_process_folio(struct mpage_da_data *mpd, struct folio *folio,
ext4_lblk_t lblk = *m_lblk;
ext4_fsblk_t pblock = *m_pblk;
int err = 0;
- int blkbits = mpd->inode->i_blkbits;
ssize_t io_end_size = 0;
struct ext4_io_end_vec *io_end_vec = ext4_last_io_end_vec(io_end);
@@ -2251,7 +2249,8 @@ static int mpage_process_folio(struct mpage_da_data *mpd, struct folio *folio,
err = PTR_ERR(io_end_vec);
goto out;
}
- io_end_vec->offset = (loff_t)mpd->map.m_lblk << blkbits;
+ io_end_vec->offset = EXT4_LBLK_TO_B(mpd->inode,
+ mpd->map.m_lblk);
}
*map_bh = true;
goto out;
@@ -2261,7 +2260,7 @@ static int mpage_process_folio(struct mpage_da_data *mpd, struct folio *folio,
bh->b_blocknr = pblock++;
}
clear_buffer_unwritten(bh);
- io_end_size += (1 << blkbits);
+ io_end_size += i_blocksize(mpd->inode);
} while (lblk++, (bh = bh->b_this_page) != head);
io_end_vec->size += io_end_size;
@@ -2463,7 +2462,7 @@ static int mpage_map_and_submit_extent(handle_t *handle,
io_end_vec = ext4_alloc_io_end_vec(io_end);
if (IS_ERR(io_end_vec))
return PTR_ERR(io_end_vec);
- io_end_vec->offset = ((loff_t)map->m_lblk) << inode->i_blkbits;
+ io_end_vec->offset = EXT4_LBLK_TO_B(inode, map->m_lblk);
do {
err = mpage_map_one_extent(handle, mpd);
if (err < 0) {
@@ -3503,8 +3502,8 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
iomap->dax_dev = EXT4_SB(inode->i_sb)->s_daxdev;
else
iomap->bdev = inode->i_sb->s_bdev;
- iomap->offset = (u64) map->m_lblk << blkbits;
- iomap->length = (u64) map->m_len << blkbits;
+ iomap->offset = EXT4_LBLK_TO_B(inode, map->m_lblk);
+ iomap->length = EXT4_LBLK_TO_B(inode, map->m_len);
if ((map->m_flags & EXT4_MAP_MAPPED) &&
!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
@@ -3678,7 +3677,6 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
unsigned int flags)
{
handle_t *handle;
- u8 blkbits = inode->i_blkbits;
int ret, dio_credits, m_flags = 0, retries = 0;
bool force_commit = false;
@@ -3737,7 +3735,7 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
* i_disksize out to i_size. This could be beyond where direct I/O is
* happening and thus expose allocated blocks to direct I/O reads.
*/
- else if (((loff_t)map->m_lblk << blkbits) >= i_size_read(inode))
+ else if (EXT4_LBLK_TO_B(inode, map->m_lblk) >= i_size_read(inode))
m_flags = EXT4_GET_BLOCKS_CREATE;
else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 2cd36f59c9e3..78cefb7cc9a7 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -1076,7 +1076,7 @@ static int htree_dirblock_to_tree(struct file *dir_file,
for (; de < top; de = ext4_next_entry(de, dir->i_sb->s_blocksize)) {
if (ext4_check_dir_entry(dir, NULL, de, bh,
bh->b_data, bh->b_size,
- (block<<EXT4_BLOCK_SIZE_BITS(dir->i_sb))
+ EXT4_LBLK_TO_B(dir, block)
+ ((char *)de - bh->b_data))) {
/* silently ignore the rest of the block */
break;
@@ -1630,7 +1630,7 @@ static struct buffer_head *__ext4_find_entry(struct inode *dir,
}
set_buffer_verified(bh);
i = search_dirblock(bh, dir, fname,
- block << EXT4_BLOCK_SIZE_BITS(sb), res_dir);
+ EXT4_LBLK_TO_B(dir, block), res_dir);
if (i == 1) {
EXT4_I(dir)->i_dir_start_lookup = block;
ret = bh;
@@ -1710,7 +1710,6 @@ static struct buffer_head * ext4_dx_find_entry(struct inode *dir,
struct ext4_filename *fname,
struct ext4_dir_entry_2 **res_dir)
{
- struct super_block * sb = dir->i_sb;
struct dx_frame frames[EXT4_HTREE_LEVEL], *frame;
struct buffer_head *bh;
ext4_lblk_t block;
@@ -1729,8 +1728,7 @@ static struct buffer_head * ext4_dx_find_entry(struct inode *dir,
goto errout;
retval = search_dirblock(bh, dir, fname,
- block << EXT4_BLOCK_SIZE_BITS(sb),
- res_dir);
+ EXT4_LBLK_TO_B(dir, block), res_dir);
if (retval == 1)
goto success;
brelse(bh);
diff --git a/fs/ext4/verity.c b/fs/ext4/verity.c
index d9203228ce97..7a980a8059bd 100644
--- a/fs/ext4/verity.c
+++ b/fs/ext4/verity.c
@@ -302,7 +302,7 @@ static int ext4_get_verity_descriptor_location(struct inode *inode,
end_lblk = le32_to_cpu(last_extent->ee_block) +
ext4_ext_get_actual_len(last_extent);
- desc_size_pos = (u64)end_lblk << inode->i_blkbits;
+ desc_size_pos = EXT4_LBLK_TO_B(inode, end_lblk);
ext4_free_ext_path(path);
if (desc_size_pos < sizeof(desc_size_disk))
--
2.46.1
^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH 10/25] ext4: add EXT4_LBLK_TO_P and EXT4_P_TO_LBLK for block/page conversion
2025-10-25 3:21 [PATCH 00/25] ext4: enable block size larger than page size libaokun
` (8 preceding siblings ...)
2025-10-25 3:22 ` [PATCH 09/25] ext4: add EXT4_LBLK_TO_B macro for logical block to bytes conversion libaokun
@ 2025-10-25 3:22 ` libaokun
2025-11-03 8:26 ` Jan Kara
2025-10-25 3:22 ` [PATCH 11/25] ext4: support large block size in ext4_mb_load_buddy_gfp() libaokun
` (14 subsequent siblings)
24 siblings, 1 reply; 68+ messages in thread
From: libaokun @ 2025-10-25 3:22 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, linux-kernel, kernel, mcgrof,
linux-fsdevel, linux-mm, yi.zhang, yangerkun, chengzhihao1,
libaokun1, libaokun
From: Baokun Li <libaokun1@huawei.com>
As BS > PS support is coming, all block number to page index (and
vice-versa) conversions must now go via bytes. Added EXT4_LBLK_TO_P()
and EXT4_P_TO_LBLK() macros to simplify these conversions and handle
both BS <= PS and BS > PS scenarios cleanly.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/ext4.h | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 9b236f620b3a..8223ed29b343 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -369,6 +369,12 @@ struct ext4_io_submit {
(round_up((offset), i_blocksize(inode)) >> (inode)->i_blkbits)
#define EXT4_LBLK_TO_B(inode, lblk) ((loff_t)(lblk) << (inode)->i_blkbits)
+/* Translate a block number to a page index */
+#define EXT4_LBLK_TO_P(inode, lblk) (EXT4_LBLK_TO_B((inode), (lblk)) >> \
+ PAGE_SHIFT)
+/* Translate a page index to a block number */
+#define EXT4_P_TO_LBLK(inode, pnum) (((loff_t)(pnum) << PAGE_SHIFT) >> \
+ (inode)->i_blkbits)
/* Translate a block number to a cluster number */
#define EXT4_B2C(sbi, blk) ((blk) >> (sbi)->s_cluster_bits)
/* Translate a cluster number to a block number */
--
2.46.1
^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH 11/25] ext4: support large block size in ext4_mb_load_buddy_gfp()
2025-10-25 3:21 [PATCH 00/25] ext4: enable block size larger than page size libaokun
` (9 preceding siblings ...)
2025-10-25 3:22 ` [PATCH 10/25] ext4: add EXT4_LBLK_TO_P and EXT4_P_TO_LBLK for block/page conversion libaokun
@ 2025-10-25 3:22 ` libaokun
2025-11-05 8:46 ` Jan Kara
2025-10-25 3:22 ` [PATCH 12/25] ext4: support large block size in ext4_mb_get_buddy_page_lock() libaokun
` (13 subsequent siblings)
24 siblings, 1 reply; 68+ messages in thread
From: libaokun @ 2025-10-25 3:22 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, linux-kernel, kernel, mcgrof,
linux-fsdevel, linux-mm, yi.zhang, yangerkun, chengzhihao1,
libaokun1, libaokun
From: Baokun Li <libaokun1@huawei.com>
Currently, ext4_mb_load_buddy_gfp() uses blocks_per_page to calculate the
folio index and offset. However, when blocksize is larger than PAGE_SIZE,
blocks_per_page becomes zero, leading to a potential division-by-zero bug.
To support BS > PS, use bytes to compute folio index and offset within
folio to get rid of blocks_per_page.
Also, if buddy and bitmap land in the same folio, we get that folio’s ref
instead of looking it up again before updating the buddy.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/mballoc.c | 27 ++++++++++++++++-----------
1 file changed, 16 insertions(+), 11 deletions(-)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 6070d3c86678..3494c6fe5bfb 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -1642,17 +1642,15 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group, gfp_t gfp)
/*
* Locking note: This routine calls ext4_mb_init_cache(), which takes the
- * block group lock of all groups for this page; do not hold the BG lock when
+ * block group lock of all groups for this folio; do not hold the BG lock when
* calling this routine!
*/
static noinline_for_stack int
ext4_mb_load_buddy_gfp(struct super_block *sb, ext4_group_t group,
struct ext4_buddy *e4b, gfp_t gfp)
{
- int blocks_per_page;
int block;
int pnum;
- int poff;
struct folio *folio;
int ret;
struct ext4_group_info *grp;
@@ -1662,7 +1660,6 @@ ext4_mb_load_buddy_gfp(struct super_block *sb, ext4_group_t group,
might_sleep();
mb_debug(sb, "load group %u\n", group);
- blocks_per_page = PAGE_SIZE / sb->s_blocksize;
grp = ext4_get_group_info(sb, group);
if (!grp)
return -EFSCORRUPTED;
@@ -1690,8 +1687,7 @@ ext4_mb_load_buddy_gfp(struct super_block *sb, ext4_group_t group,
* So for each group we need two blocks.
*/
block = group * 2;
- pnum = block / blocks_per_page;
- poff = block % blocks_per_page;
+ pnum = EXT4_LBLK_TO_P(inode, block);
/* Avoid locking the folio in the fast path ... */
folio = __filemap_get_folio(inode->i_mapping, pnum, FGP_ACCESSED, 0);
@@ -1723,7 +1719,8 @@ ext4_mb_load_buddy_gfp(struct super_block *sb, ext4_group_t group,
goto err;
}
mb_cmp_bitmaps(e4b, folio_address(folio) +
- (poff * sb->s_blocksize));
+ offset_in_folio(folio,
+ EXT4_LBLK_TO_B(inode, block)));
}
folio_unlock(folio);
}
@@ -1739,12 +1736,18 @@ ext4_mb_load_buddy_gfp(struct super_block *sb, ext4_group_t group,
/* Folios marked accessed already */
e4b->bd_bitmap_folio = folio;
- e4b->bd_bitmap = folio_address(folio) + (poff * sb->s_blocksize);
+ e4b->bd_bitmap = folio_address(folio) +
+ offset_in_folio(folio, EXT4_LBLK_TO_B(inode, block));
block++;
- pnum = block / blocks_per_page;
- poff = block % blocks_per_page;
+ pnum = EXT4_LBLK_TO_P(inode, block);
+ /* buddy and bitmap are on the same folio? */
+ if (folio_contains(folio, pnum)) {
+ folio_get(folio);
+ goto update_buddy;
+ }
+ /* we need another folio for the buddy */
folio = __filemap_get_folio(inode->i_mapping, pnum, FGP_ACCESSED, 0);
if (IS_ERR(folio) || !folio_test_uptodate(folio)) {
if (!IS_ERR(folio))
@@ -1779,9 +1782,11 @@ ext4_mb_load_buddy_gfp(struct super_block *sb, ext4_group_t group,
goto err;
}
+update_buddy:
/* Folios marked accessed already */
e4b->bd_buddy_folio = folio;
- e4b->bd_buddy = folio_address(folio) + (poff * sb->s_blocksize);
+ e4b->bd_buddy = folio_address(folio) +
+ offset_in_folio(folio, EXT4_LBLK_TO_B(inode, block));
return 0;
--
2.46.1
^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH 12/25] ext4: support large block size in ext4_mb_get_buddy_page_lock()
2025-10-25 3:21 [PATCH 00/25] ext4: enable block size larger than page size libaokun
` (10 preceding siblings ...)
2025-10-25 3:22 ` [PATCH 11/25] ext4: support large block size in ext4_mb_load_buddy_gfp() libaokun
@ 2025-10-25 3:22 ` libaokun
2025-11-05 9:13 ` Jan Kara
2025-10-25 3:22 ` [PATCH 13/25] ext4: support large block size in ext4_mb_init_cache() libaokun
` (12 subsequent siblings)
24 siblings, 1 reply; 68+ messages in thread
From: libaokun @ 2025-10-25 3:22 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, linux-kernel, kernel, mcgrof,
linux-fsdevel, linux-mm, yi.zhang, yangerkun, chengzhihao1,
libaokun1, libaokun
From: Baokun Li <libaokun1@huawei.com>
Currently, ext4_mb_get_buddy_page_lock() uses blocks_per_page to calculate
folio index and offset. However, when blocksize is larger than PAGE_SIZE,
blocks_per_page becomes zero, leading to a potential division-by-zero bug.
To support BS > PS, use bytes to compute folio index and offset within
folio to get rid of blocks_per_page.
Also, since ext4_mb_get_buddy_page_lock() already fully supports folio,
rename it to ext4_mb_get_buddy_folio_lock().
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/mballoc.c | 42 ++++++++++++++++++++++--------------------
1 file changed, 22 insertions(+), 20 deletions(-)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 3494c6fe5bfb..d42d768a705a 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -1510,50 +1510,52 @@ static int ext4_mb_init_cache(struct folio *folio, char *incore, gfp_t gfp)
}
/*
- * Lock the buddy and bitmap pages. This make sure other parallel init_group
- * on the same buddy page doesn't happen whild holding the buddy page lock.
- * Return locked buddy and bitmap pages on e4b struct. If buddy and bitmap
- * are on the same page e4b->bd_buddy_folio is NULL and return value is 0.
+ * Lock the buddy and bitmap folios. This make sure other parallel init_group
+ * on the same buddy folio doesn't happen whild holding the buddy folio lock.
+ * Return locked buddy and bitmap folios on e4b struct. If buddy and bitmap
+ * are on the same folio e4b->bd_buddy_folio is NULL and return value is 0.
*/
-static int ext4_mb_get_buddy_page_lock(struct super_block *sb,
+static int ext4_mb_get_buddy_folio_lock(struct super_block *sb,
ext4_group_t group, struct ext4_buddy *e4b, gfp_t gfp)
{
struct inode *inode = EXT4_SB(sb)->s_buddy_cache;
- int block, pnum, poff;
- int blocks_per_page;
+ int block, pnum;
struct folio *folio;
e4b->bd_buddy_folio = NULL;
e4b->bd_bitmap_folio = NULL;
- blocks_per_page = PAGE_SIZE / sb->s_blocksize;
/*
* the buddy cache inode stores the block bitmap
* and buddy information in consecutive blocks.
* So for each group we need two blocks.
*/
block = group * 2;
- pnum = block / blocks_per_page;
- poff = block % blocks_per_page;
+ pnum = EXT4_LBLK_TO_P(inode, block);
folio = __filemap_get_folio(inode->i_mapping, pnum,
FGP_LOCK | FGP_ACCESSED | FGP_CREAT, gfp);
if (IS_ERR(folio))
return PTR_ERR(folio);
BUG_ON(folio->mapping != inode->i_mapping);
+ WARN_ON_ONCE(folio_size(folio) < sb->s_blocksize);
e4b->bd_bitmap_folio = folio;
- e4b->bd_bitmap = folio_address(folio) + (poff * sb->s_blocksize);
+ e4b->bd_bitmap = folio_address(folio) +
+ offset_in_folio(folio, EXT4_LBLK_TO_B(inode, block));
- if (blocks_per_page >= 2) {
- /* buddy and bitmap are on the same page */
+ block++;
+ pnum = EXT4_LBLK_TO_P(inode, block);
+ if (folio_contains(folio, pnum)) {
+ /* buddy and bitmap are on the same folio */
return 0;
}
- /* blocks_per_page == 1, hence we need another page for the buddy */
- folio = __filemap_get_folio(inode->i_mapping, block + 1,
+ /* we need another folio for the buddy */
+ folio = __filemap_get_folio(inode->i_mapping, pnum,
FGP_LOCK | FGP_ACCESSED | FGP_CREAT, gfp);
if (IS_ERR(folio))
return PTR_ERR(folio);
BUG_ON(folio->mapping != inode->i_mapping);
+ WARN_ON_ONCE(folio_size(folio) < sb->s_blocksize);
e4b->bd_buddy_folio = folio;
return 0;
}
@@ -1592,14 +1594,14 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group, gfp_t gfp)
/*
* This ensures that we don't reinit the buddy cache
- * page which map to the group from which we are already
+ * folio which map to the group from which we are already
* allocating. If we are looking at the buddy cache we would
* have taken a reference using ext4_mb_load_buddy and that
- * would have pinned buddy page to page cache.
- * The call to ext4_mb_get_buddy_page_lock will mark the
- * page accessed.
+ * would have pinned buddy folio to page cache.
+ * The call to ext4_mb_get_buddy_folio_lock will mark the
+ * folio accessed.
*/
- ret = ext4_mb_get_buddy_page_lock(sb, group, &e4b, gfp);
+ ret = ext4_mb_get_buddy_folio_lock(sb, group, &e4b, gfp);
if (ret || !EXT4_MB_GRP_NEED_INIT(this_grp)) {
/*
* somebody initialized the group
--
2.46.1
^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH 13/25] ext4: support large block size in ext4_mb_init_cache()
2025-10-25 3:21 [PATCH 00/25] ext4: enable block size larger than page size libaokun
` (11 preceding siblings ...)
2025-10-25 3:22 ` [PATCH 12/25] ext4: support large block size in ext4_mb_get_buddy_page_lock() libaokun
@ 2025-10-25 3:22 ` libaokun
2025-11-05 9:18 ` Jan Kara
2025-10-25 3:22 ` [PATCH 14/25] ext4: prepare buddy cache inode for BS > PS with large folios libaokun
` (11 subsequent siblings)
24 siblings, 1 reply; 68+ messages in thread
From: libaokun @ 2025-10-25 3:22 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, linux-kernel, kernel, mcgrof,
linux-fsdevel, linux-mm, yi.zhang, yangerkun, chengzhihao1,
libaokun1, libaokun
From: Baokun Li <libaokun1@huawei.com>
Currently, ext4_mb_init_cache() uses blocks_per_page to calculate the
folio index and offset. However, when blocksize is larger than PAGE_SIZE,
blocks_per_page becomes zero, leading to a potential division-by-zero bug.
Since we now have the folio, we know its exact size. This allows us to
convert {blocks, groups}_per_page to {blocks, groups}_per_folio, thus
supporting block sizes greater than page size.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/mballoc.c | 44 ++++++++++++++++++++------------------------
1 file changed, 20 insertions(+), 24 deletions(-)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index d42d768a705a..31f4c7d65eb4 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -1329,26 +1329,25 @@ static void mb_regenerate_buddy(struct ext4_buddy *e4b)
* block bitmap and buddy information. The information are
* stored in the inode as
*
- * { page }
+ * { folio }
* [ group 0 bitmap][ group 0 buddy] [group 1][ group 1]...
*
*
* one block each for bitmap and buddy information.
- * So for each group we take up 2 blocks. A page can
- * contain blocks_per_page (PAGE_SIZE / blocksize) blocks.
- * So it can have information regarding groups_per_page which
- * is blocks_per_page/2
+ * So for each group we take up 2 blocks. A folio can
+ * contain blocks_per_folio (folio_size / blocksize) blocks.
+ * So it can have information regarding groups_per_folio which
+ * is blocks_per_folio/2
*
* Locking note: This routine takes the block group lock of all groups
- * for this page; do not hold this lock when calling this routine!
+ * for this folio; do not hold this lock when calling this routine!
*/
-
static int ext4_mb_init_cache(struct folio *folio, char *incore, gfp_t gfp)
{
ext4_group_t ngroups;
unsigned int blocksize;
- int blocks_per_page;
- int groups_per_page;
+ int blocks_per_folio;
+ int groups_per_folio;
int err = 0;
int i;
ext4_group_t first_group, group;
@@ -1365,27 +1364,24 @@ static int ext4_mb_init_cache(struct folio *folio, char *incore, gfp_t gfp)
sb = inode->i_sb;
ngroups = ext4_get_groups_count(sb);
blocksize = i_blocksize(inode);
- blocks_per_page = PAGE_SIZE / blocksize;
+ blocks_per_folio = folio_size(folio) / blocksize;
+ WARN_ON_ONCE(!blocks_per_folio);
+ groups_per_folio = DIV_ROUND_UP(blocks_per_folio, 2);
mb_debug(sb, "init folio %lu\n", folio->index);
- groups_per_page = blocks_per_page >> 1;
- if (groups_per_page == 0)
- groups_per_page = 1;
-
/* allocate buffer_heads to read bitmaps */
- if (groups_per_page > 1) {
- i = sizeof(struct buffer_head *) * groups_per_page;
+ if (groups_per_folio > 1) {
+ i = sizeof(struct buffer_head *) * groups_per_folio;
bh = kzalloc(i, gfp);
if (bh == NULL)
return -ENOMEM;
} else
bh = &bhs;
- first_group = folio->index * blocks_per_page / 2;
-
/* read all groups the folio covers into the cache */
- for (i = 0, group = first_group; i < groups_per_page; i++, group++) {
+ first_group = EXT4_P_TO_LBLK(inode, folio->index) / 2;
+ for (i = 0, group = first_group; i < groups_per_folio; i++, group++) {
if (group >= ngroups)
break;
@@ -1393,7 +1389,7 @@ static int ext4_mb_init_cache(struct folio *folio, char *incore, gfp_t gfp)
if (!grinfo)
continue;
/*
- * If page is uptodate then we came here after online resize
+ * If folio is uptodate then we came here after online resize
* which added some new uninitialized group info structs, so
* we must skip all initialized uptodate buddies on the folio,
* which may be currently in use by an allocating task.
@@ -1413,7 +1409,7 @@ static int ext4_mb_init_cache(struct folio *folio, char *incore, gfp_t gfp)
}
/* wait for I/O completion */
- for (i = 0, group = first_group; i < groups_per_page; i++, group++) {
+ for (i = 0, group = first_group; i < groups_per_folio; i++, group++) {
int err2;
if (!bh[i])
@@ -1423,8 +1419,8 @@ static int ext4_mb_init_cache(struct folio *folio, char *incore, gfp_t gfp)
err = err2;
}
- first_block = folio->index * blocks_per_page;
- for (i = 0; i < blocks_per_page; i++) {
+ first_block = EXT4_P_TO_LBLK(inode, folio->index);
+ for (i = 0; i < blocks_per_folio; i++) {
group = (first_block + i) >> 1;
if (group >= ngroups)
break;
@@ -1501,7 +1497,7 @@ static int ext4_mb_init_cache(struct folio *folio, char *incore, gfp_t gfp)
out:
if (bh) {
- for (i = 0; i < groups_per_page; i++)
+ for (i = 0; i < groups_per_folio; i++)
brelse(bh[i]);
if (bh != &bhs)
kfree(bh);
--
2.46.1
^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH 14/25] ext4: prepare buddy cache inode for BS > PS with large folios
2025-10-25 3:21 [PATCH 00/25] ext4: enable block size larger than page size libaokun
` (12 preceding siblings ...)
2025-10-25 3:22 ` [PATCH 13/25] ext4: support large block size in ext4_mb_init_cache() libaokun
@ 2025-10-25 3:22 ` libaokun
2025-11-05 9:19 ` Jan Kara
2025-10-25 3:22 ` [PATCH 15/25] ext4: rename 'page' references to 'folio' in multi-block allocator libaokun
` (10 subsequent siblings)
24 siblings, 1 reply; 68+ messages in thread
From: libaokun @ 2025-10-25 3:22 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, linux-kernel, kernel, mcgrof,
linux-fsdevel, linux-mm, yi.zhang, yangerkun, chengzhihao1,
libaokun1, libaokun
From: Baokun Li <libaokun1@huawei.com>
We use EXT4_BAD_INO for the buddy cache inode number. This inode is not
accessed via __ext4_new_inode() or __ext4_iget(), meaning
ext4_set_inode_mapping_order() is not called to set its folio order range.
However, future block size greater than page size support requires this
inode to support large folios, and the buddy cache code already handles
BS > PS. Therefore, ext4_set_inode_mapping_order() is now explicitly
called for this specific inode to set its folio order range.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/mballoc.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 31f4c7d65eb4..155c43ff2bc2 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3493,6 +3493,8 @@ static int ext4_mb_init_backend(struct super_block *sb)
* this will avoid confusion if it ever shows up during debugging. */
sbi->s_buddy_cache->i_ino = EXT4_BAD_INO;
EXT4_I(sbi->s_buddy_cache)->i_disksize = 0;
+ ext4_set_inode_mapping_order(sbi->s_buddy_cache);
+
for (i = 0; i < ngroups; i++) {
cond_resched();
desc = ext4_get_group_desc(sb, i, NULL);
--
2.46.1
^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH 15/25] ext4: rename 'page' references to 'folio' in multi-block allocator
2025-10-25 3:21 [PATCH 00/25] ext4: enable block size larger than page size libaokun
` (13 preceding siblings ...)
2025-10-25 3:22 ` [PATCH 14/25] ext4: prepare buddy cache inode for BS > PS with large folios libaokun
@ 2025-10-25 3:22 ` libaokun
2025-11-05 9:21 ` Jan Kara
2025-10-25 3:22 ` [PATCH 16/25] ext4: support large block size in ext4_mpage_readpages() libaokun
` (9 subsequent siblings)
24 siblings, 1 reply; 68+ messages in thread
From: libaokun @ 2025-10-25 3:22 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, linux-kernel, kernel, mcgrof,
linux-fsdevel, linux-mm, yi.zhang, yangerkun, chengzhihao1,
libaokun1, libaokun
From: Zhihao Cheng <chengzhihao1@huawei.com>
The ext4 multi-block allocator now fully supports folio objects. Update
all variable names, function names, and comments to replace legacy 'page'
terminology with 'folio', improving clarity and consistency.
No functional changes.
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/mballoc.c | 22 +++++++++++-----------
1 file changed, 11 insertions(+), 11 deletions(-)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 155c43ff2bc2..cf07d1067f5f 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -98,14 +98,14 @@
* block bitmap and buddy information. The information are stored in the
* inode as:
*
- * { page }
+ * { folio }
* [ group 0 bitmap][ group 0 buddy] [group 1][ group 1]...
*
*
* one block each for bitmap and buddy information. So for each group we
- * take up 2 blocks. A page can contain blocks_per_page (PAGE_SIZE /
- * blocksize) blocks. So it can have information regarding groups_per_page
- * which is blocks_per_page/2
+ * take up 2 blocks. A folio can contain blocks_per_folio (folio_size /
+ * blocksize) blocks. So it can have information regarding groups_per_folio
+ * which is blocks_per_folio/2
*
* The buddy cache inode is not stored on disk. The inode is thrown
* away when the filesystem is unmounted.
@@ -1556,7 +1556,7 @@ static int ext4_mb_get_buddy_folio_lock(struct super_block *sb,
return 0;
}
-static void ext4_mb_put_buddy_page_lock(struct ext4_buddy *e4b)
+static void ext4_mb_put_buddy_folio_lock(struct ext4_buddy *e4b)
{
if (e4b->bd_bitmap_folio) {
folio_unlock(e4b->bd_bitmap_folio);
@@ -1570,7 +1570,7 @@ static void ext4_mb_put_buddy_page_lock(struct ext4_buddy *e4b)
/*
* Locking note: This routine calls ext4_mb_init_cache(), which takes the
- * block group lock of all groups for this page; do not hold the BG lock when
+ * block group lock of all groups for this folio; do not hold the BG lock when
* calling this routine!
*/
static noinline_for_stack
@@ -1618,7 +1618,7 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group, gfp_t gfp)
if (e4b.bd_buddy_folio == NULL) {
/*
* If both the bitmap and buddy are in
- * the same page we don't need to force
+ * the same folio we don't need to force
* init the buddy
*/
ret = 0;
@@ -1634,7 +1634,7 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group, gfp_t gfp)
goto err;
}
err:
- ext4_mb_put_buddy_page_lock(&e4b);
+ ext4_mb_put_buddy_folio_lock(&e4b);
return ret;
}
@@ -2227,7 +2227,7 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
ac->ac_buddy = ret >> 16;
/*
- * take the page reference. We want the page to be pinned
+ * take the folio reference. We want the folio to be pinned
* so that we don't get a ext4_mb_init_cache_call for this
* group until we update the bitmap. That would mean we
* double allocate blocks. The reference is dropped
@@ -2933,7 +2933,7 @@ static int ext4_mb_scan_group(struct ext4_allocation_context *ac,
if (cr < CR_ANY_FREE && spin_is_locked(ext4_group_lock_ptr(sb, group)))
return 0;
- /* This now checks without needing the buddy page */
+ /* This now checks without needing the buddy folio */
ret = ext4_mb_good_group_nolock(ac, group, cr);
if (ret <= 0) {
if (!ac->ac_first_err)
@@ -4725,7 +4725,7 @@ static void ext4_discard_allocated_blocks(struct ext4_allocation_context *ac)
"ext4: mb_load_buddy failed (%d)", err))
/*
* This should never happen since we pin the
- * pages in the ext4_allocation_context so
+ * folios in the ext4_allocation_context so
* ext4_mb_load_buddy() should never fail.
*/
return;
--
2.46.1
^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH 16/25] ext4: support large block size in ext4_mpage_readpages()
2025-10-25 3:21 [PATCH 00/25] ext4: enable block size larger than page size libaokun
` (14 preceding siblings ...)
2025-10-25 3:22 ` [PATCH 15/25] ext4: rename 'page' references to 'folio' in multi-block allocator libaokun
@ 2025-10-25 3:22 ` libaokun
2025-11-05 9:26 ` Jan Kara
2025-10-25 3:22 ` [PATCH 17/25] ext4: support large block size in ext4_block_write_begin() libaokun
` (8 subsequent siblings)
24 siblings, 1 reply; 68+ messages in thread
From: libaokun @ 2025-10-25 3:22 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, linux-kernel, kernel, mcgrof,
linux-fsdevel, linux-mm, yi.zhang, yangerkun, chengzhihao1,
libaokun1, libaokun
From: Baokun Li <libaokun1@huawei.com>
Use the EXT4_P_TO_LBLK() macro to convert folio indexes to blocks to avoid
negative left shifts after supporting blocksize greater than PAGE_SIZE.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/readpage.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)
diff --git a/fs/ext4/readpage.c b/fs/ext4/readpage.c
index f329daf6e5c7..8c8ec9d60b90 100644
--- a/fs/ext4/readpage.c
+++ b/fs/ext4/readpage.c
@@ -213,9 +213,7 @@ int ext4_mpage_readpages(struct inode *inode,
{
struct bio *bio = NULL;
sector_t last_block_in_bio = 0;
-
const unsigned blkbits = inode->i_blkbits;
- const unsigned blocks_per_page = PAGE_SIZE >> blkbits;
const unsigned blocksize = 1 << blkbits;
sector_t next_block;
sector_t block_in_file;
@@ -251,9 +249,8 @@ int ext4_mpage_readpages(struct inode *inode,
blocks_per_folio = folio_size(folio) >> blkbits;
first_hole = blocks_per_folio;
- block_in_file = next_block =
- (sector_t)folio->index << (PAGE_SHIFT - blkbits);
- last_block = block_in_file + nr_pages * blocks_per_page;
+ block_in_file = next_block = EXT4_P_TO_LBLK(inode, folio->index);
+ last_block = EXT4_P_TO_LBLK(inode, folio->index + nr_pages);
last_block_in_file = (ext4_readpage_limit(inode) +
blocksize - 1) >> blkbits;
if (last_block > last_block_in_file)
--
2.46.1
^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH 17/25] ext4: support large block size in ext4_block_write_begin()
2025-10-25 3:21 [PATCH 00/25] ext4: enable block size larger than page size libaokun
` (15 preceding siblings ...)
2025-10-25 3:22 ` [PATCH 16/25] ext4: support large block size in ext4_mpage_readpages() libaokun
@ 2025-10-25 3:22 ` libaokun
2025-11-05 9:28 ` Jan Kara
2025-10-25 3:22 ` [PATCH 18/25] ext4: support large block size in mpage_map_and_submit_buffers() libaokun
` (7 subsequent siblings)
24 siblings, 1 reply; 68+ messages in thread
From: libaokun @ 2025-10-25 3:22 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, linux-kernel, kernel, mcgrof,
linux-fsdevel, linux-mm, yi.zhang, yangerkun, chengzhihao1,
libaokun1, libaokun
From: Baokun Li <libaokun1@huawei.com>
Use the EXT4_P_TO_LBLK() macro to convert folio indexes to blocks to avoid
negative left shifts after supporting blocksize greater than PAGE_SIZE.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/inode.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 73c1da90b604..d97ce88d6e0a 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1162,8 +1162,7 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
unsigned block_start, block_end;
sector_t block;
int err = 0;
- unsigned blocksize = inode->i_sb->s_blocksize;
- unsigned bbits;
+ unsigned int blocksize = i_blocksize(inode);
struct buffer_head *bh, *head, *wait[2];
int nr_wait = 0;
int i;
@@ -1172,12 +1171,12 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
BUG_ON(!folio_test_locked(folio));
BUG_ON(to > folio_size(folio));
BUG_ON(from > to);
+ WARN_ON_ONCE(blocksize > folio_size(folio));
head = folio_buffers(folio);
if (!head)
head = create_empty_buffers(folio, blocksize, 0);
- bbits = ilog2(blocksize);
- block = (sector_t)folio->index << (PAGE_SHIFT - bbits);
+ block = EXT4_P_TO_LBLK(inode, folio->index);
for (bh = head, block_start = 0; bh != head || !block_start;
block++, block_start = block_end, bh = bh->b_this_page) {
--
2.46.1
^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH 18/25] ext4: support large block size in mpage_map_and_submit_buffers()
2025-10-25 3:21 [PATCH 00/25] ext4: enable block size larger than page size libaokun
` (16 preceding siblings ...)
2025-10-25 3:22 ` [PATCH 17/25] ext4: support large block size in ext4_block_write_begin() libaokun
@ 2025-10-25 3:22 ` libaokun
2025-11-05 9:30 ` Jan Kara
2025-10-25 3:22 ` [PATCH 19/25] ext4: support large block size in mpage_prepare_extent_to_map() libaokun
` (6 subsequent siblings)
24 siblings, 1 reply; 68+ messages in thread
From: libaokun @ 2025-10-25 3:22 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, linux-kernel, kernel, mcgrof,
linux-fsdevel, linux-mm, yi.zhang, yangerkun, chengzhihao1,
libaokun1, libaokun
From: Baokun Li <libaokun1@huawei.com>
Use the EXT4_P_TO_LBLK/EXT4_LBLK_TO_P macros to complete the conversion
between folio indexes and blocks to avoid negative left/right shifts after
supporting blocksize greater than PAGE_SIZE.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/inode.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d97ce88d6e0a..cbf04b473ae7 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2289,15 +2289,14 @@ static int mpage_map_and_submit_buffers(struct mpage_da_data *mpd)
struct folio_batch fbatch;
unsigned nr, i;
struct inode *inode = mpd->inode;
- int bpp_bits = PAGE_SHIFT - inode->i_blkbits;
pgoff_t start, end;
ext4_lblk_t lblk;
ext4_fsblk_t pblock;
int err;
bool map_bh = false;
- start = mpd->map.m_lblk >> bpp_bits;
- end = (mpd->map.m_lblk + mpd->map.m_len - 1) >> bpp_bits;
+ start = EXT4_LBLK_TO_P(inode, mpd->map.m_lblk);
+ end = EXT4_LBLK_TO_P(inode, mpd->map.m_lblk + mpd->map.m_len - 1);
pblock = mpd->map.m_pblk;
folio_batch_init(&fbatch);
@@ -2308,7 +2307,7 @@ static int mpage_map_and_submit_buffers(struct mpage_da_data *mpd)
for (i = 0; i < nr; i++) {
struct folio *folio = fbatch.folios[i];
- lblk = folio->index << bpp_bits;
+ lblk = EXT4_P_TO_LBLK(inode, folio->index);
err = mpage_process_folio(mpd, folio, &lblk, &pblock,
&map_bh);
/*
--
2.46.1
^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH 19/25] ext4: support large block size in mpage_prepare_extent_to_map()
2025-10-25 3:21 [PATCH 00/25] ext4: enable block size larger than page size libaokun
` (17 preceding siblings ...)
2025-10-25 3:22 ` [PATCH 18/25] ext4: support large block size in mpage_map_and_submit_buffers() libaokun
@ 2025-10-25 3:22 ` libaokun
2025-11-05 9:31 ` Jan Kara
2025-10-25 3:22 ` [PATCH 20/25] ext4: support large block size in __ext4_block_zero_page_range() libaokun
` (5 subsequent siblings)
24 siblings, 1 reply; 68+ messages in thread
From: libaokun @ 2025-10-25 3:22 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, linux-kernel, kernel, mcgrof,
linux-fsdevel, linux-mm, yi.zhang, yangerkun, chengzhihao1,
libaokun1, libaokun
From: Baokun Li <libaokun1@huawei.com>
Use the EXT4_P_TO_LBLK/EXT4_LBLK_TO_P macros to complete the conversion
between folio indexes and blocks to avoid negative left/right shifts after
supporting blocksize greater than PAGE_SIZE.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/inode.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index cbf04b473ae7..ce48cc6780a3 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2610,7 +2610,6 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
pgoff_t end = mpd->end_pos >> PAGE_SHIFT;
xa_mark_t tag;
int i, err = 0;
- int blkbits = mpd->inode->i_blkbits;
ext4_lblk_t lblk;
struct buffer_head *head;
handle_t *handle = NULL;
@@ -2649,7 +2648,7 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
*/
if (mpd->wbc->sync_mode == WB_SYNC_NONE &&
mpd->wbc->nr_to_write <=
- mpd->map.m_len >> (PAGE_SHIFT - blkbits))
+ EXT4_LBLK_TO_P(mpd->inode, mpd->map.m_len))
goto out;
/* If we can't merge this page, we are done. */
@@ -2727,8 +2726,7 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
mpage_folio_done(mpd, folio);
} else {
/* Add all dirty buffers to mpd */
- lblk = ((ext4_lblk_t)folio->index) <<
- (PAGE_SHIFT - blkbits);
+ lblk = EXT4_P_TO_LBLK(mpd->inode, folio->index);
head = folio_buffers(folio);
err = mpage_process_page_bufs(mpd, head, head,
lblk);
--
2.46.1
^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH 20/25] ext4: support large block size in __ext4_block_zero_page_range()
2025-10-25 3:21 [PATCH 00/25] ext4: enable block size larger than page size libaokun
` (18 preceding siblings ...)
2025-10-25 3:22 ` [PATCH 19/25] ext4: support large block size in mpage_prepare_extent_to_map() libaokun
@ 2025-10-25 3:22 ` libaokun
2025-11-05 9:33 ` Jan Kara
2025-10-25 3:22 ` [PATCH 21/25] ext4: make online defragmentation support large block size libaokun
` (4 subsequent siblings)
24 siblings, 1 reply; 68+ messages in thread
From: libaokun @ 2025-10-25 3:22 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, linux-kernel, kernel, mcgrof,
linux-fsdevel, linux-mm, yi.zhang, yangerkun, chengzhihao1,
libaokun1, libaokun
From: Zhihao Cheng <chengzhihao1@huawei.com>
Use the EXT4_P_TO_LBLK() macro to convert folio indexes to blocks to avoid
negative left shifts after supporting blocksize greater than PAGE_SIZE.
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/inode.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ce48cc6780a3..b3fa29923a1d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4066,7 +4066,7 @@ static int __ext4_block_zero_page_range(handle_t *handle,
blocksize = inode->i_sb->s_blocksize;
- iblock = folio->index << (PAGE_SHIFT - inode->i_sb->s_blocksize_bits);
+ iblock = EXT4_P_TO_LBLK(inode, folio->index);
bh = folio_buffers(folio);
if (!bh)
--
2.46.1
^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH 21/25] ext4: make online defragmentation support large block size
2025-10-25 3:21 [PATCH 00/25] ext4: enable block size larger than page size libaokun
` (19 preceding siblings ...)
2025-10-25 3:22 ` [PATCH 20/25] ext4: support large block size in __ext4_block_zero_page_range() libaokun
@ 2025-10-25 3:22 ` libaokun
2025-11-05 9:50 ` Jan Kara
2025-10-25 3:22 ` [PATCH 22/25] fs/buffer: prevent WARN_ON in __alloc_pages_slowpath() when BS > PS libaokun
` (3 subsequent siblings)
24 siblings, 1 reply; 68+ messages in thread
From: libaokun @ 2025-10-25 3:22 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, linux-kernel, kernel, mcgrof,
linux-fsdevel, linux-mm, yi.zhang, yangerkun, chengzhihao1,
libaokun1, libaokun
From: Zhihao Cheng <chengzhihao1@huawei.com>
There are several places assuming that block size <= PAGE_SIZE, modify
them to support large block size (bs > ps).
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
fs/ext4/move_extent.c | 20 ++++++++++----------
1 file changed, 10 insertions(+), 10 deletions(-)
diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index 4b091c21908f..cb55cd9e7eeb 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -270,7 +270,6 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
int i, err2, jblocks, retries = 0;
int replaced_count = 0;
int from;
- int blocks_per_page = PAGE_SIZE >> orig_inode->i_blkbits;
struct super_block *sb = orig_inode->i_sb;
struct buffer_head *bh = NULL;
@@ -288,11 +287,11 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
return 0;
}
- orig_blk_offset = orig_page_offset * blocks_per_page +
- data_offset_in_page;
+ orig_blk_offset = EXT4_P_TO_LBLK(orig_inode, orig_page_offset) +
+ data_offset_in_page;
- donor_blk_offset = donor_page_offset * blocks_per_page +
- data_offset_in_page;
+ donor_blk_offset = EXT4_P_TO_LBLK(donor_inode, donor_page_offset) +
+ data_offset_in_page;
/* Calculate data_size */
if ((orig_blk_offset + block_len_in_page - 1) ==
@@ -565,7 +564,7 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, __u64 orig_blk,
struct inode *orig_inode = file_inode(o_filp);
struct inode *donor_inode = file_inode(d_filp);
struct ext4_ext_path *path = NULL;
- int blocks_per_page = PAGE_SIZE >> orig_inode->i_blkbits;
+ int blocks_per_page = 1;
ext4_lblk_t o_end, o_start = orig_blk;
ext4_lblk_t d_start = donor_blk;
int ret;
@@ -608,6 +607,9 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, __u64 orig_blk,
return -EOPNOTSUPP;
}
+ if (i_blocksize(orig_inode) < PAGE_SIZE)
+ blocks_per_page = PAGE_SIZE >> orig_inode->i_blkbits;
+
/* Protect orig and donor inodes against a truncate */
lock_two_nondirectories(orig_inode, donor_inode);
@@ -665,10 +667,8 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, __u64 orig_blk,
if (o_end - o_start < cur_len)
cur_len = o_end - o_start;
- orig_page_index = o_start >> (PAGE_SHIFT -
- orig_inode->i_blkbits);
- donor_page_index = d_start >> (PAGE_SHIFT -
- donor_inode->i_blkbits);
+ orig_page_index = EXT4_LBLK_TO_P(orig_inode, o_start);
+ donor_page_index = EXT4_LBLK_TO_P(donor_inode, d_start);
offset_in_page = o_start % blocks_per_page;
if (cur_len > blocks_per_page - offset_in_page)
cur_len = blocks_per_page - offset_in_page;
--
2.46.1
^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH 22/25] fs/buffer: prevent WARN_ON in __alloc_pages_slowpath() when BS > PS
2025-10-25 3:21 [PATCH 00/25] ext4: enable block size larger than page size libaokun
` (20 preceding siblings ...)
2025-10-25 3:22 ` [PATCH 21/25] ext4: make online defragmentation support large block size libaokun
@ 2025-10-25 3:22 ` libaokun
2025-10-25 4:45 ` Matthew Wilcox
2025-10-25 3:22 ` [PATCH 23/25] jbd2: " libaokun
` (2 subsequent siblings)
24 siblings, 1 reply; 68+ messages in thread
From: libaokun @ 2025-10-25 3:22 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, linux-kernel, kernel, mcgrof,
linux-fsdevel, linux-mm, yi.zhang, yangerkun, chengzhihao1,
libaokun1, libaokun
From: Baokun Li <libaokun1@huawei.com>
In __alloc_pages_slowpath(), allocating page units greater than order-1
with the __GFP_NOFAIL flag may trigger an unexpected WARN_ON. To avoid
this, handle the case separately in grow_dev_folio(). This ensures that
buffer_head-based filesystems will not encounter the warning when using
__GFP_NOFAIL to read metadata after BS > PS support is enabled.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/buffer.c | 33 +++++++++++++++++++++++++++++++--
1 file changed, 31 insertions(+), 2 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index 6a8752f7bbed..2f5a7dd199b2 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1031,6 +1031,35 @@ static sector_t folio_init_buffers(struct folio *folio,
return end_block;
}
+static struct folio *blkdev_get_folio(struct address_space *mapping,
+ pgoff_t index, fgf_t fgp_flags, gfp_t gfp)
+{
+ struct folio *folio;
+ unsigned int min_order = mapping_min_folio_order(mapping);
+
+ /*
+ * Allocating page units greater than order-1 with __GFP_NOFAIL in
+ * __alloc_pages_slowpath() can trigger an unexpected WARN_ON.
+ * Handle this case separately to suppress the warning.
+ */
+ if (min_order <= 1)
+ return __filemap_get_folio(mapping, index, fgp_flags, gfp);
+
+ while (1) {
+ folio = __filemap_get_folio(mapping, index, fgp_flags,
+ gfp & ~__GFP_NOFAIL);
+ if (!IS_ERR(folio) || !(gfp & __GFP_NOFAIL))
+ return folio;
+
+ if (PTR_ERR(folio) != -ENOMEM && PTR_ERR(folio) != -EAGAIN)
+ return folio;
+
+ memalloc_retry_wait(gfp);
+ }
+
+ return folio;
+}
+
/*
* Create the page-cache folio that contains the requested block.
*
@@ -1047,8 +1076,8 @@ static bool grow_dev_folio(struct block_device *bdev, sector_t block,
struct buffer_head *bh;
sector_t end_block = 0;
- folio = __filemap_get_folio(mapping, index,
- FGP_LOCK | FGP_ACCESSED | FGP_CREAT, gfp);
+ folio = blkdev_get_folio(mapping, index,
+ FGP_LOCK | FGP_ACCESSED | FGP_CREAT, gfp);
if (IS_ERR(folio))
return false;
--
2.46.1
^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH 23/25] jbd2: prevent WARN_ON in __alloc_pages_slowpath() when BS > PS
2025-10-25 3:21 [PATCH 00/25] ext4: enable block size larger than page size libaokun
` (21 preceding siblings ...)
2025-10-25 3:22 ` [PATCH 22/25] fs/buffer: prevent WARN_ON in __alloc_pages_slowpath() when BS > PS libaokun
@ 2025-10-25 3:22 ` libaokun
2025-10-25 3:22 ` [PATCH 24/25] ext4: add checks for large folio incompatibilities " libaokun
2025-10-25 3:22 ` [PATCH 25/25] ext4: enable block size larger than page size libaokun
24 siblings, 0 replies; 68+ messages in thread
From: libaokun @ 2025-10-25 3:22 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, linux-kernel, kernel, mcgrof,
linux-fsdevel, linux-mm, yi.zhang, yangerkun, chengzhihao1,
libaokun1, libaokun
From: Baokun Li <libaokun1@huawei.com>
In __alloc_pages_slowpath(), allocating page units larger than order-1
with __GFP_NOFAIL may trigger an unexpected WARN_ON. To prevent this,
handle the case explicitly in jbd2_alloc(), ensuring that the warning
does not occur after enabling BS > PS support.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/jbd2/journal.c | 28 +++++++++++++++++++++++++---
1 file changed, 25 insertions(+), 3 deletions(-)
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index d480b94117cd..9185f9e2b201 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -2761,14 +2761,36 @@ static struct kmem_cache *get_slab(size_t size)
void *jbd2_alloc(size_t size, gfp_t flags)
{
void *ptr;
+ int order;
BUG_ON(size & (size-1)); /* Must be a power of 2 */
- if (size < PAGE_SIZE)
+ if (size < PAGE_SIZE) {
ptr = kmem_cache_alloc(get_slab(size), flags);
- else
- ptr = (void *)__get_free_pages(flags, get_order(size));
+ goto out;
+ }
+
+ /*
+ * Allocating page units greater than order-1 with __GFP_NOFAIL in
+ * __alloc_pages_slowpath() can trigger an unexpected WARN_ON.
+ * Handle this case separately to suppress the warning.
+ */
+ order = get_order(size);
+ if (order <= 1) {
+ ptr = (void *)__get_free_pages(flags, order);
+ goto out;
+ }
+ while (1) {
+ ptr = (void *)__get_free_pages(flags & ~__GFP_NOFAIL, order);
+ if (ptr)
+ break;
+ if (!(flags & __GFP_NOFAIL))
+ break;
+ memalloc_retry_wait(flags);
+ }
+
+out:
/* Check alignment; SLUB has gotten this wrong in the past,
* and this can lead to user data corruption! */
BUG_ON(((unsigned long) ptr) & (size-1));
--
2.46.1
^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH 24/25] ext4: add checks for large folio incompatibilities when BS > PS
2025-10-25 3:21 [PATCH 00/25] ext4: enable block size larger than page size libaokun
` (22 preceding siblings ...)
2025-10-25 3:22 ` [PATCH 23/25] jbd2: " libaokun
@ 2025-10-25 3:22 ` libaokun
2025-11-05 9:59 ` Jan Kara
2025-10-25 3:22 ` [PATCH 25/25] ext4: enable block size larger than page size libaokun
24 siblings, 1 reply; 68+ messages in thread
From: libaokun @ 2025-10-25 3:22 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, linux-kernel, kernel, mcgrof,
linux-fsdevel, linux-mm, yi.zhang, yangerkun, chengzhihao1,
libaokun1, libaokun
From: Baokun Li <libaokun1@huawei.com>
Supporting a block size greater than the page size (BS > PS) requires
support for large folios. However, several features (e.g., verity, encrypt)
and mount options (e.g., data=journal) do not yet support large folios.
To prevent conflicts, this patch adds checks at mount time to prohibit
these features and options from being used when BS > PS. Since the data
mode cannot be changed on remount, there is no need to check on remount.
A new mount flag, EXT4_MF_LARGE_FOLIO, is introduced. This flag is set
after the checks pass, indicating that the filesystem has no features or
mount options incompatible with large folios. Subsequent checks can simply
test for this flag to avoid redundant verifications.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/ext4.h | 3 ++-
fs/ext4/inode.c | 10 ++++------
fs/ext4/super.c | 26 ++++++++++++++++++++++++++
3 files changed, 32 insertions(+), 7 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 8223ed29b343..f1163deb0812 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1859,7 +1859,8 @@ static inline int ext4_get_resgid(struct ext4_super_block *es)
enum {
EXT4_MF_MNTDIR_SAMPLED,
EXT4_MF_FC_INELIGIBLE, /* Fast commit ineligible */
- EXT4_MF_JOURNAL_DESTROY /* Journal is in process of destroying */
+ EXT4_MF_JOURNAL_DESTROY,/* Journal is in process of destroying */
+ EXT4_MF_LARGE_FOLIO, /* large folio is support */
};
static inline void ext4_set_mount_flag(struct super_block *sb, int bit)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b3fa29923a1d..04f9380d4211 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5143,14 +5143,12 @@ static bool ext4_should_enable_large_folio(struct inode *inode)
{
struct super_block *sb = inode->i_sb;
- if (!S_ISREG(inode->i_mode))
- return false;
- if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA ||
- ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA))
+ if (!ext4_test_mount_flag(sb, EXT4_MF_LARGE_FOLIO))
return false;
- if (ext4_has_feature_verity(sb))
+
+ if (!S_ISREG(inode->i_mode))
return false;
- if (ext4_has_feature_encrypt(sb))
+ if (ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA))
return false;
return true;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 7338c708ea1d..fdc006a973aa 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -5034,6 +5034,28 @@ static const char *ext4_has_journal_option(struct super_block *sb)
return NULL;
}
+static int ext4_check_large_folio(struct super_block *sb)
+{
+ const char *err_str = NULL;
+
+ if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA)
+ err_str = "data=journal";
+ else if (ext4_has_feature_verity(sb))
+ err_str = "verity";
+ else if (ext4_has_feature_encrypt(sb))
+ err_str = "encrypt";
+
+ if (!err_str) {
+ ext4_set_mount_flag(sb, EXT4_MF_LARGE_FOLIO);
+ } else if (sb->s_blocksize > PAGE_SIZE) {
+ ext4_msg(sb, KERN_ERR, "bs(%lu) > ps(%lu) unsupported for %s",
+ sb->s_blocksize, PAGE_SIZE, err_str);
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
static int ext4_load_super(struct super_block *sb, ext4_fsblk_t *lsb,
int silent)
{
@@ -5310,6 +5332,10 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
ext4_apply_options(fc, sb);
+ err = ext4_check_large_folio(sb);
+ if (err < 0)
+ goto failed_mount;
+
err = ext4_encoding_init(sb, es);
if (err)
goto failed_mount;
--
2.46.1
^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH 25/25] ext4: enable block size larger than page size
2025-10-25 3:21 [PATCH 00/25] ext4: enable block size larger than page size libaokun
` (23 preceding siblings ...)
2025-10-25 3:22 ` [PATCH 24/25] ext4: add checks for large folio incompatibilities " libaokun
@ 2025-10-25 3:22 ` libaokun
2025-11-05 10:14 ` Jan Kara
24 siblings, 1 reply; 68+ messages in thread
From: libaokun @ 2025-10-25 3:22 UTC (permalink / raw)
To: linux-ext4
Cc: tytso, adilger.kernel, jack, linux-kernel, kernel, mcgrof,
linux-fsdevel, linux-mm, yi.zhang, yangerkun, chengzhihao1,
libaokun1, libaokun
From: Baokun Li <libaokun1@huawei.com>
Since block device (See commit 3c20917120ce ("block/bdev: enable large
folio support for large logical block sizes")) and page cache (See commit
ab95d23bab220ef8 ("filemap: allocate mapping_min_order folios in the page
cache")) has the ability to have a minimum order when allocating folio,
and ext4 has supported large folio in commit 7ac67301e82f ("ext4: enable
large folio for regular file"), now add support for block_size > PAGE_SIZE
in ext4.
set_blocksize() -> bdev_validate_blocksize() already validates the block
size, so ext4_load_super() does not need to perform additional checks.
Here we only need to enable large folio by default when s_min_folio_order
is greater than 0 and add the FS_LBS bit to fs_flags.
In addition, mark this feature as experimental.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/inode.c | 3 +++
fs/ext4/super.c | 6 +++++-
2 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 04f9380d4211..ba6cf05860ae 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5146,6 +5146,9 @@ static bool ext4_should_enable_large_folio(struct inode *inode)
if (!ext4_test_mount_flag(sb, EXT4_MF_LARGE_FOLIO))
return false;
+ if (EXT4_SB(sb)->s_min_folio_order)
+ return true;
+
if (!S_ISREG(inode->i_mode))
return false;
if (ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA))
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index fdc006a973aa..4c0bd79bdf68 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -5053,6 +5053,9 @@ static int ext4_check_large_folio(struct super_block *sb)
return -EINVAL;
}
+ if (sb->s_blocksize > PAGE_SIZE)
+ ext4_msg(sb, KERN_NOTICE, "EXPERIMENTAL bs(%lu) > ps(%lu) enabled.",
+ sb->s_blocksize, PAGE_SIZE);
return 0;
}
@@ -7432,7 +7435,8 @@ static struct file_system_type ext4_fs_type = {
.init_fs_context = ext4_init_fs_context,
.parameters = ext4_param_specs,
.kill_sb = ext4_kill_sb,
- .fs_flags = FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MGTIME,
+ .fs_flags = FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MGTIME |
+ FS_LBS,
};
MODULE_ALIAS_FS("ext4");
--
2.46.1
^ permalink raw reply related [flat|nested] 68+ messages in thread
* Re: [PATCH 22/25] fs/buffer: prevent WARN_ON in __alloc_pages_slowpath() when BS > PS
2025-10-25 3:22 ` [PATCH 22/25] fs/buffer: prevent WARN_ON in __alloc_pages_slowpath() when BS > PS libaokun
@ 2025-10-25 4:45 ` Matthew Wilcox
2025-10-25 5:13 ` Darrick J. Wong
` (2 more replies)
0 siblings, 3 replies; 68+ messages in thread
From: Matthew Wilcox @ 2025-10-25 4:45 UTC (permalink / raw)
To: libaokun
Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, kernel,
mcgrof, linux-fsdevel, linux-mm, yi.zhang, yangerkun,
chengzhihao1, libaokun1
On Sat, Oct 25, 2025 at 11:22:18AM +0800, libaokun@huaweicloud.com wrote:
> + while (1) {
> + folio = __filemap_get_folio(mapping, index, fgp_flags,
> + gfp & ~__GFP_NOFAIL);
> + if (!IS_ERR(folio) || !(gfp & __GFP_NOFAIL))
> + return folio;
> +
> + if (PTR_ERR(folio) != -ENOMEM && PTR_ERR(folio) != -EAGAIN)
> + return folio;
> +
> + memalloc_retry_wait(gfp);
> + }
No, absolutely not. We're not having open-coded GFP_NOFAIL semantics.
The right way forward is for ext4 to use iomap, not for buffer heads
to support large block sizes.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 22/25] fs/buffer: prevent WARN_ON in __alloc_pages_slowpath() when BS > PS
2025-10-25 4:45 ` Matthew Wilcox
@ 2025-10-25 5:13 ` Darrick J. Wong
2025-10-25 6:32 ` Baokun Li
2025-10-25 6:34 ` Baokun Li
2 siblings, 0 replies; 68+ messages in thread
From: Darrick J. Wong @ 2025-10-25 5:13 UTC (permalink / raw)
To: Matthew Wilcox
Cc: libaokun, linux-ext4, tytso, adilger.kernel, jack, linux-kernel,
kernel, mcgrof, linux-fsdevel, linux-mm, yi.zhang, yangerkun,
chengzhihao1, libaokun1
On Sat, Oct 25, 2025 at 05:45:29AM +0100, Matthew Wilcox wrote:
> On Sat, Oct 25, 2025 at 11:22:18AM +0800, libaokun@huaweicloud.com wrote:
> > + while (1) {
> > + folio = __filemap_get_folio(mapping, index, fgp_flags,
> > + gfp & ~__GFP_NOFAIL);
> > + if (!IS_ERR(folio) || !(gfp & __GFP_NOFAIL))
> > + return folio;
> > +
> > + if (PTR_ERR(folio) != -ENOMEM && PTR_ERR(folio) != -EAGAIN)
> > + return folio;
> > +
> > + memalloc_retry_wait(gfp);
> > + }
>
> No, absolutely not. We're not having open-coded GFP_NOFAIL semantics.
> The right way forward is for ext4 to use iomap, not for buffer heads
> to support large block sizes.
Seconded.
--D
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 22/25] fs/buffer: prevent WARN_ON in __alloc_pages_slowpath() when BS > PS
2025-10-25 4:45 ` Matthew Wilcox
2025-10-25 5:13 ` Darrick J. Wong
@ 2025-10-25 6:32 ` Baokun Li
2025-10-25 7:01 ` Zhang Yi
` (2 more replies)
2025-10-25 6:34 ` Baokun Li
2 siblings, 3 replies; 68+ messages in thread
From: Baokun Li @ 2025-10-25 6:32 UTC (permalink / raw)
To: Matthew Wilcox, Darrick J. Wong
Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, kernel,
mcgrof, linux-fsdevel, linux-mm, yi.zhang, yangerkun,
chengzhihao1, libaokun1
On 2025-10-25 12:45, Matthew Wilcox wrote:
> On Sat, Oct 25, 2025 at 11:22:18AM +0800, libaokun@huaweicloud.com wrote:
>> + while (1) {
>> + folio = __filemap_get_folio(mapping, index, fgp_flags,
>> + gfp & ~__GFP_NOFAIL);
>> + if (!IS_ERR(folio) || !(gfp & __GFP_NOFAIL))
>> + return folio;
>> +
>> + if (PTR_ERR(folio) != -ENOMEM && PTR_ERR(folio) != -EAGAIN)
>> + return folio;
>> +
>> + memalloc_retry_wait(gfp);
>> + }
> No, absolutely not. We're not having open-coded GFP_NOFAIL semantics.
> The right way forward is for ext4 to use iomap, not for buffer heads
> to support large block sizes.
ext4 only calls getblk_unmovable or __getblk when reading critical
metadata. Both of these functions set __GFP_NOFAIL to ensure that
metadata reads do not fail due to memory pressure.
Both functions eventually call grow_dev_folio(), which is why we
handle the __GFP_NOFAIL logic there. xfs_buf_alloc_backing_mem()
has similar logic, but XFS manages its own metadata, allowing it
to use vmalloc for memory allocation.
ext4 Direct I/O has already switched to iomap, and patches to
support iomap for Buffered I/O are currently under iteration.
But as far as I know, iomap does not support metadata, and XFS does not
use iomap to read metadata either.
Am I missing something here?
--
With Best Regards,
Baokun Li
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 22/25] fs/buffer: prevent WARN_ON in __alloc_pages_slowpath() when BS > PS
2025-10-25 4:45 ` Matthew Wilcox
2025-10-25 5:13 ` Darrick J. Wong
2025-10-25 6:32 ` Baokun Li
@ 2025-10-25 6:34 ` Baokun Li
2 siblings, 0 replies; 68+ messages in thread
From: Baokun Li @ 2025-10-25 6:34 UTC (permalink / raw)
To: Matthew Wilcox, Darrick J. Wong
Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, kernel,
mcgrof, linux-fsdevel, linux-mm, yi.zhang, yangerkun,
chengzhihao1, libaokun1
On 2025-10-25 12:45, Matthew Wilcox wrote:
> On Sat, Oct 25, 2025 at 11:22:18AM +0800, libaokun@huaweicloud.com wrote:
>> + while (1) {
>> + folio = __filemap_get_folio(mapping, index, fgp_flags,
>> + gfp & ~__GFP_NOFAIL);
>> + if (!IS_ERR(folio) || !(gfp & __GFP_NOFAIL))
>> + return folio;
>> +
>> + if (PTR_ERR(folio) != -ENOMEM && PTR_ERR(folio) != -EAGAIN)
>> + return folio;
>> +
>> + memalloc_retry_wait(gfp);
>> + }
> No, absolutely not. We're not having open-coded GFP_NOFAIL semantics.
> The right way forward is for ext4 to use iomap, not for buffer heads
> to support large block sizes.
ext4 only calls getblk_unmovable or __getblk when reading critical
metadata. Both of these functions set __GFP_NOFAIL to ensure that
metadata reads do not fail due to memory pressure.
Both functions eventually call grow_dev_folio(), which is why we
handle the __GFP_NOFAIL logic there. xfs_buf_alloc_backing_mem()
has similar logic, but XFS manages its own metadata, allowing it
to use vmalloc for memory allocation.
ext4 Direct I/O has already switched to iomap, and patches to
support iomap for Buffered I/O are currently under iteration.
But as far as I know, iomap does not support metadata, and XFS does not
use iomap to read metadata either.
Am I missing something here?
--
With Best Regards,
Baokun Li
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 22/25] fs/buffer: prevent WARN_ON in __alloc_pages_slowpath() when BS > PS
2025-10-25 6:32 ` Baokun Li
@ 2025-10-25 7:01 ` Zhang Yi
2025-10-25 17:56 ` Matthew Wilcox
2025-10-30 21:25 ` Matthew Wilcox
2 siblings, 0 replies; 68+ messages in thread
From: Zhang Yi @ 2025-10-25 7:01 UTC (permalink / raw)
To: Baokun Li, Matthew Wilcox, Darrick J. Wong
Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, kernel,
mcgrof, linux-fsdevel, linux-mm, yangerkun, chengzhihao1,
libaokun1
On 10/25/2025 2:32 PM, Baokun Li wrote:
> On 2025-10-25 12:45, Matthew Wilcox wrote:
>> On Sat, Oct 25, 2025 at 11:22:18AM +0800, libaokun@huaweicloud.com wrote:
>>> + while (1) {
>>> + folio = __filemap_get_folio(mapping, index, fgp_flags,
>>> + gfp & ~__GFP_NOFAIL);
>>> + if (!IS_ERR(folio) || !(gfp & __GFP_NOFAIL))
>>> + return folio;
>>> +
>>> + if (PTR_ERR(folio) != -ENOMEM && PTR_ERR(folio) != -EAGAIN)
>>> + return folio;
>>> +
>>> + memalloc_retry_wait(gfp);
>>> + }
>> No, absolutely not. We're not having open-coded GFP_NOFAIL semantics.
>> The right way forward is for ext4 to use iomap, not for buffer heads
>> to support large block sizes.
>
> ext4 only calls getblk_unmovable or __getblk when reading critical
> metadata. Both of these functions set __GFP_NOFAIL to ensure that
> metadata reads do not fail due to memory pressure.
>
> Both functions eventually call grow_dev_folio(), which is why we
> handle the __GFP_NOFAIL logic there. xfs_buf_alloc_backing_mem()
> has similar logic, but XFS manages its own metadata, allowing it
> to use vmalloc for memory allocation.
>
> ext4 Direct I/O has already switched to iomap, and patches to
> support iomap for Buffered I/O are currently under iteration.
>
> But as far as I know, iomap does not support metadata, and XFS does not
> use iomap to read metadata either.
>
> Am I missing something here?
>
AFAIK, Unless ext4 also manages metadata on its own, like XFS does,
instead of using the bdev buffer head interface. However, this is
currently difficult to achieve.
Best Regards,
Yi.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 22/25] fs/buffer: prevent WARN_ON in __alloc_pages_slowpath() when BS > PS
2025-10-25 6:32 ` Baokun Li
2025-10-25 7:01 ` Zhang Yi
@ 2025-10-25 17:56 ` Matthew Wilcox
2025-10-27 2:57 ` Baokun Li
2025-10-27 7:40 ` Christoph Hellwig
2025-10-30 21:25 ` Matthew Wilcox
2 siblings, 2 replies; 68+ messages in thread
From: Matthew Wilcox @ 2025-10-25 17:56 UTC (permalink / raw)
To: Baokun Li
Cc: Darrick J. Wong, linux-ext4, tytso, adilger.kernel, jack,
linux-kernel, kernel, mcgrof, linux-fsdevel, linux-mm, yi.zhang,
yangerkun, chengzhihao1, libaokun1, catherine.hoang
On Sat, Oct 25, 2025 at 02:32:45PM +0800, Baokun Li wrote:
> On 2025-10-25 12:45, Matthew Wilcox wrote:
> > On Sat, Oct 25, 2025 at 11:22:18AM +0800, libaokun@huaweicloud.com wrote:
> >> + while (1) {
> >> + folio = __filemap_get_folio(mapping, index, fgp_flags,
> >> + gfp & ~__GFP_NOFAIL);
> >> + if (!IS_ERR(folio) || !(gfp & __GFP_NOFAIL))
> >> + return folio;
> >> +
> >> + if (PTR_ERR(folio) != -ENOMEM && PTR_ERR(folio) != -EAGAIN)
> >> + return folio;
> >> +
> >> + memalloc_retry_wait(gfp);
> >> + }
> > No, absolutely not. We're not having open-coded GFP_NOFAIL semantics.
> > The right way forward is for ext4 to use iomap, not for buffer heads
> > to support large block sizes.
>
> ext4 only calls getblk_unmovable or __getblk when reading critical
> metadata. Both of these functions set __GFP_NOFAIL to ensure that
> metadata reads do not fail due to memory pressure.
If filesystems actually require __GFP_NOFAIL for high-order allocations,
then this is a new requirement that needs to be communicated to the MM
developers, not hacked around in filesystems (or the VFS). And that
communication needs to be a separate thread with a clear subject line
to attract the right attention, not buried in patch 26/28.
For what it's worth, I think you have a good case. This really is
a new requirement (bs>PS) and in this scenario, we should be able to
reclaim page cache memory of the appropriate order to satisfy the NOFAIL
requirement. There will be concerns that other users will now be able to
use it without warning, but I think eventually this use case will prevail.
> Both functions eventually call grow_dev_folio(), which is why we
> handle the __GFP_NOFAIL logic there. xfs_buf_alloc_backing_mem()
> has similar logic, but XFS manages its own metadata, allowing it
> to use vmalloc for memory allocation.
The other possibility is that we switch ext4 away from the buffer cache
entirely. This is a big job! I know Catherine has been working on
a generic replacement for the buffer cache, but I'm not sure if it's
ready yet.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 22/25] fs/buffer: prevent WARN_ON in __alloc_pages_slowpath() when BS > PS
2025-10-25 17:56 ` Matthew Wilcox
@ 2025-10-27 2:57 ` Baokun Li
2025-10-27 7:40 ` Christoph Hellwig
1 sibling, 0 replies; 68+ messages in thread
From: Baokun Li @ 2025-10-27 2:57 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Darrick J. Wong, linux-ext4, tytso, adilger.kernel, jack,
linux-kernel, kernel, mcgrof, linux-fsdevel, linux-mm, yi.zhang,
yangerkun, chengzhihao1, catherine.hoang, Baokun Li,
Linus Torvalds
On 2025-10-26 01:56, Matthew Wilcox wrote:
> On Sat, Oct 25, 2025 at 02:32:45PM +0800, Baokun Li wrote:
>> On 2025-10-25 12:45, Matthew Wilcox wrote:
>>> On Sat, Oct 25, 2025 at 11:22:18AM +0800, libaokun@huaweicloud.com wrote:
>>>> + while (1) {
>>>> + folio = __filemap_get_folio(mapping, index, fgp_flags,
>>>> + gfp & ~__GFP_NOFAIL);
>>>> + if (!IS_ERR(folio) || !(gfp & __GFP_NOFAIL))
>>>> + return folio;
>>>> +
>>>> + if (PTR_ERR(folio) != -ENOMEM && PTR_ERR(folio) != -EAGAIN)
>>>> + return folio;
>>>> +
>>>> + memalloc_retry_wait(gfp);
>>>> + }
>>> No, absolutely not. We're not having open-coded GFP_NOFAIL semantics.
>>> The right way forward is for ext4 to use iomap, not for buffer heads
>>> to support large block sizes.
>> ext4 only calls getblk_unmovable or __getblk when reading critical
>> metadata. Both of these functions set __GFP_NOFAIL to ensure that
>> metadata reads do not fail due to memory pressure.
> If filesystems actually require __GFP_NOFAIL for high-order allocations,
> then this is a new requirement that needs to be communicated to the MM
> developers, not hacked around in filesystems (or the VFS). And that
> communication needs to be a separate thread with a clear subject line
> to attract the right attention, not buried in patch 26/28.
EXT4 is not the first filesystem to support LBS. I believe other
filesystems that already support LBS, even if they manage their own
metadata, have similar requirements. A filesystem cannot afford to become
read-only, shut down, or enter an inconsistent state due to memory
allocation failures in critical paths. Large folios have been around for
some time, and the fact that this warning still exists shows that the
problem is not trivial to solve.
Therefore, following the approach of filesystems that already support LBS,
such as XFS and the soon-to-be-removed bcachefs, I avoid adding
__GFP_NOFAIL for large allocations and instead retry internally to prevent
failures.
I do not intend to hide this issue in Patch 22/25. I cc’d linux-mm@kvack.org
precisely to invite memory management experts to share their thoughts on
the current situation.
Here is my limited understanding of the history of __GFP_NOFAIL:
Originally, in commit 4923abf9f1a4 ("Don't warn about order-1 allocations
with __GFP_NOFAIL"), Linus Torvalds raised the warning order from 0 to 1,
and commented,
"Maybe we should remove this warning entirely."
We had considered removing this warning, but then saw the discussion below.
Previously we used WARN_ON_ONCE_GFP, which meant the warning could be
suppressed with __GFP_NOWARN. But with the introduction of large folios,
memory allocation and reclaim have become much more challenging.
__GFP_NOFAIL can still fail, and many callers do not check the return
value, leading to potential NULL pointer dereferences.
Linus also noted that __GFP_NOFAIL is heavily abused, and even said in [1]:
“Honestly, I'm perfectly fine with just removing that stupid useless flag
entirely.”
"Because the blame should go *there*, and it should not even remotely look
like "oh, the MM code failed". No. The caller was garbage."
[1]:
https://lore.kernel.org/linux-mm/CAHk-=wgv2-=Bm16Gtn5XHWj9J6xiqriV56yamU+iG07YrN28SQ@mail.gmail.com/
From this, my understanding is that handling or retrying large allocation
failures in the caller is the direction going forward.
As for why retries are done in the VFS, there are two reasons: first, both
ext4 and jbd2 read metadata through blkdev, so a unified change is simpler.
Second, retrying here allows other buffer-head-based filesystems to support
LBS more easily.
For now, until large memory allocation and reclaim are properly handled,
this approach serves as a practical workaround.
> For what it's worth, I think you have a good case. This really is
> a new requirement (bs>PS) and in this scenario, we should be able to
> reclaim page cache memory of the appropriate order to satisfy the NOFAIL
> requirement. There will be concerns that other users will now be able to
> use it without warning, but I think eventually this use case will prevail.
Yeah, it would be best if the memory subsystem could add a flag like
__GFP_LBS to suppress these warnings and guide allocation and reclaim to
perform optimizations suited for this scenario.
>> Both functions eventually call grow_dev_folio(), which is why we
>> handle the __GFP_NOFAIL logic there. xfs_buf_alloc_backing_mem()
>> has similar logic, but XFS manages its own metadata, allowing it
>> to use vmalloc for memory allocation.
> The other possibility is that we switch ext4 away from the buffer cache
> entirely. This is a big job! I know Catherine has been working on
> a generic replacement for the buffer cache, but I'm not sure if it's
> ready yet.
>
The key issue is not whether ext4 uses buffer heads; even using vmalloc
with __GFP_NOFAIL for large allocations faces the same problem.
As Linus also mentioned in the link[1] above:
"It has then expanded and is now a problem. The cases using GFP_NOFAIL
for things like vmalloc() - which is by definition not a small
allocation - should be just removed as outright bugs."
Thanks,
Baokun
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 22/25] fs/buffer: prevent WARN_ON in __alloc_pages_slowpath() when BS > PS
2025-10-25 17:56 ` Matthew Wilcox
2025-10-27 2:57 ` Baokun Li
@ 2025-10-27 7:40 ` Christoph Hellwig
1 sibling, 0 replies; 68+ messages in thread
From: Christoph Hellwig @ 2025-10-27 7:40 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Baokun Li, Darrick J. Wong, linux-ext4, tytso, adilger.kernel,
jack, linux-kernel, kernel, mcgrof, linux-fsdevel, linux-mm,
yi.zhang, yangerkun, chengzhihao1, libaokun1, catherine.hoang
On Sat, Oct 25, 2025 at 06:56:57PM +0100, Matthew Wilcox wrote:
> If filesystems actually require __GFP_NOFAIL for high-order allocations,
> then this is a new requirement that needs to be communicated to the MM
> developers, not hacked around in filesystems (or the VFS). And that
> communication needs to be a separate thread with a clear subject line
> to attract the right attention, not buried in patch 26/28.
It's not really new. XFS had this basically since day 1, but with
Linus having a religious aversion against __GFP_NOFAIL most folks
have given up on trying to improve it as it just ends up in shouting
matches in political grounds. XFS just ends up with it's own fallback
in xfs_buf_alloc_backing_mem which survives the various rounds of
refactoring since XFS was merged. Given that weird behavior in some
of the memory allocators where GFP_NOFAIL is simply ignored for too
large allocations that seems like by far the sanest option in the
current Linux environment unfortunately.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 22/25] fs/buffer: prevent WARN_ON in __alloc_pages_slowpath() when BS > PS
2025-10-25 6:32 ` Baokun Li
2025-10-25 7:01 ` Zhang Yi
2025-10-25 17:56 ` Matthew Wilcox
@ 2025-10-30 21:25 ` Matthew Wilcox
2025-10-31 1:47 ` Zhang Yi
2025-10-31 1:55 ` Baokun Li
2 siblings, 2 replies; 68+ messages in thread
From: Matthew Wilcox @ 2025-10-30 21:25 UTC (permalink / raw)
To: Baokun Li
Cc: Darrick J. Wong, linux-ext4, tytso, adilger.kernel, jack,
linux-kernel, kernel, mcgrof, linux-fsdevel, linux-mm, yi.zhang,
yangerkun, chengzhihao1, libaokun1
On Sat, Oct 25, 2025 at 02:32:45PM +0800, Baokun Li wrote:
> On 2025-10-25 12:45, Matthew Wilcox wrote:
> > No, absolutely not. We're not having open-coded GFP_NOFAIL semantics.
> > The right way forward is for ext4 to use iomap, not for buffer heads
> > to support large block sizes.
>
> ext4 only calls getblk_unmovable or __getblk when reading critical
> metadata. Both of these functions set __GFP_NOFAIL to ensure that
> metadata reads do not fail due to memory pressure.
>
> Both functions eventually call grow_dev_folio(), which is why we
> handle the __GFP_NOFAIL logic there. xfs_buf_alloc_backing_mem()
> has similar logic, but XFS manages its own metadata, allowing it
> to use vmalloc for memory allocation.
In today's ext4 call, we discussed various options:
1. Change folios to be potentially fragmented. This change would be
ridiculously large and nobody thinks this is a good idea. Included here
for completeness.
2. Separate the buffer cache from the page cache again. They were
unified about 25 years ago, and this also feels like a very big job.
3. Duplicate the buffer cache into ext4/jbd2, remove the functionality
not needed and make _this_ version of the buffer cache allocate
its own memory instead of aliasing into the page cache. More feasible
than 1 or 2; still quite a big job.
4. Pick up Catherine's work and make ext4/jbd2 use it. Seems to be
about an equivalent amount of work to option 3.
5. Make __GFP_NOFAIL work for allocations up to 64KiB (we decided this was
probably the practical limit of sector sizes that people actually want).
In terms of programming, it's a one-line change. But we need to sell
this change to the MM people. I think it's doable because if we have
a filesystem with 64KiB sectors, there will be many clean folios in the
pagecache which are 64KiB or larger.
So, we liked option 5 best.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 22/25] fs/buffer: prevent WARN_ON in __alloc_pages_slowpath() when BS > PS
2025-10-30 21:25 ` Matthew Wilcox
@ 2025-10-31 1:47 ` Zhang Yi
2025-10-31 1:55 ` Baokun Li
1 sibling, 0 replies; 68+ messages in thread
From: Zhang Yi @ 2025-10-31 1:47 UTC (permalink / raw)
To: Matthew Wilcox, Baokun Li
Cc: Darrick J. Wong, linux-ext4, tytso, adilger.kernel, jack,
linux-kernel, kernel, mcgrof, linux-fsdevel, linux-mm, yangerkun,
chengzhihao1, libaokun1
Hi!
On 10/31/2025 5:25 AM, Matthew Wilcox wrote:
> On Sat, Oct 25, 2025 at 02:32:45PM +0800, Baokun Li wrote:
>> On 2025-10-25 12:45, Matthew Wilcox wrote:
>>> No, absolutely not. We're not having open-coded GFP_NOFAIL semantics.
>>> The right way forward is for ext4 to use iomap, not for buffer heads
>>> to support large block sizes.
>>
>> ext4 only calls getblk_unmovable or __getblk when reading critical
>> metadata. Both of these functions set __GFP_NOFAIL to ensure that
>> metadata reads do not fail due to memory pressure.
>>
>> Both functions eventually call grow_dev_folio(), which is why we
>> handle the __GFP_NOFAIL logic there. xfs_buf_alloc_backing_mem()
>> has similar logic, but XFS manages its own metadata, allowing it
>> to use vmalloc for memory allocation.
>
> In today's ext4 call, we discussed various options:
>
> 1. Change folios to be potentially fragmented. This change would be
> ridiculously large and nobody thinks this is a good idea. Included here
> for completeness.
>
> 2. Separate the buffer cache from the page cache again. They were
> unified about 25 years ago, and this also feels like a very big job.
>
> 3. Duplicate the buffer cache into ext4/jbd2, remove the functionality
> not needed and make _this_ version of the buffer cache allocate
> its own memory instead of aliasing into the page cache. More feasible
> than 1 or 2; still quite a big job.
>
> 4. Pick up Catherine's work and make ext4/jbd2 use it. Seems to be
> about an equivalent amount of work to option 3.
>
Regarding these two proposals, would you consider them for the long
term? Besides the currently discussed case, they offer additional
benefits, such as making ext4's metadata management more flexible and
secure, as well as enabling more robust error handling.
Thanks,
Yi.
> 5. Make __GFP_NOFAIL work for allocations up to 64KiB (we decided this was
> probably the practical limit of sector sizes that people actually want).
> In terms of programming, it's a one-line change. But we need to sell
> this change to the MM people. I think it's doable because if we have
> a filesystem with 64KiB sectors, there will be many clean folios in the
> pagecache which are 64KiB or larger.
>
> So, we liked option 5 best.
>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 22/25] fs/buffer: prevent WARN_ON in __alloc_pages_slowpath() when BS > PS
2025-10-30 21:25 ` Matthew Wilcox
2025-10-31 1:47 ` Zhang Yi
@ 2025-10-31 1:55 ` Baokun Li
1 sibling, 0 replies; 68+ messages in thread
From: Baokun Li @ 2025-10-31 1:55 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Darrick J. Wong, linux-ext4, tytso, adilger.kernel, jack,
linux-kernel, kernel, mcgrof, linux-fsdevel, linux-mm, yi.zhang,
yangerkun, chengzhihao1, Baokun Li, Baokun Li
On 2025-10-31 05:25, Matthew Wilcox wrote:
> On Sat, Oct 25, 2025 at 02:32:45PM +0800, Baokun Li wrote:
>> On 2025-10-25 12:45, Matthew Wilcox wrote:
>>> No, absolutely not. We're not having open-coded GFP_NOFAIL semantics.
>>> The right way forward is for ext4 to use iomap, not for buffer heads
>>> to support large block sizes.
>> ext4 only calls getblk_unmovable or __getblk when reading critical
>> metadata. Both of these functions set __GFP_NOFAIL to ensure that
>> metadata reads do not fail due to memory pressure.
>>
>> Both functions eventually call grow_dev_folio(), which is why we
>> handle the __GFP_NOFAIL logic there. xfs_buf_alloc_backing_mem()
>> has similar logic, but XFS manages its own metadata, allowing it
>> to use vmalloc for memory allocation.
> In today's ext4 call, we discussed various options:
>
> 1. Change folios to be potentially fragmented. This change would be
> ridiculously large and nobody thinks this is a good idea. Included here
> for completeness.
>
> 2. Separate the buffer cache from the page cache again. They were
> unified about 25 years ago, and this also feels like a very big job.
>
> 3. Duplicate the buffer cache into ext4/jbd2, remove the functionality
> not needed and make _this_ version of the buffer cache allocate
> its own memory instead of aliasing into the page cache. More feasible
> than 1 or 2; still quite a big job.
>
> 4. Pick up Catherine's work and make ext4/jbd2 use it. Seems to be
> about an equivalent amount of work to option 3.
>
> 5. Make __GFP_NOFAIL work for allocations up to 64KiB (we decided this was
> probably the practical limit of sector sizes that people actually want).
> In terms of programming, it's a one-line change. But we need to sell
> this change to the MM people. I think it's doable because if we have
> a filesystem with 64KiB sectors, there will be many clean folios in the
> pagecache which are 64KiB or larger.
>
> So, we liked option 5 best.
>
Thank you for your suggestions!
Yes, options 1 and 2 don’t seem very feasible, and options 3 and 4 would
involve a significant amount of work. Option 5 is indeed the simplest and
most general solution at this point, and it makes a lot of sense.
I will send a separate RFC patch to the MM list to gather feedback from the
MM people. If this approach is accepted, we can drop patches 22 and 23 from
the current series.
Cheers,
Baokun
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 01/25] ext4: remove page offset calculation in ext4_block_zero_page_range()
2025-10-25 3:21 ` [PATCH 01/25] ext4: remove page offset calculation in ext4_block_zero_page_range() libaokun
@ 2025-11-03 7:41 ` Jan Kara
0 siblings, 0 replies; 68+ messages in thread
From: Jan Kara @ 2025-11-03 7:41 UTC (permalink / raw)
To: libaokun
Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, kernel,
mcgrof, linux-fsdevel, linux-mm, yi.zhang, yangerkun,
chengzhihao1, libaokun1
On Sat 25-10-25 11:21:57, libaokun@huaweicloud.com wrote:
> From: Zhihao Cheng <chengzhihao1@huawei.com>
>
> For bs <= ps scenarios, calculating the offset within the block is
> sufficient. For bs > ps, an initial page offset calculation can lead to
> incorrect behavior. Thus this redundant calculation has been removed.
>
> Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Looks good. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/ext4/inode.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index e99306a8f47c..0742039c53a7 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4157,9 +4157,8 @@ static int ext4_block_zero_page_range(handle_t *handle,
> struct address_space *mapping, loff_t from, loff_t length)
> {
> struct inode *inode = mapping->host;
> - unsigned offset = from & (PAGE_SIZE-1);
> unsigned blocksize = inode->i_sb->s_blocksize;
> - unsigned max = blocksize - (offset & (blocksize - 1));
> + unsigned int max = blocksize - (from & (blocksize - 1));
>
> /*
> * correct length if it does not fall between
> --
> 2.46.1
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 02/25] ext4: remove page offset calculation in ext4_block_truncate_page()
2025-10-25 3:21 ` [PATCH 02/25] ext4: remove page offset calculation in ext4_block_truncate_page() libaokun
@ 2025-11-03 7:42 ` Jan Kara
0 siblings, 0 replies; 68+ messages in thread
From: Jan Kara @ 2025-11-03 7:42 UTC (permalink / raw)
To: libaokun
Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, kernel,
mcgrof, linux-fsdevel, linux-mm, yi.zhang, yangerkun,
chengzhihao1, libaokun1
On Sat 25-10-25 11:21:58, libaokun@huaweicloud.com wrote:
> From: Baokun Li <libaokun1@huawei.com>
>
> For bs <= ps scenarios, calculating the offset within the block is
> sufficient. For bs > ps, an initial page offset calculation can lead to
> incorrect behavior. Thus this redundant calculation has been removed.
>
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Looks good. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/ext4/inode.c | 5 ++---
> 1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 0742039c53a7..4c04af7e51c9 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4183,7 +4183,6 @@ static int ext4_block_zero_page_range(handle_t *handle,
> static int ext4_block_truncate_page(handle_t *handle,
> struct address_space *mapping, loff_t from)
> {
> - unsigned offset = from & (PAGE_SIZE-1);
> unsigned length;
> unsigned blocksize;
> struct inode *inode = mapping->host;
> @@ -4192,8 +4191,8 @@ static int ext4_block_truncate_page(handle_t *handle,
> if (IS_ENCRYPTED(inode) && !fscrypt_has_encryption_key(inode))
> return 0;
>
> - blocksize = inode->i_sb->s_blocksize;
> - length = blocksize - (offset & (blocksize - 1));
> + blocksize = i_blocksize(inode);
> + length = blocksize - (from & (blocksize - 1));
>
> return ext4_block_zero_page_range(handle, mapping, from, length);
> }
> --
> 2.46.1
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 03/25] ext4: remove PAGE_SIZE checks for rec_len conversion
2025-10-25 3:21 ` [PATCH 03/25] ext4: remove PAGE_SIZE checks for rec_len conversion libaokun
@ 2025-11-03 7:43 ` Jan Kara
0 siblings, 0 replies; 68+ messages in thread
From: Jan Kara @ 2025-11-03 7:43 UTC (permalink / raw)
To: libaokun
Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, kernel,
mcgrof, linux-fsdevel, linux-mm, yi.zhang, yangerkun,
chengzhihao1, libaokun1
On Sat 25-10-25 11:21:59, libaokun@huaweicloud.com wrote:
> From: Baokun Li <libaokun1@huawei.com>
>
> Previously, ext4_rec_len_(to|from)_disk only performed complex rec_len
> conversions when PAGE_SIZE >= 65536 to reduce complexity.
>
> However, we are soon to support file system block sizes greater than
> page size, which makes these conditional checks unnecessary. Thus, these
> checks are now removed.
>
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Looks good. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/ext4/ext4.h | 12 ------------
> 1 file changed, 12 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 24c414605b08..93c2bf4d125a 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -2464,28 +2464,19 @@ static inline unsigned int ext4_dir_rec_len(__u8 name_len,
> return (rec_len & ~EXT4_DIR_ROUND);
> }
>
> -/*
> - * If we ever get support for fs block sizes > page_size, we'll need
> - * to remove the #if statements in the next two functions...
> - */
> static inline unsigned int
> ext4_rec_len_from_disk(__le16 dlen, unsigned blocksize)
> {
> unsigned len = le16_to_cpu(dlen);
>
> -#if (PAGE_SIZE >= 65536)
> if (len == EXT4_MAX_REC_LEN || len == 0)
> return blocksize;
> return (len & 65532) | ((len & 3) << 16);
> -#else
> - return len;
> -#endif
> }
>
> static inline __le16 ext4_rec_len_to_disk(unsigned len, unsigned blocksize)
> {
> BUG_ON((len > blocksize) || (blocksize > (1 << 18)) || (len & 3));
> -#if (PAGE_SIZE >= 65536)
> if (len < 65536)
> return cpu_to_le16(len);
> if (len == blocksize) {
> @@ -2495,9 +2486,6 @@ static inline __le16 ext4_rec_len_to_disk(unsigned len, unsigned blocksize)
> return cpu_to_le16(0);
> }
> return cpu_to_le16((len & 65532) | ((len >> 16) & 3));
> -#else
> - return cpu_to_le16(len);
> -#endif
> }
>
> /*
> --
> 2.46.1
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 04/25] ext4: make ext4_punch_hole() support large block size
2025-10-25 3:22 ` [PATCH 04/25] ext4: make ext4_punch_hole() support large block size libaokun
@ 2025-11-03 8:05 ` Jan Kara
2025-11-04 6:55 ` Baokun Li
0 siblings, 1 reply; 68+ messages in thread
From: Jan Kara @ 2025-11-03 8:05 UTC (permalink / raw)
To: libaokun
Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, kernel,
mcgrof, linux-fsdevel, linux-mm, yi.zhang, yangerkun,
chengzhihao1, libaokun1
On Sat 25-10-25 11:22:00, libaokun@huaweicloud.com wrote:
> From: Baokun Li <libaokun1@huawei.com>
>
> Since the block size may be greater than the page size, when a hole
> extends beyond i_size, we need to align the hole's end upwards to the
> larger of PAGE_SIZE and blocksize.
>
> This is to prevent the issues seen in commit 2be4751b21ae ("ext4: fix
> 2nd xfstests 127 punch hole failure") from reappearing after BS > PS
> is supported.
>
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
When going for bs > ps support, I'm very suspicious of any code that keeps
using PAGE_SIZE because it doesn't make too much sense anymore. Usually that
should be either appropriate folio size or something like that. For example
in this case if we indeed rely on freeing some buffers then with 4k block
size in an order-2 folio things would be already broken.
As far as I'm checking truncate_inode_pages_range() already handles partial
folio invalidation fine so I think we should just use blocksize in the
rounding (to save pointless tail block zeroing) and be done with it.
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 4c04af7e51c9..a63513a3db53 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4401,7 +4401,8 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
> * the page that contains i_size.
> */
> if (end > inode->i_size)
BTW I think here we should have >= (not your fault but we can fix it when
changing the code).
> - end = round_up(inode->i_size, PAGE_SIZE);
> + end = round_up(inode->i_size,
> + umax(PAGE_SIZE, sb->s_blocksize));
> if (end > max_end)
> end = max_end;
> length = end - offset;
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 05/25] ext4: enable DIOREAD_NOLOCK by default for BS > PS as well
2025-10-25 3:22 ` [PATCH 05/25] ext4: enable DIOREAD_NOLOCK by default for BS > PS as well libaokun
@ 2025-11-03 8:06 ` Jan Kara
0 siblings, 0 replies; 68+ messages in thread
From: Jan Kara @ 2025-11-03 8:06 UTC (permalink / raw)
To: libaokun
Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, kernel,
mcgrof, linux-fsdevel, linux-mm, yi.zhang, yangerkun,
chengzhihao1, libaokun1
On Sat 25-10-25 11:22:01, libaokun@huaweicloud.com wrote:
> From: Baokun Li <libaokun1@huawei.com>
>
> The dioread_nolock related processes already support large folio, so
> dioread_nolock is enabled by default regardless of whether the blocksize
> is less than, equal to, or greater than PAGE_SIZE.
>
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Looks good. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/ext4/super.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 894529f9b0cc..aa5aee4d1b63 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4383,8 +4383,7 @@ static void ext4_set_def_opts(struct super_block *sb,
> ((def_mount_opts & EXT4_DEFM_NODELALLOC) == 0))
> set_opt(sb, DELALLOC);
>
> - if (sb->s_blocksize <= PAGE_SIZE)
> - set_opt(sb, DIOREAD_NOLOCK);
> + set_opt(sb, DIOREAD_NOLOCK);
> }
>
> static int ext4_handle_clustersize(struct super_block *sb)
> --
> 2.46.1
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 07/25] ext4: support large block size in ext4_calculate_overhead()
2025-10-25 3:22 ` [PATCH 07/25] ext4: support large block size in ext4_calculate_overhead() libaokun
@ 2025-11-03 8:14 ` Jan Kara
2025-11-03 14:37 ` Baokun Li
0 siblings, 1 reply; 68+ messages in thread
From: Jan Kara @ 2025-11-03 8:14 UTC (permalink / raw)
To: libaokun
Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, kernel,
mcgrof, linux-fsdevel, linux-mm, yi.zhang, yangerkun,
chengzhihao1, libaokun1
On Sat 25-10-25 11:22:03, libaokun@huaweicloud.com wrote:
> From: Baokun Li <libaokun1@huawei.com>
>
> ext4_calculate_overhead() used a single page for its bitmap buffer, which
> worked fine when PAGE_SIZE >= block size. However, with block size greater
> than page size (BS > PS) support, the bitmap can exceed a single page.
>
> To address this, we now use __get_free_pages() to allocate multiple pages,
> sized to the block size, to properly support BS > PS.
>
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
One comment below:
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index d353e25a5b92..7338c708ea1d 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4182,7 +4182,8 @@ int ext4_calculate_overhead(struct super_block *sb)
> unsigned int j_blocks, j_inum = le32_to_cpu(es->s_journal_inum);
> ext4_group_t i, ngroups = ext4_get_groups_count(sb);
> ext4_fsblk_t overhead = 0;
> - char *buf = (char *) get_zeroed_page(GFP_NOFS);
> + gfp_t gfp = GFP_NOFS | __GFP_ZERO;
> + char *buf = (char *)__get_free_pages(gfp, sbi->s_min_folio_order);
I think this should be using kvmalloc(). There's no reason to require
physically contiguous pages for this...
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 06/25] ext4: introduce s_min_folio_order for future BS > PS support
2025-10-25 3:22 ` [PATCH 06/25] ext4: introduce s_min_folio_order for future BS > PS support libaokun
@ 2025-11-03 8:19 ` Jan Kara
0 siblings, 0 replies; 68+ messages in thread
From: Jan Kara @ 2025-11-03 8:19 UTC (permalink / raw)
To: libaokun
Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, kernel,
mcgrof, linux-fsdevel, linux-mm, yi.zhang, yangerkun,
chengzhihao1, libaokun1
On Sat 25-10-25 11:22:02, libaokun@huaweicloud.com wrote:
> From: Baokun Li <libaokun1@huawei.com>
>
> This commit introduces the s_min_folio_order field to the ext4_sb_info
> structure. This field will store the minimum folio order required by the
> current filesystem, laying groundwork for future support of block sizes
> greater than PAGE_SIZE.
>
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Looks good. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/ext4/ext4.h | 3 +++
> fs/ext4/inode.c | 3 ++-
> fs/ext4/super.c | 10 +++++-----
> 3 files changed, 10 insertions(+), 6 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 93c2bf4d125a..bca6c3709673 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1677,6 +1677,9 @@ struct ext4_sb_info {
> /* record the last minlen when FITRIM is called. */
> unsigned long s_last_trim_minblks;
>
> + /* minimum folio order of a page cache allocation */
> + unsigned int s_min_folio_order;
> +
> /* Precomputed FS UUID checksum for seeding other checksums */
> __u32 s_csum_seed;
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index a63513a3db53..889761ed51dd 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5174,7 +5174,8 @@ void ext4_set_inode_mapping_order(struct inode *inode)
> if (!ext4_should_enable_large_folio(inode))
> return;
>
> - mapping_set_folio_order_range(inode->i_mapping, 0,
> + mapping_set_folio_order_range(inode->i_mapping,
> + EXT4_SB(inode->i_sb)->s_min_folio_order,
> EXT4_MAX_PAGECACHE_ORDER(inode));
> }
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index aa5aee4d1b63..d353e25a5b92 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -5100,11 +5100,8 @@ static int ext4_load_super(struct super_block *sb, ext4_fsblk_t *lsb,
> * If the default block size is not the same as the real block size,
> * we need to reload it.
> */
> - if (sb->s_blocksize == blocksize) {
> - *lsb = logical_sb_block;
> - sbi->s_sbh = bh;
> - return 0;
> - }
> + if (sb->s_blocksize == blocksize)
> + goto success;
>
> /*
> * bh must be released before kill_bdev(), otherwise
> @@ -5135,6 +5132,9 @@ static int ext4_load_super(struct super_block *sb, ext4_fsblk_t *lsb,
> ext4_msg(sb, KERN_ERR, "Magic mismatch, very weird!");
> goto out;
> }
> +
> +success:
> + sbi->s_min_folio_order = get_order(blocksize);
> *lsb = logical_sb_block;
> sbi->s_sbh = bh;
> return 0;
> --
> 2.46.1
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 09/25] ext4: add EXT4_LBLK_TO_B macro for logical block to bytes conversion
2025-10-25 3:22 ` [PATCH 09/25] ext4: add EXT4_LBLK_TO_B macro for logical block to bytes conversion libaokun
@ 2025-11-03 8:21 ` Jan Kara
0 siblings, 0 replies; 68+ messages in thread
From: Jan Kara @ 2025-11-03 8:21 UTC (permalink / raw)
To: libaokun
Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, kernel,
mcgrof, linux-fsdevel, linux-mm, yi.zhang, yangerkun,
chengzhihao1, libaokun1
On Sat 25-10-25 11:22:05, libaokun@huaweicloud.com wrote:
> From: Baokun Li <libaokun1@huawei.com>
>
> No functional changes.
>
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Looks good. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/ext4/ext4.h | 1 +
> fs/ext4/extents.c | 2 +-
> fs/ext4/inode.c | 20 +++++++++-----------
> fs/ext4/namei.c | 8 +++-----
> fs/ext4/verity.c | 2 +-
> 5 files changed, 15 insertions(+), 18 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index bca6c3709673..9b236f620b3a 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -367,6 +367,7 @@ struct ext4_io_submit {
> blkbits))
> #define EXT4_B_TO_LBLK(inode, offset) \
> (round_up((offset), i_blocksize(inode)) >> (inode)->i_blkbits)
> +#define EXT4_LBLK_TO_B(inode, lblk) ((loff_t)(lblk) << (inode)->i_blkbits)
>
> /* Translate a block number to a cluster number */
> #define EXT4_B2C(sbi, blk) ((blk) >> (sbi)->s_cluster_bits)
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index ca5499e9412b..da640c88b863 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4562,7 +4562,7 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
> * allow a full retry cycle for any remaining allocations
> */
> retries = 0;
> - epos = (loff_t)(map.m_lblk + ret) << blkbits;
> + epos = EXT4_LBLK_TO_B(inode, map.m_lblk + ret);
> inode_set_ctime_current(inode);
> if (new_size) {
> if (epos > new_size)
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 889761ed51dd..73c1da90b604 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -825,9 +825,8 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
> !(flags & EXT4_GET_BLOCKS_ZERO) &&
> !ext4_is_quota_file(inode) &&
> ext4_should_order_data(inode)) {
> - loff_t start_byte =
> - (loff_t)map->m_lblk << inode->i_blkbits;
> - loff_t length = (loff_t)map->m_len << inode->i_blkbits;
> + loff_t start_byte = EXT4_LBLK_TO_B(inode, map->m_lblk);
> + loff_t length = EXT4_LBLK_TO_B(inode, map->m_len);
>
> if (flags & EXT4_GET_BLOCKS_IO_SUBMIT)
> ret = ext4_jbd2_inode_add_wait(handle, inode,
> @@ -2225,7 +2224,6 @@ static int mpage_process_folio(struct mpage_da_data *mpd, struct folio *folio,
> ext4_lblk_t lblk = *m_lblk;
> ext4_fsblk_t pblock = *m_pblk;
> int err = 0;
> - int blkbits = mpd->inode->i_blkbits;
> ssize_t io_end_size = 0;
> struct ext4_io_end_vec *io_end_vec = ext4_last_io_end_vec(io_end);
>
> @@ -2251,7 +2249,8 @@ static int mpage_process_folio(struct mpage_da_data *mpd, struct folio *folio,
> err = PTR_ERR(io_end_vec);
> goto out;
> }
> - io_end_vec->offset = (loff_t)mpd->map.m_lblk << blkbits;
> + io_end_vec->offset = EXT4_LBLK_TO_B(mpd->inode,
> + mpd->map.m_lblk);
> }
> *map_bh = true;
> goto out;
> @@ -2261,7 +2260,7 @@ static int mpage_process_folio(struct mpage_da_data *mpd, struct folio *folio,
> bh->b_blocknr = pblock++;
> }
> clear_buffer_unwritten(bh);
> - io_end_size += (1 << blkbits);
> + io_end_size += i_blocksize(mpd->inode);
> } while (lblk++, (bh = bh->b_this_page) != head);
>
> io_end_vec->size += io_end_size;
> @@ -2463,7 +2462,7 @@ static int mpage_map_and_submit_extent(handle_t *handle,
> io_end_vec = ext4_alloc_io_end_vec(io_end);
> if (IS_ERR(io_end_vec))
> return PTR_ERR(io_end_vec);
> - io_end_vec->offset = ((loff_t)map->m_lblk) << inode->i_blkbits;
> + io_end_vec->offset = EXT4_LBLK_TO_B(inode, map->m_lblk);
> do {
> err = mpage_map_one_extent(handle, mpd);
> if (err < 0) {
> @@ -3503,8 +3502,8 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
> iomap->dax_dev = EXT4_SB(inode->i_sb)->s_daxdev;
> else
> iomap->bdev = inode->i_sb->s_bdev;
> - iomap->offset = (u64) map->m_lblk << blkbits;
> - iomap->length = (u64) map->m_len << blkbits;
> + iomap->offset = EXT4_LBLK_TO_B(inode, map->m_lblk);
> + iomap->length = EXT4_LBLK_TO_B(inode, map->m_len);
>
> if ((map->m_flags & EXT4_MAP_MAPPED) &&
> !ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> @@ -3678,7 +3677,6 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
> unsigned int flags)
> {
> handle_t *handle;
> - u8 blkbits = inode->i_blkbits;
> int ret, dio_credits, m_flags = 0, retries = 0;
> bool force_commit = false;
>
> @@ -3737,7 +3735,7 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
> * i_disksize out to i_size. This could be beyond where direct I/O is
> * happening and thus expose allocated blocks to direct I/O reads.
> */
> - else if (((loff_t)map->m_lblk << blkbits) >= i_size_read(inode))
> + else if (EXT4_LBLK_TO_B(inode, map->m_lblk) >= i_size_read(inode))
> m_flags = EXT4_GET_BLOCKS_CREATE;
> else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index 2cd36f59c9e3..78cefb7cc9a7 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -1076,7 +1076,7 @@ static int htree_dirblock_to_tree(struct file *dir_file,
> for (; de < top; de = ext4_next_entry(de, dir->i_sb->s_blocksize)) {
> if (ext4_check_dir_entry(dir, NULL, de, bh,
> bh->b_data, bh->b_size,
> - (block<<EXT4_BLOCK_SIZE_BITS(dir->i_sb))
> + EXT4_LBLK_TO_B(dir, block)
> + ((char *)de - bh->b_data))) {
> /* silently ignore the rest of the block */
> break;
> @@ -1630,7 +1630,7 @@ static struct buffer_head *__ext4_find_entry(struct inode *dir,
> }
> set_buffer_verified(bh);
> i = search_dirblock(bh, dir, fname,
> - block << EXT4_BLOCK_SIZE_BITS(sb), res_dir);
> + EXT4_LBLK_TO_B(dir, block), res_dir);
> if (i == 1) {
> EXT4_I(dir)->i_dir_start_lookup = block;
> ret = bh;
> @@ -1710,7 +1710,6 @@ static struct buffer_head * ext4_dx_find_entry(struct inode *dir,
> struct ext4_filename *fname,
> struct ext4_dir_entry_2 **res_dir)
> {
> - struct super_block * sb = dir->i_sb;
> struct dx_frame frames[EXT4_HTREE_LEVEL], *frame;
> struct buffer_head *bh;
> ext4_lblk_t block;
> @@ -1729,8 +1728,7 @@ static struct buffer_head * ext4_dx_find_entry(struct inode *dir,
> goto errout;
>
> retval = search_dirblock(bh, dir, fname,
> - block << EXT4_BLOCK_SIZE_BITS(sb),
> - res_dir);
> + EXT4_LBLK_TO_B(dir, block), res_dir);
> if (retval == 1)
> goto success;
> brelse(bh);
> diff --git a/fs/ext4/verity.c b/fs/ext4/verity.c
> index d9203228ce97..7a980a8059bd 100644
> --- a/fs/ext4/verity.c
> +++ b/fs/ext4/verity.c
> @@ -302,7 +302,7 @@ static int ext4_get_verity_descriptor_location(struct inode *inode,
>
> end_lblk = le32_to_cpu(last_extent->ee_block) +
> ext4_ext_get_actual_len(last_extent);
> - desc_size_pos = (u64)end_lblk << inode->i_blkbits;
> + desc_size_pos = EXT4_LBLK_TO_B(inode, end_lblk);
> ext4_free_ext_path(path);
>
> if (desc_size_pos < sizeof(desc_size_disk))
> --
> 2.46.1
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 10/25] ext4: add EXT4_LBLK_TO_P and EXT4_P_TO_LBLK for block/page conversion
2025-10-25 3:22 ` [PATCH 10/25] ext4: add EXT4_LBLK_TO_P and EXT4_P_TO_LBLK for block/page conversion libaokun
@ 2025-11-03 8:26 ` Jan Kara
2025-11-03 14:45 ` Baokun Li
0 siblings, 1 reply; 68+ messages in thread
From: Jan Kara @ 2025-11-03 8:26 UTC (permalink / raw)
To: libaokun
Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, kernel,
mcgrof, linux-fsdevel, linux-mm, yi.zhang, yangerkun,
chengzhihao1, libaokun1
On Sat 25-10-25 11:22:06, libaokun@huaweicloud.com wrote:
> From: Baokun Li <libaokun1@huawei.com>
>
> As BS > PS support is coming, all block number to page index (and
> vice-versa) conversions must now go via bytes. Added EXT4_LBLK_TO_P()
> and EXT4_P_TO_LBLK() macros to simplify these conversions and handle
> both BS <= PS and BS > PS scenarios cleanly.
>
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
'P' in the macro names seems too terse :). I'd probably use PG to give a
better hint this is about pages? So EXT4_LBLK_TO_PG() and
EXT4_PG_TO_LBLK(). BTW, patch 8 could already use these macros...
Honza
> ---
> fs/ext4/ext4.h | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 9b236f620b3a..8223ed29b343 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -369,6 +369,12 @@ struct ext4_io_submit {
> (round_up((offset), i_blocksize(inode)) >> (inode)->i_blkbits)
> #define EXT4_LBLK_TO_B(inode, lblk) ((loff_t)(lblk) << (inode)->i_blkbits)
>
> +/* Translate a block number to a page index */
> +#define EXT4_LBLK_TO_P(inode, lblk) (EXT4_LBLK_TO_B((inode), (lblk)) >> \
> + PAGE_SHIFT)
> +/* Translate a page index to a block number */
> +#define EXT4_P_TO_LBLK(inode, pnum) (((loff_t)(pnum) << PAGE_SHIFT) >> \
> + (inode)->i_blkbits)
> /* Translate a block number to a cluster number */
> #define EXT4_B2C(sbi, blk) ((blk) >> (sbi)->s_cluster_bits)
> /* Translate a cluster number to a block number */
> --
> 2.46.1
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 08/25] ext4: support large block size in ext4_readdir()
2025-10-25 3:22 ` [PATCH 08/25] ext4: support large block size in ext4_readdir() libaokun
@ 2025-11-03 8:27 ` Jan Kara
0 siblings, 0 replies; 68+ messages in thread
From: Jan Kara @ 2025-11-03 8:27 UTC (permalink / raw)
To: libaokun
Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, kernel,
mcgrof, linux-fsdevel, linux-mm, yi.zhang, yangerkun,
chengzhihao1, libaokun1
On Sat 25-10-25 11:22:04, libaokun@huaweicloud.com wrote:
> From: Baokun Li <libaokun1@huawei.com>
>
> In ext4_readdir(), page_cache_sync_readahead() is used to readahead mapped
> physical blocks. With LBS support, this can lead to a negative right shift.
>
> To fix this, the page index is now calculated by first converting the
> physical block number (pblk) to a file position (pos) before converting
> it to a page index. Also, the correct number of pages to readahead is now
> passed.
>
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Looks good. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
> index d4164c507a90..256fe2c1d4c1 100644
> --- a/fs/ext4/dir.c
> +++ b/fs/ext4/dir.c
> @@ -192,13 +192,13 @@ static int ext4_readdir(struct file *file, struct dir_context *ctx)
> continue;
> }
> if (err > 0) {
> - pgoff_t index = map.m_pblk >>
> - (PAGE_SHIFT - inode->i_blkbits);
> + pgoff_t index = map.m_pblk << inode->i_blkbits >>
> + PAGE_SHIFT;
> if (!ra_has_index(&file->f_ra, index))
> page_cache_sync_readahead(
> sb->s_bdev->bd_mapping,
> - &file->f_ra, file,
> - index, 1);
> + &file->f_ra, file, index,
> + 1 << EXT4_SB(sb)->s_min_folio_order);
> file->f_ra.prev_pos = (loff_t)index << PAGE_SHIFT;
> bh = ext4_bread(NULL, inode, map.m_lblk, 0);
> if (IS_ERR(bh)) {
> --
> 2.46.1
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 07/25] ext4: support large block size in ext4_calculate_overhead()
2025-11-03 8:14 ` Jan Kara
@ 2025-11-03 14:37 ` Baokun Li
0 siblings, 0 replies; 68+ messages in thread
From: Baokun Li @ 2025-11-03 14:37 UTC (permalink / raw)
To: Jan Kara
Cc: linux-ext4, tytso, adilger.kernel, linux-kernel, kernel, mcgrof,
linux-fsdevel, linux-mm, yi.zhang, yangerkun, chengzhihao1,
libaokun1, Baokun Li
On 2025-11-03 16:14, Jan Kara wrote:
> On Sat 25-10-25 11:22:03, libaokun@huaweicloud.com wrote:
>> From: Baokun Li <libaokun1@huawei.com>
>>
>> ext4_calculate_overhead() used a single page for its bitmap buffer, which
>> worked fine when PAGE_SIZE >= block size. However, with block size greater
>> than page size (BS > PS) support, the bitmap can exceed a single page.
>>
>> To address this, we now use __get_free_pages() to allocate multiple pages,
>> sized to the block size, to properly support BS > PS.
>>
>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
>> Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
> One comment below:
>
>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>> index d353e25a5b92..7338c708ea1d 100644
>> --- a/fs/ext4/super.c
>> +++ b/fs/ext4/super.c
>> @@ -4182,7 +4182,8 @@ int ext4_calculate_overhead(struct super_block *sb)
>> unsigned int j_blocks, j_inum = le32_to_cpu(es->s_journal_inum);
>> ext4_group_t i, ngroups = ext4_get_groups_count(sb);
>> ext4_fsblk_t overhead = 0;
>> - char *buf = (char *) get_zeroed_page(GFP_NOFS);
>> + gfp_t gfp = GFP_NOFS | __GFP_ZERO;
>> + char *buf = (char *)__get_free_pages(gfp, sbi->s_min_folio_order);
> I think this should be using kvmalloc(). There's no reason to require
> physically contiguous pages for this...
>
> Honza
Makes sense, I will use kvmalloc() in the next version.
Thanks,
Baokun
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 10/25] ext4: add EXT4_LBLK_TO_P and EXT4_P_TO_LBLK for block/page conversion
2025-11-03 8:26 ` Jan Kara
@ 2025-11-03 14:45 ` Baokun Li
2025-11-05 8:27 ` Jan Kara
0 siblings, 1 reply; 68+ messages in thread
From: Baokun Li @ 2025-11-03 14:45 UTC (permalink / raw)
To: Jan Kara
Cc: linux-ext4, tytso, adilger.kernel, linux-kernel, kernel, mcgrof,
linux-fsdevel, linux-mm, yi.zhang, yangerkun, chengzhihao1,
libaokun1, Baokun Li
On 2025-11-03 16:26, Jan Kara wrote:
> On Sat 25-10-25 11:22:06, libaokun@huaweicloud.com wrote:
>> From: Baokun Li <libaokun1@huawei.com>
>>
>> As BS > PS support is coming, all block number to page index (and
>> vice-versa) conversions must now go via bytes. Added EXT4_LBLK_TO_P()
>> and EXT4_P_TO_LBLK() macros to simplify these conversions and handle
>> both BS <= PS and BS > PS scenarios cleanly.
>>
>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
>> Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
> 'P' in the macro names seems too terse :). I'd probably use PG to give a
> better hint this is about pages? So EXT4_LBLK_TO_PG() and
> EXT4_PG_TO_LBLK().
Indeed, EXT4_LBLK_TO_PG reads much clearer. I will use it in v2.
> BTW, patch 8 could already use these macros...
>
> Honza
In Patch 8, the conversion is for a physical block number, which has a
different variable type than lblk. Since this is the only location where
this conversion is used in the code, I made a dedicated modification there.
Thank you for your review!
Cheers,
Baokun
>> ---
>> fs/ext4/ext4.h | 6 ++++++
>> 1 file changed, 6 insertions(+)
>>
>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> index 9b236f620b3a..8223ed29b343 100644
>> --- a/fs/ext4/ext4.h
>> +++ b/fs/ext4/ext4.h
>> @@ -369,6 +369,12 @@ struct ext4_io_submit {
>> (round_up((offset), i_blocksize(inode)) >> (inode)->i_blkbits)
>> #define EXT4_LBLK_TO_B(inode, lblk) ((loff_t)(lblk) << (inode)->i_blkbits)
>>
>> +/* Translate a block number to a page index */
>> +#define EXT4_LBLK_TO_P(inode, lblk) (EXT4_LBLK_TO_B((inode), (lblk)) >> \
>> + PAGE_SHIFT)
>> +/* Translate a page index to a block number */
>> +#define EXT4_P_TO_LBLK(inode, pnum) (((loff_t)(pnum) << PAGE_SHIFT) >> \
>> + (inode)->i_blkbits)
>> /* Translate a block number to a cluster number */
>> #define EXT4_B2C(sbi, blk) ((blk) >> (sbi)->s_cluster_bits)
>> /* Translate a cluster number to a block number */
>> --
>> 2.46.1
>>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 04/25] ext4: make ext4_punch_hole() support large block size
2025-11-03 8:05 ` Jan Kara
@ 2025-11-04 6:55 ` Baokun Li
0 siblings, 0 replies; 68+ messages in thread
From: Baokun Li @ 2025-11-04 6:55 UTC (permalink / raw)
To: Jan Kara
Cc: linux-ext4, tytso, adilger.kernel, linux-kernel, kernel, mcgrof,
linux-fsdevel, linux-mm, yi.zhang, yangerkun, chengzhihao1,
libaokun1, Baokun Li
On 2025-11-03 16:05, Jan Kara wrote:
> On Sat 25-10-25 11:22:00, libaokun@huaweicloud.com wrote:
>> From: Baokun Li <libaokun1@huawei.com>
>>
>> Since the block size may be greater than the page size, when a hole
>> extends beyond i_size, we need to align the hole's end upwards to the
>> larger of PAGE_SIZE and blocksize.
>>
>> This is to prevent the issues seen in commit 2be4751b21ae ("ext4: fix
>> 2nd xfstests 127 punch hole failure") from reappearing after BS > PS
>> is supported.
>>
>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
>> Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
> When going for bs > ps support, I'm very suspicious of any code that keeps
> using PAGE_SIZE because it doesn't make too much sense anymore. Usually that
> should be either appropriate folio size or something like that. For example
> in this case if we indeed rely on freeing some buffers then with 4k block
> size in an order-2 folio things would be already broken.
>
> As far as I'm checking truncate_inode_pages_range() already handles partial
> folio invalidation fine so I think we should just use blocksize in the
> rounding (to save pointless tail block zeroing) and be done with it.
Right. I missed that truncate_inode_pages_range already handles this.
I will directly use the blocksize in v2.
Thank you for your review!
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 4c04af7e51c9..a63513a3db53 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -4401,7 +4401,8 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>> * the page that contains i_size.
>> */
>> if (end > inode->i_size)
> BTW I think here we should have >= (not your fault but we can fix it when
> changing the code).
Yes, I didn’t notice this bug. I will fix it together in v2.
Cheers,
Baokun
>
>> - end = round_up(inode->i_size, PAGE_SIZE);
>> + end = round_up(inode->i_size,
>> + umax(PAGE_SIZE, sb->s_blocksize));
>> if (end > max_end)
>> end = max_end;
>> length = end - offset;
> Honza
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 10/25] ext4: add EXT4_LBLK_TO_P and EXT4_P_TO_LBLK for block/page conversion
2025-11-03 14:45 ` Baokun Li
@ 2025-11-05 8:27 ` Jan Kara
0 siblings, 0 replies; 68+ messages in thread
From: Jan Kara @ 2025-11-05 8:27 UTC (permalink / raw)
To: Baokun Li
Cc: Jan Kara, linux-ext4, tytso, adilger.kernel, linux-kernel, kernel,
mcgrof, linux-fsdevel, linux-mm, yi.zhang, yangerkun,
chengzhihao1, Baokun Li
On Mon 03-11-25 22:45:45, Baokun Li wrote:
> On 2025-11-03 16:26, Jan Kara wrote:
> > On Sat 25-10-25 11:22:06, libaokun@huaweicloud.com wrote:
> > BTW, patch 8 could already use these macros...
> >
> > Honza
>
> In Patch 8, the conversion is for a physical block number, which has a
> different variable type than lblk. Since this is the only location where
> this conversion is used in the code, I made a dedicated modification there.
Ok, fair.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 11/25] ext4: support large block size in ext4_mb_load_buddy_gfp()
2025-10-25 3:22 ` [PATCH 11/25] ext4: support large block size in ext4_mb_load_buddy_gfp() libaokun
@ 2025-11-05 8:46 ` Jan Kara
0 siblings, 0 replies; 68+ messages in thread
From: Jan Kara @ 2025-11-05 8:46 UTC (permalink / raw)
To: libaokun
Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, kernel,
mcgrof, linux-fsdevel, linux-mm, yi.zhang, yangerkun,
chengzhihao1, libaokun1
On Sat 25-10-25 11:22:07, libaokun@huaweicloud.com wrote:
> From: Baokun Li <libaokun1@huawei.com>
>
> Currently, ext4_mb_load_buddy_gfp() uses blocks_per_page to calculate the
> folio index and offset. However, when blocksize is larger than PAGE_SIZE,
> blocks_per_page becomes zero, leading to a potential division-by-zero bug.
>
> To support BS > PS, use bytes to compute folio index and offset within
> folio to get rid of blocks_per_page.
>
> Also, if buddy and bitmap land in the same folio, we get that folio’s ref
> instead of looking it up again before updating the buddy.
>
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Looks good! Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 12/25] ext4: support large block size in ext4_mb_get_buddy_page_lock()
2025-10-25 3:22 ` [PATCH 12/25] ext4: support large block size in ext4_mb_get_buddy_page_lock() libaokun
@ 2025-11-05 9:13 ` Jan Kara
2025-11-05 9:44 ` Baokun Li
0 siblings, 1 reply; 68+ messages in thread
From: Jan Kara @ 2025-11-05 9:13 UTC (permalink / raw)
To: libaokun
Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, kernel,
mcgrof, linux-fsdevel, linux-mm, yi.zhang, yangerkun,
chengzhihao1, libaokun1
On Sat 25-10-25 11:22:08, libaokun@huaweicloud.com wrote:
> From: Baokun Li <libaokun1@huawei.com>
>
> Currently, ext4_mb_get_buddy_page_lock() uses blocks_per_page to calculate
> folio index and offset. However, when blocksize is larger than PAGE_SIZE,
> blocks_per_page becomes zero, leading to a potential division-by-zero bug.
>
> To support BS > PS, use bytes to compute folio index and offset within
> folio to get rid of blocks_per_page.
>
> Also, since ext4_mb_get_buddy_page_lock() already fully supports folio,
> rename it to ext4_mb_get_buddy_folio_lock().
>
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Looks good, just two typo fixes below. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 3494c6fe5bfb..d42d768a705a 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -1510,50 +1510,52 @@ static int ext4_mb_init_cache(struct folio *folio, char *incore, gfp_t gfp)
> }
>
Let's fix some typos when updating the comment:
> /*
> - * Lock the buddy and bitmap pages. This make sure other parallel init_group
> - * on the same buddy page doesn't happen whild holding the buddy page lock.
> - * Return locked buddy and bitmap pages on e4b struct. If buddy and bitmap
> - * are on the same page e4b->bd_buddy_folio is NULL and return value is 0.
> + * Lock the buddy and bitmap folios. This make sure other parallel init_group
^^^ makes
> + * on the same buddy folio doesn't happen whild holding the buddy folio lock.
^^ while
> + * Return locked buddy and bitmap folios on e4b struct. If buddy and bitmap
> + * are on the same folio e4b->bd_buddy_folio is NULL and return value is 0.
> */
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 13/25] ext4: support large block size in ext4_mb_init_cache()
2025-10-25 3:22 ` [PATCH 13/25] ext4: support large block size in ext4_mb_init_cache() libaokun
@ 2025-11-05 9:18 ` Jan Kara
0 siblings, 0 replies; 68+ messages in thread
From: Jan Kara @ 2025-11-05 9:18 UTC (permalink / raw)
To: libaokun
Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, kernel,
mcgrof, linux-fsdevel, linux-mm, yi.zhang, yangerkun,
chengzhihao1, libaokun1
On Sat 25-10-25 11:22:09, libaokun@huaweicloud.com wrote:
> From: Baokun Li <libaokun1@huawei.com>
>
> Currently, ext4_mb_init_cache() uses blocks_per_page to calculate the
> folio index and offset. However, when blocksize is larger than PAGE_SIZE,
> blocks_per_page becomes zero, leading to a potential division-by-zero bug.
>
> Since we now have the folio, we know its exact size. This allows us to
> convert {blocks, groups}_per_page to {blocks, groups}_per_folio, thus
> supporting block sizes greater than page size.
>
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Looks good. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/ext4/mballoc.c | 44 ++++++++++++++++++++------------------------
> 1 file changed, 20 insertions(+), 24 deletions(-)
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index d42d768a705a..31f4c7d65eb4 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -1329,26 +1329,25 @@ static void mb_regenerate_buddy(struct ext4_buddy *e4b)
> * block bitmap and buddy information. The information are
> * stored in the inode as
> *
> - * { page }
> + * { folio }
> * [ group 0 bitmap][ group 0 buddy] [group 1][ group 1]...
> *
> *
> * one block each for bitmap and buddy information.
> - * So for each group we take up 2 blocks. A page can
> - * contain blocks_per_page (PAGE_SIZE / blocksize) blocks.
> - * So it can have information regarding groups_per_page which
> - * is blocks_per_page/2
> + * So for each group we take up 2 blocks. A folio can
> + * contain blocks_per_folio (folio_size / blocksize) blocks.
> + * So it can have information regarding groups_per_folio which
> + * is blocks_per_folio/2
> *
> * Locking note: This routine takes the block group lock of all groups
> - * for this page; do not hold this lock when calling this routine!
> + * for this folio; do not hold this lock when calling this routine!
> */
> -
> static int ext4_mb_init_cache(struct folio *folio, char *incore, gfp_t gfp)
> {
> ext4_group_t ngroups;
> unsigned int blocksize;
> - int blocks_per_page;
> - int groups_per_page;
> + int blocks_per_folio;
> + int groups_per_folio;
> int err = 0;
> int i;
> ext4_group_t first_group, group;
> @@ -1365,27 +1364,24 @@ static int ext4_mb_init_cache(struct folio *folio, char *incore, gfp_t gfp)
> sb = inode->i_sb;
> ngroups = ext4_get_groups_count(sb);
> blocksize = i_blocksize(inode);
> - blocks_per_page = PAGE_SIZE / blocksize;
> + blocks_per_folio = folio_size(folio) / blocksize;
> + WARN_ON_ONCE(!blocks_per_folio);
> + groups_per_folio = DIV_ROUND_UP(blocks_per_folio, 2);
>
> mb_debug(sb, "init folio %lu\n", folio->index);
>
> - groups_per_page = blocks_per_page >> 1;
> - if (groups_per_page == 0)
> - groups_per_page = 1;
> -
> /* allocate buffer_heads to read bitmaps */
> - if (groups_per_page > 1) {
> - i = sizeof(struct buffer_head *) * groups_per_page;
> + if (groups_per_folio > 1) {
> + i = sizeof(struct buffer_head *) * groups_per_folio;
> bh = kzalloc(i, gfp);
> if (bh == NULL)
> return -ENOMEM;
> } else
> bh = &bhs;
>
> - first_group = folio->index * blocks_per_page / 2;
> -
> /* read all groups the folio covers into the cache */
> - for (i = 0, group = first_group; i < groups_per_page; i++, group++) {
> + first_group = EXT4_P_TO_LBLK(inode, folio->index) / 2;
> + for (i = 0, group = first_group; i < groups_per_folio; i++, group++) {
> if (group >= ngroups)
> break;
>
> @@ -1393,7 +1389,7 @@ static int ext4_mb_init_cache(struct folio *folio, char *incore, gfp_t gfp)
> if (!grinfo)
> continue;
> /*
> - * If page is uptodate then we came here after online resize
> + * If folio is uptodate then we came here after online resize
> * which added some new uninitialized group info structs, so
> * we must skip all initialized uptodate buddies on the folio,
> * which may be currently in use by an allocating task.
> @@ -1413,7 +1409,7 @@ static int ext4_mb_init_cache(struct folio *folio, char *incore, gfp_t gfp)
> }
>
> /* wait for I/O completion */
> - for (i = 0, group = first_group; i < groups_per_page; i++, group++) {
> + for (i = 0, group = first_group; i < groups_per_folio; i++, group++) {
> int err2;
>
> if (!bh[i])
> @@ -1423,8 +1419,8 @@ static int ext4_mb_init_cache(struct folio *folio, char *incore, gfp_t gfp)
> err = err2;
> }
>
> - first_block = folio->index * blocks_per_page;
> - for (i = 0; i < blocks_per_page; i++) {
> + first_block = EXT4_P_TO_LBLK(inode, folio->index);
> + for (i = 0; i < blocks_per_folio; i++) {
> group = (first_block + i) >> 1;
> if (group >= ngroups)
> break;
> @@ -1501,7 +1497,7 @@ static int ext4_mb_init_cache(struct folio *folio, char *incore, gfp_t gfp)
>
> out:
> if (bh) {
> - for (i = 0; i < groups_per_page; i++)
> + for (i = 0; i < groups_per_folio; i++)
> brelse(bh[i]);
> if (bh != &bhs)
> kfree(bh);
> --
> 2.46.1
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 14/25] ext4: prepare buddy cache inode for BS > PS with large folios
2025-10-25 3:22 ` [PATCH 14/25] ext4: prepare buddy cache inode for BS > PS with large folios libaokun
@ 2025-11-05 9:19 ` Jan Kara
0 siblings, 0 replies; 68+ messages in thread
From: Jan Kara @ 2025-11-05 9:19 UTC (permalink / raw)
To: libaokun
Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, kernel,
mcgrof, linux-fsdevel, linux-mm, yi.zhang, yangerkun,
chengzhihao1, libaokun1
On Sat 25-10-25 11:22:10, libaokun@huaweicloud.com wrote:
> From: Baokun Li <libaokun1@huawei.com>
>
> We use EXT4_BAD_INO for the buddy cache inode number. This inode is not
> accessed via __ext4_new_inode() or __ext4_iget(), meaning
> ext4_set_inode_mapping_order() is not called to set its folio order range.
>
> However, future block size greater than page size support requires this
> inode to support large folios, and the buddy cache code already handles
> BS > PS. Therefore, ext4_set_inode_mapping_order() is now explicitly
> called for this specific inode to set its folio order range.
>
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Looks good. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/ext4/mballoc.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 31f4c7d65eb4..155c43ff2bc2 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -3493,6 +3493,8 @@ static int ext4_mb_init_backend(struct super_block *sb)
> * this will avoid confusion if it ever shows up during debugging. */
> sbi->s_buddy_cache->i_ino = EXT4_BAD_INO;
> EXT4_I(sbi->s_buddy_cache)->i_disksize = 0;
> + ext4_set_inode_mapping_order(sbi->s_buddy_cache);
> +
> for (i = 0; i < ngroups; i++) {
> cond_resched();
> desc = ext4_get_group_desc(sb, i, NULL);
> --
> 2.46.1
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 15/25] ext4: rename 'page' references to 'folio' in multi-block allocator
2025-10-25 3:22 ` [PATCH 15/25] ext4: rename 'page' references to 'folio' in multi-block allocator libaokun
@ 2025-11-05 9:21 ` Jan Kara
0 siblings, 0 replies; 68+ messages in thread
From: Jan Kara @ 2025-11-05 9:21 UTC (permalink / raw)
To: libaokun
Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, kernel,
mcgrof, linux-fsdevel, linux-mm, yi.zhang, yangerkun,
chengzhihao1, libaokun1
On Sat 25-10-25 11:22:11, libaokun@huaweicloud.com wrote:
> From: Zhihao Cheng <chengzhihao1@huawei.com>
>
> The ext4 multi-block allocator now fully supports folio objects. Update
> all variable names, function names, and comments to replace legacy 'page'
> terminology with 'folio', improving clarity and consistency.
>
> No functional changes.
>
> Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Looks good. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/ext4/mballoc.c | 22 +++++++++++-----------
> 1 file changed, 11 insertions(+), 11 deletions(-)
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 155c43ff2bc2..cf07d1067f5f 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -98,14 +98,14 @@
> * block bitmap and buddy information. The information are stored in the
> * inode as:
> *
> - * { page }
> + * { folio }
> * [ group 0 bitmap][ group 0 buddy] [group 1][ group 1]...
> *
> *
> * one block each for bitmap and buddy information. So for each group we
> - * take up 2 blocks. A page can contain blocks_per_page (PAGE_SIZE /
> - * blocksize) blocks. So it can have information regarding groups_per_page
> - * which is blocks_per_page/2
> + * take up 2 blocks. A folio can contain blocks_per_folio (folio_size /
> + * blocksize) blocks. So it can have information regarding groups_per_folio
> + * which is blocks_per_folio/2
> *
> * The buddy cache inode is not stored on disk. The inode is thrown
> * away when the filesystem is unmounted.
> @@ -1556,7 +1556,7 @@ static int ext4_mb_get_buddy_folio_lock(struct super_block *sb,
> return 0;
> }
>
> -static void ext4_mb_put_buddy_page_lock(struct ext4_buddy *e4b)
> +static void ext4_mb_put_buddy_folio_lock(struct ext4_buddy *e4b)
> {
> if (e4b->bd_bitmap_folio) {
> folio_unlock(e4b->bd_bitmap_folio);
> @@ -1570,7 +1570,7 @@ static void ext4_mb_put_buddy_page_lock(struct ext4_buddy *e4b)
>
> /*
> * Locking note: This routine calls ext4_mb_init_cache(), which takes the
> - * block group lock of all groups for this page; do not hold the BG lock when
> + * block group lock of all groups for this folio; do not hold the BG lock when
> * calling this routine!
> */
> static noinline_for_stack
> @@ -1618,7 +1618,7 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group, gfp_t gfp)
> if (e4b.bd_buddy_folio == NULL) {
> /*
> * If both the bitmap and buddy are in
> - * the same page we don't need to force
> + * the same folio we don't need to force
> * init the buddy
> */
> ret = 0;
> @@ -1634,7 +1634,7 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group, gfp_t gfp)
> goto err;
> }
> err:
> - ext4_mb_put_buddy_page_lock(&e4b);
> + ext4_mb_put_buddy_folio_lock(&e4b);
> return ret;
> }
>
> @@ -2227,7 +2227,7 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
> ac->ac_buddy = ret >> 16;
>
> /*
> - * take the page reference. We want the page to be pinned
> + * take the folio reference. We want the folio to be pinned
> * so that we don't get a ext4_mb_init_cache_call for this
> * group until we update the bitmap. That would mean we
> * double allocate blocks. The reference is dropped
> @@ -2933,7 +2933,7 @@ static int ext4_mb_scan_group(struct ext4_allocation_context *ac,
> if (cr < CR_ANY_FREE && spin_is_locked(ext4_group_lock_ptr(sb, group)))
> return 0;
>
> - /* This now checks without needing the buddy page */
> + /* This now checks without needing the buddy folio */
> ret = ext4_mb_good_group_nolock(ac, group, cr);
> if (ret <= 0) {
> if (!ac->ac_first_err)
> @@ -4725,7 +4725,7 @@ static void ext4_discard_allocated_blocks(struct ext4_allocation_context *ac)
> "ext4: mb_load_buddy failed (%d)", err))
> /*
> * This should never happen since we pin the
> - * pages in the ext4_allocation_context so
> + * folios in the ext4_allocation_context so
> * ext4_mb_load_buddy() should never fail.
> */
> return;
> --
> 2.46.1
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 16/25] ext4: support large block size in ext4_mpage_readpages()
2025-10-25 3:22 ` [PATCH 16/25] ext4: support large block size in ext4_mpage_readpages() libaokun
@ 2025-11-05 9:26 ` Jan Kara
0 siblings, 0 replies; 68+ messages in thread
From: Jan Kara @ 2025-11-05 9:26 UTC (permalink / raw)
To: libaokun
Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, kernel,
mcgrof, linux-fsdevel, linux-mm, yi.zhang, yangerkun,
chengzhihao1, libaokun1
On Sat 25-10-25 11:22:12, libaokun@huaweicloud.com wrote:
> From: Baokun Li <libaokun1@huawei.com>
>
> Use the EXT4_P_TO_LBLK() macro to convert folio indexes to blocks to avoid
> negative left shifts after supporting blocksize greater than PAGE_SIZE.
>
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Looks good. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/ext4/readpage.c | 7 ++-----
> 1 file changed, 2 insertions(+), 5 deletions(-)
>
> diff --git a/fs/ext4/readpage.c b/fs/ext4/readpage.c
> index f329daf6e5c7..8c8ec9d60b90 100644
> --- a/fs/ext4/readpage.c
> +++ b/fs/ext4/readpage.c
> @@ -213,9 +213,7 @@ int ext4_mpage_readpages(struct inode *inode,
> {
> struct bio *bio = NULL;
> sector_t last_block_in_bio = 0;
> -
> const unsigned blkbits = inode->i_blkbits;
> - const unsigned blocks_per_page = PAGE_SIZE >> blkbits;
> const unsigned blocksize = 1 << blkbits;
> sector_t next_block;
> sector_t block_in_file;
> @@ -251,9 +249,8 @@ int ext4_mpage_readpages(struct inode *inode,
>
> blocks_per_folio = folio_size(folio) >> blkbits;
> first_hole = blocks_per_folio;
> - block_in_file = next_block =
> - (sector_t)folio->index << (PAGE_SHIFT - blkbits);
> - last_block = block_in_file + nr_pages * blocks_per_page;
> + block_in_file = next_block = EXT4_P_TO_LBLK(inode, folio->index);
> + last_block = EXT4_P_TO_LBLK(inode, folio->index + nr_pages);
> last_block_in_file = (ext4_readpage_limit(inode) +
> blocksize - 1) >> blkbits;
> if (last_block > last_block_in_file)
> --
> 2.46.1
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 17/25] ext4: support large block size in ext4_block_write_begin()
2025-10-25 3:22 ` [PATCH 17/25] ext4: support large block size in ext4_block_write_begin() libaokun
@ 2025-11-05 9:28 ` Jan Kara
0 siblings, 0 replies; 68+ messages in thread
From: Jan Kara @ 2025-11-05 9:28 UTC (permalink / raw)
To: libaokun
Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, kernel,
mcgrof, linux-fsdevel, linux-mm, yi.zhang, yangerkun,
chengzhihao1, libaokun1
On Sat 25-10-25 11:22:13, libaokun@huaweicloud.com wrote:
> From: Baokun Li <libaokun1@huawei.com>
>
> Use the EXT4_P_TO_LBLK() macro to convert folio indexes to blocks to avoid
> negative left shifts after supporting blocksize greater than PAGE_SIZE.
>
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Looks good. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/ext4/inode.c | 7 +++----
> 1 file changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 73c1da90b604..d97ce88d6e0a 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1162,8 +1162,7 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
> unsigned block_start, block_end;
> sector_t block;
> int err = 0;
> - unsigned blocksize = inode->i_sb->s_blocksize;
> - unsigned bbits;
> + unsigned int blocksize = i_blocksize(inode);
> struct buffer_head *bh, *head, *wait[2];
> int nr_wait = 0;
> int i;
> @@ -1172,12 +1171,12 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
> BUG_ON(!folio_test_locked(folio));
> BUG_ON(to > folio_size(folio));
> BUG_ON(from > to);
> + WARN_ON_ONCE(blocksize > folio_size(folio));
>
> head = folio_buffers(folio);
> if (!head)
> head = create_empty_buffers(folio, blocksize, 0);
> - bbits = ilog2(blocksize);
> - block = (sector_t)folio->index << (PAGE_SHIFT - bbits);
> + block = EXT4_P_TO_LBLK(inode, folio->index);
>
> for (bh = head, block_start = 0; bh != head || !block_start;
> block++, block_start = block_end, bh = bh->b_this_page) {
> --
> 2.46.1
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 18/25] ext4: support large block size in mpage_map_and_submit_buffers()
2025-10-25 3:22 ` [PATCH 18/25] ext4: support large block size in mpage_map_and_submit_buffers() libaokun
@ 2025-11-05 9:30 ` Jan Kara
0 siblings, 0 replies; 68+ messages in thread
From: Jan Kara @ 2025-11-05 9:30 UTC (permalink / raw)
To: libaokun
Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, kernel,
mcgrof, linux-fsdevel, linux-mm, yi.zhang, yangerkun,
chengzhihao1, libaokun1
On Sat 25-10-25 11:22:14, libaokun@huaweicloud.com wrote:
> From: Baokun Li <libaokun1@huawei.com>
>
> Use the EXT4_P_TO_LBLK/EXT4_LBLK_TO_P macros to complete the conversion
> between folio indexes and blocks to avoid negative left/right shifts after
> supporting blocksize greater than PAGE_SIZE.
>
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Looks good. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/ext4/inode.c | 7 +++----
> 1 file changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index d97ce88d6e0a..cbf04b473ae7 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2289,15 +2289,14 @@ static int mpage_map_and_submit_buffers(struct mpage_da_data *mpd)
> struct folio_batch fbatch;
> unsigned nr, i;
> struct inode *inode = mpd->inode;
> - int bpp_bits = PAGE_SHIFT - inode->i_blkbits;
> pgoff_t start, end;
> ext4_lblk_t lblk;
> ext4_fsblk_t pblock;
> int err;
> bool map_bh = false;
>
> - start = mpd->map.m_lblk >> bpp_bits;
> - end = (mpd->map.m_lblk + mpd->map.m_len - 1) >> bpp_bits;
> + start = EXT4_LBLK_TO_P(inode, mpd->map.m_lblk);
> + end = EXT4_LBLK_TO_P(inode, mpd->map.m_lblk + mpd->map.m_len - 1);
> pblock = mpd->map.m_pblk;
>
> folio_batch_init(&fbatch);
> @@ -2308,7 +2307,7 @@ static int mpage_map_and_submit_buffers(struct mpage_da_data *mpd)
> for (i = 0; i < nr; i++) {
> struct folio *folio = fbatch.folios[i];
>
> - lblk = folio->index << bpp_bits;
> + lblk = EXT4_P_TO_LBLK(inode, folio->index);
> err = mpage_process_folio(mpd, folio, &lblk, &pblock,
> &map_bh);
> /*
> --
> 2.46.1
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 19/25] ext4: support large block size in mpage_prepare_extent_to_map()
2025-10-25 3:22 ` [PATCH 19/25] ext4: support large block size in mpage_prepare_extent_to_map() libaokun
@ 2025-11-05 9:31 ` Jan Kara
0 siblings, 0 replies; 68+ messages in thread
From: Jan Kara @ 2025-11-05 9:31 UTC (permalink / raw)
To: libaokun
Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, kernel,
mcgrof, linux-fsdevel, linux-mm, yi.zhang, yangerkun,
chengzhihao1, libaokun1
On Sat 25-10-25 11:22:15, libaokun@huaweicloud.com wrote:
> From: Baokun Li <libaokun1@huawei.com>
>
> Use the EXT4_P_TO_LBLK/EXT4_LBLK_TO_P macros to complete the conversion
> between folio indexes and blocks to avoid negative left/right shifts after
> supporting blocksize greater than PAGE_SIZE.
>
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Looks good. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/ext4/inode.c | 6 ++----
> 1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index cbf04b473ae7..ce48cc6780a3 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2610,7 +2610,6 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
> pgoff_t end = mpd->end_pos >> PAGE_SHIFT;
> xa_mark_t tag;
> int i, err = 0;
> - int blkbits = mpd->inode->i_blkbits;
> ext4_lblk_t lblk;
> struct buffer_head *head;
> handle_t *handle = NULL;
> @@ -2649,7 +2648,7 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
> */
> if (mpd->wbc->sync_mode == WB_SYNC_NONE &&
> mpd->wbc->nr_to_write <=
> - mpd->map.m_len >> (PAGE_SHIFT - blkbits))
> + EXT4_LBLK_TO_P(mpd->inode, mpd->map.m_len))
> goto out;
>
> /* If we can't merge this page, we are done. */
> @@ -2727,8 +2726,7 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
> mpage_folio_done(mpd, folio);
> } else {
> /* Add all dirty buffers to mpd */
> - lblk = ((ext4_lblk_t)folio->index) <<
> - (PAGE_SHIFT - blkbits);
> + lblk = EXT4_P_TO_LBLK(mpd->inode, folio->index);
> head = folio_buffers(folio);
> err = mpage_process_page_bufs(mpd, head, head,
> lblk);
> --
> 2.46.1
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 20/25] ext4: support large block size in __ext4_block_zero_page_range()
2025-10-25 3:22 ` [PATCH 20/25] ext4: support large block size in __ext4_block_zero_page_range() libaokun
@ 2025-11-05 9:33 ` Jan Kara
0 siblings, 0 replies; 68+ messages in thread
From: Jan Kara @ 2025-11-05 9:33 UTC (permalink / raw)
To: libaokun
Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, kernel,
mcgrof, linux-fsdevel, linux-mm, yi.zhang, yangerkun,
chengzhihao1, libaokun1
On Sat 25-10-25 11:22:16, libaokun@huaweicloud.com wrote:
> From: Zhihao Cheng <chengzhihao1@huawei.com>
>
> Use the EXT4_P_TO_LBLK() macro to convert folio indexes to blocks to avoid
> negative left shifts after supporting blocksize greater than PAGE_SIZE.
>
> Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Looks good. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/ext4/inode.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index ce48cc6780a3..b3fa29923a1d 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4066,7 +4066,7 @@ static int __ext4_block_zero_page_range(handle_t *handle,
>
> blocksize = inode->i_sb->s_blocksize;
>
> - iblock = folio->index << (PAGE_SHIFT - inode->i_sb->s_blocksize_bits);
> + iblock = EXT4_P_TO_LBLK(inode, folio->index);
>
> bh = folio_buffers(folio);
> if (!bh)
> --
> 2.46.1
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 12/25] ext4: support large block size in ext4_mb_get_buddy_page_lock()
2025-11-05 9:13 ` Jan Kara
@ 2025-11-05 9:44 ` Baokun Li
0 siblings, 0 replies; 68+ messages in thread
From: Baokun Li @ 2025-11-05 9:44 UTC (permalink / raw)
To: Jan Kara
Cc: linux-ext4, tytso, adilger.kernel, linux-kernel, kernel, mcgrof,
linux-fsdevel, linux-mm, yi.zhang, yangerkun, chengzhihao1,
libaokun, Baokun Li
On 2025-11-05 17:13, Jan Kara wrote:
> On Sat 25-10-25 11:22:08, libaokun@huaweicloud.com wrote:
>> From: Baokun Li <libaokun1@huawei.com>
>>
>> Currently, ext4_mb_get_buddy_page_lock() uses blocks_per_page to calculate
>> folio index and offset. However, when blocksize is larger than PAGE_SIZE,
>> blocks_per_page becomes zero, leading to a potential division-by-zero bug.
>>
>> To support BS > PS, use bytes to compute folio index and offset within
>> folio to get rid of blocks_per_page.
>>
>> Also, since ext4_mb_get_buddy_page_lock() already fully supports folio,
>> rename it to ext4_mb_get_buddy_folio_lock().
>>
>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
>> Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
> Looks good, just two typo fixes below. Feel free to add:
>
> Reviewed-by: Jan Kara <jack@suse.cz>
>
>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>> index 3494c6fe5bfb..d42d768a705a 100644
>> --- a/fs/ext4/mballoc.c
>> +++ b/fs/ext4/mballoc.c
>> @@ -1510,50 +1510,52 @@ static int ext4_mb_init_cache(struct folio *folio, char *incore, gfp_t gfp)
>> }
>>
> Let's fix some typos when updating the comment:
I’ll fix these typos in the next update.
Thank you for your review!
Regards,
Baokun
>
>> /*
>> - * Lock the buddy and bitmap pages. This make sure other parallel init_group
>> - * on the same buddy page doesn't happen whild holding the buddy page lock.
>> - * Return locked buddy and bitmap pages on e4b struct. If buddy and bitmap
>> - * are on the same page e4b->bd_buddy_folio is NULL and return value is 0.
>> + * Lock the buddy and bitmap folios. This make sure other parallel init_group
> ^^^ makes
>
>> + * on the same buddy folio doesn't happen whild holding the buddy folio lock.
> ^^ while
>
>> + * Return locked buddy and bitmap folios on e4b struct. If buddy and bitmap
>> + * are on the same folio e4b->bd_buddy_folio is NULL and return value is 0.
>> */
> Honza
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 21/25] ext4: make online defragmentation support large block size
2025-10-25 3:22 ` [PATCH 21/25] ext4: make online defragmentation support large block size libaokun
@ 2025-11-05 9:50 ` Jan Kara
2025-11-05 10:48 ` Zhang Yi
2025-11-05 11:28 ` Baokun Li
0 siblings, 2 replies; 68+ messages in thread
From: Jan Kara @ 2025-11-05 9:50 UTC (permalink / raw)
To: libaokun
Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, kernel,
mcgrof, linux-fsdevel, linux-mm, yi.zhang, yangerkun,
chengzhihao1, libaokun1
On Sat 25-10-25 11:22:17, libaokun@huaweicloud.com wrote:
> From: Zhihao Cheng <chengzhihao1@huawei.com>
>
> There are several places assuming that block size <= PAGE_SIZE, modify
> them to support large block size (bs > ps).
>
> Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
...
> @@ -565,7 +564,7 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, __u64 orig_blk,
> struct inode *orig_inode = file_inode(o_filp);
> struct inode *donor_inode = file_inode(d_filp);
> struct ext4_ext_path *path = NULL;
> - int blocks_per_page = PAGE_SIZE >> orig_inode->i_blkbits;
> + int blocks_per_page = 1;
> ext4_lblk_t o_end, o_start = orig_blk;
> ext4_lblk_t d_start = donor_blk;
> int ret;
> @@ -608,6 +607,9 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, __u64 orig_blk,
> return -EOPNOTSUPP;
> }
>
> + if (i_blocksize(orig_inode) < PAGE_SIZE)
> + blocks_per_page = PAGE_SIZE >> orig_inode->i_blkbits;
> +
I think these are strange and the only reason for this is that
ext4_move_extents() tries to make life easier to move_extent_per_page() and
that doesn't really work with larger folios anymore. I think
ext4_move_extents() just shouldn't care about pages / folios at all and
pass 'cur_len' as the length to the end of extent / moved range and
move_extent_per_page() will trim the length based on the folios it has got.
Also then we can rename some of the variables and functions from 'page' to
'folio'.
Honza
> /* Protect orig and donor inodes against a truncate */
> lock_two_nondirectories(orig_inode, donor_inode);
>
> @@ -665,10 +667,8 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, __u64 orig_blk,
> if (o_end - o_start < cur_len)
> cur_len = o_end - o_start;
>
> - orig_page_index = o_start >> (PAGE_SHIFT -
> - orig_inode->i_blkbits);
> - donor_page_index = d_start >> (PAGE_SHIFT -
> - donor_inode->i_blkbits);
> + orig_page_index = EXT4_LBLK_TO_P(orig_inode, o_start);
> + donor_page_index = EXT4_LBLK_TO_P(donor_inode, d_start);
> offset_in_page = o_start % blocks_per_page;
> if (cur_len > blocks_per_page - offset_in_page)
> cur_len = blocks_per_page - offset_in_page;
> --
> 2.46.1
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 24/25] ext4: add checks for large folio incompatibilities when BS > PS
2025-10-25 3:22 ` [PATCH 24/25] ext4: add checks for large folio incompatibilities " libaokun
@ 2025-11-05 9:59 ` Jan Kara
0 siblings, 0 replies; 68+ messages in thread
From: Jan Kara @ 2025-11-05 9:59 UTC (permalink / raw)
To: libaokun
Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, kernel,
mcgrof, linux-fsdevel, linux-mm, yi.zhang, yangerkun,
chengzhihao1, libaokun1
On Sat 25-10-25 11:22:20, libaokun@huaweicloud.com wrote:
> From: Baokun Li <libaokun1@huawei.com>
>
> Supporting a block size greater than the page size (BS > PS) requires
> support for large folios. However, several features (e.g., verity, encrypt)
> and mount options (e.g., data=journal) do not yet support large folios.
>
> To prevent conflicts, this patch adds checks at mount time to prohibit
> these features and options from being used when BS > PS. Since the data
> mode cannot be changed on remount, there is no need to check on remount.
>
> A new mount flag, EXT4_MF_LARGE_FOLIO, is introduced. This flag is set
> after the checks pass, indicating that the filesystem has no features or
> mount options incompatible with large folios. Subsequent checks can simply
> test for this flag to avoid redundant verifications.
>
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Looks good. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/ext4/ext4.h | 3 ++-
> fs/ext4/inode.c | 10 ++++------
> fs/ext4/super.c | 26 ++++++++++++++++++++++++++
> 3 files changed, 32 insertions(+), 7 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 8223ed29b343..f1163deb0812 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1859,7 +1859,8 @@ static inline int ext4_get_resgid(struct ext4_super_block *es)
> enum {
> EXT4_MF_MNTDIR_SAMPLED,
> EXT4_MF_FC_INELIGIBLE, /* Fast commit ineligible */
> - EXT4_MF_JOURNAL_DESTROY /* Journal is in process of destroying */
> + EXT4_MF_JOURNAL_DESTROY,/* Journal is in process of destroying */
> + EXT4_MF_LARGE_FOLIO, /* large folio is support */
> };
>
> static inline void ext4_set_mount_flag(struct super_block *sb, int bit)
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index b3fa29923a1d..04f9380d4211 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5143,14 +5143,12 @@ static bool ext4_should_enable_large_folio(struct inode *inode)
> {
> struct super_block *sb = inode->i_sb;
>
> - if (!S_ISREG(inode->i_mode))
> - return false;
> - if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA ||
> - ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA))
> + if (!ext4_test_mount_flag(sb, EXT4_MF_LARGE_FOLIO))
> return false;
> - if (ext4_has_feature_verity(sb))
> +
> + if (!S_ISREG(inode->i_mode))
> return false;
> - if (ext4_has_feature_encrypt(sb))
> + if (ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA))
> return false;
>
> return true;
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 7338c708ea1d..fdc006a973aa 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -5034,6 +5034,28 @@ static const char *ext4_has_journal_option(struct super_block *sb)
> return NULL;
> }
>
> +static int ext4_check_large_folio(struct super_block *sb)
> +{
> + const char *err_str = NULL;
> +
> + if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA)
> + err_str = "data=journal";
> + else if (ext4_has_feature_verity(sb))
> + err_str = "verity";
> + else if (ext4_has_feature_encrypt(sb))
> + err_str = "encrypt";
> +
> + if (!err_str) {
> + ext4_set_mount_flag(sb, EXT4_MF_LARGE_FOLIO);
> + } else if (sb->s_blocksize > PAGE_SIZE) {
> + ext4_msg(sb, KERN_ERR, "bs(%lu) > ps(%lu) unsupported for %s",
> + sb->s_blocksize, PAGE_SIZE, err_str);
> + return -EINVAL;
> + }
> +
> + return 0;
> +}
> +
> static int ext4_load_super(struct super_block *sb, ext4_fsblk_t *lsb,
> int silent)
> {
> @@ -5310,6 +5332,10 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
>
> ext4_apply_options(fc, sb);
>
> + err = ext4_check_large_folio(sb);
> + if (err < 0)
> + goto failed_mount;
> +
> err = ext4_encoding_init(sb, es);
> if (err)
> goto failed_mount;
> --
> 2.46.1
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 25/25] ext4: enable block size larger than page size
2025-10-25 3:22 ` [PATCH 25/25] ext4: enable block size larger than page size libaokun
@ 2025-11-05 10:14 ` Jan Kara
2025-11-06 2:44 ` Baokun Li
0 siblings, 1 reply; 68+ messages in thread
From: Jan Kara @ 2025-11-05 10:14 UTC (permalink / raw)
To: libaokun
Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, kernel,
mcgrof, linux-fsdevel, linux-mm, yi.zhang, yangerkun,
chengzhihao1, libaokun1
On Sat 25-10-25 11:22:21, libaokun@huaweicloud.com wrote:
> From: Baokun Li <libaokun1@huawei.com>
>
> Since block device (See commit 3c20917120ce ("block/bdev: enable large
> folio support for large logical block sizes")) and page cache (See commit
> ab95d23bab220ef8 ("filemap: allocate mapping_min_order folios in the page
> cache")) has the ability to have a minimum order when allocating folio,
> and ext4 has supported large folio in commit 7ac67301e82f ("ext4: enable
> large folio for regular file"), now add support for block_size > PAGE_SIZE
> in ext4.
>
> set_blocksize() -> bdev_validate_blocksize() already validates the block
> size, so ext4_load_super() does not need to perform additional checks.
>
> Here we only need to enable large folio by default when s_min_folio_order
> is greater than 0 and add the FS_LBS bit to fs_flags.
>
> In addition, mark this feature as experimental.
>
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
...
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 04f9380d4211..ba6cf05860ae 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5146,6 +5146,9 @@ static bool ext4_should_enable_large_folio(struct inode *inode)
> if (!ext4_test_mount_flag(sb, EXT4_MF_LARGE_FOLIO))
> return false;
>
> + if (EXT4_SB(sb)->s_min_folio_order)
> + return true;
> +
But now files with data journalling flag enabled will get large folios
possibly significantly greater that blocksize. I don't think there's a
fundamental reason why data journalling doesn't work with large folios, the
only thing that's likely going to break is that credit estimates will go
through the roof if there are too many blocks per folio. But that can be
handled by setting max folio order to be equal to min folio order when
journalling data for the inode.
It is a bit scary to be modifying max folio order in
ext4_change_inode_journal_flag() but I guess less scary than setting new
aops and if we prune the whole page cache before touching the order and
inode flag, we should be safe (famous last words ;).
Honza
> if (!S_ISREG(inode->i_mode))
> return false;
> if (ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA))
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index fdc006a973aa..4c0bd79bdf68 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -5053,6 +5053,9 @@ static int ext4_check_large_folio(struct super_block *sb)
> return -EINVAL;
> }
>
> + if (sb->s_blocksize > PAGE_SIZE)
> + ext4_msg(sb, KERN_NOTICE, "EXPERIMENTAL bs(%lu) > ps(%lu) enabled.",
> + sb->s_blocksize, PAGE_SIZE);
> return 0;
> }
>
> @@ -7432,7 +7435,8 @@ static struct file_system_type ext4_fs_type = {
> .init_fs_context = ext4_init_fs_context,
> .parameters = ext4_param_specs,
> .kill_sb = ext4_kill_sb,
> - .fs_flags = FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MGTIME,
> + .fs_flags = FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MGTIME |
> + FS_LBS,
> };
> MODULE_ALIAS_FS("ext4");
>
> --
> 2.46.1
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 21/25] ext4: make online defragmentation support large block size
2025-11-05 9:50 ` Jan Kara
@ 2025-11-05 10:48 ` Zhang Yi
2025-11-05 11:28 ` Baokun Li
1 sibling, 0 replies; 68+ messages in thread
From: Zhang Yi @ 2025-11-05 10:48 UTC (permalink / raw)
To: Jan Kara, libaokun
Cc: linux-ext4, tytso, adilger.kernel, linux-kernel, kernel, mcgrof,
linux-fsdevel, linux-mm, yangerkun, chengzhihao1, libaokun1
On 11/5/2025 5:50 PM, Jan Kara wrote:
> On Sat 25-10-25 11:22:17, libaokun@huaweicloud.com wrote:
>> From: Zhihao Cheng <chengzhihao1@huawei.com>
>>
>> There are several places assuming that block size <= PAGE_SIZE, modify
>> them to support large block size (bs > ps).
>>
>> Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
>
> ...
>
>> @@ -565,7 +564,7 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, __u64 orig_blk,
>> struct inode *orig_inode = file_inode(o_filp);
>> struct inode *donor_inode = file_inode(d_filp);
>> struct ext4_ext_path *path = NULL;
>> - int blocks_per_page = PAGE_SIZE >> orig_inode->i_blkbits;
>> + int blocks_per_page = 1;
>> ext4_lblk_t o_end, o_start = orig_blk;
>> ext4_lblk_t d_start = donor_blk;
>> int ret;
>> @@ -608,6 +607,9 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, __u64 orig_blk,
>> return -EOPNOTSUPP;
>> }
>>
>> + if (i_blocksize(orig_inode) < PAGE_SIZE)
>> + blocks_per_page = PAGE_SIZE >> orig_inode->i_blkbits;
>> +
>
> I think these are strange and the only reason for this is that
> ext4_move_extents() tries to make life easier to move_extent_per_page() and
> that doesn't really work with larger folios anymore. I think
> ext4_move_extents() just shouldn't care about pages / folios at all and
> pass 'cur_len' as the length to the end of extent / moved range and
> move_extent_per_page() will trim the length based on the folios it has got.
>
> Also then we can rename some of the variables and functions from 'page' to
> 'folio'.
>
> Honza
Hi, Jan!
Thank you for the suggestion. However, after merging my online defragmentation
optimization series[1], we don't need this patch at all. Baokun will rebase it
onto my series in the next iteration.
[1] https://lore.kernel.org/linux-ext4/20251013015128.499308-1-yi.zhang@huaweicloud.com/
Thanks,
Yi.
>
>> /* Protect orig and donor inodes against a truncate */
>> lock_two_nondirectories(orig_inode, donor_inode);
>>
>> @@ -665,10 +667,8 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, __u64 orig_blk,
>> if (o_end - o_start < cur_len)
>> cur_len = o_end - o_start;
>>
>> - orig_page_index = o_start >> (PAGE_SHIFT -
>> - orig_inode->i_blkbits);
>> - donor_page_index = d_start >> (PAGE_SHIFT -
>> - donor_inode->i_blkbits);
>> + orig_page_index = EXT4_LBLK_TO_P(orig_inode, o_start);
>> + donor_page_index = EXT4_LBLK_TO_P(donor_inode, d_start);
>> offset_in_page = o_start % blocks_per_page;
>> if (cur_len > blocks_per_page - offset_in_page)
>> cur_len = blocks_per_page - offset_in_page;
>> --
>> 2.46.1
>>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 21/25] ext4: make online defragmentation support large block size
2025-11-05 9:50 ` Jan Kara
2025-11-05 10:48 ` Zhang Yi
@ 2025-11-05 11:28 ` Baokun Li
1 sibling, 0 replies; 68+ messages in thread
From: Baokun Li @ 2025-11-05 11:28 UTC (permalink / raw)
To: Jan Kara
Cc: linux-ext4, tytso, adilger.kernel, linux-kernel, kernel, mcgrof,
linux-fsdevel, linux-mm, yi.zhang, yangerkun, chengzhihao1,
libaokun1, Baokun Li
On 2025-11-05 17:50, Jan Kara wrote:
> On Sat 25-10-25 11:22:17, libaokun@huaweicloud.com wrote:
>> From: Zhihao Cheng <chengzhihao1@huawei.com>
>>
>> There are several places assuming that block size <= PAGE_SIZE, modify
>> them to support large block size (bs > ps).
>>
>> Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> ...
>
>> @@ -565,7 +564,7 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, __u64 orig_blk,
>> struct inode *orig_inode = file_inode(o_filp);
>> struct inode *donor_inode = file_inode(d_filp);
>> struct ext4_ext_path *path = NULL;
>> - int blocks_per_page = PAGE_SIZE >> orig_inode->i_blkbits;
>> + int blocks_per_page = 1;
>> ext4_lblk_t o_end, o_start = orig_blk;
>> ext4_lblk_t d_start = donor_blk;
>> int ret;
>> @@ -608,6 +607,9 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, __u64 orig_blk,
>> return -EOPNOTSUPP;
>> }
>>
>> + if (i_blocksize(orig_inode) < PAGE_SIZE)
>> + blocks_per_page = PAGE_SIZE >> orig_inode->i_blkbits;
>> +
> I think these are strange and the only reason for this is that
> ext4_move_extents() tries to make life easier to move_extent_per_page() and
> that doesn't really work with larger folios anymore. I think
> ext4_move_extents() just shouldn't care about pages / folios at all and
> pass 'cur_len' as the length to the end of extent / moved range and
> move_extent_per_page() will trim the length based on the folios it has got.
>
> Also then we can rename some of the variables and functions from 'page' to
> 'folio'.
Yes, the code here doesn’t really support folios. YI mentioned earlier
that he would make online defragmentation support large folios. So in
this patch I only avoided shifting negative values, without doing a
deeper conversion.
YI’s conversion work looks nearly complete, so in the next version I will
rebase on top of his patches. Since his patch already removes the
function modified here, the next version will likely drop this patch or
adapt it accordingly.
Thanks for your review!
Cheers,
Baokun
>> /* Protect orig and donor inodes against a truncate */
>> lock_two_nondirectories(orig_inode, donor_inode);
>>
>> @@ -665,10 +667,8 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, __u64 orig_blk,
>> if (o_end - o_start < cur_len)
>> cur_len = o_end - o_start;
>>
>> - orig_page_index = o_start >> (PAGE_SHIFT -
>> - orig_inode->i_blkbits);
>> - donor_page_index = d_start >> (PAGE_SHIFT -
>> - donor_inode->i_blkbits);
>> + orig_page_index = EXT4_LBLK_TO_P(orig_inode, o_start);
>> + donor_page_index = EXT4_LBLK_TO_P(donor_inode, d_start);
>> offset_in_page = o_start % blocks_per_page;
>> if (cur_len > blocks_per_page - offset_in_page)
>> cur_len = blocks_per_page - offset_in_page;
>> --
>> 2.46.1
>>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH 25/25] ext4: enable block size larger than page size
2025-11-05 10:14 ` Jan Kara
@ 2025-11-06 2:44 ` Baokun Li
0 siblings, 0 replies; 68+ messages in thread
From: Baokun Li @ 2025-11-06 2:44 UTC (permalink / raw)
To: Jan Kara
Cc: linux-ext4, tytso, adilger.kernel, linux-kernel, kernel, mcgrof,
linux-fsdevel, linux-mm, yi.zhang, yangerkun, chengzhihao1,
libaokun1, Baokun Li
On 2025-11-05 18:14, Jan Kara wrote:
> On Sat 25-10-25 11:22:21, libaokun@huaweicloud.com wrote:
>> From: Baokun Li <libaokun1@huawei.com>
>>
>> Since block device (See commit 3c20917120ce ("block/bdev: enable large
>> folio support for large logical block sizes")) and page cache (See commit
>> ab95d23bab220ef8 ("filemap: allocate mapping_min_order folios in the page
>> cache")) has the ability to have a minimum order when allocating folio,
>> and ext4 has supported large folio in commit 7ac67301e82f ("ext4: enable
>> large folio for regular file"), now add support for block_size > PAGE_SIZE
>> in ext4.
>>
>> set_blocksize() -> bdev_validate_blocksize() already validates the block
>> size, so ext4_load_super() does not need to perform additional checks.
>>
>> Here we only need to enable large folio by default when s_min_folio_order
>> is greater than 0 and add the FS_LBS bit to fs_flags.
>>
>> In addition, mark this feature as experimental.
>>
>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
>> Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
> ...
>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 04f9380d4211..ba6cf05860ae 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -5146,6 +5146,9 @@ static bool ext4_should_enable_large_folio(struct inode *inode)
>> if (!ext4_test_mount_flag(sb, EXT4_MF_LARGE_FOLIO))
>> return false;
>>
>> + if (EXT4_SB(sb)->s_min_folio_order)
>> + return true;
>> +
> But now files with data journalling flag enabled will get large folios
> possibly significantly greater that blocksize. I don't think there's a
> fundamental reason why data journalling doesn't work with large folios, the
> only thing that's likely going to break is that credit estimates will go
> through the roof if there are too many blocks per folio. But that can be
> handled by setting max folio order to be equal to min folio order when
> journalling data for the inode.
>
> It is a bit scary to be modifying max folio order in
> ext4_change_inode_journal_flag() but I guess less scary than setting new
> aops and if we prune the whole page cache before touching the order and
> inode flag, we should be safe (famous last words ;).
>
Good point! This looks feasible.
We just need to adjust the folio order range based on the journal data,
and in ext4_inode_journal_mode only ignore the inode’s journal data flag
when max_order > min_order.
I’ll make the adaptation and run some tests.
Thank you for your review!
Cheers,
Baokun
>
>> if (!S_ISREG(inode->i_mode))
>> return false;
>> if (ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA))
>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>> index fdc006a973aa..4c0bd79bdf68 100644
>> --- a/fs/ext4/super.c
>> +++ b/fs/ext4/super.c
>> @@ -5053,6 +5053,9 @@ static int ext4_check_large_folio(struct super_block *sb)
>> return -EINVAL;
>> }
>>
>> + if (sb->s_blocksize > PAGE_SIZE)
>> + ext4_msg(sb, KERN_NOTICE, "EXPERIMENTAL bs(%lu) > ps(%lu) enabled.",
>> + sb->s_blocksize, PAGE_SIZE);
>> return 0;
>> }
>>
>> @@ -7432,7 +7435,8 @@ static struct file_system_type ext4_fs_type = {
>> .init_fs_context = ext4_init_fs_context,
>> .parameters = ext4_param_specs,
>> .kill_sb = ext4_kill_sb,
>> - .fs_flags = FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MGTIME,
>> + .fs_flags = FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MGTIME |
>> + FS_LBS,
>> };
>> MODULE_ALIAS_FS("ext4");
>>
>> --
>> 2.46.1
>>
^ permalink raw reply [flat|nested] 68+ messages in thread
end of thread, other threads:[~2025-11-06 2:44 UTC | newest]
Thread overview: 68+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-25 3:21 [PATCH 00/25] ext4: enable block size larger than page size libaokun
2025-10-25 3:21 ` [PATCH 01/25] ext4: remove page offset calculation in ext4_block_zero_page_range() libaokun
2025-11-03 7:41 ` Jan Kara
2025-10-25 3:21 ` [PATCH 02/25] ext4: remove page offset calculation in ext4_block_truncate_page() libaokun
2025-11-03 7:42 ` Jan Kara
2025-10-25 3:21 ` [PATCH 03/25] ext4: remove PAGE_SIZE checks for rec_len conversion libaokun
2025-11-03 7:43 ` Jan Kara
2025-10-25 3:22 ` [PATCH 04/25] ext4: make ext4_punch_hole() support large block size libaokun
2025-11-03 8:05 ` Jan Kara
2025-11-04 6:55 ` Baokun Li
2025-10-25 3:22 ` [PATCH 05/25] ext4: enable DIOREAD_NOLOCK by default for BS > PS as well libaokun
2025-11-03 8:06 ` Jan Kara
2025-10-25 3:22 ` [PATCH 06/25] ext4: introduce s_min_folio_order for future BS > PS support libaokun
2025-11-03 8:19 ` Jan Kara
2025-10-25 3:22 ` [PATCH 07/25] ext4: support large block size in ext4_calculate_overhead() libaokun
2025-11-03 8:14 ` Jan Kara
2025-11-03 14:37 ` Baokun Li
2025-10-25 3:22 ` [PATCH 08/25] ext4: support large block size in ext4_readdir() libaokun
2025-11-03 8:27 ` Jan Kara
2025-10-25 3:22 ` [PATCH 09/25] ext4: add EXT4_LBLK_TO_B macro for logical block to bytes conversion libaokun
2025-11-03 8:21 ` Jan Kara
2025-10-25 3:22 ` [PATCH 10/25] ext4: add EXT4_LBLK_TO_P and EXT4_P_TO_LBLK for block/page conversion libaokun
2025-11-03 8:26 ` Jan Kara
2025-11-03 14:45 ` Baokun Li
2025-11-05 8:27 ` Jan Kara
2025-10-25 3:22 ` [PATCH 11/25] ext4: support large block size in ext4_mb_load_buddy_gfp() libaokun
2025-11-05 8:46 ` Jan Kara
2025-10-25 3:22 ` [PATCH 12/25] ext4: support large block size in ext4_mb_get_buddy_page_lock() libaokun
2025-11-05 9:13 ` Jan Kara
2025-11-05 9:44 ` Baokun Li
2025-10-25 3:22 ` [PATCH 13/25] ext4: support large block size in ext4_mb_init_cache() libaokun
2025-11-05 9:18 ` Jan Kara
2025-10-25 3:22 ` [PATCH 14/25] ext4: prepare buddy cache inode for BS > PS with large folios libaokun
2025-11-05 9:19 ` Jan Kara
2025-10-25 3:22 ` [PATCH 15/25] ext4: rename 'page' references to 'folio' in multi-block allocator libaokun
2025-11-05 9:21 ` Jan Kara
2025-10-25 3:22 ` [PATCH 16/25] ext4: support large block size in ext4_mpage_readpages() libaokun
2025-11-05 9:26 ` Jan Kara
2025-10-25 3:22 ` [PATCH 17/25] ext4: support large block size in ext4_block_write_begin() libaokun
2025-11-05 9:28 ` Jan Kara
2025-10-25 3:22 ` [PATCH 18/25] ext4: support large block size in mpage_map_and_submit_buffers() libaokun
2025-11-05 9:30 ` Jan Kara
2025-10-25 3:22 ` [PATCH 19/25] ext4: support large block size in mpage_prepare_extent_to_map() libaokun
2025-11-05 9:31 ` Jan Kara
2025-10-25 3:22 ` [PATCH 20/25] ext4: support large block size in __ext4_block_zero_page_range() libaokun
2025-11-05 9:33 ` Jan Kara
2025-10-25 3:22 ` [PATCH 21/25] ext4: make online defragmentation support large block size libaokun
2025-11-05 9:50 ` Jan Kara
2025-11-05 10:48 ` Zhang Yi
2025-11-05 11:28 ` Baokun Li
2025-10-25 3:22 ` [PATCH 22/25] fs/buffer: prevent WARN_ON in __alloc_pages_slowpath() when BS > PS libaokun
2025-10-25 4:45 ` Matthew Wilcox
2025-10-25 5:13 ` Darrick J. Wong
2025-10-25 6:32 ` Baokun Li
2025-10-25 7:01 ` Zhang Yi
2025-10-25 17:56 ` Matthew Wilcox
2025-10-27 2:57 ` Baokun Li
2025-10-27 7:40 ` Christoph Hellwig
2025-10-30 21:25 ` Matthew Wilcox
2025-10-31 1:47 ` Zhang Yi
2025-10-31 1:55 ` Baokun Li
2025-10-25 6:34 ` Baokun Li
2025-10-25 3:22 ` [PATCH 23/25] jbd2: " libaokun
2025-10-25 3:22 ` [PATCH 24/25] ext4: add checks for large folio incompatibilities " libaokun
2025-11-05 9:59 ` Jan Kara
2025-10-25 3:22 ` [PATCH 25/25] ext4: enable block size larger than page size libaokun
2025-11-05 10:14 ` Jan Kara
2025-11-06 2:44 ` Baokun Li
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).