Linux EXT4 FS development
 help / color / mirror / Atom feed
* [PATCH 4/4] select: make select() and poll() waits freezable
From: Dai Junbing @ 2026-05-27  6:49 UTC (permalink / raw)
  To: linux-fsdevel, viro, brauner, tytso, jack, linux-ext4
  Cc: jack, linux-kernel, Dai Junbing
In-Reply-To: <20260527064912.1038-1-daijunbing@vivo.com>

Tasks blocked in select() or poll() may be woken during suspend and
resume due to freezer state transitions. This can cause avoidable
activity in the suspend/resume path and add unnecessary overhead.

Mark the waits in do_select() and do_poll() as freezable so these tasks
are not unnecessarily woken by the freezer.

Both functions are only used from their respective system call paths,
where the task sleeps without holding locks that would make freezing
unsafe.

Signed-off-by: Dai Junbing <daijunbing@vivo.com>
---
 fs/select.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index bf71c9838dfe..b0b279748355 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -600,7 +600,7 @@ static noinline_for_stack int do_select(int n, fd_set_bits *fds, struct timespec
 			to = &expire;
 		}
 
-		if (!poll_schedule_timeout(&table, TASK_INTERRUPTIBLE,
+		if (!poll_schedule_timeout(&table, TASK_INTERRUPTIBLE|TASK_FREEZABLE,
 					   to, slack))
 			timed_out = 1;
 	}
@@ -962,7 +962,7 @@ static int do_poll(struct poll_list *list, struct poll_wqueues *wait,
 			to = &expire;
 		}
 
-		if (!poll_schedule_timeout(wait, TASK_INTERRUPTIBLE, to, slack))
+		if (!poll_schedule_timeout(wait, TASK_INTERRUPTIBLE|TASK_FREEZABLE, to, slack))
 			timed_out = 1;
 	}
 	return count;
-- 
2.25.1


^ permalink raw reply related

* [PATCH v2 0/2] ext4: optimize ext4_mb_prefetch
From: Bohdan Trach @ 2026-05-27  9:03 UTC (permalink / raw)
  To: Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi
  Cc: mchehab+huawei, bohdan.trach, lilith.oberhauser, Bohdan Trach,
	linux-ext4, linux-kernel

v2:
  Fix issues found by Jan Kara, added R-b for patch 2/2.
  Extend commit message of patch 1/2 a bit.
v1:
https://lore.kernel.org/linux-ext4/20260521125931.16474-1-bohdan.trach@huaweicloud.com/

Original cover letter below:

Dear Ted,

We have been profiling scalability of some rocksdb-related workloads on
ext4 file system and have found a case where significant time ends up
being spent in ext4_mb_prefetch() function. This happens because
ext4_mb_scan_groups_linear() path is triggered in ext4_mb_scan_groups().
We have noticed that on larger, filled disks, this function can take
lots of time.

We have added a test for this issue to our fork of will-it-scale [1],
which you can use to reproduce the issue.(the actual workload does a few
writes after fallocate, they have been dropped to better illustrate the
issue).
1) https://github.com/open-s4c/will-it-scale/blob/master/tests/fallocate3.c

On this series, we optimize this code path:
Patch 1: change EXT4_MB_GRP_TEST_AND_SET_READ() to reduce the rate of
         atomic RMW operation via test_and_set_bit, which has quite
         high cost on large multicore CPUs, especially under
         contention for the group's flag cache lines.
         As this bit is only ever set, but never unset, it should be
         possible to reduce the cost of this check by calling
         test_bit[_acquire]() first.
Patch 2: restructure the ext4_mb_prefetch loop operations such that
         ext4_group_desc is fetched only after the checks based on
         ext4_group_info succeed.

This series has been tested with
        kvm-xfstests -c ext4/all -g auto
and did not introduce any new issues.

Performance test: we have used a our will-it-scale drop-in test we have
provided above, and used three machines for running it:
- Kunpeng 920 (arm64, 96 CPUs * 1 socket, 128G RAM, SAS HDD: Seagate
  Exos 10E2400 1.2TB)
- Kunpeng 920b (arm64, 80 CPUs * 2 sockets, 502G RAM, SATA SSD: Huawei
  ES3000 V6 0.96TB)
- AMD 9654 (x86_64, 96 CPUs * 2 sockets, 1.5T RAM, NVME SSD: Samsung SSD
  970 EVO Plus 1TB)
We have performed tests with existing file systems, as well as more limited
tests with a fixed-size file systems.

Benchmark on an existing file system for Kunpeng 920 (842G FS, 31% space
used) with the patch based on kernel 7.0.6:
| thr. | base | patched |      improv. |
|      | perf |    perf |              |
|------|------|---------|--------------|
|    1 | 1286 |    1608 |  +25.0388802 |
|    2 | 1673 |    1680 |   +0.4184100 |
|    4 | 1698 |    1712 |   +0.8244994 |
|    8 | 1721 |    1730 |   +0.5229518 |
|   16 | 1739 |    2313 |  +33.0074756 |
|   32 | 1742 |    3571 | +104.9942595 |
|   64 | 1735 |    3427 |  +97.5216138 |
|   96 | 1688 |    1814 |   +7.4644550 |

Benchmark on an existing file system for Kunpeng 920b (802G ext4 FS, 68%
space used) with the patch based on kernel 6.6:
| thr. | base | patched |  improv. |
|      | perf |    perf |          |
|------|------|---------|----------|
|    1 | 1613 |   1625  |   +0.74% |
|    2 | 1620 |   2603  |  +60.67% |
|    4 | 1624 |   4894  | +201.35% |
|    8 | 2505 |   8328  | +232.45% |
|   16 | 4736 |  11632  | +145.60% |
|   32 | 7784 |  13124  |  +68.60% |
|   64 | 8094 |   8636  |   +6.69% |
|  128 | 6914 |   7890  |  +14.11% |

Benchmark on an existing file system for AMD 9654 (15T FS, 6% space
used), kernel 7.1-rc3. This shows the performance impact on a mostly
free file system.
| thr. |  base | patched |    improv. |
|      |  perf |    perf |            |
|------|-------|---------|------------|
|    1 | 30901 |   31191 | +0.9384810 |
|    2 | 50874 |   50504 | -0.7272870 |
|    4 | 66068 |   64108 | -2.9666404 |
|    8 | 63963 |   61927 | -3.1830902 |
|   16 | 47809 |   47044 | -1.6001171 |
|   32 | 42441 |   42326 | -0.2709644 |
|   64 | 39773 |   39929 | +0.3922259 |
|  128 | 37065 |   36413 | -1.7590719 |

We have also performed the test with kernel 6.6 on both Kunpeng920b and
AMD 9654 with much smaller FS image (133G) to have more controlled
benchmarking environment, although this reduces the measured benefits as
well compared to a bigger FS with more groups to iterate over:

AMD 9654 performance:
| thr. |  base | patched |  improv. |
|      |  perf |    perf |          |
|------|----------------------------|
| 25% full file system:             |
|------|----------------------------|
|    1 |  5964 |    6778 |  +13.64% |
|    2 | 11811 |   13415 |  +13.58% |
|    4 | 20111 |   23570 |  +17.19% |
|    8 | 30083 |   36296 |  +20.65% |
|   16 | 27781 |   38302 |  +37.87% |
|   32 | 28325 |   36930 |  +30.37% |
|   64 | 26044 |   29952 |  +15.00% |
|  128 | 19969 |   20882 |   +4.57% |
|------|----------------------------|
| 50% full file system:             |
|------|----------------------------|
|    1 |  4093 |    7380 |  +80.30% |
|    2 | 13168 |   13906 |   +5.60% |
|    4 | 21440 |   22623 |   +5.51% |
|    8 | 30523 |   32360 |   +6.01% |
|   16 | 27502 |   34017 |  +23.68% |
|   32 | 27189 |   32480 |  +19.46% |
|   64 | 24146 |   26463 |   +9.59% |
|  128 | 18386 |   18631 |   +1.33% |
|------|----------------------------|
| 75% full file system:             |
|------|----------------------------|
|    1 |  5738 |    7208 |  +25.61% |
|    2 | 13869 |   15309 |  +10.38% |
|    4 | 21803 |   23447 |   +7.54% |
|    8 | 29004 |   30766 |   +6.07% |
|   16 | 25542 |   30584 |  +19.74% |
|   32 | 24242 |   28631 |  +18.10% |
|   64 | 20631 |   22833 |  +10.67% |
|  128 | 14603 |   15086 |   +3.30% |

Kunpeng K920b performance:
| thr. |  base | patched | improv. |
|      |  perf |    perf |         |
|------|---------------------------|
| 25% full file system:            |
|------|---------------------------|
|    1 |  5398 |    7025 | +30.14% |
|    2 |  7451 |   12299 | +65.06% |
|    4 | 12574 |   20899 | +66.20% |
|    8 | 18645 |   27694 | +48.53% |
|   16 | 25088 |   31739 | +26.51% |
|   32 | 26699 |   27632 |  +3.49% |
|   64 | 14943 |   19547 | +30.81% |
|  128 | 13047 |   14544 | +11.47% |
|------|---------------------------|
| 50% full file system:            |
|------|---------------------------|
|    1 |  4881 |    6618 | +35.58% |
|    2 |  6544 |   11660 | +78.17% |
|    4 | 11156 |   19506 | +74.84% |
|    8 | 16842 |   25835 | +53.39% |
|   16 | 23305 |   29260 | +25.55% |
|   32 | 24622 |   25303 |  +2.76% |
|   64 | 13814 |   17707 | +28.18% |
|  128 | 12061 |   13180 |  +9.27% |
|------|---------------------------|
| 75% full file system:            |
|------|---------------------------|
|    1 |  7037 |   10580 | +50.34% |
|    2 |  9216 |    9075 |  -1.52% |
|    4 | 14534 |   22076 | +51.89% |
|    8 | 19341 |   25936 | +34.09% |
|   16 | 23592 |   27409 | +16.17% |
|   32 | 23680 |   23078 |  -2.54% |
|   64 | 12836 |   15902 | +23.88% |
|  128 |  9614 |   10341 |  +7.56% |

Thanks,
Bohdan.

Bohdan Trach (2):
  ext4: avoid RWM atomic in EXT4_MB_GRP_TEST_AND_SET_READ
  ext4: get ext4_group_desc in ext4_mb_prefetch only when necessary

 fs/ext4/ext4.h    |  8 +++++++-
 fs/ext4/mballoc.c | 21 +++++++++++----------
 2 files changed, 18 insertions(+), 11 deletions(-)

-- 
2.43.0


^ permalink raw reply

* [PATCH v2 1/2] ext4: avoid RWM atomic in EXT4_MB_GRP_TEST_AND_SET_READ
From: Bohdan Trach @ 2026-05-27  9:03 UTC (permalink / raw)
  To: Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi
  Cc: mchehab+huawei, bohdan.trach, lilith.oberhauser, Bohdan Trach,
	linux-ext4, linux-kernel
In-Reply-To: <20260527090329.2680170-1-bohdan.trach@huaweicloud.com>

EXT4_MB_GRP_TEST_AND_SET_READ uses test_and_set_bit function which
issues an atomic write. This can cause high overhead due to cache
contention when multiple threads iterate over groups in a tight loop,
as is the case for ext4_mb_prefetch(). We have seen this to be a
problem for Kunpeng 920b CPUs which uses a single ARM LSE instruction
for this purpose.

Avoid this unconditional atomic write by testing the bit first without
changing its value. This is OK for this use case as this bit is never
unset.

This change significantly reduces costs of fallocate() operations which
trigger linear group scans on large multicore machines where
test_and_set_bit issues an atomic write operation unconditionally.

Signed-off-by: Bohdan Trach <bohdan.trach@huaweicloud.com>
---
 fs/ext4/ext4.h | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 56b82d4a15d7..f8eacf1375f8 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3551,7 +3551,13 @@ struct ext4_group_info {
 #define EXT4_MB_GRP_CLEAR_TRIMMED(grp)	\
 	(clear_bit(EXT4_GROUP_INFO_WAS_TRIMMED_BIT, &((grp)->bb_state)))
 #define EXT4_MB_GRP_TEST_AND_SET_READ(grp)	\
-	(test_and_set_bit(EXT4_GROUP_INFO_BBITMAP_READ_BIT, &((grp)->bb_state)))
+	(ext4_mb_grp_test_and_set_read((grp)))
+
+static inline int ext4_mb_grp_test_and_set_read(struct ext4_group_info *grp)
+{
+	return (test_bit(EXT4_GROUP_INFO_BBITMAP_READ_BIT, &grp->bb_state) ||
+		test_and_set_bit(EXT4_GROUP_INFO_BBITMAP_READ_BIT, &grp->bb_state));
+}
 
 #define EXT4_MAX_CONTENTION		8
 #define EXT4_CONTENTION_THRESHOLD	2
-- 
2.43.0


^ permalink raw reply related

* [PATCH v2 2/2] ext4: get ext4_group_desc in ext4_mb_prefetch only when necessary
From: Bohdan Trach @ 2026-05-27  9:03 UTC (permalink / raw)
  To: Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi
  Cc: mchehab+huawei, bohdan.trach, lilith.oberhauser, Bohdan Trach,
	linux-ext4, linux-kernel
In-Reply-To: <20260527090329.2680170-1-bohdan.trach@huaweicloud.com>

Getting ext4_group_desc structure can contribute to the cost of
ext4_mb_prefetch() without any need, as most groups fail the
!EXT4_MB_GRP_TEST_AND_SET_READ check.

Optimize ext4_mb_prefetch by getting the group description only when
necessary.

The result is further increase in performance of fallocate() system call
path that triggers ext4_mb_prefetch() via a linear group scan.

Signed-off-by: Bohdan Trach <bohdan.trach@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/mballoc.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 25e3d9204233..907a209eb1e8 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2861,8 +2861,6 @@ ext4_group_t ext4_mb_prefetch(struct super_block *sb, ext4_group_t group,
 
 	blk_start_plug(&plug);
 	while (nr-- > 0) {
-		struct ext4_group_desc *gdp = ext4_get_group_desc(sb, group,
-								  NULL);
 		struct ext4_group_info *grp = ext4_get_group_info(sb, group);
 
 		/*
@@ -2872,14 +2870,17 @@ ext4_group_t ext4_mb_prefetch(struct super_block *sb, ext4_group_t group,
 		 * prefetch once, so we avoid getblk() call, which can
 		 * be expensive.
 		 */
-		if (gdp && grp && !EXT4_MB_GRP_TEST_AND_SET_READ(grp) &&
-		    EXT4_MB_GRP_NEED_INIT(grp) &&
-		    ext4_free_group_clusters(sb, gdp) > 0 ) {
-			bh = ext4_read_block_bitmap_nowait(sb, group, true);
-			if (!IS_ERR_OR_NULL(bh)) {
-				if (!buffer_uptodate(bh) && cnt)
-					(*cnt)++;
-				brelse(bh);
+		if (grp && !EXT4_MB_GRP_TEST_AND_SET_READ(grp) &&
+		    EXT4_MB_GRP_NEED_INIT(grp)) {
+			struct ext4_group_desc *gdp = ext4_get_group_desc(sb, group, NULL);
+
+			if (gdp && ext4_free_group_clusters(sb, gdp) > 0) {
+				bh = ext4_read_block_bitmap_nowait(sb, group, true);
+				if (!IS_ERR_OR_NULL(bh)) {
+					if (!buffer_uptodate(bh) && cnt)
+						(*cnt)++;
+					brelse(bh);
+				}
 			}
 		}
 		if (++group >= ngroups)
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH 12/34] ext4; Convert __ext4_read_bh() to bh_submit()
From: Jan Kara @ 2026-05-27 10:38 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: Jan Kara, Christian Brauner, Christoph Hellwig, linux-fsdevel,
	linux-ext4
In-Reply-To: <20260525171931.4144395-13-willy@infradead.org>

On Mon 25-05-26 18:19:05, Matthew Wilcox (Oracle) wrote:
> Avoid an extra indirect function call by converting
> ext4_end_bitmap_read() from bh_end_io_t to bio_end_io_t and
> calling bh_submit().
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Cc: linux-ext4@vger.kernel.org

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/ext4.h   | 10 +++++-----
>  fs/ext4/ialloc.c |  5 ++++-
>  fs/ext4/super.c  | 11 ++++++-----
>  3 files changed, 15 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 94283a991e5c..6af11f0ff1c5 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -2959,7 +2959,7 @@ extern unsigned long ext4_count_dirs(struct super_block *);
>  extern void ext4_mark_bitmap_end(int start_bit, int end_bit, char *bitmap);
>  extern int ext4_init_inode_table(struct super_block *sb,
>  				 ext4_group_t group, int barrier);
> -extern void ext4_end_bitmap_read(struct buffer_head *bh, int uptodate);
> +void ext4_end_bitmap_read(struct bio *bio);
>  
>  /* fast_commit.c */
>  int ext4_fc_info_show(struct seq_file *seq, void *v);
> @@ -3184,10 +3184,10 @@ extern struct buffer_head *ext4_sb_bread_unmovable(struct super_block *sb,
>  						   sector_t block);
>  extern struct buffer_head *ext4_sb_bread_nofail(struct super_block *sb,
>  						sector_t block);
> -extern void ext4_read_bh_nowait(struct buffer_head *bh, blk_opf_t op_flags,
> -				bh_end_io_t *end_io, bool simu_fail);
> -extern int ext4_read_bh(struct buffer_head *bh, blk_opf_t op_flags,
> -			bh_end_io_t *end_io, bool simu_fail);
> +void ext4_read_bh_nowait(struct buffer_head *bh, blk_opf_t op_flags,
> +		bio_end_io_t end_io, bool simu_fail);
> +int ext4_read_bh(struct buffer_head *bh, blk_opf_t op_flags,
> +		bio_end_io_t end_io, bool simu_fail);
>  extern int ext4_read_bh_lock(struct buffer_head *bh, blk_opf_t op_flags, bool wait);
>  extern void ext4_sb_breadahead_unmovable(struct super_block *sb, sector_t block);
>  extern int ext4_seq_options_show(struct seq_file *seq, void *offset);
> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index 3fd8f0099852..2db68b1bf855 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -66,8 +66,11 @@ void ext4_mark_bitmap_end(int start_bit, int end_bit, char *bitmap)
>  		memset(bitmap + (i >> 3), 0xff, (end_bit - i) >> 3);
>  }
>  
> -void ext4_end_bitmap_read(struct buffer_head *bh, int uptodate)
> +void ext4_end_bitmap_read(struct bio *bio)
>  {
> +	bool uptodate = bio->bi_status == BLK_STS_OK;
> +	struct buffer_head *bh = bio_endio_bh(bio);
> +
>  	if (uptodate) {
>  		set_buffer_uptodate(bh);
>  		set_bitmap_uptodate(bh);
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 6a77db4d3124..fbe175951e01 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -161,7 +161,7 @@ MODULE_ALIAS("ext3");
>  
>  
>  static inline void __ext4_read_bh(struct buffer_head *bh, blk_opf_t op_flags,
> -				  bh_end_io_t *end_io, bool simu_fail)
> +				  bio_end_io_t end_io, bool simu_fail)
>  {
>  	if (simu_fail) {
>  		clear_buffer_uptodate(bh);
> @@ -176,13 +176,14 @@ static inline void __ext4_read_bh(struct buffer_head *bh, blk_opf_t op_flags,
>  	 */
>  	clear_buffer_verified(bh);
>  
> -	bh->b_end_io = end_io ? end_io : end_buffer_read_sync;
> +	if (!end_io)
> +		end_io = bh_end_read;
>  	get_bh(bh);
> -	submit_bh(REQ_OP_READ | op_flags, bh);
> +	bh_submit(bh, REQ_OP_READ | op_flags, end_io);
>  }
>  
>  void ext4_read_bh_nowait(struct buffer_head *bh, blk_opf_t op_flags,
> -			 bh_end_io_t *end_io, bool simu_fail)
> +			 bio_end_io_t end_io, bool simu_fail)
>  {
>  	BUG_ON(!buffer_locked(bh));
>  
> @@ -194,7 +195,7 @@ void ext4_read_bh_nowait(struct buffer_head *bh, blk_opf_t op_flags,
>  }
>  
>  int ext4_read_bh(struct buffer_head *bh, blk_opf_t op_flags,
> -		 bh_end_io_t *end_io, bool simu_fail)
> +		 bio_end_io_t end_io, bool simu_fail)
>  {
>  	BUG_ON(!buffer_locked(bh));
>  
> -- 
> 2.47.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH 13/34] ext4: Convert ext4_fc_submit_bh() to bh_submit()
From: Jan Kara @ 2026-05-27 10:41 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: Jan Kara, Christian Brauner, Christoph Hellwig, linux-fsdevel,
	linux-ext4
In-Reply-To: <20260525171931.4144395-14-willy@infradead.org>

On Mon 25-05-26 18:19:06, Matthew Wilcox (Oracle) wrote:
> Avoid an extra indirect function call by converting
> ext4_end_buffer_io_sync() from bh_end_io_t to bio_end_io_t and
> calling bh_submit().
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Cc: linux-ext4@vger.kernel.org

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/fast_commit.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
> index b3c22636251d..d52c64adf416 100644
> --- a/fs/ext4/fast_commit.c
> +++ b/fs/ext4/fast_commit.c
> @@ -184,8 +184,11 @@
>  #include <trace/events/ext4.h>
>  static struct kmem_cache *ext4_fc_dentry_cachep;
>  
> -static void ext4_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
> +static void ext4_end_buffer_io_sync(struct bio *bio)
>  {
> +	bool uptodate = bio->bi_status == BLK_STS_OK;
> +	struct buffer_head *bh = bio_endio_bh(bio);
> +
>  	BUFFER_TRACE(bh, "");
>  	if (uptodate) {
>  		ext4_debug("%s: Block %lld up-to-date",
> @@ -659,8 +662,7 @@ static void ext4_fc_submit_bh(struct super_block *sb, bool is_tail)
>  	lock_buffer(bh);
>  	set_buffer_dirty(bh);
>  	set_buffer_uptodate(bh);
> -	bh->b_end_io = ext4_end_buffer_io_sync;
> -	submit_bh(REQ_OP_WRITE | write_flags, bh);
> +	bh_submit(bh, REQ_OP_WRITE | write_flags, ext4_end_buffer_io_sync);
>  	EXT4_SB(sb)->s_fc_bh = NULL;
>  }
>  
> -- 
> 2.47.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH 14/34] ext4: Convert write_mmp_block_thawed() to bh_submit()
From: Jan Kara @ 2026-05-27 10:42 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: Jan Kara, Christian Brauner, Christoph Hellwig, linux-fsdevel,
	linux-ext4
In-Reply-To: <20260525171931.4144395-15-willy@infradead.org>

On Mon 25-05-26 18:19:07, Matthew Wilcox (Oracle) wrote:
> Avoid an extra indirect function call by using bh_submit() instead of
> submit_bh().
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Cc: linux-ext4@vger.kernel.org

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/mmp.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext4/mmp.c b/fs/ext4/mmp.c
> index 6f57c181ff77..493528fbed75 100644
> --- a/fs/ext4/mmp.c
> +++ b/fs/ext4/mmp.c
> @@ -46,9 +46,9 @@ static int write_mmp_block_thawed(struct super_block *sb,
>  
>  	ext4_mmp_csum_set(sb, mmp);
>  	lock_buffer(bh);
> -	bh->b_end_io = end_buffer_write_sync;
>  	get_bh(bh);
> -	submit_bh(REQ_OP_WRITE | REQ_SYNC | REQ_META | REQ_PRIO, bh);
> +	bh_submit(bh, REQ_OP_WRITE | REQ_SYNC | REQ_META | REQ_PRIO,
> +			bh_end_write);
>  	wait_on_buffer(bh);
>  	if (unlikely(!buffer_uptodate(bh)))
>  		return -EIO;
> -- 
> 2.47.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH 15/34] ext4: Convert ext4_commit_super() to bh_submit()
From: Jan Kara @ 2026-05-27 10:42 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: Jan Kara, Christian Brauner, Christoph Hellwig, linux-fsdevel,
	linux-ext4
In-Reply-To: <20260525171931.4144395-16-willy@infradead.org>

On Mon 25-05-26 18:19:08, Matthew Wilcox (Oracle) wrote:
> Avoid an extra indirect function call by using bh_submit() instead of
> submit_bh().
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Cc: linux-ext4@vger.kernel.org

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/super.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index fbe175951e01..905d66cbe3f2 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -6320,9 +6320,8 @@ static int ext4_commit_super(struct super_block *sb)
>  	get_bh(sbh);
>  	/* Clear potential dirty bit if it was journalled update */
>  	clear_buffer_dirty(sbh);
> -	sbh->b_end_io = end_buffer_write_sync;
> -	submit_bh(REQ_OP_WRITE | REQ_SYNC |
> -		  (test_opt(sb, BARRIER) ? REQ_FUA : 0), sbh);
> +	bh_submit(sbh, REQ_OP_WRITE | REQ_SYNC |
> +		  (test_opt(sb, BARRIER) ? REQ_FUA : 0), bh_end_write);
>  	wait_on_buffer(sbh);
>  	if (buffer_write_io_error(sbh)) {
>  		ext4_msg(sb, KERN_ERR, "I/O error while writing "
> -- 
> 2.47.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v2] ext2: Remove deprecated DAX support
From: Ashwin Gundarapu @ 2026-05-27 10:53 UTC (permalink / raw)
  To: Jan Kara; +Cc: jack, linux-ext4, linux-kernel
In-Reply-To: <fxaiddid432ivmkcsqmzeovemmnkyh37nfgwn4xurxb2wx5u5y@lhkjib3tmd7e>

Thanks for the review, Jan. All the style issues you mentioned have
been addressed in v3 and v4:

v3: https://lore.kernel.org/linux-ext4/19e595ac3d0.1a0dcfbe128078.1031782761444069401@zohomail.in/
v4: https://lore.kernel.org/linux-ext4/19e5aa07c9b.3a2e576d130187.5289857983023045470@zohomail.in/

v3 fixed the spaces-to-tabs indentation issues in inode.c, super.c,
and file.c. It also restored Opt_dax for a graceful mount error message.

v4 changed Opt_xip and Opt_dax from -EINVAL to break with a warning,
per Sashiko AI review, to avoid potential boot failures on systems
with these options in /etc/fstab.

The stray empty lines and tab indentation are all cleaned up in the
latest version (v4).

Thanks,
Ashwin


From: Jan Kara <jack@suse.cz>
To: "Ashwin Gundarapu"<linuxuser509@zohomail.in>
Cc: "jack"<jack@suse.com>, "linux-ext4"<linux-ext4@vger.kernel.org>, "linux-kernel"<linux-kernel@vger.kernel.org>
Date: Mon, 25 May 2026 22:00:40 +0530
Subject: Re: [PATCH v2] ext2: Remove deprecated DAX support

 > On Sun 24-05-26 11:08:53, Ashwin Gundarapu wrote: 
 > > 
 > > DAX support in ext2 was deprecated in commit d5a2693f93e4 
 > > ("ext2: Deprecate DAX") with a removal deadline of end of 2025. 
 > > Remove all DAX code from ext2 as scheduled. 
 > > 
 > > This removes the DAX mount option, IOMAP DAX support, DAX file 
 > > operations, DAX address_space_operations, and the DAX fault handler. 
 > > 
 > > Signed-off-by: Ashwin Gundarapu <linuxuser509@zohomail.in> 
 > > --- 
 > > v2: Removed unused sbi variable and fixed indentation as reported 
 > >     by kernel test robot. 
 >  
 > Thanks for the patch. Some style nits below. 
 >  
 > >  static ssize_t ext2_file_read_iter(struct kiocb *iocb, struct iov_iter *to) 
 > >  { 
 > > -#ifdef CONFIG_FS_DAX 
 > > -    if (IS_DAX(iocb->ki_filp->f_mapping->host)) 
 > > -        return ext2_dax_read_iter(iocb, to); 
 > > -#endif 
 > > + 
 >  
 > Stray empty line here. 
 >  
 > >      if (iocb->ki_flags & IOCB_DIRECT) 
 > >          return ext2_dio_read_iter(iocb, to); 
 > > 
 > > @@ -297,10 +188,7 @@ static ssize_t ext2_file_read_iter(struct kiocb *iocb, struct iov_iter *to) 
 > > 
 > >  static ssize_t ext2_file_write_iter(struct kiocb *iocb, struct iov_iter *from) 
 > >  { 
 > > -#ifdef CONFIG_FS_DAX 
 > > -    if (IS_DAX(iocb->ki_filp->f_mapping->host)) 
 > > -        return ext2_dax_write_iter(iocb, from); 
 > > -#endif 
 > > + 
 >  
 > ... and here. 
 >  
 > >      if (iocb->ki_flags & IOCB_DIRECT) 
 > >          return ext2_dio_write_iter(iocb, from); 
 > > 
 > > @@ -321,7 +209,7 @@ const struct file_operations ext2_file_operations = { 
 > >  #ifdef CONFIG_COMPAT 
 > >      .compat_ioctl    = ext2_compat_ioctl, 
 > >  #endif 
 > > -    .mmap_prepare    = ext2_file_mmap_prepare, 
 > > +    .mmap_prepare = generic_file_mmap_prepare, 
 >  
 > Please indent this with tab the same way as other methods. 
 >  
 > > @@ -841,10 +818,7 @@ static int ext2_iomap_begin(struct inode *inode, loff_t offset, loff_t length, 
 > > 
 > >      iomap->flags = 0; 
 > >      iomap->offset = (u64)first_block << blkbits; 
 > > -    if (flags & IOMAP_DAX) 
 > > -        iomap->dax_dev = sbi->s_daxdev; 
 > > -    else 
 > > -        iomap->bdev = inode->i_sb->s_bdev; 
 > > +        iomap->bdev = inode->i_sb->s_bdev; 
 >  
 > Indented with spaces instead of tabs. 
 >  
 > > @@ -1290,12 +1248,8 @@ static int ext2_setsize(struct inode *inode, loff_t newsize) 
 > > 
 > >      inode_dio_wait(inode); 
 > > 
 > > -    if (IS_DAX(inode)) 
 > > -        error = dax_truncate_page(inode, newsize, NULL, 
 > > -                      &ext2_iomap_ops); 
 > > -    else 
 > > -        error = block_truncate_page(inode->i_mapping, 
 > > -                newsize, ext2_get_block); 
 > > +        error = block_truncate_page(inode->i_mapping, 
 > > +                                newsize, ext2_get_block); 
 >  
 > Indented with spaces instead of tabs. 
 >  
 > >      if (error) 
 > >          return error; 
 > > 
 >  
 > ... 
 > > +        case Opt_xip: 
 > > +                ext2_msg_fc(fc, KERN_ERR, "DAX support has been removed. Please use ext4 instead."); 
 > > +                return -EINVAL; 
 >  
 > Indented with spaces instead of tabs. 
 >  
 > > @@ -992,16 +974,8 @@ static int ext2_fill_super(struct super_block *sb, struct fs_context *fc) 
 > >      } 
 > >      blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size); 
 > > 
 > > -    if (test_opt(sb, DAX)) { 
 > > -        if (!sbi->s_daxdev) { 
 > > -            ext2_msg(sb, KERN_ERR, 
 > > -                "DAX unsupported by block device. Turning off DAX."); 
 > > -            clear_opt(sbi->s_mount_opt, DAX); 
 > > -        } else if (blocksize != PAGE_SIZE) { 
 > > -            ext2_msg(sb, KERN_ERR, "unsupported blocksize for DAX\n"); 
 > > -            clear_opt(sbi->s_mount_opt, DAX); 
 > > -        } 
 > > -    } 
 > > + 
 > > + 
 >  
 > Stray empty lines. 
 >  
 >                                 Honza 
 > -- 
 > Jan Kara <jack@suse.com> 
 > SUSE Labs, CR 
 >  
 > 


^ permalink raw reply

* Re: [PATCH 16/34] jbd2: Convert journal commit to bh_submit()
From: Jan Kara @ 2026-05-27 10:54 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: Jan Kara, Christian Brauner, Christoph Hellwig, linux-fsdevel,
	linux-ext4
In-Reply-To: <20260525171931.4144395-17-willy@infradead.org>

On Mon 25-05-26 18:19:09, Matthew Wilcox (Oracle) wrote:
> Avoid an extra indirect function call by using bh_submit()
> instead of submit_bh() in journal_submit_commit_record()
> and jbd2_journal_commit_transaction().  These both use
> journal_end_buffer_io_sync(), so it's more straightforward to do them
> both at once.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Cc: linux-ext4@vger.kernel.org

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

Another note for future work here: The BH_Shadow handling looks like a dead
code. We hold buffer lock when writing out bh to the journal and we do
acquire the buffer lock in do_get_write_access() anyway (which is the only
place that checks for BH_Shadow) so the buffer_shadow() check should never
trigger. Needs checking, some more thought, and possibly slightly expanding
the area where buffer lock is held in do_get_write_access() but it should
be relatively low hanging fruit. Then we can completely remove BH_Shadow
and use generic IO completion function.

								Honza

> ---
>  fs/jbd2/commit.c | 13 +++++++------
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index 8cf61e7185c4..38f318bb4279 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -29,8 +29,10 @@
>  /*
>   * IO end handler for temporary buffer_heads handling writes to the journal.
>   */
> -static void journal_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
> +static void journal_end_buffer_io_sync(struct bio *bio)
>  {
> +	bool uptodate = bio->bi_status == BLK_STS_OK;
> +	struct buffer_head *bh = bio_endio_bh(bio);
>  	struct buffer_head *orig_bh = bh->b_private;
>  
>  	BUFFER_TRACE(bh, "");
> @@ -147,13 +149,12 @@ static int journal_submit_commit_record(journal_t *journal,
>  	lock_buffer(bh);
>  	clear_buffer_dirty(bh);
>  	set_buffer_uptodate(bh);
> -	bh->b_end_io = journal_end_buffer_io_sync;
>  
>  	if (journal->j_flags & JBD2_BARRIER &&
>  	    !jbd2_has_feature_async_commit(journal))
>  		write_flags |= REQ_PREFLUSH | REQ_FUA;
>  
> -	submit_bh(write_flags, bh);
> +	bh_submit(bh, write_flags, journal_end_buffer_io_sync);
>  	*cbh = bh;
>  	return 0;
>  }
> @@ -751,9 +752,9 @@ void jbd2_journal_commit_transaction(journal_t *journal)
>  				lock_buffer(bh);
>  				clear_buffer_dirty(bh);
>  				set_buffer_uptodate(bh);
> -				bh->b_end_io = journal_end_buffer_io_sync;
> -				submit_bh(REQ_OP_WRITE | JBD2_JOURNAL_REQ_FLAGS,
> -					  bh);
> +				bh_submit(bh,
> +					REQ_OP_WRITE | JBD2_JOURNAL_REQ_FLAGS,
> +					journal_end_buffer_io_sync);
>  			}
>  			cond_resched();
>  
> -- 
> 2.47.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH 17/34] jbd2: Convert jbd2_write_superblock() to bh_submit()
From: Jan Kara @ 2026-05-27 10:54 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: Jan Kara, Christian Brauner, Christoph Hellwig, linux-fsdevel,
	linux-ext4
In-Reply-To: <20260525171931.4144395-18-willy@infradead.org>

On Mon 25-05-26 18:19:10, Matthew Wilcox (Oracle) wrote:
> Avoid an extra indirect function call by using bh_submit() instead of
> submit_bh().
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Cc: linux-ext4@vger.kernel.org

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/jbd2/journal.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 4f397fcdb13c..a6616380ce38 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -1821,8 +1821,7 @@ static int jbd2_write_superblock(journal_t *journal, blk_opf_t write_flags)
>  	if (jbd2_journal_has_csum_v2or3(journal))
>  		sb->s_checksum = jbd2_superblock_csum(sb);
>  	get_bh(bh);
> -	bh->b_end_io = end_buffer_write_sync;
> -	submit_bh(REQ_OP_WRITE | write_flags, bh);
> +	bh_submit(bh, REQ_OP_WRITE | write_flags, bh_end_write);
>  	wait_on_buffer(bh);
>  	if (buffer_write_io_error(bh)) {
>  		clear_buffer_write_io_error(bh);
> -- 
> 2.47.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH 5/8] super: drop sb_lock from setup_bdev_super() tuple publication
From: Christian Brauner @ 2026-05-27 11:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Theodore Ts'o, Andreas Dilger, Jan Kara, Ritesh Harjani (IBM),
	linux-ext4, linux-cifs, Alexander Viro
In-Reply-To: <20260526-work-sget-v1-5-263f7025cedd@kernel.org>

>  	}
> -	spin_lock(&sb_lock);

Yeah, I failed to consider that we need to protect against a concurrent
sget_fc() call with a custom callback so we cannot reasonably drop this
lock.

> -	spin_unlock(&sb_lock);
> +		WRITE_ONCE(sb->s_iflags, sb->s_iflags | SB_I_STABLE_WRITES);

^ permalink raw reply

* Re: [PATCH 0/8] super: retire sget(), convert iterators to RCU
From: Christian Brauner @ 2026-05-27 11:54 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Theodore Ts'o, Andreas Dilger, Jan Kara, Ritesh Harjani (IBM),
	linux-ext4, linux-cifs, Alexander Viro
In-Reply-To: <20260526-work-sget-v1-0-263f7025cedd@kernel.org>

On Tue, May 26, 2026 at 05:09:02PM +0200, Christian Brauner wrote:
> * retire sget(): CIFS plus the two ext4 KUnit tests (extents-test,
> 
> * Walk @super_blocks and @type->fs_supers under RCU, pinned by

Can't work as I originally envisioned.

^ permalink raw reply

* Re: [PATCH 00/17] fs: replace __get_free_pages() call with kmalloc()
From: Christian Brauner @ 2026-05-27 12:05 UTC (permalink / raw)
  To: Jan Kara, Mark Fasheh, Joel Becker, Joseph Qi, Ryusuke Konishi,
	Viacheslav Dubeyko, Trond Myklebust, Anna Schumaker, Chuck Lever,
	Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Alexander Viro, Jan Kara, Dave Kleikamp, Theodore Ts'o,
	Miklos Szeredi, Andreas Hindborg, Breno Leitao, Kees Cook,
	Tigran A. Aivazian, Mike Rapoport (Microsoft)
  Cc: Christian Brauner, linux-kernel, linux-fsdevel, ocfs2-devel,
	linux-nilfs, linux-nfs, jfs-discussion, linux-ext4, linux-mm
In-Reply-To: <20260523-b4-fs-v1-0-275e36a83f0e@kernel.org>

On Sat, 23 May 2026 20:54:12 +0300, Mike Rapoport (Microsoft) wrote:
> This is a (small) part of larger work of replacing page allocator calls
> with kmalloc.
> 
> Also in git:
> https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git gfp-to-kmalloc/fs
> 
> 
> [...]

Applied to the vfs-7.2.misc branch of the vfs/vfs.git tree.
Patches in the vfs-7.2.misc branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-7.2.misc

[01/17] quota: allocate dquot_hash with kmalloc()
        https://git.kernel.org/vfs/vfs/c/c94d1fa0af45
[02/17] proc: replace __get_free_page() with kmalloc()
        https://git.kernel.org/vfs/vfs/c/3c849e5fe1db
[03/17] ocfs2/dlm: replace __get_free_page() with kmalloc()
        https://git.kernel.org/vfs/vfs/c/40b7e5db6a25
[04/17] nilfs2: replace get_zeroed_page() with kzalloc()
        https://git.kernel.org/vfs/vfs/c/2abe95d9f56d
[05/17] NFS: replace __get_free_page() with kmalloc() in nfs_show_devname()
        https://git.kernel.org/vfs/vfs/c/75805c8f6d43
[06/17] NFS: remove unused page and page2 in nfs4_replace_transport()
        https://git.kernel.org/vfs/vfs/c/0d77bacd0eab
[07/17] NFSD: replace __get_free_page() with kmalloc() in nfsd_buffered_readdir()
        https://git.kernel.org/vfs/vfs/c/64f162f93a81
[08/17] libfs: simple_transaction_get(): replace get_zeroed_page() with kzalloc()
        https://git.kernel.org/vfs/vfs/c/5a3763a94e95
[09/17] jfs: replace __get_free_page() with kmalloc()
        https://git.kernel.org/vfs/vfs/c/d50250728dc1
[10/17] jbd2: replace __get_free_pages() with kmalloc()
        https://git.kernel.org/vfs/vfs/c/75c9377833a1
[11/17] isofs: replace __get_free_page() with kmalloc()
        https://git.kernel.org/vfs/vfs/c/95f2509040ac
[12/17] fuse: replace __get_free_page() with kmalloc()
        https://git.kernel.org/vfs/vfs/c/c78262429022
[13/17] fs/select: replace __get_free_page() with kmalloc()
        https://git.kernel.org/vfs/vfs/c/ac6aa4672cef
[14/17] fs/namespace: use __getname() to allocate mntpath buffer
        https://git.kernel.org/vfs/vfs/c/bd822134dcaf
[15/17] configfs: replace __get_free_pages() with kzalloc()
        https://git.kernel.org/vfs/vfs/c/32466534cba7
[16/17] binfmt_misc: replace __get_free_page() with kmalloc()
        https://git.kernel.org/vfs/vfs/c/df5f3ac3e999
[17/17] bfs: replace get_zeroed_page() with kzalloc()
        https://git.kernel.org/vfs/vfs/c/0a994e1ab090

^ permalink raw reply

* Re: [PATCH v4 09/23] ext4: implement writeback path using iomap
From: Ojaswin Mujoo @ 2026-05-27 12:49 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai
In-Reply-To: <20260511072344.191271-10-yi.zhang@huaweicloud.com>

On Mon, May 11, 2026 at 03:23:29PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Add the iomap writeback path for ext4 buffered I/O. This introduces:
> 
>  - ext4_iomap_writepages(): the main writeback entry point.
>  - ext4_writeback_ops: a new iomap_writeback_ops instance to handle
>    block mapping and I/O submission.
>  - A new end I/O worker for converting unwritten extents, updating file
>    size, and handling DATA_ERR_ABORT after I/O completion.
> 
> Core implementation details:
> 
>  - ->writeback_range() callback
>    Calls ext4_iomap_map_writeback_range() to query the longest range of
>    existing mapped extents. For performance, when a block range is not
>    yet allocated, it allocates based on the writeback length and delalloc
>    extent length, rather than allocating for a single folio at a time.
>    The folio is then added to an iomap_ioend instance.
> 
>  - ->writeback_submit() callback
>    Registers ext4_iomap_end_bio() as the end bio callback. This callback
>    schedules a worker to handle:
>    - Unwritten extent conversion.
>    - i_disksize update after data is written back.
>    - Journal abort on writeback I/O failure.

Hi Zhang, the changes look good. I have a few comments below:
> 
> Key changes and considerations:
> 
> - Append write and unwritten extents
>   Since data=ordered mode is not used to prevent stale data exposure
>   during append writebacks, new blocks are always allocated as unwritten
>   extents (i.e. always enable dioread_nolock), and i_disksize update is
>   postponed until I/O completion. 

Makes sense.

>   Additionally, the deadlock that the
>   reserve handle was expected to resolve does not occur anymore.

I guess this is since we don't use ordered data so we can't block on
starting a txn in end io.

>   Therefore, the end I/O worker can start a normal journal handle
>   instead of a reserve handle when converting unwritten extents.
> 
> - Lock ordering
>   The ->writeback_range() callback runs under the folio lock, requiring
>   the journal handle to be started under that same lock. This reverses
>   the order compared to the buffer_head writeback path. The lock ordering
>   documentation in super.c has been updated accordingly.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  fs/ext4/ext4.h        |   4 +
>  fs/ext4/inode.c       | 208 +++++++++++++++++++++++++++++++++++++++++-
>  fs/ext4/page-io.c     | 126 +++++++++++++++++++++++++
>  fs/ext4/super.c       |   7 +-
>  fs/iomap/ioend.c      |   3 +-
>  include/linux/iomap.h |   1 +
>  6 files changed, 346 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 4832e7f7db82..078feda47e36 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1173,6 +1173,8 @@ struct ext4_inode_info {
>  	 */
>  	struct list_head i_rsv_conversion_list;
>  	struct work_struct i_rsv_conversion_work;
> +	struct list_head i_iomap_ioend_list;
> +	struct work_struct i_iomap_ioend_work;
>  
>  	/*
>  	 * Transactions that contain inode's metadata needed to complete
> @@ -3870,6 +3872,8 @@ int ext4_bio_write_folio(struct ext4_io_submit *io, struct folio *page,
>  		size_t len);
>  extern struct ext4_io_end_vec *ext4_alloc_io_end_vec(ext4_io_end_t *io_end);
>  extern struct ext4_io_end_vec *ext4_last_io_end_vec(ext4_io_end_t *io_end);
> +extern void ext4_iomap_end_io(struct work_struct *work);
> +extern void ext4_iomap_end_bio(struct bio *bio);
>  
>  /* mmp.c */
>  extern int ext4_multi_mount_protect(struct super_block *, ext4_fsblk_t);
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 1ae7d3f4a1c8..a80195bd6f20 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -44,6 +44,7 @@
>  #include <linux/iversion.h>
>  
>  #include "ext4_jbd2.h"
> +#include "ext4_extents.h"
>  #include "xattr.h"
>  #include "acl.h"
>  #include "truncate.h"
> @@ -4120,10 +4121,215 @@ static void ext4_iomap_readahead(struct readahead_control *rac)
>  	iomap_bio_readahead(rac, &ext4_iomap_buffered_read_ops);
>  }
>  
> +static int ext4_iomap_map_one_extent(struct inode *inode,
> +				     struct ext4_map_blocks *map)
> +{
> +	struct extent_status es;
> +	handle_t *handle = NULL;
> +	int credits, map_flags;
> +	int retval;
> +
> +	credits = ext4_chunk_trans_blocks(inode, map->m_len);
> +	handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, credits);
> +	if (IS_ERR(handle))
> +		return PTR_ERR(handle);
> +
> +	map->m_flags = 0;
> +	/*
> +	 * It is necessary to look up extent and map blocks under i_data_sem
> +	 * in write mode, otherwise, the delalloc extent may become stale
> +	 * during concurrent truncate operations.
> +	 */
> +	ext4_fc_track_inode(handle, inode);
> +	down_write(&EXT4_I(inode)->i_data_sem);
> +	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
> +		retval = es.es_len - (map->m_lblk - es.es_lblk);
> +		map->m_len = min_t(unsigned int, retval, map->m_len);
> +
> +		if (ext4_es_is_delayed(&es)) {

I understand that it is okay for us to rely on extent status ==
delayed here because we never reclaim delayed es entries and hence we
are sure to not skip any delayed block allocations here.

> +			map->m_flags |= EXT4_MAP_DELAYED;
> +			trace_ext4_da_write_pages_extent(inode, map);
> +			/*
> +			 * Call ext4_map_create_blocks() to allocate any
> +			 * delayed allocation blocks. It is possible that
> +			 * we're going to need more metadata blocks, however
> +			 * we must not fail because we're in writeback and
> +			 * there is nothing we can do so it might result in
> +			 * data loss. So use reserved blocks to allocate
> +			 * metadata if possible.
> +			 */
> +			map_flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT |
> +				    EXT4_GET_BLOCKS_METADATA_NOFAIL |
> +				    EXT4_EX_NOCACHE;
> +
> +			retval = ext4_map_create_blocks(handle, inode, map,
> +							map_flags);
> +			if (retval > 0)
> +				ext4_fc_track_range(handle, inode, map->m_lblk,
> +						map->m_lblk + map->m_len - 1);
> +			goto out;
> +		} else if (unlikely(ext4_es_is_hole(&es)))

Now that you've fixed the partial invalidate in iomap (patch 12/23)
can we still hit this hole case? 

> +			goto out;
> +
> +		/* Found written or unwritten extent. */
> +		map->m_pblk = ext4_es_pblock(&es) + map->m_lblk - es.es_lblk;
> +		map->m_flags = ext4_es_is_written(&es) ?
> +			       EXT4_MAP_MAPPED : EXT4_MAP_UNWRITTEN;
> +		goto out;
> +	}
> +
> +	retval = ext4_map_query_blocks(handle, inode, map, EXT4_EX_NOCACHE);
> +out:
> +	up_write(&EXT4_I(inode)->i_data_sem);
> +	ext4_journal_stop(handle);
> +	return retval < 0 ? retval : 0;
> +}
> +
> +static int ext4_iomap_map_writeback_range(struct iomap_writepage_ctx *wpc,
> +					  loff_t offset, unsigned int dirty_len)
> +{
> +	struct inode *inode = wpc->inode;
> +	struct super_block *sb = inode->i_sb;
> +	struct journal_s *journal = EXT4_SB(sb)->s_journal;
> +	struct ext4_map_blocks map;
> +	unsigned int blkbits = inode->i_blkbits;
> +	unsigned int index = offset >> blkbits;
> +	unsigned int blk_end, blk_len;
> +	int ret;
> +
> +	ret = ext4_emergency_state(sb);
> +	if (unlikely(ret))
> +		return ret;
> +
> +	/* Check validity of the cached writeback mapping. */
> +	if (offset >= wpc->iomap.offset &&
> +	    offset < wpc->iomap.offset + wpc->iomap.length &&
> +	    ext4_iomap_valid(inode, &wpc->iomap))
> +		return 0;
> +
> +	blk_len = dirty_len >> blkbits;
> +	blk_end = min_t(unsigned int, (wpc->wbc->range_end >> blkbits),
> +				      (UINT_MAX - 1));

This is an interesting idea. I'm just a bit worried when we have
range_end == LLONG_MAX (bg flush) and we will always be trying to allocate
MAX_WRITEPAGES, incase of a slightly fragmented FS, we might keep
falling into slower mballoc criterias and might waste a lot of time
scanning the groups.

> +	if (blk_end > index + blk_len)
> +		blk_len = blk_end - index + 1;
> +
> +retry:
> +	map.m_lblk = index;
> +	map.m_len = min_t(unsigned int, MAX_WRITEPAGES_EXTENT_LEN, blk_len);
> +	ret = ext4_map_blocks(NULL, inode, &map,
> +			      EXT4_GET_BLOCKS_IO_SUBMIT | EXT4_EX_NOCACHE);

Do we really need the IO_SUBMIT flag here now that we are:
1. Not using ordered data
2. We anyways don't use it in ext4_iomap_map_one_extent().

I think we can drop it.

> +	if (ret < 0)
> +		return ret;
> +
> +	/*
> +	 * The map is not a delalloc extent, it must either be a hole
> +	 * or an extent which have already been allocated.
> +	 */
> +	if (!(map.m_flags & EXT4_MAP_DELAYED))
> +		goto out;
> +
> +	/* Map one delalloc extent. */
> +	ret = ext4_iomap_map_one_extent(inode, &map);
> +	if (ret < 0) {
> +		if (ext4_emergency_state(sb))
> +			return ret;
> +
> +		/*
> +		 * Retry transient ENOSPC errors, if
> +		 * ext4_count_free_blocks() is non-zero, a commit
> +		 * should free up blocks.
> +		 */
> +		if (ret == -ENOSPC && journal && ext4_count_free_clusters(sb)) {
> +			jbd2_journal_force_commit_nested(journal);
> +			goto retry;
> +		}
> +
> +		ext4_msg(sb, KERN_CRIT,
> +			 "Delayed block allocation failed for inode %llu at logical offset %llu with max blocks %u with error %d",
> +			 inode->i_ino, (unsigned long long)map.m_lblk,
> +			 (unsigned int)map.m_len, -ret);
> +		ext4_msg(sb, KERN_CRIT,
> +			 "This should not happen!! Data will be lost\n");
> +		if (ret == -ENOSPC)
> +			ext4_print_free_blocks(inode);
> +		return ret;
> +	}
> +out:
> +	ext4_set_iomap(inode, &wpc->iomap, &map, offset, dirty_len, 0);
> +	return 0;
> +}
> +

<snip>
> 

^ permalink raw reply

* Re: [PATCH v4 10/23] ext4: implement mmap path using iomap
From: Ojaswin Mujoo @ 2026-05-27 12:56 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai
In-Reply-To: <20260511072344.191271-11-yi.zhang@huaweicloud.com>

On Mon, May 11, 2026 at 03:23:30PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Introduce ext4_iomap_page_mkwrite() to implement the mmap iomap path
> for ext4. The heavy lifting is delegated to iomap_page_mkwrite(), which
> only requires ext4_iomap_buffered_write_ops and
> ext4_iomap_buffered_da_write_ops to allocate and map blocks.
> 
> Note that the lock ordering between folio lock and transaction start in
> this path is reversed compared to the buffer_head buffered write path.
> The lock ordering documentation in super.c has been updated accordingly.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Looks good, feel free to add:
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Regards,
Ojaswin

> ---
>  fs/ext4/inode.c | 32 +++++++++++++++++++++++++++++++-
>  fs/ext4/super.c |  8 ++++++--
>  2 files changed, 37 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index a80195bd6f20..c6fe42d012fc 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4020,7 +4020,7 @@ static int ext4_iomap_buffered_do_write_begin(struct inode *inode,
>  		return -ERANGE;
>  	if (WARN_ON_ONCE(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
>  		return -EINVAL;
> -	if (WARN_ON_ONCE(!(flags & IOMAP_WRITE)))
> +	if (WARN_ON_ONCE(!(flags & (IOMAP_WRITE | IOMAP_FAULT))))
>  		return -EINVAL;
>  
>  	if (delalloc)
> @@ -4080,6 +4080,14 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
>  	if (iomap->type != IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW))
>  		return 0;
>  
> +	/*
> +	 * iomap_page_mkwrite() will never fail in a way that requires delalloc
> +	 * extents that it allocated to be revoked.  Hence never try to release
> +	 * them here.
> +	 */
> +	if (flags & IOMAP_FAULT)
> +		return 0;
> +
>  	/* Nothing to do if we've written the entire delalloc extent */
>  	start_byte = iomap_last_written_block(inode, offset, written);
>  	end_byte = round_up(offset + length, i_blocksize(inode));
> @@ -7191,6 +7199,23 @@ static int ext4_block_page_mkwrite(struct inode *inode, struct folio *folio,
>  	return ret;
>  }
>  
> +static vm_fault_t ext4_iomap_page_mkwrite(struct vm_fault *vmf)
> +{
> +	struct inode *inode = file_inode(vmf->vma->vm_file);
> +	const struct iomap_ops *iomap_ops;
> +
> +	/*
> +	 * ext4_nonda_switch() could writeback this folio, so have to
> +	 * call it before lock folio.
> +	 */
> +	if (test_opt(inode->i_sb, DELALLOC) && !ext4_nonda_switch(inode->i_sb))
> +		iomap_ops = &ext4_iomap_buffered_da_write_ops;
> +	else
> +		iomap_ops = &ext4_iomap_buffered_write_ops;
> +
> +	return iomap_page_mkwrite(vmf, iomap_ops, NULL);
> +}
> +
>  vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
> @@ -7213,6 +7238,11 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
>  
>  	filemap_invalidate_lock_shared(mapping);
>  
> +	if (ext4_inode_buffered_iomap(inode)) {
> +		ret = ext4_iomap_page_mkwrite(vmf);
> +		goto out;
> +	}
> +
>  	err = ext4_convert_inline_data(inode);
>  	if (err)
>  		goto out_ret;
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 51d87db53543..62bfe05a64bc 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -100,8 +100,12 @@ static const struct fs_parameter_spec ext4_param_specs[];
>   * Lock ordering
>   *
>   * page fault path:
> - * mmap_lock -> sb_start_pagefault -> invalidate_lock (r) -> transaction start
> - *   -> page lock -> i_data_sem (rw)
> + * - buffer_head path:
> + *   mmap_lock -> sb_start_pagefault -> invalidate_lock (r) ->
> + *     transaction start -> folio lock -> i_data_sem (rw)
> + * - iomap path:
> + *   mmap_lock -> sb_start_pagefault -> invalidate_lock (r) ->
> + *     folio lock -> transaction start -> i_data_sem (rw)
>   *
>   * buffered write path:
>   * sb_start_write -> i_rwsem (w) -> mmap_lock
> -- 
> 2.52.0
> 

^ permalink raw reply

* Re: [PATCH v4 14/23] ext4: implement partial block zero range path using iomap
From: Ojaswin Mujoo @ 2026-05-27 13:13 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai
In-Reply-To: <20260511072344.191271-15-yi.zhang@huaweicloud.com>

On Mon, May 11, 2026 at 03:23:34PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Introduce a new iomap_ops instance, ext4_iomap_zero_ops, along with
> ext4_iomap_block_zero_range() to implement block zeroing via the iomap
> infrastructure for ext4.
> 
> ext4_iomap_block_zero_range() calls iomap_zero_range() with
> ext4_iomap_zero_begin() as the callback. The callback locates and zeros
> out either a mapped partial block or a dirty, unwritten partial block.
> 
> Important constraints:
> 
> Zeroing out under an active journal handle can cause deadlock, because
> the order of acquiring the folio lock and starting a handle is
> inconsistent with the iomap writeback path.
> 
> Therefore, ext4_iomap_block_zero_range():
> - Must NOT be called under an active handle.
> - Cannot rely on data=ordered mode to ensure zeroed data persistence
>   before updating i_disksize (for the cases of post-EOF append write,
>   post-EOF fallocate, and truncate up). In subsequent patches, we will
>   address this by synchronizing commit I/O but doesn't waiting for
>   completion, and updating i_disksize to i_size only after the zeroed
>   data has been written back.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Looks good in itself. Feel free to add:

Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Regards,
Ojaswin

> ---
>  fs/ext4/inode.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 92 insertions(+)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index c6fe42d012fc..e0dae2501292 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4101,6 +4101,51 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
>  	return 0;
>  }
>  
> +static int ext4_iomap_zero_begin(struct inode *inode,
> +		loff_t offset, loff_t length, unsigned int flags,
> +		struct iomap *iomap, struct iomap *srcmap)
> +{
> +	struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);
> +	struct ext4_map_blocks map;
> +	u8 blkbits = inode->i_blkbits;
> +	unsigned int iomap_flags = 0;
> +	int ret;
> +
> +	ret = ext4_emergency_state(inode->i_sb);
> +	if (unlikely(ret))
> +		return ret;
> +
> +	if (WARN_ON_ONCE(!(flags & IOMAP_ZERO)))
> +		return -EINVAL;
> +
> +	ret = ext4_iomap_map_blocks(inode, offset, length, NULL, &map);
> +	if (ret < 0)
> +		return ret;
> +
> +	/*
> +	 * Look up dirty folios for unwritten mappings within EOF. Providing
> +	 * this bypasses the flush iomap uses to trigger extent conversion
> +	 * when unwritten mappings have dirty pagecache in need of zeroing.
> +	 */
> +	if (map.m_flags & EXT4_MAP_UNWRITTEN) {
> +		loff_t start = ((loff_t)map.m_lblk) << blkbits;
> +		loff_t end = ((loff_t)map.m_lblk + map.m_len) << blkbits;
> +
> +		iomap_fill_dirty_folios(iter, &start, end, &iomap_flags);
> +		if ((start >> blkbits) < map.m_lblk + map.m_len)
> +			map.m_len = (start >> blkbits) - map.m_lblk;
> +	}
> +
> +	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
> +	iomap->flags |= iomap_flags;
> +
> +	return 0;
> +}
> +
> +static const struct iomap_ops ext4_iomap_zero_ops = {
> +	.iomap_begin = ext4_iomap_zero_begin,
> +};
> +
>  /*
>   * Since we always allocate unwritten extents, there is no need for
>   * iomap_end to clean up allocated blocks on a short write.
> @@ -4616,6 +4661,47 @@ static int ext4_block_journalled_zero_range(struct inode *inode, loff_t from,
>  	return err;
>  }
>  
> +static int ext4_block_iomap_zero_range(struct inode *inode, loff_t from,
> +				       loff_t length, bool *did_zero,
> +				       bool *zero_written)
> +{
> +	int ret;
> +
> +	/*
> +	 * Zeroing out under an active handle can cause deadlock since
> +	 * the order of acquiring the folio lock and starting a handle is
> +	 * inconsistent with the iomap writeback procedure.
> +	 */
> +	if (WARN_ON_ONCE(ext4_handle_valid(journal_current_handle())))
> +		return -EINVAL;
> +
> +	/* The zeroing scope should not extend across a block. */
> +	if (WARN_ON_ONCE((from >> inode->i_blkbits) !=
> +			 ((from + length - 1) >> inode->i_blkbits)))
> +		return -EINVAL;
> +
> +	if (!(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_ORPHAN_FS) &&
> +	    !(inode_state_read_once(inode) & (I_NEW | I_FREEING)))
> +		WARN_ON_ONCE(!inode_is_locked(inode) &&
> +			!rwsem_is_locked(&inode->i_mapping->invalidate_lock));
> +
> +	ret = iomap_zero_range(inode, from, length, did_zero,
> +			       &ext4_iomap_zero_ops, &ext4_iomap_write_ops,
> +			       NULL);
> +	if (ret)
> +		return ret;
> +
> +	/*
> +	 * TODO: The iomap does not distinguish between different types of
> +	 * zeroing and always sets zero_written if a zeroing operation is
> +	 * performed, which may result in unnecessary order operations.
> +	 */
> +	if (did_zero && zero_written)
> +		*zero_written = *did_zero;
> +
> +	return 0;
> +}
> +
>  /*
>   * Zeros out a mapping of length 'length' starting from file offset
>   * 'from'.  The range to be zero'd must be contained with in one block.
> @@ -4642,6 +4728,9 @@ static int ext4_block_zero_range(struct inode *inode,
>  	} else if (ext4_should_journal_data(inode)) {
>  		return ext4_block_journalled_zero_range(inode, from, length,
>  							did_zero);
> +	} else if (ext4_inode_buffered_iomap(inode)) {
> +		return ext4_block_iomap_zero_range(inode, from, length,
> +						   did_zero, zero_written);
>  	}
>  	return ext4_block_do_zero_range(inode, from, length, did_zero,
>  					zero_written);
> @@ -4682,6 +4771,9 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
>  	 * truncating up or performing an append write, because there might be
>  	 * exposing stale on-disk data which may caused by concurrent post-EOF
>  	 * mmap write during folio writeback.
> +	 *
> +	 * TODO: In the iomap path, handle this by updating i_disksize to
> +	 * i_size after the zeroed data has been written back.
>  	 */
>  	if (ext4_should_order_data(inode) &&
>  	    did_zero && zero_written && !IS_DAX(inode)) {
> -- 
> 2.52.0
> 

^ permalink raw reply

* Re: [PATCH v4 15/23] ext4: add block mapping tracepoints for iomap buffered I/O path
From: Ojaswin Mujoo @ 2026-05-27 13:14 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai
In-Reply-To: <20260511072344.191271-16-yi.zhang@huaweicloud.com>

On Mon, May 11, 2026 at 03:23:35PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Add tracepoints for iomap buffered read, write, partial block zeroing,
> and writeback operations to help debug the iomap buffered I/O path.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Looks good, feel free to add:

Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Regards,
Ojaswin

> ---
>  fs/ext4/inode.c             |  6 +++++
>  include/trace/events/ext4.h | 45 +++++++++++++++++++++++++++++++++++++
>  2 files changed, 51 insertions(+)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index e0dae2501292..239d387ffaf2 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3961,6 +3961,8 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
>  	if (ret < 0)
>  		return ret;
>  
> +	trace_ext4_iomap_buffered_read_begin(inode, &map, offset, length,
> +					     flags);
>  	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
>  	return 0;
>  }
> @@ -4034,6 +4036,8 @@ static int ext4_iomap_buffered_do_write_begin(struct inode *inode,
>  	if (ret < 0)
>  		return ret;
>  
> +	trace_ext4_iomap_buffered_write_begin(inode, &map, offset, length,
> +					      flags);
>  	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
>  	return 0;
>  }
> @@ -4136,6 +4140,7 @@ static int ext4_iomap_zero_begin(struct inode *inode,
>  			map.m_len = (start >> blkbits) - map.m_lblk;
>  	}
>  
> +	trace_ext4_iomap_zero_begin(inode, &map, offset, length, flags);
>  	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
>  	iomap->flags |= iomap_flags;
>  
> @@ -4308,6 +4313,7 @@ static int ext4_iomap_map_writeback_range(struct iomap_writepage_ctx *wpc,
>  		return ret;
>  	}
>  out:
> +	trace_ext4_iomap_map_writeback_range(inode, &map, offset, dirty_len, 0);
>  	ext4_set_iomap(inode, &wpc->iomap, &map, offset, dirty_len, 0);
>  	return 0;
>  }
> diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
> index f493642cf121..ebafa06cd191 100644
> --- a/include/trace/events/ext4.h
> +++ b/include/trace/events/ext4.h
> @@ -3096,6 +3096,51 @@ TRACE_EVENT(ext4_move_extent_exit,
>  		  __entry->ret)
>  );
>  
> +DECLARE_EVENT_CLASS(ext4_set_iomap_class,
> +	TP_PROTO(struct inode *inode, struct ext4_map_blocks *map,
> +		 loff_t offset, loff_t length, unsigned int flags),
> +	TP_ARGS(inode, map, offset, length, flags),
> +	TP_STRUCT__entry(
> +		__field(dev_t, dev)
> +		__field(u64, ino)
> +		__field(ext4_lblk_t, m_lblk)
> +		__field(unsigned int, m_len)
> +		__field(unsigned int, m_flags)
> +		__field(u64, m_seq)
> +		__field(loff_t, offset)
> +		__field(loff_t, length)
> +		__field(unsigned int, iomap_flags)
> +	),
> +	TP_fast_assign(
> +		__entry->dev		= inode->i_sb->s_dev;
> +		__entry->ino		= inode->i_ino;
> +		__entry->m_lblk		= map->m_lblk;
> +		__entry->m_len		= map->m_len;
> +		__entry->m_flags	= map->m_flags;
> +		__entry->m_seq		= map->m_seq;
> +		__entry->offset		= offset;
> +		__entry->length		= length;
> +		__entry->iomap_flags	= flags;
> +
> +	),
> +	TP_printk("dev %d:%d ino %llu m_lblk %u m_len %u m_flags %s m_seq %llu orig_off 0x%llx orig_len 0x%llx iomap_flags 0x%x",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->ino, __entry->m_lblk, __entry->m_len,
> +		  show_mflags(__entry->m_flags), __entry->m_seq,
> +		  __entry->offset, __entry->length, __entry->iomap_flags)
> +)
> +
> +#define DEFINE_SET_IOMAP_EVENT(name) \
> +DEFINE_EVENT(ext4_set_iomap_class, name, \
> +	TP_PROTO(struct inode *inode, struct ext4_map_blocks *map, \
> +		 loff_t offset, loff_t length, unsigned int flags), \
> +	TP_ARGS(inode, map, offset, length, flags))
> +
> +DEFINE_SET_IOMAP_EVENT(ext4_iomap_buffered_read_begin);
> +DEFINE_SET_IOMAP_EVENT(ext4_iomap_buffered_write_begin);
> +DEFINE_SET_IOMAP_EVENT(ext4_iomap_map_writeback_range);
> +DEFINE_SET_IOMAP_EVENT(ext4_iomap_zero_begin);
> +
>  #endif /* _TRACE_EXT4_H */
>  
>  /* This part must be outside protection */
> -- 
> 2.52.0
> 

^ permalink raw reply

* Re: [PATCH v4 16/23] ext4: disable online defrag when inode using iomap buffered I/O path
From: Ojaswin Mujoo @ 2026-05-27 13:14 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai
In-Reply-To: <20260511072344.191271-17-yi.zhang@huaweicloud.com>

On Mon, May 11, 2026 at 03:23:36PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Online defragmentation does not currently support inodes using the
> iomap buffered I/O path. The existing implementation relies on
> buffer_head for sub-folio block management and data=ordered mode for
> data consistency, both of which are incompatible with the iomap path.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Looks good, feel free to add:
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Regards,
Ojaswin

> ---
>  fs/ext4/move_extent.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
> index 3329b7ad5dbd..f707a1096544 100644
> --- a/fs/ext4/move_extent.c
> +++ b/fs/ext4/move_extent.c
> @@ -476,6 +476,17 @@ static int mext_check_validity(struct inode *orig_inode,
>  		return -EOPNOTSUPP;
>  	}
>  
> +	/*
> +	 * TODO: support online defrag for inodes that using the buffered
> +	 * I/O iomap path.
> +	 */
> +	if (ext4_inode_buffered_iomap(orig_inode) ||
> +	    ext4_inode_buffered_iomap(donor_inode)) {
> +		ext4_msg(sb, KERN_ERR,
> +			 "Online defrag not supported for inode with iomap buffered IO path");
> +		return -EOPNOTSUPP;
> +	}
> +
>  	if (donor_inode->i_mode & (S_ISUID|S_ISGID)) {
>  		ext4_debug("ext4 move extent: suid or sgid is set to donor file [ino:orig %llu, donor %llu]\n",
>  			   orig_inode->i_ino, donor_inode->i_ino);
> -- 
> 2.52.0
> 

^ permalink raw reply

* Re: [PATCH v4 17/23] ext4: submit zeroed post-EOF data immediately in the iomap buffered I/O path
From: Ojaswin Mujoo @ 2026-05-27 13:41 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai
In-Reply-To: <20260511072344.191271-18-yi.zhang@huaweicloud.com>

On Mon, May 11, 2026 at 03:23:37PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> In the generic buffered_head I/O path, we rely on the data=order mode to
> ensure that the zeroed EOF block data is written before updating
> i_disksize, thus preventing stale data from being exposed.
> 
> However, the iomap buffered I/O path cannot use this mechanism. Instead,
> we issue the I/O immediately after performing the zero operation
> (without synchronous waiting for performance). This can reduce the risk
> of exposing stale data, but it does not guarantee that the zero data
> will be flushed to disk before the metadata of i_disksize is updated.
> The subsequent patches will wait for this I/O to complete before
> updating i_disksize.
> 
> Suggested-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

I think we discussed that we may not need to do this [1] but I guess
you've decided to make the tradeoff of issuing the IO to avoid having to
wait for bg flush to complete the tail page zeroing 

However, I think one side effect might be many threads calling the
writeback mechanism to issue zero IOs which might not scale well. I
don't know if it'll be a huge problem though, I guess it's a sort of
thing we will have to deal with in case we see it in real world
workloads.

[1] https://lore.kernel.org/linux-ext4/yhy4cgc4fnk7tzfejuhy6m6ljo425ebpg6khss6vtvpidg6lyp@5xcyabxrl6zm/

> ---
>  fs/ext4/inode.c | 66 ++++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 55 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 239d387ffaf2..e013aeb03d7b 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4742,6 +4742,32 @@ static int ext4_block_zero_range(struct inode *inode,
>  					zero_written);
>  }
>  
> +static int ext4_iomap_submit_zero_block(struct inode *inode,
> +					loff_t from, loff_t end)
> +{
> +	struct address_space *mapping = inode->i_mapping;
> +	struct folio *folio;
> +	bool do_submit = false;
> +
> +	folio = filemap_lock_folio(mapping, from >> PAGE_SHIFT);
> +	if (IS_ERR(folio))
> +		/* Already writeback and clear? */
> +		return PTR_ERR(folio) == -ENOENT ? 0 : PTR_ERR(folio);
> +
> +	folio_wait_writeback(folio);
> +	WARN_ON_ONCE(folio_test_writeback(folio));
> +
> +	if (likely(folio_test_dirty(folio)))
> +		do_submit = true;
> +	folio_unlock(folio);
> +	folio_put(folio);
> +
> +	/* Submit zeroed block. */
> +	if (do_submit)
> +		return filemap_fdatawrite_range(mapping, from, end - 1);
> +	return 0;
> +}
> +
>  /*
>   * Zero out a mapping from file offset 'from' up to the end of the block
>   * which corresponds to 'from' or to the given 'end' inside this block.
> @@ -4765,8 +4791,10 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
>  	if (IS_ENCRYPTED(inode) && !fscrypt_has_encryption_key(inode))
>  		return 0;
>  
> -	if (length > blocksize - offset)
> +	if (length > blocksize - offset) {
>  		length = blocksize - offset;
> +		end = from + length;
> +	}
>  
>  	err = ext4_block_zero_range(inode, from, length,
>  				    &did_zero, &zero_written);
> @@ -4781,18 +4809,34 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
>  	 * TODO: In the iomap path, handle this by updating i_disksize to
>  	 * i_size after the zeroed data has been written back.
>  	 */
> -	if (ext4_should_order_data(inode) &&
> -	    did_zero && zero_written && !IS_DAX(inode)) {
> -		handle_t *handle;
> +	if (did_zero && zero_written && !IS_DAX(inode)) {
> +		if (ext4_should_order_data(inode)) {
> +			handle_t *handle;
>  
> -		handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
> -		if (IS_ERR(handle))
> -			return PTR_ERR(handle);
> +			handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
> +			if (IS_ERR(handle))
> +				return PTR_ERR(handle);
>  
> -		err = ext4_jbd2_inode_add_write(handle, inode, from, length);
> -		ext4_journal_stop(handle);
> -		if (err)
> -			return err;
> +			err = ext4_jbd2_inode_add_write(handle, inode, from,
> +							length);
> +			ext4_journal_stop(handle);
> +			if (err)
> +				return err;
> +		/*
> +		 * inodes using the iomap buffered I/O path do not use the
> +		 * data=ordered mode. We submit zeroed range directly here.
> +		 * Do not wait for I/O completion for performance.
> +		 *
> +		 * TODO: Any operation that extends i_disksize (including
> +		 * append write end io past the zeroed boundary, truncate up,
> +		 * and append fallocate) must wait for the relevant I/O to
> +		 * complete before updating i_disksize.
> +		 */
> +		} else if (ext4_inode_buffered_iomap(inode)) {
> +			err = ext4_iomap_submit_zero_block(inode, from, end);
> +			if (err)
> +				return err;
> +		}
>  	}
>  
>  	return 0;
> -- 
> 2.52.0
> 

^ permalink raw reply

* Re: [PATCH v4 18/23] ext4: wait for ordered I/O in the iomap buffered I/O path
From: Ojaswin Mujoo @ 2026-05-27 15:58 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai
In-Reply-To: <20260511072344.191271-19-yi.zhang@huaweicloud.com>

On Mon, May 11, 2026 at 03:23:38PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> For append writes, wait for ordered I/O to complete before updating
> i_disksize. This ensures that zeroed data is flushed to disk before the
> metadata update, preventing stale data from being exposed during
> unaligned post-EOF append writes.
> 
> Suggested-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  fs/ext4/ext4.h    | 11 +++++++
>  fs/ext4/inode.c   | 80 ++++++++++++++++++++++++++++++++++++++++++-----
>  fs/ext4/page-io.c | 60 +++++++++++++++++++++++++++++++++++
>  fs/ext4/super.c   | 23 ++++++++++----
>  4 files changed, 161 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 078feda47e36..9ce2128eea3e 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1195,6 +1195,15 @@ struct ext4_inode_info {
>  #ifdef CONFIG_FS_ENCRYPTION
>  	struct fscrypt_inode_info *i_crypt_info;
>  #endif
> +
> +	/*
> +	 * Track ordered zeroed data during post-EOF append writes, fallocate,
> +	 * and truncate-up operations. These parameters are used only in the
> +	 * iomap buffered I/O path.
> +	 */
> +	ext4_lblk_t i_ordered_lblk;
> +	ext4_lblk_t i_ordered_len;
> +	wait_queue_head_t i_ordered_wq;
>  };
>  
>  /*
> @@ -3858,6 +3867,8 @@ extern int ext4_move_extents(struct file *o_filp, struct file *d_filp,
>  			     __u64 len, __u64 *moved_len);
>  
>  /* page-io.c */
> +#define EXT4_IOMAP_IOEND_ORDER_IO	1UL	/* This I/O is an ordered one */
> +
>  extern int __init ext4_init_pageio(void);
>  extern void ext4_exit_pageio(void);
>  extern ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags);
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index e013aeb03d7b..11fb369efeb1 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4345,6 +4345,7 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
>  {
>  	struct iomap_ioend *ioend = wpc->wb_ctx;
>  	struct ext4_inode_info *ei = EXT4_I(ioend->io_inode);
> +	ext4_lblk_t start, end, order_lblk, order_len;
>  
>  	/*
>  	 * After I/O completion, a worker needs to be scheduled when:
> @@ -4357,6 +4358,30 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
>  	    test_opt(ioend->io_inode->i_sb, DATA_ERR_ABORT))
>  		ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
>  
> +	/*
> +	 * Mark the I/O as ordered. Ordered I/O requires separate endio
> +	 * handling and must not be merged with regular I/O operations.
> +	 */
> +	order_len = READ_ONCE(ei->i_ordered_len);
> +	if (order_len) {
> +		/*
> +		 * Pair with smp_store_release() in ext4_block_zero_eof().
> +		 * Ensure we see the updated i_ordered_lblk that was written
> +		 * before the release store to i_ordered_len.
> +		 */
> +		smp_rmb();
> +		order_lblk = READ_ONCE(ei->i_ordered_lblk);
> +		start = ioend->io_offset >> ioend->io_inode->i_blkbits;
> +		end = EXT4_B_TO_LBLK(ioend->io_inode,
> +				     ioend->io_offset + ioend->io_size);
> +
> +		if (start <= order_lblk && end >= order_lblk + order_len) {

Hi Zhang,

I guess this check is enough cause ordered_lblk and ordered_len will
always be  contained in a single block.

> +			ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
> +			ioend->io_private = (void *)EXT4_IOMAP_IOEND_ORDER_IO;
> +			ioend->io_flags |= IOMAP_IOEND_BOUNDARY;

FWIU, we are wanting the ordered IO to not be merged and submitted asap
since we want to wake up the waiters. Is there any other reason?

Adding the boundary in ->writeback_submit() only affects
iomap_ioend_can_merge() which happens after we have woken up the waiters
and deferred the IO to the wq. We ideally want it affect
iomap_can_add_to_ioend() ie we need to add IOMAP_F_BOUNDARY in
->writeback_range().

Secondly, I don't think boundary is the right flag here. It ensures
that everything before the ordered iomap gets submitted and the ordered
iomap starts a new ioend. This can still keep getting merged with the
newer ioends untils we decide to submit the IO, which can delay waking
up the waiters. If we really want the "no merge" behavior, we'll have to
do something like [1] (Check the 2 NOMERGE flag patches).

> +		}
> +	}
> +
>  	return iomap_ioend_writeback_submit(wpc, error);
>  }
>  
> @@ -4746,8 +4771,10 @@ static int ext4_iomap_submit_zero_block(struct inode *inode,
>  					loff_t from, loff_t end)
>  {
>  	struct address_space *mapping = inode->i_mapping;
> +	struct ext4_inode_info *ei = EXT4_I(inode);
>  	struct folio *folio;
>  	bool do_submit = false;
> +	int ret;
>  
>  	folio = filemap_lock_folio(mapping, from >> PAGE_SHIFT);
>  	if (IS_ERR(folio))
> @@ -4757,14 +4784,50 @@ static int ext4_iomap_submit_zero_block(struct inode *inode,
>  	folio_wait_writeback(folio);
>  	WARN_ON_ONCE(folio_test_writeback(folio));
>  
> -	if (likely(folio_test_dirty(folio)))
> +	/*
> +	 * Mark the ordered range. It will be cleared upon I/O completion
> +	 * in ext4_iomap_end_bio(). Any operation that extends i_disksize
> +	 * (including append write end io past the zeroed boundary,
> +	 * truncate up and append fallocate) must wait for this I/O to
> +	 * complete before updating i_disksize.
> +	 *
> +	 * When multiple overlapping unaligned EOF writes are in flight, we
> +	 * only need to track and wait for the first one. Subsequent writes
> +	 * will zero the gap in memory and ensure that the zeroed data is
> +	 * written out along with the valid data in the same block before
> +	 * i_disksize is updated.
> +	 */
> +	if (likely(folio_test_dirty(folio) &&
> +		   READ_ONCE(ei->i_ordered_len) == 0)) {
> +		WRITE_ONCE(ei->i_ordered_lblk,
> +			   from >> inode->i_blkbits);
> +		/*
> +		 * Pairs with smp_rmb() in ext4_iomap_writeback_submit()
> +		 * and ext4_iomap_wb_ordered_wait(). Ensure the updated
> +		 * i_ordered_lblk is visible when i_ordered_len becomes
> +		 * non-zero.
> +		 */
> +		smp_store_release(&ei->i_ordered_len, 1);
>  		do_submit = true;
> +	}
>  	folio_unlock(folio);
>  	folio_put(folio);
>  
>  	/* Submit zeroed block. */
> -	if (do_submit)
> -		return filemap_fdatawrite_range(mapping, from, end - 1);
> +	if (do_submit) {
> +		ret = filemap_fdatawrite_range(mapping, from, end - 1);
> +		if (ret) {
> +			/*
> +			 * Pairs with wait_event() in
> +			 * ext4_iomap_wb_ordered_wait(). Ensure
> +			 * i_ordered_len = 0 is visible before waking up
> +			 * waiters.
> +			 */
> +			smp_store_release(&ei->i_ordered_len, 0);
> +			wake_up_all(&ei->i_ordered_wq);
> +			return ret;
> +		}
> +	}
>  	return 0;
>  }
>  
> @@ -4827,10 +4890,13 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
>  		 * data=ordered mode. We submit zeroed range directly here.
>  		 * Do not wait for I/O completion for performance.
>  		 *
> -		 * TODO: Any operation that extends i_disksize (including
> -		 * append write end io past the zeroed boundary, truncate up,
> -		 * and append fallocate) must wait for the relevant I/O to
> -		 * complete before updating i_disksize.
> +		 * The end_io handler ext4_iomap_wb_ordered_wait() will wait
> +		 * for I/O completion before updating i_disksize if the write
> +		 * extends beyond the zeroed boundary.
> +		 *
> +		 * TODO: Any other operation that extends i_disksize
> +		 * (including truncate up and append fallocate) must wait for
> +		 * the relevant I/O to complete before updating i_disksize.
>  		 */
>  		} else if (ext4_inode_buffered_iomap(inode)) {
>  			err = ext4_iomap_submit_zero_block(inode, from, end);
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index 3050c887329f..ad05ebb49bf6 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -613,6 +613,46 @@ int ext4_bio_write_folio(struct ext4_io_submit *io, struct folio *folio,
>  	return 0;
>  }
>  
> +/*
> + * If the old disk size is not block size aligned and the current
> + * writeback range is entirely beyond the old EOF block, we should
> + * wait for the zeroed data written in ext4_block_zero_eof() to be
> + * written out, otherwise, it may expose stale data in that block.
> + */
> +static void ext4_iomap_wb_ordered_wait(struct inode *inode,
> +				       loff_t pos, loff_t end)
> +{
> +	struct ext4_inode_info *ei = EXT4_I(inode);
> +	unsigned int blocksize = i_blocksize(inode);
> +	loff_t disksize = READ_ONCE(ei->i_disksize);
> +	ext4_lblk_t order_lblk, order_len;
> +
> +	/*
> +	 * Waiting for ordered I/O is unnecessary when:
> +	 * - The on-disk size is block-aligned (no stale data exists).
> +	 * - The write start is within the block of the old EOF
> +	 *   (overwriting, or appending to a block that already contains
> +	 *   valid data).
> +	 */
> +	if (!(disksize & (blocksize - 1)) ||
> +	    pos < round_up(disksize, blocksize))
> +		return;
> +
> +	order_len = READ_ONCE(ei->i_ordered_len);
> +	if (!order_len)
> +		return;
> +
> +	/*
> +	 * Pair with smp_store_release() in ext4_iomap_end_bio() and
> +	 * ext4_block_zero_eof(). Ensure we see the updated i_ordered_lblk
> +	 * that was written before the release store to i_ordered_len.
> +	 */
> +	smp_rmb();
> +	order_lblk = READ_ONCE(ei->i_ordered_lblk);
> +	if ((pos >> inode->i_blkbits) >= order_lblk + order_len)
> +		wait_event(ei->i_ordered_wq, READ_ONCE(ei->i_ordered_len) == 0);
> +}
> +
>  static int ext4_iomap_wb_update_disksize(handle_t *handle, struct inode *inode,
>  					 loff_t end)
>  {
> @@ -656,6 +696,9 @@ static void ext4_iomap_finish_ioend(struct iomap_ioend *ioend)
>  		goto out;
>  	}
>  
> +	/* Wait ordered zero data to be written out. */
> +	ext4_iomap_wb_ordered_wait(inode, pos, pos + size);
> +
>  	/* We may need to convert one extent and dirty the inode. */
>  	credits = ext4_chunk_trans_blocks(inode,
>  			EXT4_MAX_BLOCKS(size, pos, inode->i_blkbits));
> @@ -717,8 +760,25 @@ void ext4_iomap_end_bio(struct bio *bio)
>  	struct inode *inode = ioend->io_inode;
>  	struct ext4_inode_info *ei = EXT4_I(inode);
>  	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> +	unsigned long io_mode = (unsigned long)ioend->io_private;
>  	unsigned long flags;
>  
> +	/*
> +	 * This is an ordered I/O, clear the ordered range set in
> +	 * ext4_block_zero_eof() and wake up all waiters that will update
> +	 * the inode i_disksize.
> +	 */
> +	if (io_mode == EXT4_IOMAP_IOEND_ORDER_IO) {
> +		/*
> +		 * Pairs with wait_event() in ext4_iomap_wb_ordered_wait().
> +		 * Ensure i_ordered_len = 0 is visible before waking up
> +		 * waiters.
> +		 */
> +		smp_store_release(&ei->i_ordered_len, 0);
> +		wake_up_all(&ei->i_ordered_wq);
> +		goto defer;
> +	}
> +
>  	/* Needs to convert unwritten extents or update the i_disksize. */
>  	if ((ioend->io_flags & IOMAP_IOEND_UNWRITTEN) ||
>  	    ioend->io_offset + ioend->io_size > READ_ONCE(ei->i_disksize))
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 62bfe05a64bc..9c0a00e716f3 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1444,6 +1444,9 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
>  	ext4_fc_init_inode(&ei->vfs_inode);
>  	spin_lock_init(&ei->i_fc_lock);
>  	mmb_init(&ei->i_metadata_bhs, &ei->vfs_inode.i_data);
> +	ei->i_ordered_lblk = 0;
> +	ei->i_ordered_len = 0;
> +	init_waitqueue_head(&ei->i_ordered_wq);
>  	return &ei->vfs_inode;
>  }
>  
> @@ -1480,12 +1483,20 @@ static void ext4_destroy_inode(struct inode *inode)
>  		dump_stack();
>  	}
>  
> -	if (!(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_ERROR_FS) &&
> -	    WARN_ON_ONCE(EXT4_I(inode)->i_reserved_data_blocks))
> -		ext4_msg(inode->i_sb, KERN_ERR,
> -			 "Inode %llu (%p): i_reserved_data_blocks (%u) not cleared!",
> -			 inode->i_ino, EXT4_I(inode),
> -			 EXT4_I(inode)->i_reserved_data_blocks);
> +	if (!(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_ERROR_FS)) {
> +		if (WARN_ON_ONCE(EXT4_I(inode)->i_reserved_data_blocks))
> +			ext4_msg(inode->i_sb, KERN_ERR,
> +				 "Inode %llu (%p): i_reserved_data_blocks (%u) not cleared!",
> +				 inode->i_ino, EXT4_I(inode),
> +				 EXT4_I(inode)->i_reserved_data_blocks);
> +
> +		if (WARN_ON_ONCE(EXT4_I(inode)->i_ordered_len))
> +			ext4_msg(inode->i_sb, KERN_ERR,
> +				 "Inode %llu (%p): i_ordered_lblk (%u) and i_ordered_len (%u) not cleared!",
> +				 inode->i_ino, EXT4_I(inode),
> +				 EXT4_I(inode)->i_ordered_lblk,
> +				 EXT4_I(inode)->i_ordered_len);
> +	}
>  }
>  
>  static void ext4_shutdown(struct super_block *sb)
> -- 
> 2.52.0
> 

^ permalink raw reply

* Re: [PATCH 04/17] nilfs2: replace get_zeroed_page() with kzalloc()
From: Ryusuke Konishi @ 2026-05-27 16:02 UTC (permalink / raw)
  To: Mike Rapoport (Microsoft)
  Cc: Viacheslav Dubeyko, Jan Kara, Mark Fasheh, Joel Becker, Joseph Qi,
	Viacheslav Dubeyko, Trond Myklebust, Anna Schumaker, Chuck Lever,
	Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Alexander Viro, Christian Brauner, Jan Kara, Dave Kleikamp,
	Theodore Ts'o, Miklos Szeredi, Andreas Hindborg, Breno Leitao,
	Kees Cook, Tigran A. Aivazian, linux-kernel, linux-fsdevel,
	ocfs2-devel, linux-nilfs, linux-nfs, jfs-discussion, linux-ext4,
	linux-mm
In-Reply-To: <1bb537f6dc36b00788b613fb8f71579478418457.camel@redhat.com>

On Tue, May 26, 2026 at 2:07 AM Viacheslav Dubeyko wrote:
>
> On Sat, 2026-05-23 at 20:54 +0300, Mike Rapoport (Microsoft) wrote:
> > nilfs_ioctl_wrap_copy() allocates a temporary buffer with
> > get_zeroed_page().
> >
> > kzalloc() is a better API for such use and it also provides better
> > scalability and more debugging possibilities.
> >
> > Replace use of get_zeroed_page() with kzalloc().
> >
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > ---
> >  fs/nilfs2/ioctl.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c
> > index e0a606643e87..b73f2c5d10f0 100644
> > --- a/fs/nilfs2/ioctl.c
> > +++ b/fs/nilfs2/ioctl.c
> > @@ -69,7 +69,7 @@ static int nilfs_ioctl_wrap_copy(struct the_nilfs *nilfs,
> >       if (argv->v_index > ~(__u64)0 - argv->v_nmembs)
> >               return -EINVAL;
> >
> > -     buf = (void *)get_zeroed_page(GFP_NOFS);
> > +     buf = kzalloc(PAGE_SIZE, GFP_NOFS);
> >       if (unlikely(!buf))
> >               return -ENOMEM;
> >       maxmembs = PAGE_SIZE / argv->v_size;
> > @@ -107,7 +107,7 @@ static int nilfs_ioctl_wrap_copy(struct the_nilfs *nilfs,
> >       }
> >       argv->v_nmembs = total;
> >
> > -     free_pages((unsigned long)buf, 0);
> > +     kfree(buf);
> >       return ret;
> >  }
> >
>
> Makes sense to me.
>
> Reviewed-by: Viacheslav Dubeyko <slava@dubeyko.com>
>
> Thanks,
> Slava.

Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>

This conversion looks reasonable and won't affect the behavior of the
ioctls that use the modified function.

Thanks,
Ryusuke Konishi

^ permalink raw reply

* Re: [PATCH v2 1/2] ext4: avoid RWM atomic in EXT4_MB_GRP_TEST_AND_SET_READ
From: Andreas Dilger @ 2026-05-27 19:46 UTC (permalink / raw)
  To: Bohdan Trach
  Cc: Theodore Ts'o, Baokun Li, Jan Kara, Ojaswin Mujoo,
	Ritesh Harjani (IBM), Zhang Yi, mchehab+huawei, bohdan.trach,
	lilith.oberhauser, linux-ext4, linux-kernel
In-Reply-To: <20260527090329.2680170-2-bohdan.trach@huaweicloud.com>

On May 27, 2026, at 03:03, Bohdan Trach <bohdan.trach@huaweicloud.com> wrote:
> 
> EXT4_MB_GRP_TEST_AND_SET_READ uses test_and_set_bit function which
> issues an atomic write. This can cause high overhead due to cache
> contention when multiple threads iterate over groups in a tight loop,
> as is the case for ext4_mb_prefetch(). We have seen this to be a
> problem for Kunpeng 920b CPUs which uses a single ARM LSE instruction
> for this purpose.
> 
> Avoid this unconditional atomic write by testing the bit first without
> changing its value. This is OK for this use case as this bit is never
> unset.
> 
> This change significantly reduces costs of fallocate() operations which
> trigger linear group scans on large multicore machines where
> test_and_set_bit issues an atomic write operation unconditionally.
> 
> Signed-off-by: Bohdan Trach <bohdan.trach@huaweicloud.com>

Thanks for the patch.  Definitely the benchmarks in the 0/2 email show
significant gains for the Kunpeng system, and reducing contention makes sense
as core counts increase and the likely case is that the bit is already set.

That said, I wonder if this should (also/instead) be put into test_and_set_bit()
itself, or add test_and_unlikely_set_bit() or test_and_rarely_set_bit()
(or similar) optimized for the case where the bit is likely to already be set.

I see in your benchmarking that there is not "apples-to-apples" comparisons for
ARM(Kunpeng) vs. AMD on the same storage.  The storage hardware and space usage
is different for each test run, and the ARM numbers show only marginal gains and
more negative than positive results at all thread counts:

> Benchmark on an existing file system for AMD 9654 (15T FS, 6% space
> used), kernel 7.1-rc3. This shows the performance impact on a mostly
> free file system.
> | thr. |  base | patched |    improv. |
> |      |  perf |    perf |            |
> |------|-------|---------|------------|
> |    1 | 30901 |   31191 | +0.9384810 |
> |    2 | 50874 |   50504 | -0.7272870 |
> |    4 | 66068 |   64108 | -2.9666404 |
> |    8 | 63963 |   61927 | -3.1830902 |
> |   16 | 47809 |   47044 | -1.6001171 |
> |   32 | 42441 |   42326 | -0.2709644 |
> |   64 | 39773 |   39929 | +0.3922259 |
> |  128 | 37065 |   36413 | -1.7590719 |


The performance reduction might be caused by the now double memory access on
AMD that is only adding overhead on that CPU implementation?  It would be useful
to see the testing on Kunpeng vs. AMD/Intel on the same storage device/usage.

That would tell us if it is more appropriate to optimize this in the aarch64
test_and_set_bit() rather than in ext4.

Cheers, Andreas


> ---
> fs/ext4/ext4.h | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 56b82d4a15d7..f8eacf1375f8 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -3551,7 +3551,13 @@ struct ext4_group_info {
>  #define EXT4_MB_GRP_CLEAR_TRIMMED(grp) \
>  	(clear_bit(EXT4_GROUP_INFO_WAS_TRIMMED_BIT, &((grp)->bb_state)))
>  #define EXT4_MB_GRP_TEST_AND_SET_READ(grp) \
> -	(test_and_set_bit(EXT4_GROUP_INFO_BBITMAP_READ_BIT, &((grp)->bb_state)))
> +	(ext4_mb_grp_test_and_set_read((grp)))
> +
> +static inline int ext4_mb_grp_test_and_set_read(struct ext4_group_info *grp)
> +{
> + 	return (test_bit(EXT4_GROUP_INFO_BBITMAP_READ_BIT, &grp->bb_state) ||
> + 		test_and_set_bit(EXT4_GROUP_INFO_BBITMAP_READ_BIT, &grp->bb_state));
> +}
> 
> #define EXT4_MAX_CONTENTION 8
> #define EXT4_CONTENTION_THRESHOLD 2
> -- 
> 2.43.0
> 


Cheers, Andreas






^ permalink raw reply

* Re: [PATCH v2 2/2] ext4: get ext4_group_desc in ext4_mb_prefetch only when necessary
From: Andreas Dilger @ 2026-05-27 19:49 UTC (permalink / raw)
  To: Bohdan Trach
  Cc: Theodore Ts'o, Baokun Li, Jan Kara, Ojaswin Mujoo,
	Ritesh Harjani (IBM), Zhang Yi, mchehab+huawei, bohdan.trach,
	lilith.oberhauser, linux-ext4, linux-kernel
In-Reply-To: <20260527090329.2680170-3-bohdan.trach@huaweicloud.com>

On May 27, 2026, at 03:03, Bohdan Trach <bohdan.trach@huaweicloud.com> wrote:
> 
> Getting ext4_group_desc structure can contribute to the cost of
> ext4_mb_prefetch() without any need, as most groups fail the
> !EXT4_MB_GRP_TEST_AND_SET_READ check.
> 
> Optimize ext4_mb_prefetch by getting the group description only when
> necessary.
> 
> The result is further increase in performance of fallocate() system call
> path that triggers ext4_mb_prefetch() via a linear group scan.
> 
> Signed-off-by: Bohdan Trach <bohdan.trach@huaweicloud.com>
> Reviewed-by: Jan Kara <jack@suse.cz>

This looks reasonable, and is independent of the EXT4_MB_GRP_TEST_AND_SET_READ()
micro-optimization in the 1/2 patch.

Reviewed-by: Andreas Dilger <adilger@dilger.ca <mailto:adilger@dilger.ca>>

Cheers, Andreas






^ permalink raw reply

* Re: [PATCH] jbd2: Remove special jbd2 slabs
From: Tal Zussman @ 2026-05-27 20:33 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle), Theodore Ts'o
  Cc: Jan Kara, linux-ext4, linux-fsdevel, Mike Rapoport (Microsoft),
	Vlastimil Babka
In-Reply-To: <20260525201321.21717-1-willy@infradead.org>

On 5/25/26 4:13 PM, Matthew Wilcox (Oracle) wrote:

Hi,

One small comment below.

> When jbd2 was originally written, kmalloc() would not guarantee alignment
> for the requested memory.  Since commit 59bb47985c1d in 2019, kmalloc
> has guaranteed natural alignment for power-of-two allocations.  We can
> now remove the jbd2 special slabs and just use kmalloc() directly.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  fs/jbd2/commit.c      |   8 ++-
>  fs/jbd2/journal.c     | 121 ++----------------------------------------
>  fs/jbd2/transaction.c |   8 +--
>  include/linux/jbd2.h  |   3 --
>  4 files changed, 11 insertions(+), 129 deletions(-)
> 
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index 38f318bb4279..2e8dbc4547bb 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -514,10 +514,8 @@ void jbd2_journal_commit_transaction(journal_t *journal)
>  		 * leave undo-committed data.
>  		 */
>  		if (jh->b_committed_data) {
> -			struct buffer_head *bh = jh2bh(jh);
> -
>  			spin_lock(&jh->b_state_lock);
> -			jbd2_free(jh->b_committed_data, bh->b_size);
> +			kfree(jh->b_committed_data);
>  			jh->b_committed_data = NULL;
>  			spin_unlock(&jh->b_state_lock);
>  		}
> @@ -978,7 +976,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
>  		 * its triggers if they exist, so we can clear that too.
>  		 */
>  		if (jh->b_committed_data) {
> -			jbd2_free(jh->b_committed_data, bh->b_size);
> +			kfree(jh->b_committed_data);
>  			jh->b_committed_data = NULL;
>  			if (jh->b_frozen_data) {
>  				jh->b_committed_data = jh->b_frozen_data;
> @@ -986,7 +984,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
>  				jh->b_frozen_triggers = NULL;
>  			}
>  		} else if (jh->b_frozen_data) {
> -			jbd2_free(jh->b_frozen_data, bh->b_size);
> +			kfree(jh->b_frozen_data);
>  			jh->b_frozen_data = NULL;
>  			jh->b_frozen_triggers = NULL;
>  		}
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index a6616380ce38..ad10c8a92fa0 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -95,8 +95,6 @@ EXPORT_SYMBOL(jbd2_journal_release_jbd_inode);
>  EXPORT_SYMBOL(jbd2_journal_begin_ordered_truncate);
>  EXPORT_SYMBOL(jbd2_inode_cache);
>  
> -static int jbd2_journal_create_slab(size_t slab_size);
> -
>  #ifdef CONFIG_JBD2_DEBUG
>  void __jbd2_debug(int level, const char *file, const char *func,
>  		  unsigned int line, const char *fmt, ...)
> @@ -385,10 +383,10 @@ int jbd2_journal_write_metadata_buffer(transaction_t *transaction,
>  			goto escape_done;
>  
>  		spin_unlock(&jh_in->b_state_lock);
> -		tmp = jbd2_alloc(bh_in->b_size, GFP_NOFS | __GFP_NOFAIL);
> +		tmp = kmalloc(bh_in->b_size, GFP_NOFS | __GFP_NOFAIL);
>  		spin_lock(&jh_in->b_state_lock);
>  		if (jh_in->b_frozen_data) {
> -			jbd2_free(tmp, bh_in->b_size);
> +			kfree(tmp);
>  			goto copy_done;
>  		}
>  
> @@ -2063,14 +2061,6 @@ EXPORT_SYMBOL(jbd2_journal_update_sb_errno);
>  int jbd2_journal_load(journal_t *journal)
>  {
>  	int err;
> -	journal_superblock_t *sb = journal->j_superblock;
> -
> -	/*
> -	 * Create a slab for this blocksize
> -	 */
> -	err = jbd2_journal_create_slab(be32_to_cpu(sb->s_blocksize));
> -	if (err)
> -		return err;
>  
>  	/* Let the recovery code check whether it needs to recover any
>  	 * data from the journal. */
> @@ -2698,108 +2688,6 @@ size_t journal_tag_bytes(journal_t *journal)
>  		return sz - sizeof(__u32);
>  }
>  
> -/*
> - * JBD memory management
> - *
> - * These functions are used to allocate block-sized chunks of memory
> - * used for making copies of buffer_head data.  Very often it will be
> - * page-sized chunks of data, but sometimes it will be in
> - * sub-page-size chunks.  (For example, 16k pages on Power systems
> - * with a 4k block file system.)  For blocks smaller than a page, we
> - * use a SLAB allocator.  There are slab caches for each block size,
> - * which are allocated at mount time, if necessary, and we only free
> - * (all of) the slab caches when/if the jbd2 module is unloaded.  For
> - * this reason we don't need to a mutex to protect access to
> - * jbd2_slab[] allocating or releasing memory; only in
> - * jbd2_journal_create_slab().
> - */
> -#define JBD2_MAX_SLABS 8
> -static struct kmem_cache *jbd2_slab[JBD2_MAX_SLABS];
> -
> -static const char *jbd2_slab_names[JBD2_MAX_SLABS] = {
> -	"jbd2_1k", "jbd2_2k", "jbd2_4k", "jbd2_8k",
> -	"jbd2_16k", "jbd2_32k", "jbd2_64k", "jbd2_128k"
> -};
> -
> -
> -static void jbd2_journal_destroy_slabs(void)
> -{
> -	int i;
> -
> -	for (i = 0; i < JBD2_MAX_SLABS; i++) {
> -		kmem_cache_destroy(jbd2_slab[i]);
> -		jbd2_slab[i] = NULL;
> -	}
> -}
> -
> -static int jbd2_journal_create_slab(size_t size)
> -{
> -	static DEFINE_MUTEX(jbd2_slab_create_mutex);
> -	int i = order_base_2(size) - 10;
> -	size_t slab_size;
> -
> -	if (size == PAGE_SIZE)
> -		return 0;
> -
> -	if (i >= JBD2_MAX_SLABS)
> -		return -EINVAL;
> -
> -	if (unlikely(i < 0))
> -		i = 0;
> -	mutex_lock(&jbd2_slab_create_mutex);
> -	if (jbd2_slab[i]) {
> -		mutex_unlock(&jbd2_slab_create_mutex);
> -		return 0;	/* Already created */
> -	}
> -
> -	slab_size = 1 << (i+10);
> -	jbd2_slab[i] = kmem_cache_create(jbd2_slab_names[i], slab_size,
> -					 slab_size, 0, NULL);
> -	mutex_unlock(&jbd2_slab_create_mutex);
> -	if (!jbd2_slab[i]) {
> -		printk(KERN_EMERG "JBD2: no memory for jbd2_slab cache\n");
> -		return -ENOMEM;
> -	}
> -	return 0;
> -}
> -
> -static struct kmem_cache *get_slab(size_t size)
> -{
> -	int i = order_base_2(size) - 10;
> -
> -	BUG_ON(i >= JBD2_MAX_SLABS);
> -	if (unlikely(i < 0))
> -		i = 0;
> -	BUG_ON(jbd2_slab[i] == NULL);
> -	return jbd2_slab[i];
> -}
> -
> -void *jbd2_alloc(size_t size, gfp_t flags)
> -{
> -	void *ptr;
> -
> -	BUG_ON(size & (size-1)); /* Must be a power of 2 */
> -
> -	if (size < PAGE_SIZE)
> -		ptr = kmem_cache_alloc(get_slab(size), flags);
> -	else
> -		ptr = (void *)__get_free_pages(flags, get_order(size));
> -
> -	/* Check alignment; SLUB has gotten this wrong in the past,
> -	 * and this can lead to user data corruption! */
> -	BUG_ON(((unsigned long) ptr) & (size-1));
> -
> -	return ptr;
> -}
> -
> -void jbd2_free(void *ptr, size_t size)
> -{
> -	if (size < PAGE_SIZE)
> -		kmem_cache_free(get_slab(size), ptr);
> -	else
> -		free_pages((unsigned long)ptr, get_order(size));
> -};
> -
>  /*
>   * Journal_head storage management
>   */
> @@ -2977,11 +2865,11 @@ static void journal_release_journal_head(struct journal_head *jh, size_t b_size)

I think the b_size parameter can be removed from journal_release_journal_head()
and its single caller now.

>  {
>  	if (jh->b_frozen_data) {
>  		printk(KERN_WARNING "%s: freeing b_frozen_data\n", __func__);
> -		jbd2_free(jh->b_frozen_data, b_size);
> +		kfree(jh->b_frozen_data);
>  	}
>  	if (jh->b_committed_data) {
>  		printk(KERN_WARNING "%s: freeing b_committed_data\n", __func__);
> -		jbd2_free(jh->b_committed_data, b_size);
> +		kfree(jh->b_committed_data);
>  	}
>  	journal_free_journal_head(jh);
>  }
> @@ -3142,7 +3030,6 @@ static void jbd2_journal_destroy_caches(void)
>  	jbd2_journal_destroy_handle_cache();
>  	jbd2_journal_destroy_inode_cache();
>  	jbd2_journal_destroy_transaction_cache();
> -	jbd2_journal_destroy_slabs();
>  }
>  
>  static int __init journal_init(void)
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index 4885903bbd10..48ddb566d12d 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -1131,7 +1131,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
>  		if (!frozen_buffer) {
>  			JBUFFER_TRACE(jh, "allocate memory for buffer");
>  			spin_unlock(&jh->b_state_lock);
> -			frozen_buffer = jbd2_alloc(jh2bh(jh)->b_size,
> +			frozen_buffer = kmalloc(jh2bh(jh)->b_size,
>  						   GFP_NOFS | __GFP_NOFAIL);
>  			goto repeat;
>  		}
> @@ -1159,7 +1159,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
>  
>  out:
>  	if (unlikely(frozen_buffer))	/* It's usually NULL */
> -		jbd2_free(frozen_buffer, bh->b_size);
> +		kfree(frozen_buffer);
>  
>  	JBUFFER_TRACE(jh, "exit");
>  	return error;
> @@ -1424,7 +1424,7 @@ int jbd2_journal_get_undo_access(handle_t *handle, struct buffer_head *bh)
>  
>  repeat:
>  	if (!jh->b_committed_data)
> -		committed_data = jbd2_alloc(jh2bh(jh)->b_size,
> +		committed_data = kmalloc(jh2bh(jh)->b_size,
>  					    GFP_NOFS|__GFP_NOFAIL);
>  
>  	spin_lock(&jh->b_state_lock);
> @@ -1445,7 +1445,7 @@ int jbd2_journal_get_undo_access(handle_t *handle, struct buffer_head *bh)
>  out:
>  	jbd2_journal_put_journal_head(jh);
>  	if (unlikely(committed_data))
> -		jbd2_free(committed_data, bh->b_size);
> +		kfree(committed_data);
>  	return err;
>  }
>  
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index 7e785aa6d35d..b68561187e90 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -63,9 +63,6 @@ void __jbd2_debug(int level, const char *file, const char *func,
>  #define jbd2_debug(n, fmt, a...)  no_printk(fmt, ##a)
>  #endif
>  
> -extern void *jbd2_alloc(size_t size, gfp_t flags);
> -extern void jbd2_free(void *ptr, size_t size);
> -
>  #define JBD2_MIN_JOURNAL_BLOCKS 1024
>  #define JBD2_DEFAULT_FAST_COMMIT_BLOCKS 256
>  


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox