Linux EXT4 FS development
 help / color / mirror / Atom feed
* [PATCH] ext4: avoid full buffer walks for large folio partial writes
From: Jia Zhu @ 2026-06-03 13:48 UTC (permalink / raw)
  To: Theodore Ts'o, Andreas Dilger
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Baokun Li,
	Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, linux-ext4,
	linux-fsdevel, linux-kernel, Jia Zhu

Ext4 buffered writes into large folios still walk every buffer_head in the
folio in ext4_block_write_begin() and again in block_commit_write(). Before
regular files used large folios this was cheap, but a large folio can
contain hundreds of buffer_heads. Small overwrites of an existing large
folio therefore pay work proportional to the folio size instead of the
write size.

This is visible when the page cache is first populated with large folios
and then a small range is overwritten. The numbers below come from a local
libMicro-based microbenchmark. Each round first drops caches, writes a
10 MiB file with dd to instantiate large page-cache folios, and then runs
libMicro's write, pwrite, or writev benchmark for a small buffered
overwrite. The writev cases use libMicro's default vector count of 10.

A representative pwrite round is:

	sync
	echo 3 > /proc/sys/vm/drop_caches
	dd if=/dev/zero of=$file bs=1024k count=10
	taskset -c 0 ./bin/pwrite -H -C 50 -D 3 -S -N pwrite_u1k \
		-s 1k -f $file

To avoid comparing this change with an older kernel, the benchmark uses two
kernels built from the same master tree: one with this change and one with
only this change reverted. With THP=always and 10 dd-prefill rounds, median
latencies were:

			nofix		patched		improvement
	write_u1k	1.418 usec	0.342 usec	75.9%
	write_u10k	1.887 usec	0.409 usec	78.3%
	write_u100k	4.114 usec	2.554 usec	37.9%
	pwrite_u1k	1.677 usec	0.335 usec	80.1%
	pwrite_u10k	1.903 usec	0.410 usec	78.5%
	pwrite_u100k	4.101 usec	2.563 usec	37.5%
	writev_u1k	2.285 usec	0.756 usec	66.9%
	writev_u10k	4.655 usec	3.025 usec	35.0%

Start the ext4 write_begin walk at the first buffer that overlaps the
write. For already-uptodate large folio overwrites, add a partial commit
path which marks only the written buffers uptodate and dirty. Leave
non-uptodate folios on the old full-buffer commit path so BH_New cleanup
and folio-uptodate discovery are preserved.

Partially uptodate large folios remain described by per-buffer state, which
is what block_is_partially_uptodate() and read_folio use for later reads.

Signed-off-by: Jia Zhu <zhujia.zj@bytedance.com>
---
 fs/buffer.c     | 51 +++++++++++++++++++++++++++++++++++++++++++++++++
 fs/ext4/inode.c | 21 ++++++++++----------
 2 files changed, 62 insertions(+), 10 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index b0b3792b1496e..e0c5868b088be 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2092,6 +2092,44 @@ int __block_write_begin(struct folio *folio, loff_t pos, unsigned len,
 }
 EXPORT_SYMBOL(__block_write_begin);
 
+static struct buffer_head *folio_buffer_seek(struct buffer_head *head,
+					     unsigned int blocksize,
+					     size_t offset,
+					     size_t *block_start)
+{
+	size_t nr = offset / blocksize;
+
+	*block_start = nr * blocksize;
+	while (nr--)
+		head = head->b_this_page;
+	return head;
+}
+
+static void block_commit_write_range(struct buffer_head *head,
+				     unsigned int blocksize, size_t from,
+				     size_t to)
+{
+	size_t block_start, block_end;
+	struct buffer_head *bh;
+
+	if (from == to)
+		return;
+	if (WARN_ON_ONCE(to > folio_size(head->b_folio)))
+		return;
+
+	bh = folio_buffer_seek(head, blocksize, from, &block_start);
+	do {
+		block_end = block_start + blocksize;
+		set_buffer_uptodate(bh);
+		mark_buffer_dirty(bh);
+		if (buffer_new(bh))
+			clear_buffer_new(bh);
+
+		block_start = block_end;
+		bh = bh->b_this_page;
+	} while (block_start < to && bh != head);
+}
+
 void block_commit_write(struct folio *folio, size_t from, size_t to)
 {
 	size_t block_start, block_end;
@@ -2104,6 +2142,19 @@ void block_commit_write(struct folio *folio, size_t from, size_t to)
 		return;
 	blocksize = bh->b_size;
 
+	/*
+	 * Large folios can carry hundreds of buffer_heads.  For partial writes,
+	 * keep commit work local to the written range; partially uptodate
+	 * reads remain governed by the buffer state.
+	 */
+	if (folio_test_large(folio) && from < to &&
+	    folio_test_uptodate(folio) &&
+	    to <= folio_size(folio) &&
+	    (from != 0 || to != folio_size(folio))) {
+		block_commit_write_range(head, blocksize, from, to);
+		return;
+	}
+
 	block_start = 0;
 	do {
 		block_end = block_start + blocksize;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c2c2d6ac7f3d1..e58bba0289eba 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1180,7 +1180,7 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
 	unsigned int blocksize = i_blocksize(inode);
 	struct buffer_head *bh, *head, *wait[2];
 	int nr_wait = 0;
-	int i;
+	unsigned int i;
 	bool should_journal_data = ext4_should_journal_data(inode);
 
 	BUG_ON(!folio_test_locked(folio));
@@ -1191,17 +1191,18 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
 	head = folio_buffers(folio);
 	if (!head)
 		head = create_empty_buffers(folio, blocksize, 0);
-	block = EXT4_PG_TO_LBLK(inode, folio->index);
+	if (from == to)
+		return 0;
+	block_start = round_down(from, blocksize);
+	block = EXT4_PG_TO_LBLK(inode, folio->index) +
+		(block_start >> inode->i_blkbits);
+	bh = head;
+	for (i = 0; i < block_start; i += blocksize)
+		bh = bh->b_this_page;
 
-	for (bh = head, block_start = 0; bh != head || !block_start;
-	    block++, block_start = block_end, bh = bh->b_this_page) {
+	for (; block_start < to;
+	     block++, block_start = block_end, bh = bh->b_this_page) {
 		block_end = block_start + blocksize;
-		if (block_end <= from || block_start >= to) {
-			if (folio_test_uptodate(folio)) {
-				set_buffer_uptodate(bh);
-			}
-			continue;
-		}
 		if (WARN_ON_ONCE(buffer_new(bh)))
 			clear_buffer_new(bh);
 		if (!buffer_mapped(bh)) {

base-commit: e43ffb69e0438cddd72aaa30898b4dc446f664f8
-- 
2.39.5 (Apple Git-154)

^ permalink raw reply related

* Re: [PATCH 10/17] jbd2: replace __get_free_pages() with kmalloc()
From: Theodore Tso @ 2026-06-03 13:50 UTC (permalink / raw)
  To: Mike Rapoport (Microsoft)
  Cc: Jan Kara, Mark Fasheh, Joel Becker, Joseph Qi, Ryusuke Konishi,
	Viacheslav Dubeyko, Trond Myklebust, Anna Schumaker, Chuck Lever,
	Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Alexander Viro, Christian Brauner, Jan Kara, Dave Kleikamp,
	Miklos Szeredi, Andreas Hindborg, Breno Leitao, Kees Cook,
	Tigran A. Aivazian, linux-kernel, linux-fsdevel, ocfs2-devel,
	linux-nilfs, linux-nfs, jfs-discussion, linux-ext4, linux-mm
In-Reply-To: <20260523-b4-fs-v1-10-275e36a83f0e@kernel.org>

On Sat, May 23, 2026 at 08:54:22PM +0300, Mike Rapoport (Microsoft) wrote:
> jbd2_alloc() falls back from kmem_cache_alloc() to __get_free_pages() for
> allocations larger than PAGE_SIZE.
> But kmalloc() can handle such cases with essentially the same fallback.
> 
> Replace use of __get_free_pages() with kmalloc() and simplify
> jbd2_free() as both kmem_cache_alloc() and kmalloc() allocations can be
> freed with kfree().
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

So historically __get_free_pages() was more efficient than kmalloc
since previously the kmalloc overhead meant that a single 4k
allocation would take two pages instead of one.  I'm guessing that has
since changed?

Can you explain to someone who hasn't been tracking the changes in
kmalloc over time:

  * How does the efficiency of kmalloc compare to __get_free_page when
    order == 1?  What is the overhead in terms of memory overhead?
    I'm a bit less concerned about CPU overhead, but it would be good
    to know that?

  * What does kmalloc() do when a size > PAGE_SIZE is passed?  Will it
    return contiguous memory, or return an error or worse, BUG?  And
    same question as above; what is the overhead of kmalloc() when
    size is 2*PAGE_SIZE?  8*PAGE_SIZE?

Thanks,

						- Ted

^ permalink raw reply

* Re: [PATCH 2/8] ext4: convert mballoc KUnit test to sget_fc()
From: Theodore Tso @ 2026-06-03 13:52 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Andreas Dilger, Jan Kara, Ritesh Harjani (IBM),
	linux-ext4, linux-cifs, Alexander Viro
In-Reply-To: <20260528-pailletten-gitter-hiermit-5198ec556b28@brauner>

On Thu, May 28, 2026 at 02:02:50PM +0200, Christian Brauner wrote:
> 
> In these two cases it's fine. Because you're just using the allocation
> and deallocation functions to get a fs_context that's basically just an
> empty vessel to get at a superblock via sget_fc() but you're not really
> doing anything with it.

If you're OK with, I have no objects, but...

I'm sure it's fine today.  But is this something which is documented
to be fine in the future?  It just seems a little fragile and is
contrary to the documentation.

Thanks,

						- Ted

^ permalink raw reply

* Re: [PATCH RFC 10/17] ext4: use fast incremental CRC update in ext4_free_inode()
From: Theodore Tso @ 2026-06-03 14:17 UTC (permalink / raw)
  To: sashiko; +Cc: linux-ext4, Baokun Li
In-Reply-To: <20260508233305.EB600C2BCB0@smtp.kernel.org>

On Fri, May 08, 2026 at 11:33:05PM +0000, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> - [High] ext4_inode_bitmap_csum_set_fast() corrupts the inode bitmap checksum if EXT4_INODES_PER_GROUP(sb) is not a multiple of 8.
>
> > -	ext4_inode_bitmap_csum_set(sb, gdp, bitmap_bh);
> > +	ext4_inode_bitmap_csum_set_fast(sb, gdp, bit);
> 
> Can this corrupt the inode bitmap checksum if EXT4_INODES_PER_GROUP(sb) is not
> a multiple of 8?
> 
> In ext4_inode_bitmap_csum_set(), the CRC32c checksum is computed over exactly
> EXT4_INODES_PER_GROUP(sb) >> 3 bytes. If s_inodes_per_group is not a multiple
> of 8, any trailing fractional bits are strictly excluded from the checksum
> calculation.
> 
> Because ext4_fill_super() does not enforce that s_inodes_per_group is a
> multiple of 8, a crafted filesystem can have an unaligned s_inodes_per_group.

The reason why ext4_fill_super() doesn't enforce that
s_inodes_per_group is a multiple of 8 was that a long time ago, back
when Android was allergic to GPLv2 in userspace, they implemented
their own version of mke2fs (and didn't run fsck on the file system,
sigh).  Their MIT licensed version of make_ext4fs would occasionally
make file systems that were not a multiple of 8, and this ran afoul of
e2fsck[1] if someone actually tried to repair a corrupted Android user
data file system (as opposed to just wiping the flash and starting
from scratch).

[1] https://sourceforge.net/p/e2fsprogs/bugs/292/

This was fixed long ago (over a decade ago), and so at this point, I'm
pretty sure any such mobile handsets are in the landfill, so we
probably should fix this by adding a check in ext4_fill_super() and a
corresponding check in e2fsck.

					- Ted

^ permalink raw reply

* Re: [PATCH] common/defrag: Skip defrag tests on DAX-enabled filesystems
From: Disha Goel @ 2026-06-03 14:20 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: fstests, linux-ext4, linux-fsdevel, ritesh.list, ojaswin, xfs
In-Reply-To: <20260602200930.GB6054@frogsfrogsfrogs>

On 03/06/26 1:39 am, Darrick J. Wong wrote:
> On Tue, Jun 02, 2026 at 03:44:18PM +0530, Disha Goel wrote:
>> Online defragmentation is not supported on DAX-enabled filesystems
>> because DAX bypasses the page cache required for defrag operations.
>>
>> Add check in _require_defrag() to skip tests when DAX is enabled,
>> avoiding false failures on ext4/301-304, ext4/308 and generic/018.
>>
>> Signed-off-by: Disha Goel <disgoel@linux.ibm.com>
>> ---
>>   common/defrag | 4 ++++
>>   1 file changed, 4 insertions(+)
>>
>> diff --git a/common/defrag b/common/defrag
>> index 055d0d0e..28db2f7a 100644
>> --- a/common/defrag
>> +++ b/common/defrag
>> @@ -6,6 +6,10 @@
>>   
>>   _require_defrag()
>>   {
>> +    # Defragmentation is not supported on DAX-enabled filesystems
>> +    if echo "$MOUNT_OPTIONS" | grep -qw "dax"; then
>> +        _notrun "Defragmentation not supported on DAX-enabled filesystem"
>> +    fi
> 
> Defrag doesn't work on XFS on DAX as well?  It seems to work fine on my
> VMs...
> 

Thank you for pointing this out. You're right — I missed that xfs defrag
works fine with dax.

I'll fix this in v2 to only skip for ext4.

> <confused>
> 
> --D
> 
>>       case "$FSTYP" in
>>       xfs)
>>           # xfs_fsr does preallocates, require "falloc"
>> -- 
>> 2.45.1
>>
>>

-- 
Regards,
Disha


^ permalink raw reply

* Re: [PATCH] common/defrag: Skip defrag tests on DAX-enabled filesystems
From: Disha Goel @ 2026-06-03 14:22 UTC (permalink / raw)
  To: Ojaswin Mujoo; +Cc: fstests, linux-ext4, linux-fsdevel, ritesh.list
In-Reply-To: <ah6yl1T9jnN0wH6d@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>

On 02/06/26 4:08 pm, Ojaswin Mujoo wrote:
> On Tue, Jun 02, 2026 at 03:44:18PM +0530, Disha Goel wrote:
>> Online defragmentation is not supported on DAX-enabled filesystems
>> because DAX bypasses the page cache required for defrag operations.
>>
>> Add check in _require_defrag() to skip tests when DAX is enabled,
>> avoiding false failures on ext4/301-304, ext4/308 and generic/018.
>>
>> Signed-off-by: Disha Goel <disgoel@linux.ibm.com>
> 
> Looks good Disha, feel free to add:
> 
> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> 
> One small comment:
>> ---
>>   common/defrag | 4 ++++
>>   1 file changed, 4 insertions(+)
>>
>> diff --git a/common/defrag b/common/defrag
>> index 055d0d0e..28db2f7a 100644
>> --- a/common/defrag
>> +++ b/common/defrag
>> @@ -6,6 +6,10 @@
>>   
>>   _require_defrag()
>>   {
>> +    # Defragmentation is not supported on DAX-enabled filesystems
> 
> I think this comment is not needed as _notrun explains it already

Thanks, I'll fix this in v2.

> 
>> +    if echo "$MOUNT_OPTIONS" | grep -qw "dax"; then
>> +        _notrun "Defragmentation not supported on DAX-enabled filesystem"
>> +    fi
>>       case "$FSTYP" in
>>       xfs)
>>           # xfs_fsr does preallocates, require "falloc"
>> -- 
>> 2.45.1
>>

-- 
Regards,
Disha


^ permalink raw reply

* Re: [PATCH] ext4: avoid full buffer walks for large folio partial writes
From: Matthew Wilcox @ 2026-06-03 18:11 UTC (permalink / raw)
  To: Jia Zhu
  Cc: Theodore Ts'o, Andreas Dilger, Alexander Viro,
	Christian Brauner, Jan Kara, Baokun Li, Ojaswin Mujoo,
	Ritesh Harjani, Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel
In-Reply-To: <20260603134800.25155-1-zhujia.zj@bytedance.com>

On Wed, Jun 03, 2026 at 09:48:00PM +0800, Jia Zhu wrote:
> Ext4 buffered writes into large folios still walk every buffer_head in the
> folio in ext4_block_write_begin() and again in block_commit_write(). Before
> regular files used large folios this was cheap, but a large folio can
> contain hundreds of buffer_heads. Small overwrites of an existing large
> folio therefore pay work proportional to the folio size instead of the
> write size.

Is this a common case for you, or is this something you noticed by
inspection?

> Start the ext4 write_begin walk at the first buffer that overlaps the
> write. For already-uptodate large folio overwrites, add a partial commit
> path which marks only the written buffers uptodate and dirty. Leave
> non-uptodate folios on the old full-buffer commit path so BH_New cleanup
> and folio-uptodate discovery are preserved.

Wouldn't you get just as much benefit from this?

+++ b/fs/buffer.c
@@ -2096,6 +2096,7 @@ void block_commit_write(struct folio *folio, size_t from,
size_t to)
 {
        size_t block_start, block_end;
        bool partial = false;
+       bool uptodate = folio_test_uptodate(folio);
        unsigned blocksize;
        struct buffer_head *bh, *head;

@@ -2118,6 +2119,8 @@ void block_commit_write(struct folio *folio, size_t from, size_t to)
                        clear_buffer_new(bh);

                block_start = block_end;
+               if (uptodate && block_start >= to)
+                       break;
                bh = bh->b_this_page;
        } while (bh != head);

> @@ -1191,17 +1191,18 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
>  	head = folio_buffers(folio);
>  	if (!head)
>  		head = create_empty_buffers(folio, blocksize, 0);
> -	block = EXT4_PG_TO_LBLK(inode, folio->index);
> +	if (from == to)
> +		return 0;
> +	block_start = round_down(from, blocksize);
> +	block = EXT4_PG_TO_LBLK(inode, folio->index) +
> +		(block_start >> inode->i_blkbits);
> +	bh = head;
> +	for (i = 0; i < block_start; i += blocksize)
> +		bh = bh->b_this_page;
>  
> -	for (bh = head, block_start = 0; bh != head || !block_start;
> -	    block++, block_start = block_end, bh = bh->b_this_page) {
> +	for (; block_start < to;
> +	     block++, block_start = block_end, bh = bh->b_this_page) {
>  		block_end = block_start + blocksize;
> -		if (block_end <= from || block_start >= to) {
> -			if (folio_test_uptodate(folio)) {
> -				set_buffer_uptodate(bh);
> -			}
> -			continue;
> -		}
>  		if (WARN_ON_ONCE(buffer_new(bh)))
>  			clear_buffer_new(bh);
>  		if (!buffer_mapped(bh)) {
> 

I'm unconvinced that this is safe ... but all of this is a distraction
form what we should really be doing which is converting ext4 to use
iomap instead of buffer heads.

^ permalink raw reply

* Re: [PATCH 10/17] jbd2: replace __get_free_pages() with kmalloc()
From: Mike Rapoport @ 2026-06-04  6:14 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Jan Kara, Mark Fasheh, Joel Becker, Joseph Qi, Ryusuke Konishi,
	Viacheslav Dubeyko, Trond Myklebust, Anna Schumaker, Chuck Lever,
	Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Alexander Viro, Christian Brauner, Jan Kara, Dave Kleikamp,
	Miklos Szeredi, Andreas Hindborg, Breno Leitao, Kees Cook,
	Tigran A. Aivazian, linux-kernel, linux-fsdevel, ocfs2-devel,
	linux-nilfs, linux-nfs, jfs-discussion, linux-ext4, linux-mm
In-Reply-To: <yfzx3jgzwesernofl7mzixa2mhjfii5v3o7yapghtmozixrpfu@6bsh7iixyiov>

Hi Ted,

On Wed, Jun 03, 2026 at 09:50:15AM -0400, Theodore Tso wrote:
> On Sat, May 23, 2026 at 08:54:22PM +0300, Mike Rapoport (Microsoft) wrote:
> > jbd2_alloc() falls back from kmem_cache_alloc() to __get_free_pages() for
> > allocations larger than PAGE_SIZE.
> > But kmalloc() can handle such cases with essentially the same fallback.
> > 
> > Replace use of __get_free_pages() with kmalloc() and simplify
> > jbd2_free() as both kmem_cache_alloc() and kmalloc() allocations can be
> > freed with kfree().
> > 
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> 
> So historically __get_free_pages() was more efficient than kmalloc
> since previously the kmalloc overhead meant that a single 4k
> allocation would take two pages instead of one.  I'm guessing that has
> since changed?

Today there's no memory overhead for kmalloc(PAGE_SIZE). Cache refill takes
more pages of course, but they will be handed over to the next
kmalloc(PAGE_SIZE).
 
> Can you explain to someone who hasn't been tracking the changes in
> kmalloc over time:
> 
>   * How does the efficiency of kmalloc compare to __get_free_page when
>     order == 1?  What is the overhead in terms of memory overhead?
>     I'm a bit less concerned about CPU overhead, but it would be good
>     to know that?

There's no memory overhead when order == 1.
As for the CPU overhead, the difference for the fast path allocations is
not measurable and for the slow path it is anyway determined by the amount
of reclaim involved rather than by what allocator is used.
 
>   * What does kmalloc() do when a size > PAGE_SIZE is passed?  Will it
>     return contiguous memory, or return an error or worse, BUG?  And
>     same question as above; what is the overhead of kmalloc() when
>     size is 2*PAGE_SIZE?  8*PAGE_SIZE?

For size >= PAGE_SIZE kmalloc() always returns contiguous page aligned
memory.

Larger allocations (> PAGE_SIZE * 2) go straight to the page allocator. 

> Thanks,
> 
> 						- Ted

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* [PATCH v2] ext4: Fix ERR_PTR(0) in ext4_mkdir()
From: Hongling Zeng @ 2026-06-04  7:36 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, neil, brauner, jlayton
  Cc: linux-ext4, linux-kernel, zhongling0719, Hongling Zeng

When mkdir succeeds, ext4_mkdir() returns ERR_PTR(0) which is incorrect.
It should return NULL instead for success and ERR_PTR() only with
negative error codes for failure.

Fixes: 88d5baf69082 ("Change inode_operations.mkdir to return struct dentry *")
Signed-off-by: Hongling Zeng <zenghongling@kylinos.cn>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun@linux.alibaba.com>

---
Change in v2:
 -Add pre-reivewer
---
 fs/ext4/namei.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 4a47fbd8dd30..8cadaeb15b2b 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -3054,7 +3054,7 @@ static struct dentry *ext4_mkdir(struct mnt_idmap *idmap, struct inode *dir,
 out_retry:
 	if (err == -ENOSPC && ext4_should_retry_alloc(dir->i_sb, &retries))
 		goto retry;
-	return ERR_PTR(err);
+	return err ? ERR_PTR(err) : NULL;
 }
 
 /*
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH v10 00/22] fs-verity support for XFS with post EOF merkle tree
From: Andrey Albershteyn @ 2026-06-04  9:04 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Carlos Maiolino, Christoph Hellwig, Andrey Albershteyn, linux-xfs,
	fsverity, linux-fsdevel, ebiggers, linux-ext4, linux-f2fs-devel,
	linux-btrfs, linux-unionfs, djwong, david
In-Reply-To: <fjdfwhwi4aogyiaoijwvw6w4npuu5mbt6ua6fkhwcp5ajlm543@ume2fkxg36cb>

On 2026-05-28 16:50:45, Andrey Albershteyn wrote:
> On 2026-05-28 14:20:08, Christian Brauner wrote:
> > On Tue, May 26, 2026 at 12:19:43PM +0200, Carlos Maiolino wrote:
> > > On Fri, May 22, 2026 at 02:07:57PM +0200, Christoph Hellwig wrote:
> > > > On Fri, May 22, 2026 at 12:03:20PM +0200, Christian Brauner wrote:
> > > > > > I was expecting this to come through xfs tree too if Eric and Christian
> > > > > > agree.
> > > > > 
> > > > > You may take it through the xfs tree if there are no conflicts with
> > > > > vfs-7.2.iomap. If there are I want to add the iomap changes into
> > > > > vfs-7.2.iomap that you can pull in.
> > > > 
> > > > Merging the iomap bits through the iomap branch might make sense, given
> > > > that iomap usually tends to see quite a bit of activity.
> > > > 
> > > 
> > > That sounds good to me. If you want to go ahead and pull in the iomap
> > > bits, do so, and give me a heads up when you do it so I'll pull your
> > > branch locally.
> > 
> > Great, can the series please be resent based on current vfs-7.2.iomap
> > then please? Because the iomap changes in this series don't apply
> > cleanly on vfs-7.2.iomap so we already have merge conflicts...
> > 
> 
> hmm do you mean this branch?
> https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs-7.2.iomap
> 
> patches 07..09 seems to apply cleanly. The only conflict I see is in
> the overlayfs patch 03. This is because [1] (is in -rc5) is missing
> in vfs-7.2.iomap.
> 
> [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/overlayfs/util.c?h=v7.1-rc5&id=0c8c88b8eb82a2a41bec5f17c076d6312dc40316

Christian, ping

Would be nice to have iomap in vfs, so Carlos can pull and test the
rest

-- 
- Andrey


^ permalink raw reply

* Re: [PATCH v10 00/22] fs-verity support for XFS with post EOF merkle tree
From: Christian Brauner @ 2026-06-04 12:00 UTC (permalink / raw)
  To: Andrey Albershteyn
  Cc: Carlos Maiolino, Christoph Hellwig, Andrey Albershteyn, linux-xfs,
	fsverity, linux-fsdevel, ebiggers, linux-ext4, linux-f2fs-devel,
	linux-btrfs, linux-unionfs, djwong, david
In-Reply-To: <aiE_YQc6SGSdWlcE@aalbersh-thinkpadx1carbongen13.rmtcz.csb>

On Thu, Jun 04, 2026 at 11:04:32AM +0200, Andrey Albershteyn wrote:
> On 2026-05-28 16:50:45, Andrey Albershteyn wrote:
> > On 2026-05-28 14:20:08, Christian Brauner wrote:
> > > On Tue, May 26, 2026 at 12:19:43PM +0200, Carlos Maiolino wrote:
> > > > On Fri, May 22, 2026 at 02:07:57PM +0200, Christoph Hellwig wrote:
> > > > > On Fri, May 22, 2026 at 12:03:20PM +0200, Christian Brauner wrote:
> > > > > > > I was expecting this to come through xfs tree too if Eric and Christian
> > > > > > > agree.
> > > > > > 
> > > > > > You may take it through the xfs tree if there are no conflicts with
> > > > > > vfs-7.2.iomap. If there are I want to add the iomap changes into
> > > > > > vfs-7.2.iomap that you can pull in.
> > > > > 
> > > > > Merging the iomap bits through the iomap branch might make sense, given
> > > > > that iomap usually tends to see quite a bit of activity.
> > > > > 
> > > > 
> > > > That sounds good to me. If you want to go ahead and pull in the iomap
> > > > bits, do so, and give me a heads up when you do it so I'll pull your
> > > > branch locally.
> > > 
> > > Great, can the series please be resent based on current vfs-7.2.iomap
> > > then please? Because the iomap changes in this series don't apply
> > > cleanly on vfs-7.2.iomap so we already have merge conflicts...
> > > 
> > 
> > hmm do you mean this branch?
> > https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs-7.2.iomap
> > 
> > patches 07..09 seems to apply cleanly. The only conflict I see is in
> > the overlayfs patch 03. This is because [1] (is in -rc5) is missing
> > in vfs-7.2.iomap.
> > 
> > [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/overlayfs/util.c?h=v7.1-rc5&id=0c8c88b8eb82a2a41bec5f17c076d6312dc40316
> 
> Christian, ping
> 
> Would be nice to have iomap in vfs, so Carlos can pull and test the
> rest

Applied but note IOMAP_F_FSVERITY
changed from (1U << 10) to (1U << 11) since we have another flag
addition this cycle.


^ permalink raw reply

* Re: [PATCH v10 00/22] fs-verity support for XFS with post EOF merkle tree
From: Andrey Albershteyn @ 2026-06-04 12:07 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Carlos Maiolino, Christoph Hellwig, Andrey Albershteyn, linux-xfs,
	fsverity, linux-fsdevel, ebiggers, linux-ext4, linux-f2fs-devel,
	linux-btrfs, linux-unionfs, djwong, david
In-Reply-To: <20260604-holen-rundum-ausfechten-2193b39da363@brauner>

On 2026-06-04 14:00:07, Christian Brauner wrote:
> On Thu, Jun 04, 2026 at 11:04:32AM +0200, Andrey Albershteyn wrote:
> > On 2026-05-28 16:50:45, Andrey Albershteyn wrote:
> > > On 2026-05-28 14:20:08, Christian Brauner wrote:
> > > > On Tue, May 26, 2026 at 12:19:43PM +0200, Carlos Maiolino wrote:
> > > > > On Fri, May 22, 2026 at 02:07:57PM +0200, Christoph Hellwig wrote:
> > > > > > On Fri, May 22, 2026 at 12:03:20PM +0200, Christian Brauner wrote:
> > > > > > > > I was expecting this to come through xfs tree too if Eric and Christian
> > > > > > > > agree.
> > > > > > > 
> > > > > > > You may take it through the xfs tree if there are no conflicts with
> > > > > > > vfs-7.2.iomap. If there are I want to add the iomap changes into
> > > > > > > vfs-7.2.iomap that you can pull in.
> > > > > > 
> > > > > > Merging the iomap bits through the iomap branch might make sense, given
> > > > > > that iomap usually tends to see quite a bit of activity.
> > > > > > 
> > > > > 
> > > > > That sounds good to me. If you want to go ahead and pull in the iomap
> > > > > bits, do so, and give me a heads up when you do it so I'll pull your
> > > > > branch locally.
> > > > 
> > > > Great, can the series please be resent based on current vfs-7.2.iomap
> > > > then please? Because the iomap changes in this series don't apply
> > > > cleanly on vfs-7.2.iomap so we already have merge conflicts...
> > > > 
> > > 
> > > hmm do you mean this branch?
> > > https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs-7.2.iomap
> > > 
> > > patches 07..09 seems to apply cleanly. The only conflict I see is in
> > > the overlayfs patch 03. This is because [1] (is in -rc5) is missing
> > > in vfs-7.2.iomap.
> > > 
> > > [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/overlayfs/util.c?h=v7.1-rc5&id=0c8c88b8eb82a2a41bec5f17c076d6312dc40316
> > 
> > Christian, ping
> > 
> > Would be nice to have iomap in vfs, so Carlos can pull and test the
> > rest
> 
> Applied but note IOMAP_F_FSVERITY
> changed from (1U << 10) to (1U << 11) since we have another flag
> addition this cycle.

Oh I haven't noticed that, thanks!

-- 
- Andrey


^ permalink raw reply

* Re: [PATCH v10 00/22] fs-verity support for XFS with post EOF merkle tree
From: Carlos Maiolino @ 2026-06-04 12:08 UTC (permalink / raw)
  To: Andrey Albershteyn
  Cc: Christian Brauner, Christoph Hellwig, Andrey Albershteyn,
	linux-xfs, fsverity, linux-fsdevel, ebiggers, linux-ext4,
	linux-f2fs-devel, linux-btrfs, linux-unionfs, djwong, david
In-Reply-To: <yezqqgowgmbn2z42zvha7cfcprym5vnurb33brdmooab6csdks@a76a7v6rtywn>

On Thu, Jun 04, 2026 at 02:07:05PM +0200, Andrey Albershteyn wrote:
> On 2026-06-04 14:00:07, Christian Brauner wrote:
> > On Thu, Jun 04, 2026 at 11:04:32AM +0200, Andrey Albershteyn wrote:
> > > On 2026-05-28 16:50:45, Andrey Albershteyn wrote:
> > > > On 2026-05-28 14:20:08, Christian Brauner wrote:
> > > > > On Tue, May 26, 2026 at 12:19:43PM +0200, Carlos Maiolino wrote:
> > > > > > On Fri, May 22, 2026 at 02:07:57PM +0200, Christoph Hellwig wrote:
> > > > > > > On Fri, May 22, 2026 at 12:03:20PM +0200, Christian Brauner wrote:
> > > > > > > > > I was expecting this to come through xfs tree too if Eric and Christian
> > > > > > > > > agree.
> > > > > > > > 
> > > > > > > > You may take it through the xfs tree if there are no conflicts with
> > > > > > > > vfs-7.2.iomap. If there are I want to add the iomap changes into
> > > > > > > > vfs-7.2.iomap that you can pull in.
> > > > > > > 
> > > > > > > Merging the iomap bits through the iomap branch might make sense, given
> > > > > > > that iomap usually tends to see quite a bit of activity.
> > > > > > > 
> > > > > > 
> > > > > > That sounds good to me. If you want to go ahead and pull in the iomap
> > > > > > bits, do so, and give me a heads up when you do it so I'll pull your
> > > > > > branch locally.
> > > > > 
> > > > > Great, can the series please be resent based on current vfs-7.2.iomap
> > > > > then please? Because the iomap changes in this series don't apply
> > > > > cleanly on vfs-7.2.iomap so we already have merge conflicts...
> > > > > 
> > > > 
> > > > hmm do you mean this branch?
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs-7.2.iomap
> > > > 
> > > > patches 07..09 seems to apply cleanly. The only conflict I see is in
> > > > the overlayfs patch 03. This is because [1] (is in -rc5) is missing
> > > > in vfs-7.2.iomap.
> > > > 
> > > > [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/overlayfs/util.c?h=v7.1-rc5&id=0c8c88b8eb82a2a41bec5f17c076d6312dc40316
> > > 
> > > Christian, ping
> > > 
> > > Would be nice to have iomap in vfs, so Carlos can pull and test the
> > > rest
> > 
> > Applied but note IOMAP_F_FSVERITY
> > changed from (1U << 10) to (1U << 11) since we have another flag
> > addition this cycle.
> 
> Oh I haven't noticed that, thanks!

Thanks Christian. I'll deal with the rest of the series next week!

> 
> -- 
> - Andrey
> 

^ permalink raw reply

* [PATCH v2] common/defrag: skip defrag tests on DAX-enabled filesystems
From: Disha Goel @ 2026-06-04 12:23 UTC (permalink / raw)
  To: fstests
  Cc: linux-ext4, linux-fsdevel, linux-xfs, ritesh.list, ojaswin,
	djwong, Disha Goel

Online defragmentation is not supported on ext4 DAX-enabled filesystems.
The ext4 defrag ioctl (EXT4_IOC_MOVE_EXT) returns EOPNOTSUPP when used
on DAX files.

Add an ext4-specific check in _require_defrag() to skip tests when DAX
is enabled, avoiding false failures on ext4/301-304, ext4/308, and
generic/018.

XFS defrag works with DAX, so this check is ext4-specific.

Signed-off-by: Disha Goel <disgoel@linux.ibm.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
Changes in v2:
- Made the check ext4-specific as XFS defrag works with DAX
  (feedback from Darrick)
- Use __scratch_uses_fsdax() instead of grepping MOUNT_OPTIONS
- Removed unnecessary comment as _notrun message is self-explanatory

 common/defrag | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/common/defrag b/common/defrag
index 055d0d0e..f17271cd 100644
--- a/common/defrag
+++ b/common/defrag
@@ -6,6 +6,10 @@
 
 _require_defrag()
 {
+    if [ "$FSTYP" = "ext4" ] && __scratch_uses_fsdax; then
+        _notrun "ext4 online defrag not supported with DAX"
+    fi
+
     case "$FSTYP" in
     xfs)
         # xfs_fsr does preallocates, require "falloc"
-- 
2.45.1


^ permalink raw reply related

* Re: [PATCH 10/17] jbd2: replace __get_free_pages() with kmalloc()
From: Theodore Tso @ 2026-06-04 14:05 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Jan Kara, Mark Fasheh, Joel Becker, Joseph Qi, Ryusuke Konishi,
	Viacheslav Dubeyko, Trond Myklebust, Anna Schumaker, Chuck Lever,
	Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Alexander Viro, Christian Brauner, Jan Kara, Dave Kleikamp,
	Miklos Szeredi, Andreas Hindborg, Breno Leitao, Kees Cook,
	Tigran A. Aivazian, linux-kernel, linux-fsdevel, ocfs2-devel,
	linux-nilfs, linux-nfs, jfs-discussion, linux-ext4, linux-mm
In-Reply-To: <aiEX4UTxEnBTjVKo@kernel.org>

On Thu, Jun 04, 2026 at 09:14:57AM +0300, Mike Rapoport wrote:
> There's no memory overhead when order == 1.
> As for the CPU overhead, the difference for the fast path allocations is
> not measurable and for the slow path it is anyway determined by the amount
> of reclaim involved rather than by what allocator is used.

Thanks for confirming!

> Larger allocations (> PAGE_SIZE * 2) go straight to the page allocator.

Another question: Today, we can either use kmalloc() (or
__get_free_pages, previously) or vmalloc().  Is there a way a file
system can say, "give me physically contiguous pages if possible, but
if it's too hard --- with some TBD to specify what 'too hard' means or
can be specified --- fall back to a vmalloc-style approach, with the
page table / TLB overhead that this might imply"?

I suppose we could do it with kmalloc() with some flags which to
prevent forced reclaim / compaction, and if that fails, then fall back
to vmalloc().  Is there a better way?

Thanks,

					- Ted

^ permalink raw reply

* Re: [PATCH] jbd2: check for aborted handle in jbd2_journal_dirty_metadata()
From: Theodore Ts'o @ 2026-06-04 14:45 UTC (permalink / raw)
  To: jack, Deepanshu Kartikey
  Cc: Theodore Ts'o, linux-ext4, linux-kernel,
	syzbot+98f651460e558a21baae
In-Reply-To: <20260507050605.50081-1-kartikey406@gmail.com>


On Thu, 07 May 2026 10:36:05 +0530, Deepanshu Kartikey wrote:
> jbd2_journal_dirty_metadata() unconditionally dereferences
> handle->h_transaction at function entry to obtain the journal pointer:
> 
> 	transaction_t *transaction = handle->h_transaction;
> 	journal_t *journal = transaction->t_journal;
> 
> However, h_transaction may legitimately be NULL for an aborted handle.
> The is_handle_aborted() helper in include/linux/jbd2.h explicitly
> treats !h_transaction as one of the aborted states:
> 
> [...]

Applied, thanks!

[1/1] jbd2: check for aborted handle in jbd2_journal_dirty_metadata()
      commit: 8fc197cf366beaabaeb46575c8cf46fe5076b943

Best regards,
-- 
Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply

* Re: [PATCH v2] jbd2: fix integer underflow in jbd2_journal_initialize_fast_commit()
From: Theodore Ts'o @ 2026-06-04 14:45 UTC (permalink / raw)
  To: Jan Kara, Harshad Shirwadkar, Junrui Luo
  Cc: Theodore Ts'o, linux-ext4, linux-kernel, Yuhao Jiang, stable
In-Reply-To: <SYBPR01MB7881663C927DE9D7BBF4D1DFAF062@SYBPR01MB7881.ausprd01.prod.outlook.com>


On Wed, 13 May 2026 17:28:40 +0800, Junrui Luo wrote:
> jbd2_journal_initialize_fast_commit() validates journal capacity by
> checking (journal->j_last - num_fc_blks < JBD2_MIN_JOURNAL_BLOCKS).
> Both j_last and num_fc_blks are unsigned, so when num_fc_blks exceeds
> j_last the subtraction wraps to a large value, bypassing the bounds
> check.
> 
> The resulting underflow corrupts j_last, j_fc_first, and j_free,
> leading to journal abort.
> 
> [...]

Applied, thanks!

[1/1] jbd2: fix integer underflow in jbd2_journal_initialize_fast_commit()
      commit: 289a2ca0c9b7eae74f93fc213b0b971669b8683d

Best regards,
-- 
Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply

* Re: [PATCH] ext4: fix fast commit wait/wake bit mapping on 64-bit
From: Theodore Ts'o @ 2026-06-04 14:45 UTC (permalink / raw)
  To: Li Chen
  Cc: Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, linux-ext4, linux-kernel,
	Sashiko AI review
In-Reply-To: <20260513085818.552432-1-me@linux.beauty>


On Wed, 13 May 2026 16:58:17 +0800, Li Chen wrote:
> On 64-bit, ext4 dynamic inode states live in the upper half of i_flags,
> and ext4_test_inode_state() applies the corresponding +32 offset.
> 
> The fast-commit wait and wake paths open-coded the wait key with the raw
> EXT4_STATE_* value. Add small helpers for the state wait word and bit,
> and use them for the FC_COMMITTING and FC_FLUSHING_DATA waits so the wait
> key follows the same mapping as the state helpers.
> 
> [...]

Applied, thanks!

[1/1] ext4: fix fast commit wait/wake bit mapping on 64-bit
      commit: 8b3bc93fee6771775243665a0cf31857d6659775

Best regards,
-- 
Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply

* Re: [PATCH v6 0/2] ext4: add hash Kunit tests and optimize str2hashbuf
From: Theodore Ts'o @ 2026-06-04 14:45 UTC (permalink / raw)
  To: Andreas Dilger, Baokun Li, Jan Kara, Ojaswin Mujoo,
	Ritesh Harjani, Zhang Yi, Guan-Chun Wu
  Cc: Theodore Ts'o, linux-ext4, linux-kernel, edward062254,
	visitorckw, david.laight.linux
In-Reply-To: <20260531080019.3794809-1-409411716@gms.tku.edu.tw>


On Sun, 31 May 2026 16:00:17 +0800, Guan-Chun Wu wrote:
> This series adds Kunit tests for fs/ext4/hash.c and refactors
> the str2hashbuf_{signed,unsigned}() helpers.
> 
> Patch 1 adds test coverage for ext4fs_dirhash(), including the main
> hash variants and relevant edge cases.
> 
> Patch 2 simplifies the str2hashbuf helper implementation by processing
> input in 4-byte chunks and removing function-pointer dispatch. This also
> reduces overhead and shows roughly 2x improvement on longer inputs in
> local testing.
> 
> [...]

Applied, thanks!

[1/2] ext4: add Kunit coverage for directory hash computation
      commit: 3147cac6c1929f26b4687993b8c7af5b7b34496d
[2/2] ext4: improve str2hashbuf by processing 4-byte chunks and removing function pointers
      commit: 3ca1d19c1971ac4f25478eafb741e726bf2d5954

Best regards,
-- 
Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply

* Re: [PATCH 10/17] jbd2: replace __get_free_pages() with kmalloc()
From: Matthew Wilcox @ 2026-06-04 14:46 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Mike Rapoport, Jan Kara, Mark Fasheh, Joel Becker, Joseph Qi,
	Ryusuke Konishi, Viacheslav Dubeyko, Trond Myklebust,
	Anna Schumaker, Chuck Lever, Jeff Layton, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Alexander Viro,
	Christian Brauner, Jan Kara, Dave Kleikamp, Miklos Szeredi,
	Andreas Hindborg, Breno Leitao, Kees Cook, Tigran A. Aivazian,
	linux-kernel, linux-fsdevel, ocfs2-devel, linux-nilfs, linux-nfs,
	jfs-discussion, linux-ext4, linux-mm
In-Reply-To: <ximvn6jwgtam665a4droqkp73o55kwvd5uukyidwjesmysobth@oe7rigpsjfkz>

I'm hoping you'll take my "Remove special jbd2 slabs" patch instead of
this one, but answering here anyway ...

On Thu, Jun 04, 2026 at 10:05:52AM -0400, Theodore Tso wrote:
> On Thu, Jun 04, 2026 at 09:14:57AM +0300, Mike Rapoport wrote:
> > There's no memory overhead when order == 1.
> > As for the CPU overhead, the difference for the fast path allocations is
> > not measurable and for the slow path it is anyway determined by the amount
> > of reclaim involved rather than by what allocator is used.
> 
> Thanks for confirming!
> 
> > Larger allocations (> PAGE_SIZE * 2) go straight to the page allocator.

That is a detail subject to change.  I have some ideas ...

What users are guaranteed is that kmalloc returns physically contiguous
memory.  And that if it's a power-of-two that it's naturally aligned.

> Another question: Today, we can either use kmalloc() (or
> __get_free_pages, previously) or vmalloc().  Is there a way a file
> system can say, "give me physically contiguous pages if possible, but
> if it's too hard --- with some TBD to specify what 'too hard' means or
> can be specified --- fall back to a vmalloc-style approach, with the
> page table / TLB overhead that this might imply"?
> 
> I suppose we could do it with kmalloc() with some flags which to
> prevent forced reclaim / compaction, and if that fails, then fall back
> to vmalloc().  Is there a better way?

I think we'd like to avoid doing that.  A lot of code has various
workarounds for deficiencies in the memory allocator (some of which have
been fixed and thus the workarounds only complicate matters).  If the
memory allocator(s) aren't providing what you need (be it performance
under load, fragmentation avoidance or whatever), it's best to get that
fixed rather than having fallback paths.

There have been people who have suggested "What if folios could be
physically discontiguous", and sometimes I've hhumoured them, but the
simplifications enabled by requiring folios to be contiguous are quite
immense.

We've been trying to move in the direction of exposing more high-level
APIs so people can say "I want to allocate 10MB of memory but it doesn't
need to be contiguous" and have the allocator either fail the whole
thing up front or make efforts to ensure that you get the whole 10MB.
It's a lot more efficient than calling get_free_page() 2500 times
and possibly having reclaim run a dozen different times.

(anyone else try to create a brd that's actually larger than system ram?
;-)

^ permalink raw reply

* Re: [PATCH] ext4: fix LOGFLUSH shutdown ordering to allow ordered-mode data writeback
From: Theodore Ts'o @ 2026-06-04 14:45 UTC (permalink / raw)
  To: linux-ext4, Zhang Yi
  Cc: Theodore Ts'o, linux-fsdevel, linux-kernel, adilger.kernel,
	libaokun, jack, ojaswin, ritesh.list, yi.zhang, yizhang089,
	yangerkun, yukuai
In-Reply-To: <20260424104201.1930823-1-yi.zhang@huaweicloud.com>


On Fri, 24 Apr 2026 18:42:01 +0800, Zhang Yi wrote:
> In EXT4_GOING_FLAGS_LOGFLUSH mode, the EXT4_FLAGS_SHUTDOWN flag was set
> before calling ext4_force_commit().  This caused ordered-mode data
> writeback (triggered by journal commit) to fail with -EIO, since
> ext4_do_writepages() checks for the shutdown flag.  The journal would
> then be aborted prematurely before the commit could succeed.
> 
> Fix this by calling ext4_force_commit() first, then setting the
> shutdown flag, so that pending data can be written back correctly.
> 
> [...]

Applied, thanks!

[1/1] ext4: fix LOGFLUSH shutdown ordering to allow ordered-mode data writeback
      commit: d99748ef1695ce17eaf51c64b7a06952fa7cddab

Best regards,
-- 
Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply

* Re: [RFC v8 0/7] ext4: fast commit: snapshot inode state for FC log
From: Theodore Ts'o @ 2026-06-04 14:45 UTC (permalink / raw)
  To: Zhang Yi, Andreas Dilger, Li Chen
  Cc: Theodore Ts'o, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, linux-ext4, linux-trace-kernel, linux-kernel
In-Reply-To: <20260515091829.194810-1-me@linux.beauty>


On Fri, 15 May 2026 17:18:20 +0800, Li Chen wrote:
> (This RFC v8 series is rebased onto linux-next master as of 2026-05-09,
> commit e98d21c170b0 ("Add linux-next specific files for 20260508"), and
> depends on patch "ext4: fix fast commit wait/wake bit mapping on
> 64-bit" [0]).
> 
> Zhang Yi in RFC v3 review pointed out that postponing lockdep assertions only
> masks the issue, and that sleeping in ext4_fc_track_inode() while holding
> i_data_sem can form a real ABBA deadlock if the fast commit writer also needs
> i_data_sem while the inode is in FC_COMMITTING.
> 
> [...]

Applied, thanks!

[1/7] ext4: fast commit: snapshot inode state before writing log
      commit: e9c6e0b8e096255feb71ec996c77bdfbe9c36e91
[2/7] ext4: lockdep: handle i_data_sem subclassing for special inodes
      commit: 7f473f971382d73a58e386afa7efdaac294b89f0
[3/7] ext4: fast commit: avoid waiting for FC_COMMITTING
      commit: b3060e96533dc3157fc6d3d45dc19927c566977b
[4/7] ext4: fast commit: avoid self-deadlock in inode snapshotting
      commit: 2b9b216628fd9352f9c791701c8990d05736aa90
[5/7] ext4: fast commit: avoid i_data_sem by dropping ext4_map_blocks() in snapshots
      commit: 22d887e06a57261df58404c8dce50c4ef37549ed
[6/7] ext4: fast commit: add lock_updates tracepoint
      commit: d2f6e83bbbef31169ea363af4277f5c09c914eda
[7/7] ext4: fast commit: export snapshot stats in fc_info
      commit: 56bb0b64f4b198bad5ce674509c10793d471148f

Best regards,
-- 
Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply

* Re: [PATCH v2] common/defrag: skip defrag tests on DAX-enabled filesystems
From: Darrick J. Wong @ 2026-06-04 14:54 UTC (permalink / raw)
  To: Disha Goel
  Cc: fstests, linux-ext4, linux-fsdevel, linux-xfs, ritesh.list,
	ojaswin
In-Reply-To: <20260604122305.39805-1-disgoel@linux.ibm.com>

On Thu, Jun 04, 2026 at 05:53:05PM +0530, Disha Goel wrote:
> Online defragmentation is not supported on ext4 DAX-enabled filesystems.
> The ext4 defrag ioctl (EXT4_IOC_MOVE_EXT) returns EOPNOTSUPP when used
> on DAX files.
> 
> Add an ext4-specific check in _require_defrag() to skip tests when DAX
> is enabled, avoiding false failures on ext4/301-304, ext4/308, and
> generic/018.
> 
> XFS defrag works with DAX, so this check is ext4-specific.
> 
> Signed-off-by: Disha Goel <disgoel@linux.ibm.com>
> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> ---
> Changes in v2:
> - Made the check ext4-specific as XFS defrag works with DAX
>   (feedback from Darrick)
> - Use __scratch_uses_fsdax() instead of grepping MOUNT_OPTIONS
> - Removed unnecessary comment as _notrun message is self-explanatory
> 
>  common/defrag | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/common/defrag b/common/defrag
> index 055d0d0e..f17271cd 100644
> --- a/common/defrag
> +++ b/common/defrag
> @@ -6,6 +6,10 @@
>  
>  _require_defrag()
>  {
> +    if [ "$FSTYP" = "ext4" ] && __scratch_uses_fsdax; then

Shouldn't this be:

	ext4)
		__scratch_uses_fsdax && _notrun "..."
		;;

in the case statement below?

--D

> +        _notrun "ext4 online defrag not supported with DAX"
> +    fi
> +
>      case "$FSTYP" in
>      xfs)
>          # xfs_fsr does preallocates, require "falloc"
> -- 
> 2.45.1
> 

^ permalink raw reply

* [syzbot] [overlayfs?] [ext4?] possible deadlock in lock_two_nondirectories (2)
From: syzbot @ 2026-06-04 21:33 UTC (permalink / raw)
  To: amir73il, linux-ext4, linux-kernel, linux-unionfs, miklos,
	syzkaller-bugs

Hello,

syzbot found the following issue on:

HEAD commit:    ba3e43a9e601 Merge tag 'soc-fixes-7.1-2' of git://git.kern..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=1033aa56580000
kernel config:  https://syzkaller.appspot.com/x/.config?x=bd38685893011045
dashboard link: https://syzkaller.appspot.com/bug?extid=ad6118a7584b607c67f2
compiler:       Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=17e2f3ec580000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=174c2a66580000

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/8759ddf1bfa7/disk-ba3e43a9.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/e2f0e563c705/vmlinux-ba3e43a9.xz
kernel image: https://storage.googleapis.com/syzbot-assets/b40bdb37a0d7/bzImage-ba3e43a9.xz
mounted in repro: https://storage.googleapis.com/syzbot-assets/4074e1f6d9f8/mount_0.gz
  fsck result: failed (log: https://syzkaller.appspot.com/x/fsck.log?x=1103db7e580000)

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+ad6118a7584b607c67f2@syzkaller.appspotmail.com

EXT4-fs: Ignoring removed bh option
EXT4-fs (loop0): stripe (5) is not aligned with cluster size (16), stripe is disabled
EXT4-fs (loop0): mounted filesystem 00000000-0000-0000-0000-000000000000 r/w without journal. Quota mode: none.
======================================================
WARNING: possible circular locking dependency detected
syzkaller #0 Not tainted
------------------------------------------------------
syz.0.22/5968 is trying to acquire lock:
ffff88805aab44a0 (&ovl_i_mutex_key[depth]){+.+.}-{4:4}, at: inode_lock include/linux/fs.h:1029 [inline]
ffff88805aab44a0 (&ovl_i_mutex_key[depth]){+.+.}-{4:4}, at: lock_two_nondirectories+0xe7/0x180 fs/inode.c:1254

but task is already holding lock:
ffff88803ea9c480 (sb_writers#4){.+.+}-{0:0}, at: mnt_want_write_file+0x63/0x210 fs/namespace.c:537

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #1 (sb_writers#4){.+.+}-{0:0}:
       percpu_down_read_internal include/linux/percpu-rwsem.h:53 [inline]
       percpu_down_read_freezable include/linux/percpu-rwsem.h:83 [inline]
       __sb_start_write include/linux/fs/super.h:19 [inline]
       sb_start_write+0x4d/0x1c0 include/linux/fs/super.h:125
       file_start_write include/linux/fs.h:2724 [inline]
       vfs_iter_write+0x1f8/0x610 fs/read_write.c:982
       do_backing_file_write_iter fs/backing-file.c:226 [inline]
       backing_file_write_iter+0x5e7/0x950 fs/backing-file.c:274
       ovl_write_iter+0x2fd/0x3d0 fs/overlayfs/file.c:370
       new_sync_write fs/read_write.c:595 [inline]
       vfs_write+0x629/0xba0 fs/read_write.c:688
       ksys_pwrite64 fs/read_write.c:795 [inline]
       __do_sys_pwrite64 fs/read_write.c:803 [inline]
       __se_sys_pwrite64 fs/read_write.c:800 [inline]
       __x64_sys_pwrite64+0x19c/0x230 fs/read_write.c:800
       do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
       do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
       entry_SYSCALL_64_after_hwframe+0x77/0x7f

-> #0 (&ovl_i_mutex_key[depth]){+.+.}-{4:4}:
       check_prev_add kernel/locking/lockdep.c:3165 [inline]
       check_prevs_add kernel/locking/lockdep.c:3284 [inline]
       validate_chain kernel/locking/lockdep.c:3908 [inline]
       __lock_acquire+0x15a5/0x2cf0 kernel/locking/lockdep.c:5237
       lock_acquire+0x106/0x350 kernel/locking/lockdep.c:5868
       down_write+0x3a/0x50 kernel/locking/rwsem.c:1625
       inode_lock include/linux/fs.h:1029 [inline]
       lock_two_nondirectories+0xe7/0x180 fs/inode.c:1254
       ext4_move_extents+0x20f/0x3950 fs/ext4/move_extent.c:589
       __ext4_ioctl fs/ext4/ioctl.c:1657 [inline]
       ext4_ioctl+0x3092/0x4b40 fs/ext4/ioctl.c:1922
       vfs_ioctl fs/ioctl.c:51 [inline]
       __do_sys_ioctl fs/ioctl.c:597 [inline]
       __se_sys_ioctl+0xff/0x170 fs/ioctl.c:583
       do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
       do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
       entry_SYSCALL_64_after_hwframe+0x77/0x7f

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  rlock(sb_writers#4);
                               lock(&ovl_i_mutex_key[depth]);
                               lock(sb_writers#4);
  lock(&ovl_i_mutex_key[depth]);

 *** DEADLOCK ***

1 lock held by syz.0.22/5968:
 #0: ffff88803ea9c480 (sb_writers#4){.+.+}-{0:0}, at: mnt_want_write_file+0x63/0x210 fs/namespace.c:537

stack backtrace:
CPU: 0 UID: 0 PID: 5968 Comm: syz.0.22 Not tainted syzkaller #0 PREEMPT_{RT,(full)} 
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/18/2026
Call Trace:
 <TASK>
 dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
 print_circular_bug+0x2e1/0x300 kernel/locking/lockdep.c:2043
 check_noncircular+0x12e/0x150 kernel/locking/lockdep.c:2175
 check_prev_add kernel/locking/lockdep.c:3165 [inline]
 check_prevs_add kernel/locking/lockdep.c:3284 [inline]
 validate_chain kernel/locking/lockdep.c:3908 [inline]
 __lock_acquire+0x15a5/0x2cf0 kernel/locking/lockdep.c:5237
 lock_acquire+0x106/0x350 kernel/locking/lockdep.c:5868
 down_write+0x3a/0x50 kernel/locking/rwsem.c:1625
 inode_lock include/linux/fs.h:1029 [inline]
 lock_two_nondirectories+0xe7/0x180 fs/inode.c:1254
 ext4_move_extents+0x20f/0x3950 fs/ext4/move_extent.c:589
 __ext4_ioctl fs/ext4/ioctl.c:1657 [inline]
 ext4_ioctl+0x3092/0x4b40 fs/ext4/ioctl.c:1922
 vfs_ioctl fs/ioctl.c:51 [inline]
 __do_sys_ioctl fs/ioctl.c:597 [inline]
 __se_sys_ioctl+0xff/0x170 fs/ioctl.c:583
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fb592a3ce59
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffc3443e838 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fb592cb5fa0 RCX: 00007fb592a3ce59
RDX: 0000200000000040 RSI: 00000000c028660f RDI: 0000000000000005
RBP: 00007fb592ad2d6f R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fb592cb5fac R14: 00007fb592cb5fa0 R15: 00007fb592cb5fa0
 </TASK>


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want syzbot to run the reproducer, reply with:
#syz test: git://repo/address.git branch-or-commit-hash
If you attach or paste a git patch, syzbot will apply it before testing.

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

^ permalink raw reply

* Re: [PATCH v2] common/defrag: skip defrag tests on DAX-enabled filesystems
From: Disha Goel @ 2026-06-05  7:02 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: fstests, linux-ext4, linux-fsdevel, linux-xfs, ritesh.list,
	ojaswin
In-Reply-To: <20260604145434.GG6095@frogsfrogsfrogs>

On 04/06/26 8:24 pm, Darrick J. Wong wrote:
> On Thu, Jun 04, 2026 at 05:53:05PM +0530, Disha Goel wrote:
>> Online defragmentation is not supported on ext4 DAX-enabled filesystems.
>> The ext4 defrag ioctl (EXT4_IOC_MOVE_EXT) returns EOPNOTSUPP when used
>> on DAX files.
>>
>> Add an ext4-specific check in _require_defrag() to skip tests when DAX
>> is enabled, avoiding false failures on ext4/301-304, ext4/308, and
>> generic/018.
>>
>> XFS defrag works with DAX, so this check is ext4-specific.
>>
>> Signed-off-by: Disha Goel <disgoel@linux.ibm.com>
>> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
>> ---
>> Changes in v2:
>> - Made the check ext4-specific as XFS defrag works with DAX
>>    (feedback from Darrick)
>> - Use __scratch_uses_fsdax() instead of grepping MOUNT_OPTIONS
>> - Removed unnecessary comment as _notrun message is self-explanatory
>>
>>   common/defrag | 4 ++++
>>   1 file changed, 4 insertions(+)
>>
>> diff --git a/common/defrag b/common/defrag
>> index 055d0d0e..f17271cd 100644
>> --- a/common/defrag
>> +++ b/common/defrag
>> @@ -6,6 +6,10 @@
>>   
>>   _require_defrag()
>>   {
>> +    if [ "$FSTYP" = "ext4" ] && __scratch_uses_fsdax; then
> 
> Shouldn't this be:
> 
> 	ext4)
> 		__scratch_uses_fsdax && _notrun "..."
> 		;;
> 
> in the case statement below?
> 
> --D

Yes, that makes more sense. Keeping the ext4-specific check inside the
ext4 case is cleaner and more consistent with the existing structure.

I'll send v3 with this change.

> 
>> +        _notrun "ext4 online defrag not supported with DAX"
>> +    fi
>> +
>>       case "$FSTYP" in
>>       xfs)
>>           # xfs_fsr does preallocates, require "falloc"
>> -- 
>> 2.45.1
>>

-- 
Regards,
Disha


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox