Linux EXT4 FS development

Linux EXT4 FS development
 help / color / mirror / Atom feed

* Re: [PATCH] jbd2: check for aborted handle in jbd2_journal_dirty_metadata()
From: Theodore Ts'o @ 2026-06-04 14:45 UTC (permalink / raw)
  To: jack, Deepanshu Kartikey
  Cc: Theodore Ts'o, linux-ext4, linux-kernel,
	syzbot+98f651460e558a21baae
In-Reply-To: <20260507050605.50081-1-kartikey406@gmail.com>


On Thu, 07 May 2026 10:36:05 +0530, Deepanshu Kartikey wrote:
> jbd2_journal_dirty_metadata() unconditionally dereferences
> handle->h_transaction at function entry to obtain the journal pointer:
> 
> 	transaction_t *transaction = handle->h_transaction;
> 	journal_t *journal = transaction->t_journal;
> 
> However, h_transaction may legitimately be NULL for an aborted handle.
> The is_handle_aborted() helper in include/linux/jbd2.h explicitly
> treats !h_transaction as one of the aborted states:
> 
> [...]

Applied, thanks!

[1/1] jbd2: check for aborted handle in jbd2_journal_dirty_metadata()
      commit: 8fc197cf366beaabaeb46575c8cf46fe5076b943

Best regards,
-- 
Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply

* Re: [PATCH 10/17] jbd2: replace __get_free_pages() with kmalloc()
From: Theodore Tso @ 2026-06-04 14:05 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Jan Kara, Mark Fasheh, Joel Becker, Joseph Qi, Ryusuke Konishi,
	Viacheslav Dubeyko, Trond Myklebust, Anna Schumaker, Chuck Lever,
	Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Alexander Viro, Christian Brauner, Jan Kara, Dave Kleikamp,
	Miklos Szeredi, Andreas Hindborg, Breno Leitao, Kees Cook,
	Tigran A. Aivazian, linux-kernel, linux-fsdevel, ocfs2-devel,
	linux-nilfs, linux-nfs, jfs-discussion, linux-ext4, linux-mm
In-Reply-To: <aiEX4UTxEnBTjVKo@kernel.org>

On Thu, Jun 04, 2026 at 09:14:57AM +0300, Mike Rapoport wrote:
> There's no memory overhead when order == 1.
> As for the CPU overhead, the difference for the fast path allocations is
> not measurable and for the slow path it is anyway determined by the amount
> of reclaim involved rather than by what allocator is used.

Thanks for confirming!

> Larger allocations (> PAGE_SIZE * 2) go straight to the page allocator.

Another question: Today, we can either use kmalloc() (or
__get_free_pages, previously) or vmalloc().  Is there a way a file
system can say, "give me physically contiguous pages if possible, but
if it's too hard --- with some TBD to specify what 'too hard' means or
can be specified --- fall back to a vmalloc-style approach, with the
page table / TLB overhead that this might imply"?

I suppose we could do it with kmalloc() with some flags which to
prevent forced reclaim / compaction, and if that fails, then fall back
to vmalloc().  Is there a better way?

Thanks,

					- Ted

^ permalink raw reply

* [PATCH v2] common/defrag: skip defrag tests on DAX-enabled filesystems
From: Disha Goel @ 2026-06-04 12:23 UTC (permalink / raw)
  To: fstests
  Cc: linux-ext4, linux-fsdevel, linux-xfs, ritesh.list, ojaswin,
	djwong, Disha Goel

Online defragmentation is not supported on ext4 DAX-enabled filesystems.
The ext4 defrag ioctl (EXT4_IOC_MOVE_EXT) returns EOPNOTSUPP when used
on DAX files.

Add an ext4-specific check in _require_defrag() to skip tests when DAX
is enabled, avoiding false failures on ext4/301-304, ext4/308, and
generic/018.

XFS defrag works with DAX, so this check is ext4-specific.

Signed-off-by: Disha Goel <disgoel@linux.ibm.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
Changes in v2:
- Made the check ext4-specific as XFS defrag works with DAX
  (feedback from Darrick)
- Use __scratch_uses_fsdax() instead of grepping MOUNT_OPTIONS
- Removed unnecessary comment as _notrun message is self-explanatory

 common/defrag | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/common/defrag b/common/defrag
index 055d0d0e..f17271cd 100644
--- a/common/defrag
+++ b/common/defrag
@@ -6,6 +6,10 @@
 
 _require_defrag()
 {
+    if [ "$FSTYP" = "ext4" ] && __scratch_uses_fsdax; then
+        _notrun "ext4 online defrag not supported with DAX"
+    fi
+
     case "$FSTYP" in
     xfs)
         # xfs_fsr does preallocates, require "falloc"
-- 
2.45.1


^ permalink raw reply related

* Re: [PATCH v10 00/22] fs-verity support for XFS with post EOF merkle tree
From: Carlos Maiolino @ 2026-06-04 12:08 UTC (permalink / raw)
  To: Andrey Albershteyn
  Cc: Christian Brauner, Christoph Hellwig, Andrey Albershteyn,
	linux-xfs, fsverity, linux-fsdevel, ebiggers, linux-ext4,
	linux-f2fs-devel, linux-btrfs, linux-unionfs, djwong, david
In-Reply-To: <yezqqgowgmbn2z42zvha7cfcprym5vnurb33brdmooab6csdks@a76a7v6rtywn>

On Thu, Jun 04, 2026 at 02:07:05PM +0200, Andrey Albershteyn wrote:
> On 2026-06-04 14:00:07, Christian Brauner wrote:
> > On Thu, Jun 04, 2026 at 11:04:32AM +0200, Andrey Albershteyn wrote:
> > > On 2026-05-28 16:50:45, Andrey Albershteyn wrote:
> > > > On 2026-05-28 14:20:08, Christian Brauner wrote:
> > > > > On Tue, May 26, 2026 at 12:19:43PM +0200, Carlos Maiolino wrote:
> > > > > > On Fri, May 22, 2026 at 02:07:57PM +0200, Christoph Hellwig wrote:
> > > > > > > On Fri, May 22, 2026 at 12:03:20PM +0200, Christian Brauner wrote:
> > > > > > > > > I was expecting this to come through xfs tree too if Eric and Christian
> > > > > > > > > agree.
> > > > > > > > 
> > > > > > > > You may take it through the xfs tree if there are no conflicts with
> > > > > > > > vfs-7.2.iomap. If there are I want to add the iomap changes into
> > > > > > > > vfs-7.2.iomap that you can pull in.
> > > > > > > 
> > > > > > > Merging the iomap bits through the iomap branch might make sense, given
> > > > > > > that iomap usually tends to see quite a bit of activity.
> > > > > > > 
> > > > > > 
> > > > > > That sounds good to me. If you want to go ahead and pull in the iomap
> > > > > > bits, do so, and give me a heads up when you do it so I'll pull your
> > > > > > branch locally.
> > > > > 
> > > > > Great, can the series please be resent based on current vfs-7.2.iomap
> > > > > then please? Because the iomap changes in this series don't apply
> > > > > cleanly on vfs-7.2.iomap so we already have merge conflicts...
> > > > > 
> > > > 
> > > > hmm do you mean this branch?
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs-7.2.iomap
> > > > 
> > > > patches 07..09 seems to apply cleanly. The only conflict I see is in
> > > > the overlayfs patch 03. This is because [1] (is in -rc5) is missing
> > > > in vfs-7.2.iomap.
> > > > 
> > > > [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/overlayfs/util.c?h=v7.1-rc5&id=0c8c88b8eb82a2a41bec5f17c076d6312dc40316
> > > 
> > > Christian, ping
> > > 
> > > Would be nice to have iomap in vfs, so Carlos can pull and test the
> > > rest
> > 
> > Applied but note IOMAP_F_FSVERITY
> > changed from (1U << 10) to (1U << 11) since we have another flag
> > addition this cycle.
> 
> Oh I haven't noticed that, thanks!

Thanks Christian. I'll deal with the rest of the series next week!

> 
> -- 
> - Andrey
> 

^ permalink raw reply

* Re: [PATCH v10 00/22] fs-verity support for XFS with post EOF merkle tree
From: Andrey Albershteyn @ 2026-06-04 12:07 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Carlos Maiolino, Christoph Hellwig, Andrey Albershteyn, linux-xfs,
	fsverity, linux-fsdevel, ebiggers, linux-ext4, linux-f2fs-devel,
	linux-btrfs, linux-unionfs, djwong, david
In-Reply-To: <20260604-holen-rundum-ausfechten-2193b39da363@brauner>

On 2026-06-04 14:00:07, Christian Brauner wrote:
> On Thu, Jun 04, 2026 at 11:04:32AM +0200, Andrey Albershteyn wrote:
> > On 2026-05-28 16:50:45, Andrey Albershteyn wrote:
> > > On 2026-05-28 14:20:08, Christian Brauner wrote:
> > > > On Tue, May 26, 2026 at 12:19:43PM +0200, Carlos Maiolino wrote:
> > > > > On Fri, May 22, 2026 at 02:07:57PM +0200, Christoph Hellwig wrote:
> > > > > > On Fri, May 22, 2026 at 12:03:20PM +0200, Christian Brauner wrote:
> > > > > > > > I was expecting this to come through xfs tree too if Eric and Christian
> > > > > > > > agree.
> > > > > > > 
> > > > > > > You may take it through the xfs tree if there are no conflicts with
> > > > > > > vfs-7.2.iomap. If there are I want to add the iomap changes into
> > > > > > > vfs-7.2.iomap that you can pull in.
> > > > > > 
> > > > > > Merging the iomap bits through the iomap branch might make sense, given
> > > > > > that iomap usually tends to see quite a bit of activity.
> > > > > > 
> > > > > 
> > > > > That sounds good to me. If you want to go ahead and pull in the iomap
> > > > > bits, do so, and give me a heads up when you do it so I'll pull your
> > > > > branch locally.
> > > > 
> > > > Great, can the series please be resent based on current vfs-7.2.iomap
> > > > then please? Because the iomap changes in this series don't apply
> > > > cleanly on vfs-7.2.iomap so we already have merge conflicts...
> > > > 
> > > 
> > > hmm do you mean this branch?
> > > https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs-7.2.iomap
> > > 
> > > patches 07..09 seems to apply cleanly. The only conflict I see is in
> > > the overlayfs patch 03. This is because [1] (is in -rc5) is missing
> > > in vfs-7.2.iomap.
> > > 
> > > [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/overlayfs/util.c?h=v7.1-rc5&id=0c8c88b8eb82a2a41bec5f17c076d6312dc40316
> > 
> > Christian, ping
> > 
> > Would be nice to have iomap in vfs, so Carlos can pull and test the
> > rest
> 
> Applied but note IOMAP_F_FSVERITY
> changed from (1U << 10) to (1U << 11) since we have another flag
> addition this cycle.

Oh I haven't noticed that, thanks!

-- 
- Andrey


^ permalink raw reply

* Re: [PATCH v10 00/22] fs-verity support for XFS with post EOF merkle tree
From: Christian Brauner @ 2026-06-04 12:00 UTC (permalink / raw)
  To: Andrey Albershteyn
  Cc: Carlos Maiolino, Christoph Hellwig, Andrey Albershteyn, linux-xfs,
	fsverity, linux-fsdevel, ebiggers, linux-ext4, linux-f2fs-devel,
	linux-btrfs, linux-unionfs, djwong, david
In-Reply-To: <aiE_YQc6SGSdWlcE@aalbersh-thinkpadx1carbongen13.rmtcz.csb>

On Thu, Jun 04, 2026 at 11:04:32AM +0200, Andrey Albershteyn wrote:
> On 2026-05-28 16:50:45, Andrey Albershteyn wrote:
> > On 2026-05-28 14:20:08, Christian Brauner wrote:
> > > On Tue, May 26, 2026 at 12:19:43PM +0200, Carlos Maiolino wrote:
> > > > On Fri, May 22, 2026 at 02:07:57PM +0200, Christoph Hellwig wrote:
> > > > > On Fri, May 22, 2026 at 12:03:20PM +0200, Christian Brauner wrote:
> > > > > > > I was expecting this to come through xfs tree too if Eric and Christian
> > > > > > > agree.
> > > > > > 
> > > > > > You may take it through the xfs tree if there are no conflicts with
> > > > > > vfs-7.2.iomap. If there are I want to add the iomap changes into
> > > > > > vfs-7.2.iomap that you can pull in.
> > > > > 
> > > > > Merging the iomap bits through the iomap branch might make sense, given
> > > > > that iomap usually tends to see quite a bit of activity.
> > > > > 
> > > > 
> > > > That sounds good to me. If you want to go ahead and pull in the iomap
> > > > bits, do so, and give me a heads up when you do it so I'll pull your
> > > > branch locally.
> > > 
> > > Great, can the series please be resent based on current vfs-7.2.iomap
> > > then please? Because the iomap changes in this series don't apply
> > > cleanly on vfs-7.2.iomap so we already have merge conflicts...
> > > 
> > 
> > hmm do you mean this branch?
> > https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs-7.2.iomap
> > 
> > patches 07..09 seems to apply cleanly. The only conflict I see is in
> > the overlayfs patch 03. This is because [1] (is in -rc5) is missing
> > in vfs-7.2.iomap.
> > 
> > [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/overlayfs/util.c?h=v7.1-rc5&id=0c8c88b8eb82a2a41bec5f17c076d6312dc40316
> 
> Christian, ping
> 
> Would be nice to have iomap in vfs, so Carlos can pull and test the
> rest

Applied but note IOMAP_F_FSVERITY
changed from (1U << 10) to (1U << 11) since we have another flag
addition this cycle.


^ permalink raw reply

* Re: [PATCH v10 00/22] fs-verity support for XFS with post EOF merkle tree
From: Andrey Albershteyn @ 2026-06-04  9:04 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Carlos Maiolino, Christoph Hellwig, Andrey Albershteyn, linux-xfs,
	fsverity, linux-fsdevel, ebiggers, linux-ext4, linux-f2fs-devel,
	linux-btrfs, linux-unionfs, djwong, david
In-Reply-To: <fjdfwhwi4aogyiaoijwvw6w4npuu5mbt6ua6fkhwcp5ajlm543@ume2fkxg36cb>

On 2026-05-28 16:50:45, Andrey Albershteyn wrote:
> On 2026-05-28 14:20:08, Christian Brauner wrote:
> > On Tue, May 26, 2026 at 12:19:43PM +0200, Carlos Maiolino wrote:
> > > On Fri, May 22, 2026 at 02:07:57PM +0200, Christoph Hellwig wrote:
> > > > On Fri, May 22, 2026 at 12:03:20PM +0200, Christian Brauner wrote:
> > > > > > I was expecting this to come through xfs tree too if Eric and Christian
> > > > > > agree.
> > > > > 
> > > > > You may take it through the xfs tree if there are no conflicts with
> > > > > vfs-7.2.iomap. If there are I want to add the iomap changes into
> > > > > vfs-7.2.iomap that you can pull in.
> > > > 
> > > > Merging the iomap bits through the iomap branch might make sense, given
> > > > that iomap usually tends to see quite a bit of activity.
> > > > 
> > > 
> > > That sounds good to me. If you want to go ahead and pull in the iomap
> > > bits, do so, and give me a heads up when you do it so I'll pull your
> > > branch locally.
> > 
> > Great, can the series please be resent based on current vfs-7.2.iomap
> > then please? Because the iomap changes in this series don't apply
> > cleanly on vfs-7.2.iomap so we already have merge conflicts...
> > 
> 
> hmm do you mean this branch?
> https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs-7.2.iomap
> 
> patches 07..09 seems to apply cleanly. The only conflict I see is in
> the overlayfs patch 03. This is because [1] (is in -rc5) is missing
> in vfs-7.2.iomap.
> 
> [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/overlayfs/util.c?h=v7.1-rc5&id=0c8c88b8eb82a2a41bec5f17c076d6312dc40316

Christian, ping

Would be nice to have iomap in vfs, so Carlos can pull and test the
rest

-- 
- Andrey


^ permalink raw reply

* [PATCH v2] ext4: Fix ERR_PTR(0) in ext4_mkdir()
From: Hongling Zeng @ 2026-06-04  7:36 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, neil, brauner, jlayton
  Cc: linux-ext4, linux-kernel, zhongling0719, Hongling Zeng

When mkdir succeeds, ext4_mkdir() returns ERR_PTR(0) which is incorrect.
It should return NULL instead for success and ERR_PTR() only with
negative error codes for failure.

Fixes: 88d5baf69082 ("Change inode_operations.mkdir to return struct dentry *")
Signed-off-by: Hongling Zeng <zenghongling@kylinos.cn>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun@linux.alibaba.com>

---
Change in v2:
 -Add pre-reivewer
---
 fs/ext4/namei.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 4a47fbd8dd30..8cadaeb15b2b 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -3054,7 +3054,7 @@ static struct dentry *ext4_mkdir(struct mnt_idmap *idmap, struct inode *dir,
 out_retry:
 	if (err == -ENOSPC && ext4_should_retry_alloc(dir->i_sb, &retries))
 		goto retry;
-	return ERR_PTR(err);
+	return err ? ERR_PTR(err) : NULL;
 }
 
 /*
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH 10/17] jbd2: replace __get_free_pages() with kmalloc()
From: Mike Rapoport @ 2026-06-04  6:14 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Jan Kara, Mark Fasheh, Joel Becker, Joseph Qi, Ryusuke Konishi,
	Viacheslav Dubeyko, Trond Myklebust, Anna Schumaker, Chuck Lever,
	Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Alexander Viro, Christian Brauner, Jan Kara, Dave Kleikamp,
	Miklos Szeredi, Andreas Hindborg, Breno Leitao, Kees Cook,
	Tigran A. Aivazian, linux-kernel, linux-fsdevel, ocfs2-devel,
	linux-nilfs, linux-nfs, jfs-discussion, linux-ext4, linux-mm
In-Reply-To: <yfzx3jgzwesernofl7mzixa2mhjfii5v3o7yapghtmozixrpfu@6bsh7iixyiov>

Hi Ted,

On Wed, Jun 03, 2026 at 09:50:15AM -0400, Theodore Tso wrote:
> On Sat, May 23, 2026 at 08:54:22PM +0300, Mike Rapoport (Microsoft) wrote:
> > jbd2_alloc() falls back from kmem_cache_alloc() to __get_free_pages() for
> > allocations larger than PAGE_SIZE.
> > But kmalloc() can handle such cases with essentially the same fallback.
> > 
> > Replace use of __get_free_pages() with kmalloc() and simplify
> > jbd2_free() as both kmem_cache_alloc() and kmalloc() allocations can be
> > freed with kfree().
> > 
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> 
> So historically __get_free_pages() was more efficient than kmalloc
> since previously the kmalloc overhead meant that a single 4k
> allocation would take two pages instead of one.  I'm guessing that has
> since changed?

Today there's no memory overhead for kmalloc(PAGE_SIZE). Cache refill takes
more pages of course, but they will be handed over to the next
kmalloc(PAGE_SIZE).
 
> Can you explain to someone who hasn't been tracking the changes in
> kmalloc over time:
> 
>   * How does the efficiency of kmalloc compare to __get_free_page when
>     order == 1?  What is the overhead in terms of memory overhead?
>     I'm a bit less concerned about CPU overhead, but it would be good
>     to know that?

There's no memory overhead when order == 1.
As for the CPU overhead, the difference for the fast path allocations is
not measurable and for the slow path it is anyway determined by the amount
of reclaim involved rather than by what allocator is used.
 
>   * What does kmalloc() do when a size > PAGE_SIZE is passed?  Will it
>     return contiguous memory, or return an error or worse, BUG?  And
>     same question as above; what is the overhead of kmalloc() when
>     size is 2*PAGE_SIZE?  8*PAGE_SIZE?

For size >= PAGE_SIZE kmalloc() always returns contiguous page aligned
memory.

Larger allocations (> PAGE_SIZE * 2) go straight to the page allocator. 

> Thanks,
> 
> 						- Ted

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH] ext4: avoid full buffer walks for large folio partial writes
From: Matthew Wilcox @ 2026-06-03 18:11 UTC (permalink / raw)
  To: Jia Zhu
  Cc: Theodore Ts'o, Andreas Dilger, Alexander Viro,
	Christian Brauner, Jan Kara, Baokun Li, Ojaswin Mujoo,
	Ritesh Harjani, Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel
In-Reply-To: <20260603134800.25155-1-zhujia.zj@bytedance.com>

On Wed, Jun 03, 2026 at 09:48:00PM +0800, Jia Zhu wrote:
> Ext4 buffered writes into large folios still walk every buffer_head in the
> folio in ext4_block_write_begin() and again in block_commit_write(). Before
> regular files used large folios this was cheap, but a large folio can
> contain hundreds of buffer_heads. Small overwrites of an existing large
> folio therefore pay work proportional to the folio size instead of the
> write size.

Is this a common case for you, or is this something you noticed by
inspection?

> Start the ext4 write_begin walk at the first buffer that overlaps the
> write. For already-uptodate large folio overwrites, add a partial commit
> path which marks only the written buffers uptodate and dirty. Leave
> non-uptodate folios on the old full-buffer commit path so BH_New cleanup
> and folio-uptodate discovery are preserved.

Wouldn't you get just as much benefit from this?

+++ b/fs/buffer.c
@@ -2096,6 +2096,7 @@ void block_commit_write(struct folio *folio, size_t from,
size_t to)
 {
        size_t block_start, block_end;
        bool partial = false;
+       bool uptodate = folio_test_uptodate(folio);
        unsigned blocksize;
        struct buffer_head *bh, *head;

@@ -2118,6 +2119,8 @@ void block_commit_write(struct folio *folio, size_t from, size_t to)
                        clear_buffer_new(bh);

                block_start = block_end;
+               if (uptodate && block_start >= to)
+                       break;
                bh = bh->b_this_page;
        } while (bh != head);

> @@ -1191,17 +1191,18 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
>  	head = folio_buffers(folio);
>  	if (!head)
>  		head = create_empty_buffers(folio, blocksize, 0);
> -	block = EXT4_PG_TO_LBLK(inode, folio->index);
> +	if (from == to)
> +		return 0;
> +	block_start = round_down(from, blocksize);
> +	block = EXT4_PG_TO_LBLK(inode, folio->index) +
> +		(block_start >> inode->i_blkbits);
> +	bh = head;
> +	for (i = 0; i < block_start; i += blocksize)
> +		bh = bh->b_this_page;
>  
> -	for (bh = head, block_start = 0; bh != head || !block_start;
> -	    block++, block_start = block_end, bh = bh->b_this_page) {
> +	for (; block_start < to;
> +	     block++, block_start = block_end, bh = bh->b_this_page) {
>  		block_end = block_start + blocksize;
> -		if (block_end <= from || block_start >= to) {
> -			if (folio_test_uptodate(folio)) {
> -				set_buffer_uptodate(bh);
> -			}
> -			continue;
> -		}
>  		if (WARN_ON_ONCE(buffer_new(bh)))
>  			clear_buffer_new(bh);
>  		if (!buffer_mapped(bh)) {
> 

I'm unconvinced that this is safe ... but all of this is a distraction
form what we should really be doing which is converting ext4 to use
iomap instead of buffer heads.

^ permalink raw reply

* Re: [PATCH] common/defrag: Skip defrag tests on DAX-enabled filesystems
From: Disha Goel @ 2026-06-03 14:22 UTC (permalink / raw)
  To: Ojaswin Mujoo; +Cc: fstests, linux-ext4, linux-fsdevel, ritesh.list
In-Reply-To: <ah6yl1T9jnN0wH6d@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>

On 02/06/26 4:08 pm, Ojaswin Mujoo wrote:
> On Tue, Jun 02, 2026 at 03:44:18PM +0530, Disha Goel wrote:
>> Online defragmentation is not supported on DAX-enabled filesystems
>> because DAX bypasses the page cache required for defrag operations.
>>
>> Add check in _require_defrag() to skip tests when DAX is enabled,
>> avoiding false failures on ext4/301-304, ext4/308 and generic/018.
>>
>> Signed-off-by: Disha Goel <disgoel@linux.ibm.com>
> 
> Looks good Disha, feel free to add:
> 
> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> 
> One small comment:
>> ---
>>   common/defrag | 4 ++++
>>   1 file changed, 4 insertions(+)
>>
>> diff --git a/common/defrag b/common/defrag
>> index 055d0d0e..28db2f7a 100644
>> --- a/common/defrag
>> +++ b/common/defrag
>> @@ -6,6 +6,10 @@
>>   
>>   _require_defrag()
>>   {
>> +    # Defragmentation is not supported on DAX-enabled filesystems
> 
> I think this comment is not needed as _notrun explains it already

Thanks, I'll fix this in v2.

> 
>> +    if echo "$MOUNT_OPTIONS" | grep -qw "dax"; then
>> +        _notrun "Defragmentation not supported on DAX-enabled filesystem"
>> +    fi
>>       case "$FSTYP" in
>>       xfs)
>>           # xfs_fsr does preallocates, require "falloc"
>> -- 
>> 2.45.1
>>

-- 
Regards,
Disha


^ permalink raw reply

* Re: [PATCH] common/defrag: Skip defrag tests on DAX-enabled filesystems
From: Disha Goel @ 2026-06-03 14:20 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: fstests, linux-ext4, linux-fsdevel, ritesh.list, ojaswin, xfs
In-Reply-To: <20260602200930.GB6054@frogsfrogsfrogs>

On 03/06/26 1:39 am, Darrick J. Wong wrote:
> On Tue, Jun 02, 2026 at 03:44:18PM +0530, Disha Goel wrote:
>> Online defragmentation is not supported on DAX-enabled filesystems
>> because DAX bypasses the page cache required for defrag operations.
>>
>> Add check in _require_defrag() to skip tests when DAX is enabled,
>> avoiding false failures on ext4/301-304, ext4/308 and generic/018.
>>
>> Signed-off-by: Disha Goel <disgoel@linux.ibm.com>
>> ---
>>   common/defrag | 4 ++++
>>   1 file changed, 4 insertions(+)
>>
>> diff --git a/common/defrag b/common/defrag
>> index 055d0d0e..28db2f7a 100644
>> --- a/common/defrag
>> +++ b/common/defrag
>> @@ -6,6 +6,10 @@
>>   
>>   _require_defrag()
>>   {
>> +    # Defragmentation is not supported on DAX-enabled filesystems
>> +    if echo "$MOUNT_OPTIONS" | grep -qw "dax"; then
>> +        _notrun "Defragmentation not supported on DAX-enabled filesystem"
>> +    fi
> 
> Defrag doesn't work on XFS on DAX as well?  It seems to work fine on my
> VMs...
> 

Thank you for pointing this out. You're right — I missed that xfs defrag
works fine with dax.

I'll fix this in v2 to only skip for ext4.

> <confused>
> 
> --D
> 
>>       case "$FSTYP" in
>>       xfs)
>>           # xfs_fsr does preallocates, require "falloc"
>> -- 
>> 2.45.1
>>
>>

-- 
Regards,
Disha


^ permalink raw reply

* Re: [PATCH RFC 10/17] ext4: use fast incremental CRC update in ext4_free_inode()
From: Theodore Tso @ 2026-06-03 14:17 UTC (permalink / raw)
  To: sashiko; +Cc: linux-ext4, Baokun Li
In-Reply-To: <20260508233305.EB600C2BCB0@smtp.kernel.org>

On Fri, May 08, 2026 at 11:33:05PM +0000, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> - [High] ext4_inode_bitmap_csum_set_fast() corrupts the inode bitmap checksum if EXT4_INODES_PER_GROUP(sb) is not a multiple of 8.
>
> > -	ext4_inode_bitmap_csum_set(sb, gdp, bitmap_bh);
> > +	ext4_inode_bitmap_csum_set_fast(sb, gdp, bit);
> 
> Can this corrupt the inode bitmap checksum if EXT4_INODES_PER_GROUP(sb) is not
> a multiple of 8?
> 
> In ext4_inode_bitmap_csum_set(), the CRC32c checksum is computed over exactly
> EXT4_INODES_PER_GROUP(sb) >> 3 bytes. If s_inodes_per_group is not a multiple
> of 8, any trailing fractional bits are strictly excluded from the checksum
> calculation.
> 
> Because ext4_fill_super() does not enforce that s_inodes_per_group is a
> multiple of 8, a crafted filesystem can have an unaligned s_inodes_per_group.

The reason why ext4_fill_super() doesn't enforce that
s_inodes_per_group is a multiple of 8 was that a long time ago, back
when Android was allergic to GPLv2 in userspace, they implemented
their own version of mke2fs (and didn't run fsck on the file system,
sigh).  Their MIT licensed version of make_ext4fs would occasionally
make file systems that were not a multiple of 8, and this ran afoul of
e2fsck[1] if someone actually tried to repair a corrupted Android user
data file system (as opposed to just wiping the flash and starting
from scratch).

[1] https://sourceforge.net/p/e2fsprogs/bugs/292/

This was fixed long ago (over a decade ago), and so at this point, I'm
pretty sure any such mobile handsets are in the landfill, so we
probably should fix this by adding a check in ext4_fill_super() and a
corresponding check in e2fsck.

					- Ted

^ permalink raw reply

* Re: [PATCH 2/8] ext4: convert mballoc KUnit test to sget_fc()
From: Theodore Tso @ 2026-06-03 13:52 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Andreas Dilger, Jan Kara, Ritesh Harjani (IBM),
	linux-ext4, linux-cifs, Alexander Viro
In-Reply-To: <20260528-pailletten-gitter-hiermit-5198ec556b28@brauner>

On Thu, May 28, 2026 at 02:02:50PM +0200, Christian Brauner wrote:
> 
> In these two cases it's fine. Because you're just using the allocation
> and deallocation functions to get a fs_context that's basically just an
> empty vessel to get at a superblock via sget_fc() but you're not really
> doing anything with it.

If you're OK with, I have no objects, but...

I'm sure it's fine today.  But is this something which is documented
to be fine in the future?  It just seems a little fragile and is
contrary to the documentation.

Thanks,

						- Ted

^ permalink raw reply

* Re: [PATCH 10/17] jbd2: replace __get_free_pages() with kmalloc()
From: Theodore Tso @ 2026-06-03 13:50 UTC (permalink / raw)
  To: Mike Rapoport (Microsoft)
  Cc: Jan Kara, Mark Fasheh, Joel Becker, Joseph Qi, Ryusuke Konishi,
	Viacheslav Dubeyko, Trond Myklebust, Anna Schumaker, Chuck Lever,
	Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Alexander Viro, Christian Brauner, Jan Kara, Dave Kleikamp,
	Miklos Szeredi, Andreas Hindborg, Breno Leitao, Kees Cook,
	Tigran A. Aivazian, linux-kernel, linux-fsdevel, ocfs2-devel,
	linux-nilfs, linux-nfs, jfs-discussion, linux-ext4, linux-mm
In-Reply-To: <20260523-b4-fs-v1-10-275e36a83f0e@kernel.org>

On Sat, May 23, 2026 at 08:54:22PM +0300, Mike Rapoport (Microsoft) wrote:
> jbd2_alloc() falls back from kmem_cache_alloc() to __get_free_pages() for
> allocations larger than PAGE_SIZE.
> But kmalloc() can handle such cases with essentially the same fallback.
> 
> Replace use of __get_free_pages() with kmalloc() and simplify
> jbd2_free() as both kmem_cache_alloc() and kmalloc() allocations can be
> freed with kfree().
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

So historically __get_free_pages() was more efficient than kmalloc
since previously the kmalloc overhead meant that a single 4k
allocation would take two pages instead of one.  I'm guessing that has
since changed?

Can you explain to someone who hasn't been tracking the changes in
kmalloc over time:

  * How does the efficiency of kmalloc compare to __get_free_page when
    order == 1?  What is the overhead in terms of memory overhead?
    I'm a bit less concerned about CPU overhead, but it would be good
    to know that?

  * What does kmalloc() do when a size > PAGE_SIZE is passed?  Will it
    return contiguous memory, or return an error or worse, BUG?  And
    same question as above; what is the overhead of kmalloc() when
    size is 2*PAGE_SIZE?  8*PAGE_SIZE?

Thanks,

						- Ted

^ permalink raw reply

* [PATCH] ext4: avoid full buffer walks for large folio partial writes
From: Jia Zhu @ 2026-06-03 13:48 UTC (permalink / raw)
  To: Theodore Ts'o, Andreas Dilger
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Baokun Li,
	Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, linux-ext4,
	linux-fsdevel, linux-kernel, Jia Zhu

Ext4 buffered writes into large folios still walk every buffer_head in the
folio in ext4_block_write_begin() and again in block_commit_write(). Before
regular files used large folios this was cheap, but a large folio can
contain hundreds of buffer_heads. Small overwrites of an existing large
folio therefore pay work proportional to the folio size instead of the
write size.

This is visible when the page cache is first populated with large folios
and then a small range is overwritten. The numbers below come from a local
libMicro-based microbenchmark. Each round first drops caches, writes a
10 MiB file with dd to instantiate large page-cache folios, and then runs
libMicro's write, pwrite, or writev benchmark for a small buffered
overwrite. The writev cases use libMicro's default vector count of 10.

A representative pwrite round is:

	sync
	echo 3 > /proc/sys/vm/drop_caches
	dd if=/dev/zero of=$file bs=1024k count=10
	taskset -c 0 ./bin/pwrite -H -C 50 -D 3 -S -N pwrite_u1k \
		-s 1k -f $file

To avoid comparing this change with an older kernel, the benchmark uses two
kernels built from the same master tree: one with this change and one with
only this change reverted. With THP=always and 10 dd-prefill rounds, median
latencies were:

			nofix		patched		improvement
	write_u1k	1.418 usec	0.342 usec	75.9%
	write_u10k	1.887 usec	0.409 usec	78.3%
	write_u100k	4.114 usec	2.554 usec	37.9%
	pwrite_u1k	1.677 usec	0.335 usec	80.1%
	pwrite_u10k	1.903 usec	0.410 usec	78.5%
	pwrite_u100k	4.101 usec	2.563 usec	37.5%
	writev_u1k	2.285 usec	0.756 usec	66.9%
	writev_u10k	4.655 usec	3.025 usec	35.0%

Start the ext4 write_begin walk at the first buffer that overlaps the
write. For already-uptodate large folio overwrites, add a partial commit
path which marks only the written buffers uptodate and dirty. Leave
non-uptodate folios on the old full-buffer commit path so BH_New cleanup
and folio-uptodate discovery are preserved.

Partially uptodate large folios remain described by per-buffer state, which
is what block_is_partially_uptodate() and read_folio use for later reads.

Signed-off-by: Jia Zhu <zhujia.zj@bytedance.com>
---
 fs/buffer.c     | 51 +++++++++++++++++++++++++++++++++++++++++++++++++
 fs/ext4/inode.c | 21 ++++++++++----------
 2 files changed, 62 insertions(+), 10 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index b0b3792b1496e..e0c5868b088be 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2092,6 +2092,44 @@ int __block_write_begin(struct folio *folio, loff_t pos, unsigned len,
 }
 EXPORT_SYMBOL(__block_write_begin);
 
+static struct buffer_head *folio_buffer_seek(struct buffer_head *head,
+					     unsigned int blocksize,
+					     size_t offset,
+					     size_t *block_start)
+{
+	size_t nr = offset / blocksize;
+
+	*block_start = nr * blocksize;
+	while (nr--)
+		head = head->b_this_page;
+	return head;
+}
+
+static void block_commit_write_range(struct buffer_head *head,
+				     unsigned int blocksize, size_t from,
+				     size_t to)
+{
+	size_t block_start, block_end;
+	struct buffer_head *bh;
+
+	if (from == to)
+		return;
+	if (WARN_ON_ONCE(to > folio_size(head->b_folio)))
+		return;
+
+	bh = folio_buffer_seek(head, blocksize, from, &block_start);
+	do {
+		block_end = block_start + blocksize;
+		set_buffer_uptodate(bh);
+		mark_buffer_dirty(bh);
+		if (buffer_new(bh))
+			clear_buffer_new(bh);
+
+		block_start = block_end;
+		bh = bh->b_this_page;
+	} while (block_start < to && bh != head);
+}
+
 void block_commit_write(struct folio *folio, size_t from, size_t to)
 {
 	size_t block_start, block_end;
@@ -2104,6 +2142,19 @@ void block_commit_write(struct folio *folio, size_t from, size_t to)
 		return;
 	blocksize = bh->b_size;
 
+	/*
+	 * Large folios can carry hundreds of buffer_heads.  For partial writes,
+	 * keep commit work local to the written range; partially uptodate
+	 * reads remain governed by the buffer state.
+	 */
+	if (folio_test_large(folio) && from < to &&
+	    folio_test_uptodate(folio) &&
+	    to <= folio_size(folio) &&
+	    (from != 0 || to != folio_size(folio))) {
+		block_commit_write_range(head, blocksize, from, to);
+		return;
+	}
+
 	block_start = 0;
 	do {
 		block_end = block_start + blocksize;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c2c2d6ac7f3d1..e58bba0289eba 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1180,7 +1180,7 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
 	unsigned int blocksize = i_blocksize(inode);
 	struct buffer_head *bh, *head, *wait[2];
 	int nr_wait = 0;
-	int i;
+	unsigned int i;
 	bool should_journal_data = ext4_should_journal_data(inode);
 
 	BUG_ON(!folio_test_locked(folio));
@@ -1191,17 +1191,18 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
 	head = folio_buffers(folio);
 	if (!head)
 		head = create_empty_buffers(folio, blocksize, 0);
-	block = EXT4_PG_TO_LBLK(inode, folio->index);
+	if (from == to)
+		return 0;
+	block_start = round_down(from, blocksize);
+	block = EXT4_PG_TO_LBLK(inode, folio->index) +
+		(block_start >> inode->i_blkbits);
+	bh = head;
+	for (i = 0; i < block_start; i += blocksize)
+		bh = bh->b_this_page;
 
-	for (bh = head, block_start = 0; bh != head || !block_start;
-	    block++, block_start = block_end, bh = bh->b_this_page) {
+	for (; block_start < to;
+	     block++, block_start = block_end, bh = bh->b_this_page) {
 		block_end = block_start + blocksize;
-		if (block_end <= from || block_start >= to) {
-			if (folio_test_uptodate(folio)) {
-				set_buffer_uptodate(bh);
-			}
-			continue;
-		}
 		if (WARN_ON_ONCE(buffer_new(bh)))
 			clear_buffer_new(bh);
 		if (!buffer_mapped(bh)) {

base-commit: e43ffb69e0438cddd72aaa30898b4dc446f664f8
-- 
2.39.5 (Apple Git-154)

^ permalink raw reply related

* Re: [PATCH RFC 7/8] erofs: open via dedicated fs bdev helpers
From: Christian Brauner @ 2026-06-03 13:42 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Jens Axboe, Alexander Viro, linux-block, linux-kernel,
	linux-fsdevel, Carlos Maiolino, linux-xfs, Chris Mason,
	David Sterba, linux-btrfs, Theodore Ts'o, linux-ext4,
	Gao Xiang, linux-erofs, Christoph Hellwig, Jan Kara
In-Reply-To: <7c5bfcf0-36a3-4cc6-bf31-6af4fc901c37@linux.alibaba.com>

> May I ask if it's an urgent 7.2 work? If not, I could

No no, it's way too late for that this cycle.

> make a preparation patch for the upcoming 7.2 cycle
> to handle erofs_map_dev() failure here so you don't
> need to bother with this in this patchset.

Sounds good. I take it you can just do this yourself without me.

> I will seek more time to resolve the recent todos

Thanks!

> yet always intercepted by other unrelated stuffs.

:)

^ permalink raw reply

* Re: [PATCH v2 09/10] ext4: Use mmb infrastructure for inode buffer writeout
From: Theodore Tso @ 2026-06-03 13:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Christian Brauner, aivazian.tigran, linux-ext4,
	OGAWA Hirofumi
In-Reply-To: <20260525085821.769119-19-jack@suse.cz>

On Mon, May 25, 2026 at 10:58:15AM +0200, Jan Kara wrote:
> Use mmb inode buffer writeout infrastructure to reliably write out
> inode's inode table block on fsync(2) in nojournal mode (from
> ext4_sync_parent() and ext4_fsync_nojournal()). This significantly
> simplifies the code as we don't have to explicitely handle inode buffer
> writeback in ext4_write_inode() and thus we can also remove
> sync_inode_metadata() calls from ext4_sync_parent() and
> ext4_write_inode() call from ext4_fsync_nojournal().
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Acked-by: Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply

* Re: [PATCH v2 02/10] ext4: Allocate mapping_metadata_bhs struct on demand
From: Theodore Tso @ 2026-06-03 13:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Christian Brauner, aivazian.tigran, linux-ext4,
	OGAWA Hirofumi
In-Reply-To: <20260525085821.769119-12-jack@suse.cz>

On Mon, May 25, 2026 at 10:58:08AM +0200, Jan Kara wrote:
> Currently every ext4 inode gets mapping_metadata_bhs struct although it
> is only needed when running without a journal and only for inodes where
> any metadata was dirtied. Allocate mapping_metadata_bhs struct on demand
> when dirtying the first metadata buffer for the inode.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Acked-by: Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply

* Re: [PATCH v4 08/23] ext4: implement buffered write path using iomap
From: Ojaswin Mujoo @ 2026-06-03 11:08 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai
In-Reply-To: <cc05c17d-163e-4251-b2c9-aa3a6f9555d7@huaweicloud.com>

On Wed, Jun 03, 2026 at 10:56:34AM +0800, Zhang Yi wrote:
> On 6/2/2026 6:26 PM, Ojaswin Mujoo wrote:
> > On Mon, May 11, 2026 at 03:23:28PM +0800, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> Introduce two new iomap_ops instances for ext4 buffered writes:
> >>
> >>  - ext4_iomap_buffered_da_write_ops: for delayed allocation mode, using
> >>    ext4_da_map_blocks() to map delalloc extents.
> >>  - ext4_iomap_buffered_write_ops: for non-delayed allocation mode, using
> >>    ext4_iomap_get_blocks() to directly allocate blocks.
> >>
> >> Also add ext4_iomap_valid() for the iomap infrastructure to check extent
> >> validity.
> >>
> >> Key changes and considerations:
> >>
> >>  - Unwritten extents for new blocks (dioread_nolock always on)
> >>    Since data=ordered mode is not used to prevent stale data exposure in
> >>    the non-delayed allocation path, new blocks are always allocated as
> >>    unwritten extents.
> >>
> >>  - Short write and write failure handling
> >>    a. Delalloc path: On short write or failure, the stale delalloc range
> >>       must be dropped and its space reservation released. Otherwise, a
> >>       clean folio may cover leftover delalloc extents, causing
> >>       inaccurate space reservation accounting.
> >>    b. Non-delalloc path: No cleanup of allocated blocks is needed on
> >>       short write.
> >>
> >>  - Lock ordering reversal
> >>    The folio lock and transaction start ordering is reversed compared to
> >>    the buffer_head buffered write path. To handle this, the journal
> >>    handle must be stopped in iomap_begin() callbacks. The lock ordering
> >>    documentation in super.c has been updated accordingly.
> >>
> >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> > 
> > I went through this again and after our discussion the changes looks
> > okay. Just a small quesiton below but otherwise feel free to add:
> > 
> > Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> 
> Thank you a lot for your careful review!
> 
> > 
> >> ---
> >>  fs/ext4/ext4.h  |   4 ++
> >>  fs/ext4/file.c  |  20 +++++-
> >>  fs/ext4/inode.c | 165 +++++++++++++++++++++++++++++++++++++++++++++++-
> >>  fs/ext4/super.c |  10 ++-
> >>  4 files changed, 192 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> >> index 1e27d73d7427..4832e7f7db82 100644
> >> --- a/fs/ext4/ext4.h
> >> +++ b/fs/ext4/ext4.h
> >> @@ -3057,6 +3057,7 @@ int ext4_walk_page_buffers(handle_t *handle,
> >>  int do_journal_get_write_access(handle_t *handle, struct inode *inode,
> >>  				struct buffer_head *bh);
> >>  void ext4_set_inode_mapping_order(struct inode *inode);
> >> +int ext4_nonda_switch(struct super_block *sb);
> >>  #define FALL_BACK_TO_NONDELALLOC 1
> >>  #define CONVERT_INLINE_DATA	 2
> > 
> > <snip>
> > 
> >> +	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
> >> +	return 0;
> >> +}
> >> +
> >> +static int ext4_iomap_buffered_write_begin(struct inode *inode,
> >> +		loff_t offset, loff_t length, unsigned int flags,
> >> +		struct iomap *iomap, struct iomap *srcmap)
> >> +{
> >> +	return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
> >> +						  iomap, srcmap, false);
> >> +}
> >> +
> >> +static int ext4_iomap_buffered_da_write_begin(struct inode *inode,
> >> +		loff_t offset, loff_t length, unsigned int flags,
> >> +		struct iomap *iomap, struct iomap *srcmap)
> >> +{
> >> +	return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
> >> +						  iomap, srcmap, true);
> >> +}
> >> +
> >> +/*
> >> + * On write failure, drop the stale delayed allocation range and release
> >> + * its reserved space for both start and end blocks. Otherwise, we may
> >> + * leave a range of delayed extents covered by a clean folio, which can
> >> + * result in inaccurate space reservation accounting.
> >> + */
> >> +static void ext4_iomap_punch_delalloc(struct inode *inode, loff_t offset,
> >> +				     loff_t length, struct iomap *iomap)
> >> +{
> >> +	down_write(&EXT4_I(inode)->i_data_sem);
> >> +	ext4_es_remove_extent(inode, offset >> inode->i_blkbits,
> >> +			DIV_ROUND_UP_ULL(length, EXT4_BLOCK_SIZE(inode->i_sb)));
> >> +	up_write(&EXT4_I(inode)->i_data_sem);
> >> +}
> >> +
> >> +static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
> >> +					    loff_t length, ssize_t written,
> >> +					    unsigned int flags,
> >> +					    struct iomap *iomap)
> >> +{
> >> +	loff_t start_byte, end_byte;
> >> +
> >> +	/* If we didn't reserve the blocks, we're not allowed to punch them. */
> >> +	if (iomap->type != IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW))
> > 
> > Will we ever get IOMAP_F_NEW here? I think the da_write_begin() call
> > either creates a new IOMAP_DELALLOC extent or finds older ones which
> > won't have EXT4_MAP_NEW set
> > 
> 
> Oops. This is a bug！ In ext4_da_map_blocks(), when allocating a new
> delalloc extent, the EXT4_MAP_NEW flag should be set. If this flag is
> not set, then when a short write occurs, we cannot distinguish whether
> an extent is a pre-existing delalloc extent or a newly allocated one.
> This prevents the subsequent truncate operation from being executed,
> leaving the newly allocated delalloc extent behind. I will fix this in
> next iteration.

Yes thats true I misread the condition and missed that we will always
exit early here :/

Regards,
ojaswin

> 
> Thanks,
> Yi.
> 

^ permalink raw reply

* [syzbot ci] Re: fs: support freeze/thaw/mark_dead/sync with shared devices
From: syzbot ci @ 2026-06-03  6:43 UTC (permalink / raw)
  To: axboe, brauner, cem, clm, dsterba, hch, jack, linux-block,
	linux-btrfs, linux-erofs, linux-ext4, linux-fsdevel, linux-kernel,
	linux-xfs, tytso, viro, xiang
  Cc: syzbot, syzkaller-bugs
In-Reply-To: <20260602-work-super-bdev_holder_global-v1-0-bb0fd82f3861@kernel.org>

syzbot ci has tested the following series

[v1] fs: support freeze/thaw/mark_dead/sync with shared devices
https://lore.kernel.org/all/20260602-work-super-bdev_holder_global-v1-0-bb0fd82f3861@kernel.org
* [PATCH RFC 1/8] fs, block: move blk_mode_t and fop_flags_t into <linux/types.h>
* [PATCH RFC 2/8] fs: add a global device to super block hash table
* [PATCH RFC 3/8] fs: refuse to claim any frozen block device
* [PATCH RFC 4/8] xfs: port to fs_bdev_file_open_by_path()
* [PATCH RFC 5/8] btrfs: open via dedicated fs bdev helpers
* [PATCH RFC 6/8] ext4: open via dedicated fs bdev helpers
* [PATCH RFC 7/8] erofs: open via dedicated fs bdev helpers
* [PATCH RFC 8/8] super: make fs_holder_ops private

and found the following issue:
general protection fault in close_fs_devices

Full report is available here:
https://ci.syzbot.org/series/9511f00a-a3c2-44ab-9a0b-2d65de5bbd49

***

general protection fault in close_fs_devices

tree:      bpf-next
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/bpf/bpf-next.git
base:      254f49634ee16a731174d2ae34bc50bd5f45e731
arch:      amd64
compiler:  Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config:    https://ci.syzbot.org/builds/4af26755-5773-453e-807d-ee451d2fdec5/config
syz repro: https://ci.syzbot.org/findings/2d8d96f7-d133-47dc-b4ca-5c0c65e1b6c9/syz_repro

btrfs: Deprecated parameter 'usebackuproot'
BTRFS warning: 'usebackuproot' is deprecated, use 'rescue=usebackuproot' instead
BTRFS: device fsid ed167579-eb65-4e76-9a50-61ac97e9b59d devid 1281 transid 8 /dev/loop1 (7:1) scanned by syz.1.18 (5863)
Oops: general protection fault, probably for non-canonical address 0xdffffc00000000f8: 0000 [#1] SMP KASAN PTI
KASAN: null-ptr-deref in range [0x00000000000007c0-0x00000000000007c7]
CPU: 1 UID: 0 PID: 5863 Comm: syz.1.18 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:btrfs_close_bdev fs/btrfs/volumes.c:1140 [inline]
RIP: 0010:btrfs_close_one_device fs/btrfs/volumes.c:1161 [inline]
RIP: 0010:close_fs_devices+0x47c/0x860 fs/btrfs/volumes.c:1204
Code: 3c 08 00 74 08 48 89 ef e8 b1 95 38 fe 48 8b 6d 00 b8 c0 07 00 00 48 01 c5 48 89 e8 48 c1 e8 03 48 b9 00 00 00 00 00 fc ff df <80> 3c 08 00 74 08 48 89 ef e8 86 95 38 fe 48 8b 75 00 4c 89 ff e8
RSP: 0018:ffffc90004007a48 EFLAGS: 00010202
RAX: 00000000000000f8 RBX: 1ffff110368c440b RCX: dffffc0000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 00000000000007c0 R08: ffff8881b462206f R09: 1ffff110368c440d
R10: dffffc0000000000 R11: ffffed10368c440e R12: ffff8881b4622000
R13: ffff8881b4622068 R14: ffff8881b4622058 R15: ffff8881707b7a00
FS:  00007f849d6ce6c0(0000) GS:ffff8882a9292000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f849c786a00 CR3: 00000001bbbcc000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 btrfs_close_devices+0xcd/0x570 fs/btrfs/volumes.c:1219
 btrfs_free_fs_info+0x4f/0x360 fs/btrfs/disk-io.c:1205
 deactivate_locked_super+0xbc/0x130 fs/super.c:477
 btrfs_get_tree_super fs/btrfs/super.c:-1 [inline]
 btrfs_get_tree_subvol fs/btrfs/super.c:2087 [inline]
 btrfs_get_tree+0xca6/0x1910 fs/btrfs/super.c:2121
 vfs_get_tree+0x92/0x2a0 fs/super.c:1928
 fc_mount fs/namespace.c:1193 [inline]
 do_new_mount_fc fs/namespace.c:3758 [inline]
 do_new_mount+0x341/0xd30 fs/namespace.c:3834
 do_mount fs/namespace.c:4167 [inline]
 __do_sys_mount fs/namespace.c:4383 [inline]
 __se_sys_mount+0x31d/0x420 fs/namespace.c:4360
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x15f/0xf80 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f849c79e0ca
Code: 48 c7 c2 e8 ff ff ff f7 d8 64 89 02 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f849d6cde58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 00007f849d6cdee0 RCX: 00007f849c79e0ca
RDX: 00002000000055c0 RSI: 0000200000000340 RDI: 00007f849d6cdea0
RBP: 00002000000055c0 R08: 00007f849d6cdee0 R09: 0000000000000408
R10: 0000000000000408 R11: 0000000000000246 R12: 0000200000000340
R13: 00007f849d6cdea0 R14: 00000000000055f5 R15: 0000200000000380
 </TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:btrfs_close_bdev fs/btrfs/volumes.c:1140 [inline]
RIP: 0010:btrfs_close_one_device fs/btrfs/volumes.c:1161 [inline]
RIP: 0010:close_fs_devices+0x47c/0x860 fs/btrfs/volumes.c:1204
Code: 3c 08 00 74 08 48 89 ef e8 b1 95 38 fe 48 8b 6d 00 b8 c0 07 00 00 48 01 c5 48 89 e8 48 c1 e8 03 48 b9 00 00 00 00 00 fc ff df <80> 3c 08 00 74 08 48 89 ef e8 86 95 38 fe 48 8b 75 00 4c 89 ff e8
RSP: 0018:ffffc90004007a48 EFLAGS: 00010202

RAX: 00000000000000f8 RBX: 1ffff110368c440b RCX: dffffc0000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 00000000000007c0 R08: ffff8881b462206f R09: 1ffff110368c440d
R10: dffffc0000000000 R11: ffffed10368c440e R12: ffff8881b4622000
R13: ffff8881b4622068 R14: ffff8881b4622058 R15: ffff8881707b7a00
FS:  00007f849d6ce6c0(0000) GS:ffff8882a9292000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000557941c2b058 CR3: 00000001bbbcc000 CR4: 00000000000006f0
----------------
Code disassembly (best guess):
   0:	3c 08                	cmp    $0x8,%al
   2:	00 74 08 48          	add    %dh,0x48(%rax,%rcx,1)
   6:	89 ef                	mov    %ebp,%edi
   8:	e8 b1 95 38 fe       	call   0xfe3895be
   d:	48 8b 6d 00          	mov    0x0(%rbp),%rbp
  11:	b8 c0 07 00 00       	mov    $0x7c0,%eax
  16:	48 01 c5             	add    %rax,%rbp
  19:	48 89 e8             	mov    %rbp,%rax
  1c:	48 c1 e8 03          	shr    $0x3,%rax
  20:	48 b9 00 00 00 00 00 	movabs $0xdffffc0000000000,%rcx
  27:	fc ff df
* 2a:	80 3c 08 00          	cmpb   $0x0,(%rax,%rcx,1) <-- trapping instruction
  2e:	74 08                	je     0x38
  30:	48 89 ef             	mov    %rbp,%rdi
  33:	e8 86 95 38 fe       	call   0xfe3895be
  38:	48 8b 75 00          	mov    0x0(%rbp),%rsi
  3c:	4c 89 ff             	mov    %r15,%rdi
  3f:	e8                   	.byte 0xe8


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.

To test a patch for this bug, please reply with `#syz test`
(should be on a separate line).

The patch should be attached to the email.
Note: arguments like custom git repos and branches are not supported.

^ permalink raw reply

* Re: [PATCH v4 08/23] ext4: implement buffered write path using iomap
From: Zhang Yi @ 2026-06-03  2:56 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai
In-Reply-To: <ah6vwa_MytxBN-Z8@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>

On 6/2/2026 6:26 PM, Ojaswin Mujoo wrote:
> On Mon, May 11, 2026 at 03:23:28PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> Introduce two new iomap_ops instances for ext4 buffered writes:
>>
>>  - ext4_iomap_buffered_da_write_ops: for delayed allocation mode, using
>>    ext4_da_map_blocks() to map delalloc extents.
>>  - ext4_iomap_buffered_write_ops: for non-delayed allocation mode, using
>>    ext4_iomap_get_blocks() to directly allocate blocks.
>>
>> Also add ext4_iomap_valid() for the iomap infrastructure to check extent
>> validity.
>>
>> Key changes and considerations:
>>
>>  - Unwritten extents for new blocks (dioread_nolock always on)
>>    Since data=ordered mode is not used to prevent stale data exposure in
>>    the non-delayed allocation path, new blocks are always allocated as
>>    unwritten extents.
>>
>>  - Short write and write failure handling
>>    a. Delalloc path: On short write or failure, the stale delalloc range
>>       must be dropped and its space reservation released. Otherwise, a
>>       clean folio may cover leftover delalloc extents, causing
>>       inaccurate space reservation accounting.
>>    b. Non-delalloc path: No cleanup of allocated blocks is needed on
>>       short write.
>>
>>  - Lock ordering reversal
>>    The folio lock and transaction start ordering is reversed compared to
>>    the buffer_head buffered write path. To handle this, the journal
>>    handle must be stopped in iomap_begin() callbacks. The lock ordering
>>    documentation in super.c has been updated accordingly.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> 
> I went through this again and after our discussion the changes looks
> okay. Just a small quesiton below but otherwise feel free to add:
> 
> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Thank you a lot for your careful review!

> 
>> ---
>>  fs/ext4/ext4.h  |   4 ++
>>  fs/ext4/file.c  |  20 +++++-
>>  fs/ext4/inode.c | 165 +++++++++++++++++++++++++++++++++++++++++++++++-
>>  fs/ext4/super.c |  10 ++-
>>  4 files changed, 192 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> index 1e27d73d7427..4832e7f7db82 100644
>> --- a/fs/ext4/ext4.h
>> +++ b/fs/ext4/ext4.h
>> @@ -3057,6 +3057,7 @@ int ext4_walk_page_buffers(handle_t *handle,
>>  int do_journal_get_write_access(handle_t *handle, struct inode *inode,
>>  				struct buffer_head *bh);
>>  void ext4_set_inode_mapping_order(struct inode *inode);
>> +int ext4_nonda_switch(struct super_block *sb);
>>  #define FALL_BACK_TO_NONDELALLOC 1
>>  #define CONVERT_INLINE_DATA	 2
> 
> <snip>
> 
>> +	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
>> +	return 0;
>> +}
>> +
>> +static int ext4_iomap_buffered_write_begin(struct inode *inode,
>> +		loff_t offset, loff_t length, unsigned int flags,
>> +		struct iomap *iomap, struct iomap *srcmap)
>> +{
>> +	return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
>> +						  iomap, srcmap, false);
>> +}
>> +
>> +static int ext4_iomap_buffered_da_write_begin(struct inode *inode,
>> +		loff_t offset, loff_t length, unsigned int flags,
>> +		struct iomap *iomap, struct iomap *srcmap)
>> +{
>> +	return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
>> +						  iomap, srcmap, true);
>> +}
>> +
>> +/*
>> + * On write failure, drop the stale delayed allocation range and release
>> + * its reserved space for both start and end blocks. Otherwise, we may
>> + * leave a range of delayed extents covered by a clean folio, which can
>> + * result in inaccurate space reservation accounting.
>> + */
>> +static void ext4_iomap_punch_delalloc(struct inode *inode, loff_t offset,
>> +				     loff_t length, struct iomap *iomap)
>> +{
>> +	down_write(&EXT4_I(inode)->i_data_sem);
>> +	ext4_es_remove_extent(inode, offset >> inode->i_blkbits,
>> +			DIV_ROUND_UP_ULL(length, EXT4_BLOCK_SIZE(inode->i_sb)));
>> +	up_write(&EXT4_I(inode)->i_data_sem);
>> +}
>> +
>> +static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
>> +					    loff_t length, ssize_t written,
>> +					    unsigned int flags,
>> +					    struct iomap *iomap)
>> +{
>> +	loff_t start_byte, end_byte;
>> +
>> +	/* If we didn't reserve the blocks, we're not allowed to punch them. */
>> +	if (iomap->type != IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW))
> 
> Will we ever get IOMAP_F_NEW here? I think the da_write_begin() call
> either creates a new IOMAP_DELALLOC extent or finds older ones which
> won't have EXT4_MAP_NEW set
> 

Oops. This is a bug！ In ext4_da_map_blocks(), when allocating a new
delalloc extent, the EXT4_MAP_NEW flag should be set. If this flag is
not set, then when a short write occurs, we cannot distinguish whether
an extent is a pre-existing delalloc extent or a newly allocated one.
This prevents the subsequent truncate operation from being executed,
leaving the newly allocated delalloc extent behind. I will fix this in
next iteration.

Thanks,
Yi.


^ permalink raw reply

* Re: [PATCH v4 08/23] ext4: implement buffered write path using iomap
From: Zhang Yi @ 2026-06-03  1:44 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai
In-Reply-To: <ah6q3bx27P1wptTg@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>

On 6/2/2026 6:05 PM, Ojaswin Mujoo wrote:
> On Fri, May 29, 2026 at 05:13:55PM +0800, Zhang Yi wrote:
>> Hi, Ojaswin!
>>
>> On 5/27/2026 1:10 AM, Ojaswin Mujoo wrote:
>>> On Mon, May 11, 2026 at 03:23:28PM +0800, Zhang Yi wrote:
>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>
>>>> Introduce two new iomap_ops instances for ext4 buffered writes:
>>>>
>>>>  - ext4_iomap_buffered_da_write_ops: for delayed allocation mode, using
>>>>    ext4_da_map_blocks() to map delalloc extents.
>>>>  - ext4_iomap_buffered_write_ops: for non-delayed allocation mode, using
>>>>    ext4_iomap_get_blocks() to directly allocate blocks.
>>>>
>>>> Also add ext4_iomap_valid() for the iomap infrastructure to check extent
>>>> validity.
>>>>
>>>> Key changes and considerations:
>>>>
>>>>  - Unwritten extents for new blocks (dioread_nolock always on)
>>>>    Since data=ordered mode is not used to prevent stale data exposure in
>>>>    the non-delayed allocation path, new blocks are always allocated as
>>>>    unwritten extents.
>>>
>>> Okay makes sense.
>>>
>>>>
>>>>  - Short write and write failure handling
>>>>    a. Delalloc path: On short write or failure, the stale delalloc range
>>>>       must be dropped and its space reservation released. Otherwise, a
>>>>       clean folio may cover leftover delalloc extents, causing
>>>>       inaccurate space reservation accounting.
>>>
>>> Hmm, okay so in the usual buffer head path, seems like during a short
>>> write we still zero the new buffers we couldn't write and keep it dirty
>>> (folio_zero_new_buffers()). This way they are still written back and
>>> the delalloc reservations are used up.
>>>
>>
>> In fact, in the normal buffer head path, writeback does not consume
>> delalloc reservations. Instead, the reservations are retained until the
>> inode is released or the area is written again using delalloc. This is
>> because i_size is not updated during short writes. Therefore, when a
>> zeroed dirty folio is written back, no block mapping is created for it.
>> For details, please see the lblk >= blocks judgment in
>> mpage_process_page_bufs().
> 
> Oh okay I see, I'm not very clear on the code path but what about a case
> where i_size is beyond the short write range.
> 

Yeah, You're right. When i_size extends beyond the short write range,
the delalloc reservation will be consumed during writeback.

Thanks,
Yi.




^ permalink raw reply

* Re: [PATCH] common/defrag: Skip defrag tests on DAX-enabled filesystems
From: Darrick J. Wong @ 2026-06-02 20:09 UTC (permalink / raw)
  To: Disha Goel; +Cc: fstests, linux-ext4, linux-fsdevel, ritesh.list, ojaswin, xfs
In-Reply-To: <20260602101418.55131-1-disgoel@linux.ibm.com>

On Tue, Jun 02, 2026 at 03:44:18PM +0530, Disha Goel wrote:
> Online defragmentation is not supported on DAX-enabled filesystems
> because DAX bypasses the page cache required for defrag operations.
> 
> Add check in _require_defrag() to skip tests when DAX is enabled,
> avoiding false failures on ext4/301-304, ext4/308 and generic/018.
> 
> Signed-off-by: Disha Goel <disgoel@linux.ibm.com>
> ---
>  common/defrag | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/common/defrag b/common/defrag
> index 055d0d0e..28db2f7a 100644
> --- a/common/defrag
> +++ b/common/defrag
> @@ -6,6 +6,10 @@
>  
>  _require_defrag()
>  {
> +    # Defragmentation is not supported on DAX-enabled filesystems
> +    if echo "$MOUNT_OPTIONS" | grep -qw "dax"; then
> +        _notrun "Defragmentation not supported on DAX-enabled filesystem"
> +    fi

Defrag doesn't work on XFS on DAX as well?  It seems to work fine on my
VMs...

<confused>

--D

>      case "$FSTYP" in
>      xfs)
>          # xfs_fsr does preallocates, require "falloc"
> -- 
> 2.45.1
> 
> 

^ permalink raw reply

* Re: [PATCH RFC 7/8] erofs: open via dedicated fs bdev helpers
From: Gao Xiang @ 2026-06-02 16:25 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, Alexander Viro, linux-block, linux-kernel,
	linux-fsdevel, Carlos Maiolino, linux-xfs, Chris Mason,
	David Sterba, linux-btrfs, Theodore Ts'o, linux-ext4,
	Gao Xiang, linux-erofs, Christoph Hellwig, Jan Kara
In-Reply-To: <20260602-work-super-bdev_holder_global-v1-7-bb0fd82f3861@kernel.org>



On 2026/6/2 18:10, Christian Brauner wrote:
> Route opens through fs_bdev_file_open_by_path() so each external device
> is registered against the correct superblock, and convert the matching
> releases.
> 
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
> ---
>   fs/erofs/data.c     |  6 +++++
>   fs/erofs/internal.h | 10 ++++++++
>   fs/erofs/super.c    | 66 +++++++++++++++++++++++++++++++++++++++++++----------
>   fs/erofs/zdata.c    | 10 +++++---
>   4 files changed, 77 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/erofs/data.c b/fs/erofs/data.c
> index 44da21c9d777..5220585293df 100644
> --- a/fs/erofs/data.c
> +++ b/fs/erofs/data.c
> @@ -69,6 +69,9 @@ int erofs_init_metabuf(struct erofs_buf *buf, struct super_block *sb,
>   {
>   	struct erofs_sb_info *sbi = EROFS_SB(sb);
>   
> +	if (erofs_is_shutdown(sb))
> +		return -EIO;
> +
>   	buf->file = NULL;
>   	if (in_metabox) {
>   		if (unlikely(!sbi->metabox_inode))
> @@ -236,6 +239,9 @@ int erofs_map_dev(struct super_block *sb, struct erofs_map_dev *map)
>   		}
>   		up_read(&devs->rwsem);
>   	}
> +	if (erofs_is_shutdown(sb) ||
> +	    (map->m_dif && READ_ONCE(map->m_dif->dead)))
> +		return -EIO;

Take a quick look at the code, maybe we can just add
the SHUTDOWN status only since I don't think remove an
individual blob device is useful for the typical image
use cases, so there is no need adding `dead` for each
individual extra device.

and just bail out if erofs_is_shutdown() at the very
beginning of erofs_map_dev()?

>   	return 0;
>   }
>   

...

> diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
> index 43bb5a6a9924..89ae91935364 100644
> --- a/fs/erofs/zdata.c
> +++ b/fs/erofs/zdata.c
> @@ -1697,11 +1697,15 @@ static void z_erofs_submit_queue(struct z_erofs_frontend *f,
>   			continue;
>   		}
>   
> -		/* no device id here, thus it will always succeed */
>   		mdev = (struct erofs_map_dev) {
>   			.m_pa = round_down(pcl->pos, sb->s_blocksize),
>   		};
> -		(void)erofs_map_dev(sb, &mdev);
> +		if (erofs_map_dev(sb, &mdev)) {
> +			/* the backing device is gone; fail the batch */
> +			q[JQ_SUBMIT]->eio = true;
> +			qtail[JQ_SUBMIT] = &pcl->next;
> +			continue;
> +		}

It needs some injection tests anyway.

May I ask if it's an urgent 7.2 work? If not, I could
make a preparation patch for the upcoming 7.2 cycle
to handle erofs_map_dev() failure here so you don't
need to bother with this in this patchset.

I will seek more time to resolve the recent todos
yet always intercepted by other unrelated stuffs.

Thanks,
Gao Xaing

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox