Linux EXT4 FS development
 help / color / mirror / Atom feed
* Re: [PATCH v2 2/2] ext4: avoid tail write_begin walk for uptodate folios
From: Jia Zhu @ 2026-06-09  3:54 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jia Zhu, Theodore Ts'o, Andreas Dilger, Matthew Wilcox,
	Alexander Viro, Christian Brauner, Baokun Li, Ojaswin Mujoo,
	Ritesh Harjani, Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel
In-Reply-To: <slupx7cfhq4uuk5bd3xnzu7ybnnqghzrseg3xjao6x7tbj75v3@jxv32j4em2cu>

On Mon, Jun 08, 2026 at 04:29:59PM +0200, Jan Kara wrote:
> You simplify this condition to:
> 
>   block_start < to || (!folio_uptodate && bh != head)
> 
> With this updated feel free to add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>

Thanks Jan.  I've applied this simplification in v3 and added your
Reviewed-by tag to both patches.

^ permalink raw reply

* [PATCH v3 2/2] ext4: avoid tail write_begin walk for uptodate folios
From: Jia Zhu @ 2026-06-09  3:52 UTC (permalink / raw)
  To: Theodore Ts'o, Andreas Dilger
  Cc: Matthew Wilcox, Alexander Viro, Christian Brauner, Jan Kara,
	Baokun Li, Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, linux-ext4,
	linux-fsdevel, linux-kernel, Jia Zhu, stable
In-Reply-To: <20260609035202.90669-1-zhujia.zj@bytedance.com>

Ext4 buffered writes into large folios also pay a full buffer_head
walk in ext4_block_write_begin().  For a small overwrite of an existing
cached folio, the folio is already uptodate and the write only needs to
prepare the buffers through the written range.  Walking the suffix still
makes the write_begin cost proportional to the folio size.

Before ext4 enabled large folios for regular files, the same loop was
bounded by a single page of buffers.  That commit made the existing
full-folio walk visible as a regression for cached small overwrites.

The suffix walk is needed for non-uptodate folios, where ext4 may have
to submit reads for partial blocks, preserve new-buffer cleanup, and run
error zeroing.  Keep those folios on the old full walk.

For already-uptodate folios, keep the walk starting at the first buffer
rather than seeking directly to from.  This preserves the existing prefix
buffer state handling.  Stop once block_start reaches the end of the
write range, because the skipped suffix would only repeat the
outside-range uptodate handling for buffers beyond @to.

On current master, the libMicro ext4 large-folio overwrite test shows
the following full-series result.  Results are median usecs/call over 10
runs, lower is better:

case        nofix     this series   improvement
write_u1k   1.418     0.3405        76.0%
write_u10k  1.887     0.4175        77.9%
pwrite_u1k  1.6775    0.3390        79.8%
pwrite_u10k 1.9035    0.4130        78.3%

Fixes: 7ac67301e82f0 ("ext4: enable large folio for regular file")
Cc: stable@vger.kernel.org # v6.16+
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jia Zhu <zhujia.zj@bytedance.com>
---
 fs/ext4/inode.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c2c2d6ac7f3d1..0fccb8f6a2116 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1182,6 +1182,7 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
 	int nr_wait = 0;
 	int i;
 	bool should_journal_data = ext4_should_journal_data(inode);
+	bool folio_uptodate = folio_test_uptodate(folio);
 
 	BUG_ON(!folio_test_locked(folio));
 	BUG_ON(to > folio_size(folio));
@@ -1193,13 +1194,13 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
 		head = create_empty_buffers(folio, blocksize, 0);
 	block = EXT4_PG_TO_LBLK(inode, folio->index);
 
-	for (bh = head, block_start = 0; bh != head || !block_start;
+	for (bh = head, block_start = 0;
+	     block_start < to || (!folio_uptodate && bh != head);
 	    block++, block_start = block_end, bh = bh->b_this_page) {
 		block_end = block_start + blocksize;
 		if (block_end <= from || block_start >= to) {
-			if (folio_test_uptodate(folio)) {
+			if (folio_uptodate)
 				set_buffer_uptodate(bh);
-			}
 			continue;
 		}
 		if (WARN_ON_ONCE(buffer_new(bh)))
@@ -1220,7 +1221,7 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
 				if (should_journal_data)
 					do_journal_get_write_access(handle,
 								    inode, bh);
-				if (folio_test_uptodate(folio)) {
+				if (folio_uptodate) {
 					/*
 					 * Unlike __block_write_begin() we leave
 					 * dirtying of new uptodate buffers to
@@ -1237,7 +1238,7 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
 				continue;
 			}
 		}
-		if (folio_test_uptodate(folio)) {
+		if (folio_uptodate) {
 			set_buffer_uptodate(bh);
 			continue;
 		}
-- 
2.20.1

^ permalink raw reply related

* [PATCH v3 1/2] fs/buffer: avoid tail commit walk for uptodate folios
From: Jia Zhu @ 2026-06-09  3:52 UTC (permalink / raw)
  To: Theodore Ts'o, Andreas Dilger
  Cc: Matthew Wilcox, Alexander Viro, Christian Brauner, Jan Kara,
	Baokun Li, Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, linux-ext4,
	linux-fsdevel, linux-kernel, Jia Zhu, stable
In-Reply-To: <20260609035202.90669-1-zhujia.zj@bytedance.com>

block_commit_write() always walks every buffer_head attached to the
folio.  That was cheap for order-0 folios, but large folios can contain
hundreds of buffer_heads.  For a small buffered overwrite of an
already-uptodate large folio, the commit work is therefore proportional
to the folio size rather than the copied range.

This became visible with ext4 regular-file large folios, where cached
small overwrites reach block_commit_write() through block_write_end().
Before ext4 enabled large folios for regular files, this path was only
hit with order-0 folios for normal ext4 buffered writes, so the full walk
was bounded.  The ext4 large-folio commit is therefore the regression
point for this generic helper cost.

The full walk is still needed when the folio is not uptodate, because
block_commit_write() uses per-buffer uptodate state to decide whether
the whole folio can be marked uptodate.  Keep those folios on the old
full-buffer path.

For a folio that was already uptodate on entry, the commit no longer
needs tail buffers for folio-uptodate discovery.  The copied range has
already been processed once block_start reaches @to, so stop there and
avoid the suffix walk.

Fixes: 7ac67301e82f0 ("ext4: enable large folio for regular file")
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: stable@vger.kernel.org # v6.16+
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jia Zhu <zhujia.zj@bytedance.com>
---
 fs/buffer.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/buffer.c b/fs/buffer.c
index b0b3792b1496e..c8c41c799030d 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2096,6 +2096,7 @@ void block_commit_write(struct folio *folio, size_t from, size_t to)
 {
 	size_t block_start, block_end;
 	bool partial = false;
+	bool uptodate = folio_test_uptodate(folio);
 	unsigned blocksize;
 	struct buffer_head *bh, *head;
 
@@ -2118,6 +2119,8 @@ void block_commit_write(struct folio *folio, size_t from, size_t to)
 			clear_buffer_new(bh);
 
 		block_start = block_end;
+		if (uptodate && block_start >= to)
+			break;
 		bh = bh->b_this_page;
 	} while (bh != head);
 
-- 
2.20.1

^ permalink raw reply related

* [PATCH v3 0/2] ext4: avoid tail walks for cached large-folio writes
From: Jia Zhu @ 2026-06-09  3:52 UTC (permalink / raw)
  To: Theodore Ts'o, Andreas Dilger
  Cc: Matthew Wilcox, Alexander Viro, Christian Brauner, Jan Kara,
	Baokun Li, Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, linux-ext4,
	linux-fsdevel, linux-kernel, Jia Zhu

Hi,

This series addresses a buffered-write regression we found during our
v6.12 -> v6.18 LTS upgrade testing on ext4.

The regression is in the remaining buffer_head path.  A small overwrite
of an already cached, uptodate large folio still walks every buffer_head
attached to the folio in both write_begin and write_end.  With order-0
folios this was bounded by the page size.  After ext4 enabled large
folios for regular files, the same loops became proportional to the
folio size.

I agree that converting ext4 buffered I/O to iomap is the right long-term
direction, and that would avoid this problem.  This series is meant as a
small fix for current and LTS kernels that still use the buffer_head path.

Patch 1 follows Willy's suggestion for block_commit_write(): if the folio
was already uptodate on entry, stop the commit walk once the copied range
has been processed.

Patch 2 applies the same conservative shape to ext4_block_write_begin().
It keeps walking from the first buffer, so prefix buffer state handling is
unchanged, and only skips the suffix for folios that were already
uptodate on entry.

The workload is from libMicro, which we use in kernel release testing:

  https://github.com/rzezeski/libMicro

The table below includes the v6.12 baseline from the same release
benchmark.  The v6.12 and v6.18 columns were run with THP=always.  The
last column is v6.18 with this series applied.  Results are usecs/call,
lower is better, and the improvement is relative to unpatched v6.18.

case           v6.12    v6.18   v6.18 + series   improvement
write_u1k      0.609    4.659       0.528           88.7%
write_u10k     1.408    4.869       0.809           83.4%
pwrite_u1k     0.609    4.659       0.538           88.5%
pwrite_u10k    1.399    4.889       0.819           83.2%
writev_u1k     2.238    5.277       1.179           77.7%
writev_u10k   11.057    8.029       4.219           47.5%

For the cases that regressed from v6.12 to v6.18 in this test, this
series brings the v6.18 numbers back below the v6.12 cost.

Previous versions:
v2:
  https://lore.kernel.org/r/20260608120131.45146-1-zhujia.zj@bytedance.com
v1:
  https://lore.kernel.org/r/20260603134800.25155-1-zhujia.zj@bytedance.com

Changes since v2:
- simplify the ext4 loop condition as suggested by Jan;
- add Reviewed-by tags from Jan;
- add stable Cc tags.

Changes since v1:
- replace the ext4 seek-to-@from optimization with a conservative tail
  break that preserves prefix buffer handling;
- add the block_commit_write() tail break suggested by Willy;
- add v6.12 and v6.18 benchmark results for the full series.

Jia Zhu (2):
  fs/buffer: avoid tail commit walk for uptodate folios
  ext4: avoid tail write_begin walk for uptodate folios

 fs/buffer.c     |  3 +++
 fs/ext4/inode.c | 11 ++++++-----
 2 files changed, 9 insertions(+), 5 deletions(-)

-- 
2.20.1

^ permalink raw reply

* Re: [PATCH v2 3/3] ext4: reject mount if inodes per group is not a multiple of inodes per block
From: Zhang Yi @ 2026-06-09  3:33 UTC (permalink / raw)
  To: Baokun Li, linux-ext4
  Cc: tytso, adilger.kernel, jack, ojaswin, ritesh.list, Sashiko
In-Reply-To: <20260608111150.827117-4-libaokun@linux.alibaba.com>

On 6/8/2026 7:11 PM, Baokun Li wrote:
> If s_inodes_per_group is not a multiple of s_inodes_per_block, the
> division that computes s_itb_per_group truncates, reserving fewer blocks
> for the inode table than needed.
> 
> On a crafted filesystem image, this allows __ext4_get_inode_loc() to
> compute a block offset beyond the inode table, reading unrelated data as
> an inode structure.
> 
> Add the missing divisibility check alongside the existing validation in
> ext4_block_group_meta_init().
> 
> Reported-by: Sashiko <sashiko-bot@kernel.org>
> Closes: https://sashiko.dev/#/patchset/20260608061112.392391-1-libaokun%40linux.alibaba.com
> Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>

Looks good to me.

Reviewed-by: Zhang Yi <yi.zhang@huawei.com>

> ---
>  fs/ext4/super.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 3ddcb4a8d4db..5ec9e1ef00c0 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -5306,7 +5306,8 @@ static int ext4_block_group_meta_init(struct super_block *sb, int silent)
>  	}
>  	if (sbi->s_inodes_per_group < sbi->s_inodes_per_block ||
>  	    sbi->s_inodes_per_group > sb->s_blocksize * 8 ||
> -	    sbi->s_inodes_per_group & 7) {
> +	    sbi->s_inodes_per_group & 7 ||
> +	    sbi->s_inodes_per_group % sbi->s_inodes_per_block) {
>  		ext4_msg(sb, KERN_ERR, "invalid inodes per group: %lu",
>  			 sbi->s_inodes_per_group);
>  		return -EINVAL;


^ permalink raw reply

* Re: [PATCH] ext4: validate donor file superblock early in EXT4_IOC_MOVE_EXT
From: Zhou, Yun @ 2026-06-09  3:12 UTC (permalink / raw)
  To: Hillf Danton, Andreas Dilger
  Cc: tytso, jack, Amir Goldstein, linux-ext4, linux-kernel
In-Reply-To: <20260609011522.1708-1-hdanton@sina.com>



On 6/9/26 09:15, Hillf Danton wrote:
> CAUTION: This email comes from a non Wind River email account!
> Do not click links or open attachments unless you recognize the sender and know the content is safe.
>
> On Mon, 8 Jun 2026 12:20:07 -0600 Andreas Dilger wrote:
>> On Jun 8, 2026, at 09:25, Yun Zhou <yun.zhou@windriver.com> wrote:
>>> Reject the EXT4_IOC_MOVE_EXT ioctl early if the donor file does not
>>> belong to the same superblock as the original file.  Currently, this
>>> validation is performed inside ext4_move_extents() by
>>> mext_check_validity(), but only after lock_two_nondirectories() has
>>> already acquired the inode locks.  When the donor fd refers to a file
>>> on a different filesystem (e.g., overlayfs), this late validation
>>> creates a circular lock dependency:
>>>
>>>   CPU0 (overlayfs write)            CPU1 (ext4 ioctl)
>>>   ----                              ----
>>>   inode_lock(ovl_inode)
>>>                                     mnt_want_write_file(filp)
>>>                                       sb_start_write(ext4_sb)   [sb_writers]
>>>     backing_file_write_iter()
>>>       vfs_iter_write(real_file)
>>>         file_start_write(real_file)
>>>           sb_start_write(ext4_sb)   [blocked by freeze]
>>>                                     lock_two_nondirectories()
>>>                                       inode_lock(ovl_inode)     [blocked]
> Why does it exist if the locking order on CPU0 is incorrect?
The locking order on CPU0 (overlayfs write path: inode_lock → sb_writers)
is correct and inherent to stacked filesystems.The actual bug is that CPU1's
execution path should never exist — ext4 ioctl should not be locking an
overlayfs inode in the first place.This happens because the donor fd passed
from userspace is not validated before being used in 
lock_two_nondirectories().

^ permalink raw reply

* Re: [PATCH v2 2/3] ext4: reduce max cluster size to match documented 256MB limit
From: Zhang Yi @ 2026-06-09  2:48 UTC (permalink / raw)
  To: Baokun Li, linux-ext4
  Cc: tytso, adilger.kernel, jack, ojaswin, ritesh.list, Sashiko
In-Reply-To: <20260608111150.827117-3-libaokun@linux.alibaba.com>

Hi, Baokun!

On 6/8/2026 7:11 PM, Baokun Li wrote:
> The mke2fs man page documents:
> 
>   Valid cluster-size values are from 2048 to 256M bytes per cluster.

Hmm, I checked the mke2fs(8)[1] and didn't find this sentence, instead,
it said:

  Valid cluster-size values range from 2 to 32768 times the filesystem
  block size and must be a power of 2.

However, the implementation of mkfs does not seem to respect this
constraint and instead directly uses static macros to limit the
user-specified cluster size.

  #define EXT2_MIN_BLOCK_LOG_SIZE         10      /* 1024 */
  #define EXT2_MIN_CLUSTER_LOG_SIZE       EXT2_MIN_BLOCK_LOG_SIZE
  #define EXT2_MAX_CLUSTER_LOG_SIZE       29      /* 512MB  */

This is confusing, or perhaps I missed something. If I understand
correctly, users can now format an image with a maximum cluster size of
512 MB (I tried it, and it worked). If the kernel's
EXT4_MAX_CLUSTER_LOG_SIZE is limited to 28, this would cause such
existing images to become unmountable, even though I don't think images
with such a large cluster size actually exist in practice. Therefore,
I'm not sure this is safe.

[1] https://man7.org/linux/man-pages/man8/mke2fs.8.html

> 
> but EXT4_MAX_CLUSTER_LOG_SIZE was set to 30 (1GB), allowing crafted
> filesystem images to specify cluster sizes up to 1GB.
> 
> On 32-bit systems with bigalloc enabled, the consistency check in
> ext4_handle_clustersize():
> 
>   s_blocks_per_group == s_clusters_per_group * (clustersize / blocksize)
> 
> can overflow when the cluster ratio is large enough. Since
> s_blocks_per_group is not range-checked in the bigalloc path, the
> wrapped product can pass the consistency check, leading to inconsistent
> group geometry and potential out-of-bounds block allocation.
> 
> Reduce EXT4_MAX_CLUSTER_LOG_SIZE to 28 to match the documented 256MB
> limit. With this cap, the maximum product is:
> 
>   (blocksize * 8) * (256M / blocksize) = 2^31
> 
> which fits safely in a 32-bit unsigned long for all block sizes.
> 
> Reported-by: Sashiko <sashiko-bot@kernel.org>
> Closes: https://sashiko.dev/#/patchset/20260608061112.392391-1-libaokun%40linux.alibaba.com
> Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
> ---
>  fs/ext4/ext4.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 94283a991e5c..11e41a864db8 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -334,7 +334,7 @@ struct ext4_io_submit {
>  #define	EXT4_MAX_BLOCK_SIZE		65536
>  #define EXT4_MIN_BLOCK_LOG_SIZE		10
>  #define EXT4_MAX_BLOCK_LOG_SIZE		16
> -#define EXT4_MAX_CLUSTER_LOG_SIZE	30
> +#define EXT4_MAX_CLUSTER_LOG_SIZE	28
>  #ifdef __KERNEL__
>  # define EXT4_BLOCK_SIZE(s)		((s)->s_blocksize)
>  #else


^ permalink raw reply

* Re: [PATCH v2 3/3] ext4: reject mount if inodes per group is not a multiple of inodes per block
From: Baokun Li @ 2026-06-09  1:46 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: linux-ext4, tytso, jack, yi.zhang, ojaswin, ritesh.list, Sashiko
In-Reply-To: <097CFB85-3C4A-4BDF-829C-E6B2322386EC@dilger.ca>

On 2026/6/9 02:27, Andreas Dilger wrote:
> On Jun 8, 2026, at 05:11, Baokun Li <libaokun@linux.alibaba.com> wrote:
>> If s_inodes_per_group is not a multiple of s_inodes_per_block, the
>> division that computes s_itb_per_group truncates, reserving fewer blocks
>> for the inode table than needed.
>>
>> On a crafted filesystem image, this allows __ext4_get_inode_loc() to
>> compute a block offset beyond the inode table, reading unrelated data as
>> an inode structure.
>>
>> Add the missing divisibility check alongside the existing validation in
>> ext4_block_group_meta_init().
>>
>> Reported-by: Sashiko <sashiko-bot@kernel.org>
>> Closes: https://sashiko.dev/#/patchset/20260608061112.392391-1-libaokun%40linux.alibaba.com
>> Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>

Hi Andreas,

Thank you for your review!

> Looks good.  Is this also fixed/checked in e2fsprogs?  
Yes, a similar patchset adding analogous checks to e2fsprogs is also in
the works.
> There is really no reason *not* to use all the space in the last itable block for inodes.
>
> Reviewed-by: Andreas Dilger <adilger@dilger.ca <mailto:adilger@dilger.ca>>

Indeed. Images created by mke2fs do not exhibit this issue; it only
occurs with certain crafted, malformed images. Thanks, Baokun


^ permalink raw reply

* Re: [PATCH] ext4: validate donor file superblock early in EXT4_IOC_MOVE_EXT
From: Hillf Danton @ 2026-06-09  1:15 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: tytso, jack, Yun Zhou, Amir Goldstein, linux-ext4, linux-kernel
In-Reply-To: <5D214F11-047A-4414-B9BC-8ED669817145@dilger.ca>

On Mon, 8 Jun 2026 12:20:07 -0600 Andreas Dilger wrote:
> On Jun 8, 2026, at 09:25, Yun Zhou <yun.zhou@windriver.com> wrote:
> > 
> > Reject the EXT4_IOC_MOVE_EXT ioctl early if the donor file does not
> > belong to the same superblock as the original file.  Currently, this
> > validation is performed inside ext4_move_extents() by
> > mext_check_validity(), but only after lock_two_nondirectories() has
> > already acquired the inode locks.  When the donor fd refers to a file
> > on a different filesystem (e.g., overlayfs), this late validation
> > creates a circular lock dependency:
> > 
> >  CPU0 (overlayfs write)            CPU1 (ext4 ioctl)
> >  ----                              ----
> >  inode_lock(ovl_inode)
> >                                    mnt_want_write_file(filp)
> >                                      sb_start_write(ext4_sb)   [sb_writers]
> >    backing_file_write_iter()
> >      vfs_iter_write(real_file)
> >        file_start_write(real_file)
> >          sb_start_write(ext4_sb)   [blocked by freeze]
> >                                    lock_two_nondirectories()
> >                                      inode_lock(ovl_inode)     [blocked]
>
Why does it exist if the locking order on CPU0 is incorrect?

^ permalink raw reply

* [PATCH net v2] ext4: fix out-of-bounds read in ext4_read_inline_dir()
From: Xiang Mei @ 2026-06-09  1:07 UTC (permalink / raw)
  To: linux-ext4
  Cc: Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, Weiming Shi, Xiang Mei

ext4_read_inline_dir() reads de->rec_len / de->name past the end of its
inline buffer for a crafted or corrupted inline directory, triggering a
slab-out-of-bounds read during getdents64():

  BUG: KASAN: slab-out-of-bounds in filldir64 (fs/readdir.c:371)
  Read of size 8 at addr ffff88800fd3da3c by task exploit/146
   ...
   kasan_report (mm/kasan/report.c:595)
   filldir64 (fs/readdir.c:371)
   iterate_dir (fs/readdir.c:110)
   ...

The payload is copied into a buffer of exactly inline_size bytes:

	dir_buf = kmalloc(inline_size, GFP_NOFS);

but iteration runs in a logical position space extra_offset bytes larger
than the buffer (extra_size = extra_offset + inline_size), so the synthetic
"." and ".." entries land at the offsets they would have in a block-based
directory. A real dirent is formed at "dir_buf + pos - extra_offset", yet
the loop bounds and the ext4_check_dir_entry() length argument are all
expressed in the larger extra_size. Two reachable sites dereference a
dirent before confirming its physical offset is inside the allocation:

In the main loop, ctx->pos is attacker-controlled via lseek() and the entry
is validated with extra_size, so ext4_check_dir_entry() accepts a dirent
running up to extra_offset bytes past the allocation before its length
check fires. ctx->pos is also a signed loff_t: an lseek() to a small value
below extra_offset makes "ctx->pos - extra_offset" negative, so a check
that only bounds the top of the buffer is bypassed by underflow and de is
formed before dir_buf.

In the cookie-rescan loop, entered when i_version changed since the last
readdir(2), the walk restarts from the beginning with i bounded by
extra_size, so as i approaches extra_size the unconditional read of
de->rec_len runs past the allocation before any validation.

Both are the same defect, logical extra_size space versus the physical
inline_size buffer. In each loop, reject a dirent whose header would not
fit within inline_size before forming de, and in the main loop also reject a
position that underflows below extra_offset. Validate the main-loop entry
against inline_size rather than extra_size. Entries that legitimately fill
the inline data still pass.

Fixes: c4d8b0235aa9 ("ext4: fix readdir error in case inline_data+^dir_index.")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Xiang Mei <xmei5@asu.edu>
---
v2: check the both bounds

 fs/ext4/inline.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 8045e4ff270c..bbe93f1b56b2 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -1454,6 +1454,9 @@ int ext4_read_inline_dir(struct file *file,
 			/* for other entry, the real offset in
 			 * the buf has to be tuned accordingly.
 			 */
+			if (i - extra_offset + ext4_dir_rec_len(1, NULL) >
+			    inline_size)
+				break;
 			de = (struct ext4_dir_entry_2 *)
 				(dir_buf + i - extra_offset);
 			/* It's too expensive to do a full
@@ -1488,10 +1491,20 @@ int ext4_read_inline_dir(struct file *file,
 			continue;
 		}
 
+		/*
+		 * de lives at dir_buf + ctx->pos - extra_offset, so the dirent
+		 * header must fit within inline_size.  ctx->pos is a signed,
+		 * lseek()-controlled loff_t: check the lower bound first, or
+		 * ctx->pos < extra_offset underflows and points de before dir_buf.
+		 */
+		if (ctx->pos < extra_offset ||
+		    ctx->pos - extra_offset + ext4_dir_rec_len(1, NULL) >
+		    inline_size)
+			goto out;
 		de = (struct ext4_dir_entry_2 *)
 			(dir_buf + ctx->pos - extra_offset);
 		if (ext4_check_dir_entry(inode, file, de, iloc.bh, dir_buf,
-					 extra_size, ctx->pos))
+					 inline_size, ctx->pos))
 			goto out;
 		if (le32_to_cpu(de->inode)) {
 			if (!dir_emit(ctx, de->name, de->name_len,
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH ext4] ext4: fix out-of-bounds read in ext4_read_inline_dir()
From: Xiang Mei @ 2026-06-08 21:09 UTC (permalink / raw)
  To: linux-ext4
  Cc: Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, Weiming Shi
In-Reply-To: <20260608210713.1940288-1-xmei5@asu.edu>

On Mon, Jun 08, 2026 at 02:07:13PM -0700, Xiang Mei wrote:
> ext4_read_inline_dir() reads de->rec_len / de->name past the end of its
> inline buffer for a crafted or corrupted inline directory, triggering a
> slab-out-of-bounds read during getdents64():
> 
>   BUG: KASAN: slab-out-of-bounds in filldir64 (fs/readdir.c:371)
>   Read of size 8 at addr ffff88800fd3da3c by task exploit/146
>    ...
>    kasan_report (mm/kasan/report.c:595)
>    filldir64 (fs/readdir.c:371)
>    iterate_dir (fs/readdir.c:110)
>    ...
> 
> The payload is copied into a buffer of exactly inline_size bytes:
> 
> 	dir_buf = kmalloc(inline_size, GFP_NOFS);
> 
> but iteration runs in an inflated logical position space, extra_offset
> bytes larger than the buffer (extra_size = extra_offset + inline_size),
> so the synthetic "." and ".." entries land at the offsets they would have
> in a block-based directory.  A real dirent is therefore formed at the
> physical address "dir_buf + pos - extra_offset", yet the loop bounds and
> the ext4_check_dir_entry() length argument are all expressed in the larger
> extra_size.  This mismatch lets two reachable sites dereference a dirent
> before confirming its physical offset is inside the allocation.
> 
> In the main loop, ctx->pos is attacker-controlled via lseek().  The entry
> is validated with extra_size instead of inline_size, so
> ext4_check_dir_entry() accepts rec_len/name_len running up to extra_offset
> bytes past the allocation, and dereferences de before its (too-large)
> length check fires.
> 
> In the cookie-rescan loop, entered when i_version changed since the last
> readdir(2) (or after an lseek resets the cookie), the walk restarts from
> the beginning with i bounded by extra_size.  As i approaches extra_size,
> "i - extra_offset" approaches inline_size, so the unconditional read of
> de->rec_len runs past the allocation before any validation.
> 
> Both are the same defect, logical extra_size space versus the physical
> inline_size buffer, so fix them together.  In each loop, reject a dirent
> whose minimum-size header would not fit within inline_size before forming
> and dereferencing de, and validate the main-loop entry against inline_size
> rather than extra_size.  extra_offset only inflates the logical position of
> "." and ".."; the dirent itself always lives in the kmalloc(inline_size)
> buffer.  Entries that legitimately fill the inline data still pass; only
> accesses that would fall outside the allocation are rejected.
> 
> Fixes: c4d8b0235aa9 ("ext4: fix readdir error in case inline_data+^dir_index.")
> Reported-by: Weiming Shi <bestswngs@gmail.com>
> Assisted-by: Claude:claude-opus-4-8
> Signed-off-by: Xiang Mei <xmei5@asu.edu>
> ---
>  fs/ext4/inline.c | 16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
> index 8045e4ff270c..0ec85cfcc859 100644
> --- a/fs/ext4/inline.c
> +++ b/fs/ext4/inline.c
> @@ -1454,6 +1454,9 @@ int ext4_read_inline_dir(struct file *file,
>  			/* for other entry, the real offset in
>  			 * the buf has to be tuned accordingly.
>  			 */
> +			if (i - extra_offset + ext4_dir_rec_len(1, NULL) >
> +			    inline_size)
> +				break;
>  			de = (struct ext4_dir_entry_2 *)
>  				(dir_buf + i - extra_offset);
>  			/* It's too expensive to do a full
> @@ -1488,10 +1491,21 @@ int ext4_read_inline_dir(struct file *file,
>  			continue;
>  		}
>  
> +		/*
> +		 * ctx->pos can be set to an arbitrary value via lseek(), and
> +		 * the rescan above may also advance it.  Make sure the dirent
> +		 * header lies within the inline_size payload before
> +		 * dereferencing it: extra_offset only inflates the logical
> +		 * position of "." and "..", the dirent itself always lives in
> +		 * the kmalloc(inline_size) buffer.
> +		 */
> +		if (ctx->pos - extra_offset + ext4_dir_rec_len(1, NULL) >
> +		    inline_size)
> +			goto out;
>  		de = (struct ext4_dir_entry_2 *)
>  			(dir_buf + ctx->pos - extra_offset);
>  		if (ext4_check_dir_entry(inode, file, de, iloc.bh, dir_buf,
> -					 extra_size, ctx->pos))
> +					 inline_size, ctx->pos))
>  			goto out;
>  		if (le32_to_cpu(de->inode)) {
>  			if (!dir_emit(ctx, de->name, de->name_len,
> -- 
> 2.43.0
>

Thanks for your attention to this bug. We have a PoC that can trigger 
the KASAN crash. Please DM if you need it to reproduce the issue.

Xiang

^ permalink raw reply

* [PATCH ext4] ext4: fix out-of-bounds read in ext4_read_inline_dir()
From: Xiang Mei @ 2026-06-08 21:07 UTC (permalink / raw)
  To: linux-ext4
  Cc: Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, Weiming Shi, Xiang Mei

ext4_read_inline_dir() reads de->rec_len / de->name past the end of its
inline buffer for a crafted or corrupted inline directory, triggering a
slab-out-of-bounds read during getdents64():

  BUG: KASAN: slab-out-of-bounds in filldir64 (fs/readdir.c:371)
  Read of size 8 at addr ffff88800fd3da3c by task exploit/146
   ...
   kasan_report (mm/kasan/report.c:595)
   filldir64 (fs/readdir.c:371)
   iterate_dir (fs/readdir.c:110)
   ...

The payload is copied into a buffer of exactly inline_size bytes:

	dir_buf = kmalloc(inline_size, GFP_NOFS);

but iteration runs in an inflated logical position space, extra_offset
bytes larger than the buffer (extra_size = extra_offset + inline_size),
so the synthetic "." and ".." entries land at the offsets they would have
in a block-based directory.  A real dirent is therefore formed at the
physical address "dir_buf + pos - extra_offset", yet the loop bounds and
the ext4_check_dir_entry() length argument are all expressed in the larger
extra_size.  This mismatch lets two reachable sites dereference a dirent
before confirming its physical offset is inside the allocation.

In the main loop, ctx->pos is attacker-controlled via lseek().  The entry
is validated with extra_size instead of inline_size, so
ext4_check_dir_entry() accepts rec_len/name_len running up to extra_offset
bytes past the allocation, and dereferences de before its (too-large)
length check fires.

In the cookie-rescan loop, entered when i_version changed since the last
readdir(2) (or after an lseek resets the cookie), the walk restarts from
the beginning with i bounded by extra_size.  As i approaches extra_size,
"i - extra_offset" approaches inline_size, so the unconditional read of
de->rec_len runs past the allocation before any validation.

Both are the same defect, logical extra_size space versus the physical
inline_size buffer, so fix them together.  In each loop, reject a dirent
whose minimum-size header would not fit within inline_size before forming
and dereferencing de, and validate the main-loop entry against inline_size
rather than extra_size.  extra_offset only inflates the logical position of
"." and ".."; the dirent itself always lives in the kmalloc(inline_size)
buffer.  Entries that legitimately fill the inline data still pass; only
accesses that would fall outside the allocation are rejected.

Fixes: c4d8b0235aa9 ("ext4: fix readdir error in case inline_data+^dir_index.")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Xiang Mei <xmei5@asu.edu>
---
 fs/ext4/inline.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 8045e4ff270c..0ec85cfcc859 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -1454,6 +1454,9 @@ int ext4_read_inline_dir(struct file *file,
 			/* for other entry, the real offset in
 			 * the buf has to be tuned accordingly.
 			 */
+			if (i - extra_offset + ext4_dir_rec_len(1, NULL) >
+			    inline_size)
+				break;
 			de = (struct ext4_dir_entry_2 *)
 				(dir_buf + i - extra_offset);
 			/* It's too expensive to do a full
@@ -1488,10 +1491,21 @@ int ext4_read_inline_dir(struct file *file,
 			continue;
 		}
 
+		/*
+		 * ctx->pos can be set to an arbitrary value via lseek(), and
+		 * the rescan above may also advance it.  Make sure the dirent
+		 * header lies within the inline_size payload before
+		 * dereferencing it: extra_offset only inflates the logical
+		 * position of "." and "..", the dirent itself always lives in
+		 * the kmalloc(inline_size) buffer.
+		 */
+		if (ctx->pos - extra_offset + ext4_dir_rec_len(1, NULL) >
+		    inline_size)
+			goto out;
 		de = (struct ext4_dir_entry_2 *)
 			(dir_buf + ctx->pos - extra_offset);
 		if (ext4_check_dir_entry(inode, file, de, iloc.bh, dir_buf,
-					 extra_size, ctx->pos))
+					 inline_size, ctx->pos))
 			goto out;
 		if (le32_to_cpu(de->inode)) {
 			if (!dir_emit(ctx, de->name, de->name_len,
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH v2 3/3] ext4: reject mount if inodes per group is not a multiple of inodes per block
From: Andreas Dilger @ 2026-06-08 18:27 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, tytso, jack, yi.zhang, ojaswin, ritesh.list, Sashiko
In-Reply-To: <20260608111150.827117-4-libaokun@linux.alibaba.com>

On Jun 8, 2026, at 05:11, Baokun Li <libaokun@linux.alibaba.com> wrote:
> 
> If s_inodes_per_group is not a multiple of s_inodes_per_block, the
> division that computes s_itb_per_group truncates, reserving fewer blocks
> for the inode table than needed.
> 
> On a crafted filesystem image, this allows __ext4_get_inode_loc() to
> compute a block offset beyond the inode table, reading unrelated data as
> an inode structure.
> 
> Add the missing divisibility check alongside the existing validation in
> ext4_block_group_meta_init().
> 
> Reported-by: Sashiko <sashiko-bot@kernel.org>
> Closes: https://sashiko.dev/#/patchset/20260608061112.392391-1-libaokun%40linux.alibaba.com
> Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>

Looks good.  Is this also fixed/checked in e2fsprogs?  There is really no reason *not* to use all the space in the last itable block for inodes.

Reviewed-by: Andreas Dilger <adilger@dilger.ca <mailto:adilger@dilger.ca>>

> ---
> fs/ext4/super.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 3ddcb4a8d4db..5ec9e1ef00c0 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -5306,7 +5306,8 @@ static int ext4_block_group_meta_init(struct super_block *sb,
>  	}
>  	if (sbi->s_inodes_per_group < sbi->s_inodes_per_block ||
>  	    sbi->s_inodes_per_group > sb->s_blocksize * 8 ||
> -	    sbi->s_inodes_per_group & 7) {
> +	    sbi->s_inodes_per_group & 7 ||
> +	    sbi->s_inodes_per_group % sbi->s_inodes_per_block) {
>  		ext4_msg(sb, KERN_ERR, "invalid inodes per group: %lu",
>  			 sbi->s_inodes_per_group);
>  		return -EINVAL;
> -- 
> 2.43.7
> 


Cheers, Andreas






^ permalink raw reply

* Re: [PATCH v2 2/3] ext4: reduce max cluster size to match documented 256MB limit
From: Andreas Dilger @ 2026-06-08 18:24 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, tytso, jack, yi.zhang, ojaswin, ritesh.list, Sashiko
In-Reply-To: <20260608111150.827117-3-libaokun@linux.alibaba.com>

On Jun 8, 2026, at 05:11, Baokun Li <libaokun@linux.alibaba.com> wrote:
> 
> The mke2fs man page documents:
> 
>  Valid cluster-size values are from 2048 to 256M bytes per cluster.
> 
> but EXT4_MAX_CLUSTER_LOG_SIZE was set to 30 (1GB), allowing crafted
> filesystem images to specify cluster sizes up to 1GB.
> 
> On 32-bit systems with bigalloc enabled, the consistency check in
> ext4_handle_clustersize():
> 
>  s_blocks_per_group == s_clusters_per_group * (clustersize / blocksize)
> 
> can overflow when the cluster ratio is large enough. Since
> s_blocks_per_group is not range-checked in the bigalloc path, the
> wrapped product can pass the consistency check, leading to inconsistent
> group geometry and potential out-of-bounds block allocation.
> 
> Reduce EXT4_MAX_CLUSTER_LOG_SIZE to 28 to match the documented 256MB
> limit. With this cap, the maximum product is:
> 
>  (blocksize * 8) * (256M / blocksize) = 2^31
> 
> which fits safely in a 32-bit unsigned long for all block sizes.
> 
> Reported-by: Sashiko <sashiko-bot@kernel.org>
> Closes: https://sashiko.dev/#/patchset/20260608061112.392391-1-libaokun%40linux.alibaba.com
> Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>

It should be pretty safe.  I doubt anyone could be using > 1GiB clusters for real-world
usage at this point, given the nature of the bug and the amount of space used for metadata
(e.g. 1GiB per directory, etc).  At that granularity they can use LVM for space management.

Reviewed-by: Andreas Dilger <adilger@dilger.ca <mailto:adilger@dilger.ca>>

> ---
> fs/ext4/ext4.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 94283a991e5c..11e41a864db8 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -334,7 +334,7 @@ struct ext4_io_submit {
>  #define EXT4_MAX_BLOCK_SIZE 65536
>  #define EXT4_MIN_BLOCK_LOG_SIZE 10
>  #define EXT4_MAX_BLOCK_LOG_SIZE 16
> -#define EXT4_MAX_CLUSTER_LOG_SIZE 30
> +#define EXT4_MAX_CLUSTER_LOG_SIZE 28
>  #ifdef __KERNEL__
>  # define EXT4_BLOCK_SIZE(s) ((s)->s_blocksize)
>  #else
> -- 
> 2.43.7
> 


Cheers, Andreas






^ permalink raw reply

* Re: [PATCH] ext4: validate donor file superblock early in EXT4_IOC_MOVE_EXT
From: Andreas Dilger @ 2026-06-08 18:20 UTC (permalink / raw)
  To: Yun Zhou
  Cc: tytso, libaokun, jack, ojaswin, ritesh.list, yi.zhang, dmonakhov,
	linux-ext4, linux-kernel
In-Reply-To: <20260608152521.1292656-1-yun.zhou@windriver.com>

On Jun 8, 2026, at 09:25, Yun Zhou <yun.zhou@windriver.com> wrote:
> 
> Reject the EXT4_IOC_MOVE_EXT ioctl early if the donor file does not
> belong to the same superblock as the original file.  Currently, this
> validation is performed inside ext4_move_extents() by
> mext_check_validity(), but only after lock_two_nondirectories() has
> already acquired the inode locks.  When the donor fd refers to a file
> on a different filesystem (e.g., overlayfs), this late validation
> creates a circular lock dependency:
> 
>  CPU0 (overlayfs write)            CPU1 (ext4 ioctl)
>  ----                              ----
>  inode_lock(ovl_inode)
>                                    mnt_want_write_file(filp)
>                                      sb_start_write(ext4_sb)   [sb_writers]
>    backing_file_write_iter()
>      vfs_iter_write(real_file)
>        file_start_write(real_file)
>          sb_start_write(ext4_sb)   [blocked by freeze]
>                                    lock_two_nondirectories()
>                                      inode_lock(ovl_inode)     [blocked]
> 
> With a concurrent freeze operation holding sb_writers write side, this
> forms a deadlock cycle: CPU0 waits for freeze to complete, freeze waits
> for CPU1's sb_writers reader to exit, CPU1 waits for CPU0's inode lock.
> 
> Since EXT4_IOC_MOVE_EXT exchanges physical extents between two files,
> it fundamentally requires both files to reside on the same ext4
> filesystem.  Moving the superblock check before any lock acquisition
> is both semantically correct and eliminates the circular dependency
> by ensuring that cross-filesystem donor fds are rejected before
> sb_writers or inode locks are taken.
> 
> Fixes: fcf6b1b729bc ("ext4: refactor ext4_move_extents code base")
> Reported-by: syzbot+ad6118a7584b607c67f2@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=ad6118a7584b607c67f2
> Signed-off-by: Yun Zhou <yun.zhou@windriver.com>

Thanks for the patch.  Looks good to me.

Reviewed-by: Andreas Dilger <adilger@dilger.ca <mailto:adilger@dilger.ca>>

> ---
> fs/ext4/ioctl.c | 3 +++
> 1 file changed, 3 insertions(+)
> 
> diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
> index 1d0c3d4bdf47..f7cd419a3218 100644
> --- a/fs/ext4/ioctl.c
> +++ b/fs/ext4/ioctl.c
> @@ -1650,6 +1650,9 @@ static long __ext4_ioctl(struct file *filp, unsigned int cmd,
>  	if (!(fd_file(donor)->f_mode & FMODE_WRITE))
>  		return -EBADF;
> 
> +	if (file_inode(filp)->i_sb != file_inode(fd_file(donor))->i_sb)
> +		return -EXDEV;
> +
>  	err = mnt_want_write_file(filp);
>  	if (err)
>  		return err;
> -- 
> 2.43.0
> 


Cheers, Andreas






^ permalink raw reply

* [PATCH] ext4: validate donor file superblock early in EXT4_IOC_MOVE_EXT
From: Yun Zhou @ 2026-06-08 15:25 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, dmonakhov
  Cc: linux-ext4, linux-kernel, yun.zhou

Reject the EXT4_IOC_MOVE_EXT ioctl early if the donor file does not
belong to the same superblock as the original file.  Currently, this
validation is performed inside ext4_move_extents() by
mext_check_validity(), but only after lock_two_nondirectories() has
already acquired the inode locks.  When the donor fd refers to a file
on a different filesystem (e.g., overlayfs), this late validation
creates a circular lock dependency:

  CPU0 (overlayfs write)            CPU1 (ext4 ioctl)
  ----                              ----
  inode_lock(ovl_inode)
                                    mnt_want_write_file(filp)
                                      sb_start_write(ext4_sb)   [sb_writers]
    backing_file_write_iter()
      vfs_iter_write(real_file)
        file_start_write(real_file)
          sb_start_write(ext4_sb)   [blocked by freeze]
                                    lock_two_nondirectories()
                                      inode_lock(ovl_inode)     [blocked]

With a concurrent freeze operation holding sb_writers write side, this
forms a deadlock cycle: CPU0 waits for freeze to complete, freeze waits
for CPU1's sb_writers reader to exit, CPU1 waits for CPU0's inode lock.

Since EXT4_IOC_MOVE_EXT exchanges physical extents between two files,
it fundamentally requires both files to reside on the same ext4
filesystem.  Moving the superblock check before any lock acquisition
is both semantically correct and eliminates the circular dependency
by ensuring that cross-filesystem donor fds are rejected before
sb_writers or inode locks are taken.

Fixes: fcf6b1b729bc ("ext4: refactor ext4_move_extents code base")
Reported-by: syzbot+ad6118a7584b607c67f2@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=ad6118a7584b607c67f2
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
 fs/ext4/ioctl.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 1d0c3d4bdf47..f7cd419a3218 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -1650,6 +1650,9 @@ static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 		if (!(fd_file(donor)->f_mode & FMODE_WRITE))
 			return -EBADF;
 
+		if (file_inode(filp)->i_sb != file_inode(fd_file(donor))->i_sb)
+			return -EXDEV;
+
 		err = mnt_want_write_file(filp);
 		if (err)
 			return err;
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH v6 06/11] fstests: verify f_fsid for cloned filesystems
From: Anand Suveer Jain @ 2026-06-08 14:59 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: fstests, linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel,
	zlang, hch
In-Reply-To: <20260529043955.GG6070@frogsfrogsfrogs>

On 29/5/26 12:39, Darrick J. Wong wrote:
> On Thu, May 28, 2026 at 12:05:37PM +0800, Anand Jain wrote:
>> Verify that the cloned filesystem provides an f_fsid that is persistent
>> across mount cycles, yet unique from the original filesystem's f_fsid.
> 
> Might want to add that last part to the test description itself, because
> otherwise I don't know what 'verify' means.
> 


Looks like there's going to be v7, I have updated the test description..

-------
+# Check that the cloned filesystem provides an f_fsid that is persistent
+# across mount cycles if the block device maj:min remains unchanged.
-------

>> Signed-off-by: Anand Jain <asj@kernel.org>
>> ---
>>  tests/generic/802     | 67 +++++++++++++++++++++++++++++++++++++++++++
>>  tests/generic/802.out |  7 +++++
>>  2 files changed, 74 insertions(+)
>>  create mode 100644 tests/generic/802
>>  create mode 100644 tests/generic/802.out
>>
>> diff --git a/tests/generic/802 b/tests/generic/802
>> new file mode 100644
>> index 000000000000..653e74e11b53
>> --- /dev/null
>> +++ b/tests/generic/802
>> @@ -0,0 +1,67 @@
>> +#! /bin/bash
>> +# SPDX-License-Identifier: GPL-2.0
>> +# Copyright (c) 2026 Anand Jain <asj@kernel.org>.  All Rights Reserved.
>> +#
>> +# FS QA Test 802
>> +# Verify f_fsid and s_uuid of cloned filesystems across mount cycle.
>> +
>> +. ./common/preamble
>> +
>> +_begin_fstest auto quick mount clone
>> +
>> +_require_test
>> +_require_block_device $TEST_DEV
>> +_require_loop
>> +
>> +[ "$FSTYP" = "btrfs" ] && _fixed_by_kernel_commit xxxxxxxxxxxx \
>> +	"btrfs: use on-disk uuid for s_uuid in temp_fsid mounts"
>> +[ "$FSTYP" = "btrfs" ] && _fixed_by_kernel_commit xxxxxxxxxxxx \
>> +	"btrfs: derive f_fsid from on-disk fsuuid and dev_t"
> 
> _fixed_by_fs_commit?
> 

Oh right! I completely forgot about the new helper _fixed_by_fs_commit.
Now fixed. Thanks.

-[ "$FSTYP" = "btrfs" ] && _fixed_by_kernel_commit xxxxxxxxxxxx \
+_fixed_by_fs_commit btrfs xxxxxxxxxxxx \
         "btrfs: use on-disk uuid for s_uuid in temp_fsid mounts"
-[ "$FSTYP" = "btrfs" ] && _fixed_by_kernel_commit xxxxxxxxxxxx \
+_fixed_by_fs_commit btrfs xxxxxxxxxxxx \
         "btrfs: derive f_fsid from on-disk fsuuid and dev_t"

>> +
>> +_cleanup()
>> +{
>> +	cd /
>> +	rm -r -f $tmp.*
>> +	umount $mnt1 $mnt2 2>/dev/null
>> +	_loop_image_destroy "${devs[@]}" 2> /dev/null
>> +}
>> +
>> +# Setup base loop device and its clone
>> +devs=()
>> +_loop_image_create_clone devs
>> +mkdir -p $TEST_DIR/$seq
>> +mnt1=$TEST_DIR/$seq/mnt1
>> +mnt2=$TEST_DIR/$seq/mnt2
>> +mkdir -p $mnt1
>> +mkdir -p $mnt2
>> +
>> +# Mount both filesystems simultaneously using mandatory clone mount options
>> +_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
>> +						_fail "Failed to mount dev1"
>> +_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
>> +						_fail "Failed to mount dev2"
>> +
>> +# Capture baseline filesystem IDs for comparison
>> +fsid_scratch=$(stat -f -c "%i" $mnt1)
>> +fsid_clone=$(stat -f -c "%i" $mnt2)
>> +
>> +echo "**** fsid initially ****"
>> +echo $fsid_scratch | sed -e "s/$fsid_scratch/FSID_SCRATCH/g"
>> +echo $fsid_clone | sed -e "s/$fsid_clone/FSID_CLONE/g"
> 
> Why echo only to sed?
> 

I tried to keep both the .out log and the script readable.
Since a bare echo FSID_SCRATCH and echo FSID_CLONE has no
clear reference to $fsid_scratch and $fsid_clone, I just
added a little extra code around them.

However, I'm completely fine with removing both viz,
expected output and its script (for the first mount part).
As I'm pretty sure there will be a v7 now, ;-)
I can make those changes.

Thanks, Anand

>> +
>> +# Verify that the fsids remain stable after a mount cycle, even when the
>> +# mount order is reversed.
>> +echo "**** fsid after mount cycle ****"
>> +_unmount $mnt1
>> +_unmount $mnt2
>> +_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
>> +						_fail "Failed to mount dev2"
>> +_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
>> +						_fail "Failed to mount dev1"
>> +
>> +# Compare post mount-cycle values against the baseline
>> +stat -f -c "%i" $mnt1 | sed -e "s/$fsid_scratch/FSID_SCRATCH/g"
>> +stat -f -c "%i" $mnt2 | sed -e "s/$fsid_clone/FSID_CLONE/g"
>> +
>> +status=0
>> +exit
>> diff --git a/tests/generic/802.out b/tests/generic/802.out
>> new file mode 100644
>> index 000000000000..d1e008f122bb
>> --- /dev/null
>> +++ b/tests/generic/802.out
>> @@ -0,0 +1,7 @@
>> +QA output created by 802
>> +**** fsid initially ****
>> +FSID_SCRATCH
>> +FSID_CLONE
>> +**** fsid after mount cycle ****
>> +FSID_SCRATCH
>> +FSID_CLONE
>> -- 
>> 2.43.0
>>
>>


^ permalink raw reply

* Re: [PATCH v6 05/11] fstests: verify fanotify isolation on cloned filesystems
From: Anand Suveer Jain @ 2026-06-08 14:45 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: fstests, linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel,
	zlang, hch
In-Reply-To: <20260529043647.GF6070@frogsfrogsfrogs>

On 29/5/26 12:36, Darrick J. Wong wrote:
> On Thu, May 28, 2026 at 12:05:36PM +0800, Anand Jain wrote:
>> Verify that fanotify events are correctly routed to the appropriate
>> watcher when cloned filesystems are mounted.
>> Helps verify kernel's event notification distinguishes between devices
>> sharing the same FSID/UUID.
>>
>> Signed-off-by: Anand Jain <asj@kernel.org>
>> ---
>>  tests/generic/801     | 135 ++++++++++++++++++++++++++++++++++++++++++
>>  tests/generic/801.out |   7 +++
>>  2 files changed, 142 insertions(+)
>>  create mode 100644 tests/generic/801
>>  create mode 100644 tests/generic/801.out
>>
>> diff --git a/tests/generic/801 b/tests/generic/801
>> new file mode 100644
>> index 000000000000..3bfb87d41922
>> --- /dev/null
>> +++ b/tests/generic/801
>> @@ -0,0 +1,135 @@
>> +#! /bin/bash
>> +# SPDX-License-Identifier: GPL-2.0
>> +# Copyright (c) 2026 Anand Jain <asj@kernel.org>.  All Rights Reserved.
>> +#
>> +# FS QA Test 801
>> +# Verify fanotify FID functionality on cloned filesystems by setting up
>> +# watchers and making sure notifications are in the correct logs files.
>> +
>> +. ./common/preamble
>> +
>> +_begin_fstest auto quick mount clone
>> +
>> +_require_test
>> +_require_block_device $TEST_DEV
>> +_require_loop
>> +_require_command "$FSNOTIFYWAIT_PROG" fsnotifywait
>> +_require_unique_f_fsid
>> +
>> +_cleanup()
>> +{
>> +	cd /
>> +	[[ -n $pid1 ]] && { kill -TERM "$pid1" 2> /dev/null; wait $pid1; }
>> +	[[ -n $pid2 ]] && { kill -TERM "$pid2" 2> /dev/null; wait $pid2; }
>> +
>> +	if [ "$semanage_added" = "yes" ]; then
>> +		semanage permissive -d unconfined_t >/dev/null 2>&1 || true
>> +	fi
>> +
>> +	umount $mnt1 $mnt2 2>/dev/null
>> +	_loop_image_destroy "${devs[@]}" 2> /dev/null
>> +	rm -r -f $tmp.*
>> +}
>> +



>> +# Run fsnotifywait in unbuffered mode to watch filesystem-wide create events
>> +monitor_fanotify()
>> +{
>> +	local mmnt=$1
>> +	exec stdbuf -oL $FSNOTIFYWAIT_PROG -m -F -S -e create "$mmnt" 2>&1
> 
> I guess you need stdbuf to force fsnotifywait to run in linebuffered
> mode even if you pipe/redirect it somewhere?
> 

yeah, stdbuf helps get the output as and when created.

>> +}
>> +
>> +# Transform f_fsid into the hi.lo format used in fanotify FID logs
>> +fsid_to_fid_parts()
>> +{
>> +	local fsid=$1
>> +	# Pad to 16 hex chars (64-bit), then split into two 32-bit halves
>> +	local padded=$(printf '%016x' "0x${fsid}")
>> +	local hi=$(printf '%x' "0x${padded:0:8}")   # strips leading zeros
>> +	local lo=$(printf '%x' "0x${padded:8:8}")   # strips leading zeros
>> +	echo "${hi}.${lo}"
>> +}
>> +
>> +# Create base loop device and its clone
>> +devs=()
>> +_loop_image_create_clone devs
>> +mkdir -p $TEST_DIR/$seq
>> +mnt1=$TEST_DIR/$seq/mnt1
>> +mnt2=$TEST_DIR/$seq/mnt2
>> +mkdir -p $mnt1
>> +mkdir -p $mnt2
>> +
>> +# Mount both base and clone filesystems using required clone mount options
>> +_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
>> +						_fail "Failed to mount dev1"
>> +_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
>> +						_fail "Failed to mount dev2"
>> +
>> +# Fetch filesystem IDs to verify the kernel can differentiate between them
>> +fsid1=$(stat -f -c "%i" $mnt1)
>> +fsid2=$(stat -f -c "%i" $mnt2)
>> +
>> +log1=$tmp.fanotify1
>> +log2=$tmp.fanotify2
>> +
>> +pid1=""
>> +pid2=""
>> +echo "Setup FID fanotify watchers on both mnt1 and mnt2"
>> +
>> +# Permit unconfined_t domains when SELinux is enforcing to prevent fanotify
>> +# blockages
>> +semanage_added="no"
>> +if [ "$(getenforce 2>/dev/null)" = "Enforcing" ]; then
>> +    if ! semanage permissive -l | grep -q "unconfined_t"; then
>> +        semanage permissive -a unconfined_t >/dev/null 2>&1 && semanage_added="yes"
>> +    fi
>> +fi
> 
> Is there a cleaner way to manage setting up and automatically undoing
> this step?
> 
> There might not be, since iirc the suggestion to register cleanup
> functions in a cleanups=() array and call them all in reverse order
> didn't go anywhere.
> 

If there are multiple use cases, we could wrap it up in a helper,
similar to _scratch_dev_pool_{get|put}, if it helps.

Thanks, Anand


>> +
>> +# Start asynchronous fanotify monitors
>> +( monitor_fanotify "$mnt1" > "$log1" ) &
>> +pid1=$!
>> +( monitor_fanotify "$mnt2" > "$log2" ) &
>> +pid2=$!
>> +sleep 2
>> +
>> +echo "Trigger file creation on mnt1"
>> +touch $mnt1/file_on_mnt1
>> +sync
>> +sleep 1
>> +
>> +echo "Trigger file creation on mnt2"
>> +touch $mnt2/file_on_mnt2
>> +sync
>> +sleep 1
>> +
>> +echo "Verify fsid in the fanotify"
>> +kill $pid1 $pid2
>> +wait $pid1 $pid2 2>/dev/null
>> +pid1=""
>> +pid2=""
>> +
>> +e_fsid1=$(fsid_to_fid_parts "$fsid1")
>> +e_fsid2=$(fsid_to_fid_parts "$fsid2")
>> +
>> +# Dump debug details to the full log
>> +echo $fsid1 $e_fsid1 $fsid2 $e_fsid2 >> $seqres.full
>> +cat $log1 >> $seqres.full
>> +cat $log2 >> $seqres.full
>> +
>> +# Ensure monitor 1 only captured events belonging to mnt 1 and fsid 1
>> +if grep -qF "$e_fsid1" "$log1" && ! grep -qF "$e_fsid2" "$log1"; then
>> +	echo "SUCCESS: mnt1 events found"
>> +else
>> +	[ ! -s "$log1" ] && echo "  - mnt1 received no events."
>> +	grep -qF "$e_fsid2" "$log1" && echo "  - mnt1 received event from mnt2."
>> +fi
>> +
>> +# Ensure monitor 2 only captured events belonging to mnt 2 and fsid 2
>> +if grep -qF "$e_fsid2" "$log2" && ! grep -qF "$e_fsid1" "$log2"; then
>> +	echo "SUCCESS: mnt2 events found"
>> +else
>> +	[ ! -s "$log2" ] && echo "  - mnt2 received no events."
>> +	grep -qF "$e_fsid1" "$log2" && echo "  - mnt2 received event from mnt1."
>> +fi
>> +
>> +status=0
>> +exit
>> diff --git a/tests/generic/801.out b/tests/generic/801.out
>> new file mode 100644
>> index 000000000000..d7b318d9f27c
>> --- /dev/null
>> +++ b/tests/generic/801.out
>> @@ -0,0 +1,7 @@
>> +QA output created by 801
>> +Setup FID fanotify watchers on both mnt1 and mnt2
>> +Trigger file creation on mnt1
>> +Trigger file creation on mnt2
>> +Verify fsid in the fanotify
>> +SUCCESS: mnt1 events found
>> +SUCCESS: mnt2 events found
>> -- 
>> 2.43.0
>>
>>


^ permalink raw reply

* Re: [PATCH v6 04/11] fstests: add _require_unique_f_fsid() helper
From: Anand Suveer Jain @ 2026-06-08 14:43 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: fstests, linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel,
	zlang, hch
In-Reply-To: <20260529043056.GE6070@frogsfrogsfrogs>

On 29/5/26 12:30, Darrick J. Wong wrote:
> On Thu, May 28, 2026 at 12:05:35PM +0800, Anand Jain wrote:
>> Add a helper to check if the target filesystem supports unique f_fsid
>> tracking across cloned or snapshot instances.
>>
>> Certain filesystems like XFS, Btrfs, and F2FS ensure unique f_fsid
>> identifiers per filesystem instance. However, Ext4 derives its f_fsid
>> directly from its superblock UUID, which leads to identical f_fsid
>> values on cloned images until the UUID is manually modified by userspace.
>>
>> Introduce _require_unique_f_fsid() to allow test cases requiring strict
>> f_fsid uniqueness to skip gracefully on unsupported filesystems.
>>
>> Signed-off-by: Anand Jain <asj@kernel.org>
>> ---
>>  common/rc | 21 +++++++++++++++++++++
>>  1 file changed, 21 insertions(+)
>>
>> diff --git a/common/rc b/common/rc
>> index 937f478963b4..5446552aed92 100644
>> --- a/common/rc
>> +++ b/common/rc
>> @@ -6314,6 +6314,27 @@ _require_fanotify_ioerrors()
>>  	_notrun "$FSTYP does not support fanotify ioerrors"
>>  }
>>  
>> +# Ext4 derives f_fsid from the superblock UUID, meaning clones share the
>> +# same f_fsid until their UUIDs diverge. Conversely, XFS, Btrfs,
>> +# and F2FS ensure f_fsid remains unique per filesystem instance (often by
>> +# deriving it from the UUID and underlying block device.)
>> +#
>> +# Across all filesystems, a UUID collision causes libblkid tools to return
>> +# non-deterministic device mappings. It is ultimately the responsibility
> 
> "device mappings", as in /dev/disk/by-id/$UUID ?
> 

Correct.. I'll make it specific.

>> +# of the userspace utility or use-case to enforce uniqueness when a clone
>> +# diverges. For details, see mailing list thread discussions titled:
>> +#      "ext4: derive f_fsid from block device to avoid collisions".
> 
> How about providing a direct lore link?
> 

Sure, that will be..

Link:
https://lore.kernel.org/linux-ext4/20260409131238.GC18443@macsyma-wired.lan/

instead of the title.

Thanks, Anand


> --D
> 




>> +_require_unique_f_fsid()
>> +{
>> +	# Skip the test if the filesystem does not enforce unique f_fsids
>> +	# natively. Checking this dynamically requires recreating a clone
>> +	# layout, so we use a static lookup based on FSTYP.
>> +	if [ "$FSTYP" == "ext4" ]; then
>> +		_notrun "Target filesystem ($FSTYP) does not guarantee unique f_fsid on clones."
>> +	fi
>> +}
>> +
>> +
>>  # Computes a percentage of the available space in a filesystem and
>>  # returns that quantity in MB. The percentage must not contain a percent
>>  # sign ("%").
>> -- 
>> 2.43.0
>>
>>


^ permalink raw reply

* Re: [PATCH v6 02/11] fstests: add _clone_mount_option() helper
From: Anand Suveer Jain @ 2026-06-08 14:41 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: fstests, linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel,
	zlang, hch
In-Reply-To: <20260529042834.GC6070@frogsfrogsfrogs>

On 29/5/26 12:28, Darrick J. Wong wrote:
> On Thu, May 28, 2026 at 12:05:33PM +0800, Anand Jain wrote:
>> Adds _clone_mount_option() helper function to handle filesystem-specific
>> requirements for mounting cloned devices. Abstract the need for -o nouuid
>> on XFS.
>>
>> Signed-off-by: Anand Jain <asj@kernel.org>
>> ---
>>  common/rc | 17 +++++++++++++++++
>>  1 file changed, 17 insertions(+)
>>
>> diff --git a/common/rc b/common/rc
>> index d7e3e0bdfb1e..937f478963b4 100644
>> --- a/common/rc
>> +++ b/common/rc
>> @@ -414,6 +414,23 @@ _scratch_mount_options()
>>  					$SCRATCH_DEV $SCRATCH_MNT
>>  }
>>  
>> +# Return filesystem-specific mount options required for mounting clone/snapshot
>> +# devices.
>> +_clone_mount_option()
>> +{
>> +	local mount_opts=""
>> +
>> +	case "$FSTYP" in
>> +	xfs)
>> +		# Allow mounting a duplicate filesystem on the same host
>> +		mount_opts="-o nouuid"
>> +		;;
>> +	*)
>> +	esac
>> +
>> +	echo $mount_opts
> 
> I probably would've just echo'd straight from inside the case statement,

Nice. Let's see if there is v7, I will change this to as below.

> but this otherwise looks ok,
> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> 
> --D
> 



diff --git a/common/rc b/common/rc
index 79be51e4da31..18d4f73cead9 100644
--- a/common/rc
+++ b/common/rc
@@ -418,17 +418,13 @@ _scratch_mount_options()
  # devices.
  _clone_mount_option()
  {
-       local mount_opts=""
-
         case "$FSTYP" in
         xfs)
                 # Allow mounting a duplicate filesystem on the same host
-               mount_opts="-o nouuid"
+               echo "-o nouuid"
                 ;;
         *)
         esac
-
-       echo $mount_opts
  }

  _supports_filetype()





>> +}
>> +
>>  _supports_filetype()
>>  {
>>  	local dir=$1
>> -- 
>> 2.43.0
>>
>>


^ permalink raw reply related

* Re: [PATCH v6 01/11] fstests: add _loop_image_create_clone() helper
From: Anand Suveer Jain @ 2026-06-08 14:39 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: fstests, linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel,
	zlang, hch
In-Reply-To: <20260529042743.GB6070@frogsfrogsfrogs>

On 29/5/26 12:27, Darrick J. Wong wrote:
> On Thu, May 28, 2026 at 12:05:32PM +0800, Anand Jain wrote:
>> Introduce _loop_image_create_clone() and _loop_image_destroy() to mkfs an
>> image file and clone it to another image file, and attach a loop device to
>> them. And its destroy part.
>>
>> Signed-off-by: Anand Jain <asj@kernel.org>
>> ---
>>  common/rc | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 63 insertions(+)
>>
>> diff --git a/common/rc b/common/rc
>> index 79189e7e6e94..d7e3e0bdfb1e 100644
>> --- a/common/rc
>> +++ b/common/rc
>> @@ -1520,6 +1520,69 @@ _scratch_resvblks()
>>  	esac
>>  }
>>  
>> +# Create a small loop image, run an optional tuning function ($2) on it,
>> +# clone it, and attach both to loop devices, returned in ($1).
>> +# Args:
>> +#   $1: Nameref to return the array of allocated loop devices [base, clone].
>> +#   $2: Optional callback function to tune the base filesystem before cloning.
>> +_loop_image_create_clone()
>> +{
>> +	local -n _ret=$1
> 
> That switch   ^^ is very clever.  I always wondered how one did indirect
> variables in bash.
> 
>> +	local pre_clone_tune_func="$2"
>> +	local img_file=$TEST_DIR/${seq}.img
>> +	local img_file_clone=$TEST_DIR/${seq}_clone.img
>> +	local size=$(_small_fs_size_mb 128) # Smallest possible
>> +	local loop_devs
>> +
>> +	# Since we copy the block device image, we keep its size small.
>> +	_require_fs_space $TEST_DIR $((size * 1024))
>> +
>> +	_create_file_sized $((size * 1024 * 1024)) $img_file ||
>> +				_fail "Failed: Create $img_file $size"
>> +
>> +	loop_devs=$(_create_loop_device $img_file)
>> +	_ret=($loop_devs)
> 
> Should this check that a loopdev actually got created?
> 

Hmm, in the function _create_loop_device(), we are
calling _fail if create fails, so no need to duplicate, right?

>> +	case $FSTYP in
>> +	xfs)
>> +		_mkfs_dev "-s size=4096" ${loop_devs[0]}
>> +		;;
>> +	btrfs)
>> +		_mkfs_dev ${loop_devs[0]}
>> +		;;
>> +	*)
>> +		_mkfs_dev ${loop_devs[0]}
>> +		;;
>> +	esac
>> +
>> +	# Only execute if the function argument is not empty
>> +	if [ -n "$pre_clone_tune_func" ]; then
>> +		$pre_clone_tune_func ${loop_devs[0]}
>> +	fi
>> +
>> +	sync ${loop_devs[0]}
>> +	cp $img_file $img_file_clone
>> +


>> +	loop_devs="$loop_devs $(_create_loop_device $img_file_clone)"
> 
> 	local lodev="$(_create_loop_device ...)"
> 
> 	test -z "$lodev" && _fail "second loopdev not created"
> 	_ret+=("$lodev")
> 
> ?

If the second `_create_loop_device()` happens to fail, it will
already have called `_fail`, so "second loopdev..." won't be
used at all.


Thanks, Anand



>> +
>> +	_ret=($loop_devs)
>> +}
>> +
>> +# Teardown loop devices and delete their underlying backing image files.
>> +# Accepts a list of loop device paths (e.g., /dev/loop0 /dev/loop1).
>> +_loop_image_destroy()
>> +{
>> +	for d in "$@"; do
>> +		# Retrieve the path of the backing file
>> +		local f=$(losetup --noheadings --output BACK-FILE $d)
>> +
>> +		# Detach the loop device from the backing file
>> +		_destroy_loop_device "$d"
>> +
>> +		# Clean up the backing disk image file
>> +		[ -n "$f" ] && rm -f "$f"
>> +	done
>> +}
>>  
>>  # Repair scratch filesystem.  Returns 0 if the FS is good to go (either no
>>  # errors found or errors were fixed) and nonzero otherwise; also spits out
>> -- 
>> 2.43.0
>>
>>


^ permalink raw reply

* Re: [PATCH v2 2/2] ext4: avoid tail write_begin walk for uptodate folios
From: Jan Kara @ 2026-06-08 14:29 UTC (permalink / raw)
  To: Jia Zhu
  Cc: Theodore Ts'o, Andreas Dilger, Matthew Wilcox, Alexander Viro,
	Christian Brauner, Jan Kara, Baokun Li, Ojaswin Mujoo,
	Ritesh Harjani, Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel
In-Reply-To: <20260608120131.45146-3-zhujia.zj@bytedance.com>

On Mon 08-06-26 20:01:31, Jia Zhu wrote:
> Ext4 buffered writes into large folios also pay a full buffer_head
> walk in ext4_block_write_begin().  For a small overwrite of an existing
> cached folio, the folio is already uptodate and the write only needs to
> prepare the buffers through the written range.  Walking the suffix still
> makes the write_begin cost proportional to the folio size.
> 
> Before ext4 enabled large folios for regular files, the same loop was
> bounded by a single page of buffers.  That commit made the existing
> full-folio walk visible as a regression for cached small overwrites.
> 
> The suffix walk is needed for non-uptodate folios, where ext4 may have
> to submit reads for partial blocks, preserve new-buffer cleanup, and run
> error zeroing.  Keep those folios on the old full walk.
> 
> For already-uptodate folios, keep the walk starting at the first buffer
> rather than seeking directly to from.  This preserves the existing prefix
> buffer state handling.  Stop once block_start reaches the end of the
> write range, because the skipped suffix would only repeat the
> outside-range uptodate handling for buffers beyond @to.
> 
> On current master, the libMicro ext4 large-folio overwrite test shows
> the following full-series result.  Results are median usecs/call over 10
> runs, lower is better:
> 
> case        nofix     this series   improvement
> write_u1k   1.418     0.3405        76.0%
> write_u10k  1.887     0.4175        77.9%
> pwrite_u1k  1.6775    0.3390        79.8%
> pwrite_u10k 1.9035    0.4130        78.3%
> 
> Fixes: 7ac67301e82f0 ("ext4: enable large folio for regular file")
> Signed-off-by: Jia Zhu <zhujia.zj@bytedance.com>

Looks good, just one simplification suggestion:

> @@ -1193,13 +1194,14 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
>  		head = create_empty_buffers(folio, blocksize, 0);
>  	block = EXT4_PG_TO_LBLK(inode, folio->index);
>  
> -	for (bh = head, block_start = 0; bh != head || !block_start;
> +	for (bh = head, block_start = 0;
> +	     (bh != head || !block_start) &&
> +	     (!folio_uptodate || block_start < to);

You simplify this condition to:

  block_start < to || (!folio_uptodate && bh != head)

With this updated feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza


>  	    block++, block_start = block_end, bh = bh->b_this_page) {
>  		block_end = block_start + blocksize;
>  		if (block_end <= from || block_start >= to) {
> -			if (folio_test_uptodate(folio)) {
> +			if (folio_uptodate)
>  				set_buffer_uptodate(bh);
> -			}
>  			continue;
>  		}
>  		if (WARN_ON_ONCE(buffer_new(bh)))
> @@ -1220,7 +1222,7 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
>  				if (should_journal_data)
>  					do_journal_get_write_access(handle,
>  								    inode, bh);
> -				if (folio_test_uptodate(folio)) {
> +				if (folio_uptodate) {
>  					/*
>  					 * Unlike __block_write_begin() we leave
>  					 * dirtying of new uptodate buffers to
> @@ -1237,7 +1239,7 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
>  				continue;
>  			}
>  		}
> -		if (folio_test_uptodate(folio)) {
> +		if (folio_uptodate) {
>  			set_buffer_uptodate(bh);
>  			continue;
>  		}
> -- 
> 2.20.1
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v2 1/2] fs/buffer: avoid tail commit walk for uptodate folios
From: Jan Kara @ 2026-06-08 13:06 UTC (permalink / raw)
  To: Jia Zhu
  Cc: Theodore Ts'o, Andreas Dilger, Matthew Wilcox, Alexander Viro,
	Christian Brauner, Jan Kara, Baokun Li, Ojaswin Mujoo,
	Ritesh Harjani, Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel
In-Reply-To: <20260608120131.45146-2-zhujia.zj@bytedance.com>

On Mon 08-06-26 20:01:30, Jia Zhu wrote:
> block_commit_write() always walks every buffer_head attached to the
> folio.  That was cheap for order-0 folios, but large folios can contain
> hundreds of buffer_heads.  For a small buffered overwrite of an
> already-uptodate large folio, the commit work is therefore proportional
> to the folio size rather than the copied range.
> 
> This became visible with ext4 regular-file large folios, where cached
> small overwrites reach block_commit_write() through block_write_end().
> Before ext4 enabled large folios for regular files, this path was only
> hit with order-0 folios for normal ext4 buffered writes, so the full walk
> was bounded.  The ext4 large-folio commit is therefore the regression
> point for this generic helper cost.
> 
> The full walk is still needed when the folio is not uptodate, because
> block_commit_write() uses per-buffer uptodate state to decide whether
> the whole folio can be marked uptodate.  Keep those folios on the old
> full-buffer path.
> 
> For a folio that was already uptodate on entry, the commit no longer
> needs tail buffers for folio-uptodate discovery.  The copied range has
> already been processed once block_start reaches @to, so stop there and
> avoid the suffix walk.
> 
> Fixes: 7ac67301e82f0 ("ext4: enable large folio for regular file")
> Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Signed-off-by: Jia Zhu <zhujia.zj@bytedance.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/buffer.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index b0b3792b1496e..c8c41c799030d 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2096,6 +2096,7 @@ void block_commit_write(struct folio *folio, size_t from, size_t to)
>  {
>  	size_t block_start, block_end;
>  	bool partial = false;
> +	bool uptodate = folio_test_uptodate(folio);
>  	unsigned blocksize;
>  	struct buffer_head *bh, *head;
>  
> @@ -2118,6 +2119,8 @@ void block_commit_write(struct folio *folio, size_t from, size_t to)
>  			clear_buffer_new(bh);
>  
>  		block_start = block_end;
> +		if (uptodate && block_start >= to)
> +			break;
>  		bh = bh->b_this_page;
>  	} while (bh != head);
>  
> -- 
> 2.20.1
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: security bug reporting: e2fsprogs: Path Traversal and heap overflow
From: Theodore Tso @ 2026-06-08 12:54 UTC (permalink / raw)
  To: Feng Xue; +Cc: Andreas Dilger, linux-ext4@vger.kernel.org
In-Reply-To: <SY0P300MB0070B3B68E4E4FC0C78DE355901C2@SY0P300MB0070.AUSP300.PROD.OUTLOOK.COM>

On Mon, Jun 08, 2026 at 10:18:38AM +0000, Feng Xue wrote:
> Hi Andres,
> 
> Thanks for the quick turn around.
> Agree with you on the inode one.

Anreas, thanks for the the patch.  I had reviewed Feng's security
report and my assessment had aligned with yours.  I had it on my todo
list, along with a mental note to make sure that this particular
corrupted file system would be caught by the kernel's fs/ext4/super.c.

> For the debugfs, yes it is not what people would run on a daily
> basis. it turns interesting in scenarios like incident response and
> forensic, so yes it depends on the ROI and definitely your call to
> whether it should be fixed.

The debugfs concern (as well as another related one) has already been
reported on github[1][2].

[1] https://github.com/tytso/e2fsprogs/issues/272
[2] https://github.com/tytso/e2fsprogs/issues/273

It "should" be fixed, but I'm quite busy these days, and part of it is
many kernel developers needing to react to the large number of
LLM-reported security bugs[3][4][5].  And so we have to prioritize the
order in which we address security or other bug fix reports.

[3] https://lwn.net/Articles/1063303/
[4] https://www.tomshardware.com/software/linux/linus-torvalds-says-ai-bug-reports-have-made-the-linux-security-mailing-list-almost-entirely-unmanageable
[5] https://medium.com/@tridge60/rsync-and-outrage-d9849599e5a0

I do appreciate that your report included a suggested fix.  The thing
that would make it even *more* useful (and gain you an a win of a
patch landed in e2fsprogs or the Linux kernel :-) would be if the
report came with an formal patch submission.  If you send it using the
git send-email (and see the Linux kernel's instructions for how to
submit changes[6]), to any mailing list at linux-*@vger.kernel.org,
including linux-ext4@vger.kernel.org, the Linux kernel changes will
get automatically get reviewed by an LLM agent, Sashiko[7].

[6] https://www.kernel.org/doc/html/latest/process/submitting-patches.html
[7] https://lwn.net/Articles/1063303/

For bonus points, if your AI agent can also create a regression test
much like how Andreas's patch, that would be great.  And if we can
collaborate on a skills.md that can teach AI agent how to create
reports that are easier for Linux developers to consume, and perhaps
to even create patches that require much less work on Linux developers
(or which Linux developers can use themselves), that would be really
exciting.  Yeah, that could result in a certain amount of LLM fear and
yourage as with rsync[5], but we have that with the Linux kernel
already.  :-)

Cheers,

						- Ted

^ permalink raw reply

* Re: [PATCH v2 1/3] ext4: reject mount if clusters/inodes per group are not 8-aligned
From: Zhang Yi @ 2026-06-08 12:53 UTC (permalink / raw)
  To: Baokun Li, linux-ext4
  Cc: tytso, adilger.kernel, jack, ojaswin, ritesh.list, Sashiko
In-Reply-To: <20260608111150.827117-2-libaokun@linux.alibaba.com>

On 6/8/2026 7:11 PM, Baokun Li wrote:
> The block and inode bitmap checksums are computed over a whole number of
> bytes: ext4_inode_bitmap_csum_*() use EXT4_INODES_PER_GROUP(sb) >> 3 and
> ext4_block_bitmap_csum_*() use EXT4_CLUSTERS_PER_GROUP(sb) / 8 as the
> length passed to ext4_chksum().
> 
> If s_inodes_per_group or s_clusters_per_group is not a multiple of 8, the
> trailing fractional bits are excluded from the checksum.  Those bits are
> then unprotected, and any incremental csum update path that assumes a
> byte-aligned bitmap can compute a checksum inconsistent with the full
> recalculation, corrupting the on-disk bitmap checksum.
> 
> Reject such filesystems at mount time by adding the missing " & 7"
> alignment checks alongside the existing range validation.
> 
> Suggested-by: Theodore Ts'o <tytso@mit.edu>
> Link: https://patch.msgid.link/h3n7jlfhyna64dn5o76qxcspnhxdddcs6crpxftmy7gnl7b3sx@jenszfpcsnit
> Reported-by: Sashiko <sashiko-bot@kernel.org>
> Closes: https://sashiko.dev/#/patchset/20260508121539.4174601-1-libaokun%40linux.alibaba.com?part=10
> Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>

Looks good to me. Thanks!

Reviewed-by: Zhang Yi <yi.zhang@huawei.com>

> ---
>  fs/ext4/super.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 6a77db4d3124..3ddcb4a8d4db 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4472,8 +4472,9 @@ static int ext4_handle_clustersize(struct super_block *sb)
>  		sbi->s_cluster_bits = 0;
>  	}
>  	sbi->s_clusters_per_group = le32_to_cpu(es->s_clusters_per_group);
> -	if (sbi->s_clusters_per_group > sb->s_blocksize * 8) {
> -		ext4_msg(sb, KERN_ERR, "#clusters per group too big: %lu",
> +	if (sbi->s_clusters_per_group > sb->s_blocksize * 8 ||
> +	    sbi->s_clusters_per_group & 7) {
> +		ext4_msg(sb, KERN_ERR, "invalid #clusters per group: %lu",
>  			 sbi->s_clusters_per_group);
>  		return -EINVAL;
>  	}
> @@ -5304,8 +5305,9 @@ static int ext4_block_group_meta_init(struct super_block *sb, int silent)
>  		return -EINVAL;
>  	}
>  	if (sbi->s_inodes_per_group < sbi->s_inodes_per_block ||
> -	    sbi->s_inodes_per_group > sb->s_blocksize * 8) {
> -		ext4_msg(sb, KERN_ERR, "invalid inodes per group: %lu\n",
> +	    sbi->s_inodes_per_group > sb->s_blocksize * 8 ||
> +	    sbi->s_inodes_per_group & 7) {
> +		ext4_msg(sb, KERN_ERR, "invalid inodes per group: %lu",
>  			 sbi->s_inodes_per_group);
>  		return -EINVAL;
>  	}


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox