Linux EXT4 FS development

Linux EXT4 FS development
 help / color / mirror / Atom feed

* Re: [PATCH v6 04/11] fstests: add _require_unique_f_fsid() helper
From: Anand Suveer Jain @ 2026-06-08 14:43 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: fstests, linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel,
	zlang, hch
In-Reply-To: <20260529043056.GE6070@frogsfrogsfrogs>

On 29/5/26 12:30, Darrick J. Wong wrote:
> On Thu, May 28, 2026 at 12:05:35PM +0800, Anand Jain wrote:
>> Add a helper to check if the target filesystem supports unique f_fsid
>> tracking across cloned or snapshot instances.
>>
>> Certain filesystems like XFS, Btrfs, and F2FS ensure unique f_fsid
>> identifiers per filesystem instance. However, Ext4 derives its f_fsid
>> directly from its superblock UUID, which leads to identical f_fsid
>> values on cloned images until the UUID is manually modified by userspace.
>>
>> Introduce _require_unique_f_fsid() to allow test cases requiring strict
>> f_fsid uniqueness to skip gracefully on unsupported filesystems.
>>
>> Signed-off-by: Anand Jain <asj@kernel.org>
>> ---
>>  common/rc | 21 +++++++++++++++++++++
>>  1 file changed, 21 insertions(+)
>>
>> diff --git a/common/rc b/common/rc
>> index 937f478963b4..5446552aed92 100644
>> --- a/common/rc
>> +++ b/common/rc
>> @@ -6314,6 +6314,27 @@ _require_fanotify_ioerrors()
>>  	_notrun "$FSTYP does not support fanotify ioerrors"
>>  }
>>  
>> +# Ext4 derives f_fsid from the superblock UUID, meaning clones share the
>> +# same f_fsid until their UUIDs diverge. Conversely, XFS, Btrfs,
>> +# and F2FS ensure f_fsid remains unique per filesystem instance (often by
>> +# deriving it from the UUID and underlying block device.)
>> +#
>> +# Across all filesystems, a UUID collision causes libblkid tools to return
>> +# non-deterministic device mappings. It is ultimately the responsibility
> 
> "device mappings", as in /dev/disk/by-id/$UUID ?
> 

Correct.. I'll make it specific.

>> +# of the userspace utility or use-case to enforce uniqueness when a clone
>> +# diverges. For details, see mailing list thread discussions titled:
>> +#      "ext4: derive f_fsid from block device to avoid collisions".
> 
> How about providing a direct lore link?
> 

Sure, that will be..

Link:
https://lore.kernel.org/linux-ext4/20260409131238.GC18443@macsyma-wired.lan/

instead of the title.

Thanks, Anand


> --D
> 




>> +_require_unique_f_fsid()
>> +{
>> +	# Skip the test if the filesystem does not enforce unique f_fsids
>> +	# natively. Checking this dynamically requires recreating a clone
>> +	# layout, so we use a static lookup based on FSTYP.
>> +	if [ "$FSTYP" == "ext4" ]; then
>> +		_notrun "Target filesystem ($FSTYP) does not guarantee unique f_fsid on clones."
>> +	fi
>> +}
>> +
>> +
>>  # Computes a percentage of the available space in a filesystem and
>>  # returns that quantity in MB. The percentage must not contain a percent
>>  # sign ("%").
>> -- 
>> 2.43.0
>>
>>


^ permalink raw reply

* Re: [PATCH v6 02/11] fstests: add _clone_mount_option() helper
From: Anand Suveer Jain @ 2026-06-08 14:41 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: fstests, linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel,
	zlang, hch
In-Reply-To: <20260529042834.GC6070@frogsfrogsfrogs>

On 29/5/26 12:28, Darrick J. Wong wrote:
> On Thu, May 28, 2026 at 12:05:33PM +0800, Anand Jain wrote:
>> Adds _clone_mount_option() helper function to handle filesystem-specific
>> requirements for mounting cloned devices. Abstract the need for -o nouuid
>> on XFS.
>>
>> Signed-off-by: Anand Jain <asj@kernel.org>
>> ---
>>  common/rc | 17 +++++++++++++++++
>>  1 file changed, 17 insertions(+)
>>
>> diff --git a/common/rc b/common/rc
>> index d7e3e0bdfb1e..937f478963b4 100644
>> --- a/common/rc
>> +++ b/common/rc
>> @@ -414,6 +414,23 @@ _scratch_mount_options()
>>  					$SCRATCH_DEV $SCRATCH_MNT
>>  }
>>  
>> +# Return filesystem-specific mount options required for mounting clone/snapshot
>> +# devices.
>> +_clone_mount_option()
>> +{
>> +	local mount_opts=""
>> +
>> +	case "$FSTYP" in
>> +	xfs)
>> +		# Allow mounting a duplicate filesystem on the same host
>> +		mount_opts="-o nouuid"
>> +		;;
>> +	*)
>> +	esac
>> +
>> +	echo $mount_opts
> 
> I probably would've just echo'd straight from inside the case statement,

Nice. Let's see if there is v7, I will change this to as below.

> but this otherwise looks ok,
> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> 
> --D
> 



diff --git a/common/rc b/common/rc
index 79be51e4da31..18d4f73cead9 100644
--- a/common/rc
+++ b/common/rc
@@ -418,17 +418,13 @@ _scratch_mount_options()
  # devices.
  _clone_mount_option()
  {
-       local mount_opts=""
-
         case "$FSTYP" in
         xfs)
                 # Allow mounting a duplicate filesystem on the same host
-               mount_opts="-o nouuid"
+               echo "-o nouuid"
                 ;;
         *)
         esac
-
-       echo $mount_opts
  }

  _supports_filetype()





>> +}
>> +
>>  _supports_filetype()
>>  {
>>  	local dir=$1
>> -- 
>> 2.43.0
>>
>>


^ permalink raw reply related

* Re: [PATCH v6 01/11] fstests: add _loop_image_create_clone() helper
From: Anand Suveer Jain @ 2026-06-08 14:39 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: fstests, linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel,
	zlang, hch
In-Reply-To: <20260529042743.GB6070@frogsfrogsfrogs>

On 29/5/26 12:27, Darrick J. Wong wrote:
> On Thu, May 28, 2026 at 12:05:32PM +0800, Anand Jain wrote:
>> Introduce _loop_image_create_clone() and _loop_image_destroy() to mkfs an
>> image file and clone it to another image file, and attach a loop device to
>> them. And its destroy part.
>>
>> Signed-off-by: Anand Jain <asj@kernel.org>
>> ---
>>  common/rc | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 63 insertions(+)
>>
>> diff --git a/common/rc b/common/rc
>> index 79189e7e6e94..d7e3e0bdfb1e 100644
>> --- a/common/rc
>> +++ b/common/rc
>> @@ -1520,6 +1520,69 @@ _scratch_resvblks()
>>  	esac
>>  }
>>  
>> +# Create a small loop image, run an optional tuning function ($2) on it,
>> +# clone it, and attach both to loop devices, returned in ($1).
>> +# Args:
>> +#   $1: Nameref to return the array of allocated loop devices [base, clone].
>> +#   $2: Optional callback function to tune the base filesystem before cloning.
>> +_loop_image_create_clone()
>> +{
>> +	local -n _ret=$1
> 
> That switch   ^^ is very clever.  I always wondered how one did indirect
> variables in bash.
> 
>> +	local pre_clone_tune_func="$2"
>> +	local img_file=$TEST_DIR/${seq}.img
>> +	local img_file_clone=$TEST_DIR/${seq}_clone.img
>> +	local size=$(_small_fs_size_mb 128) # Smallest possible
>> +	local loop_devs
>> +
>> +	# Since we copy the block device image, we keep its size small.
>> +	_require_fs_space $TEST_DIR $((size * 1024))
>> +
>> +	_create_file_sized $((size * 1024 * 1024)) $img_file ||
>> +				_fail "Failed: Create $img_file $size"
>> +
>> +	loop_devs=$(_create_loop_device $img_file)
>> +	_ret=($loop_devs)
> 
> Should this check that a loopdev actually got created?
> 

Hmm, in the function _create_loop_device(), we are
calling _fail if create fails, so no need to duplicate, right?

>> +	case $FSTYP in
>> +	xfs)
>> +		_mkfs_dev "-s size=4096" ${loop_devs[0]}
>> +		;;
>> +	btrfs)
>> +		_mkfs_dev ${loop_devs[0]}
>> +		;;
>> +	*)
>> +		_mkfs_dev ${loop_devs[0]}
>> +		;;
>> +	esac
>> +
>> +	# Only execute if the function argument is not empty
>> +	if [ -n "$pre_clone_tune_func" ]; then
>> +		$pre_clone_tune_func ${loop_devs[0]}
>> +	fi
>> +
>> +	sync ${loop_devs[0]}
>> +	cp $img_file $img_file_clone
>> +


>> +	loop_devs="$loop_devs $(_create_loop_device $img_file_clone)"
> 
> 	local lodev="$(_create_loop_device ...)"
> 
> 	test -z "$lodev" && _fail "second loopdev not created"
> 	_ret+=("$lodev")
> 
> ?

If the second `_create_loop_device()` happens to fail, it will
already have called `_fail`, so "second loopdev..." won't be
used at all.


Thanks, Anand



>> +
>> +	_ret=($loop_devs)
>> +}
>> +
>> +# Teardown loop devices and delete their underlying backing image files.
>> +# Accepts a list of loop device paths (e.g., /dev/loop0 /dev/loop1).
>> +_loop_image_destroy()
>> +{
>> +	for d in "$@"; do
>> +		# Retrieve the path of the backing file
>> +		local f=$(losetup --noheadings --output BACK-FILE $d)
>> +
>> +		# Detach the loop device from the backing file
>> +		_destroy_loop_device "$d"
>> +
>> +		# Clean up the backing disk image file
>> +		[ -n "$f" ] && rm -f "$f"
>> +	done
>> +}
>>  
>>  # Repair scratch filesystem.  Returns 0 if the FS is good to go (either no
>>  # errors found or errors were fixed) and nonzero otherwise; also spits out
>> -- 
>> 2.43.0
>>
>>


^ permalink raw reply

* Re: [PATCH v2 2/2] ext4: avoid tail write_begin walk for uptodate folios
From: Jan Kara @ 2026-06-08 14:29 UTC (permalink / raw)
  To: Jia Zhu
  Cc: Theodore Ts'o, Andreas Dilger, Matthew Wilcox, Alexander Viro,
	Christian Brauner, Jan Kara, Baokun Li, Ojaswin Mujoo,
	Ritesh Harjani, Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel
In-Reply-To: <20260608120131.45146-3-zhujia.zj@bytedance.com>

On Mon 08-06-26 20:01:31, Jia Zhu wrote:
> Ext4 buffered writes into large folios also pay a full buffer_head
> walk in ext4_block_write_begin().  For a small overwrite of an existing
> cached folio, the folio is already uptodate and the write only needs to
> prepare the buffers through the written range.  Walking the suffix still
> makes the write_begin cost proportional to the folio size.
> 
> Before ext4 enabled large folios for regular files, the same loop was
> bounded by a single page of buffers.  That commit made the existing
> full-folio walk visible as a regression for cached small overwrites.
> 
> The suffix walk is needed for non-uptodate folios, where ext4 may have
> to submit reads for partial blocks, preserve new-buffer cleanup, and run
> error zeroing.  Keep those folios on the old full walk.
> 
> For already-uptodate folios, keep the walk starting at the first buffer
> rather than seeking directly to from.  This preserves the existing prefix
> buffer state handling.  Stop once block_start reaches the end of the
> write range, because the skipped suffix would only repeat the
> outside-range uptodate handling for buffers beyond @to.
> 
> On current master, the libMicro ext4 large-folio overwrite test shows
> the following full-series result.  Results are median usecs/call over 10
> runs, lower is better:
> 
> case        nofix     this series   improvement
> write_u1k   1.418     0.3405        76.0%
> write_u10k  1.887     0.4175        77.9%
> pwrite_u1k  1.6775    0.3390        79.8%
> pwrite_u10k 1.9035    0.4130        78.3%
> 
> Fixes: 7ac67301e82f0 ("ext4: enable large folio for regular file")
> Signed-off-by: Jia Zhu <zhujia.zj@bytedance.com>

Looks good, just one simplification suggestion:

> @@ -1193,13 +1194,14 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
>  		head = create_empty_buffers(folio, blocksize, 0);
>  	block = EXT4_PG_TO_LBLK(inode, folio->index);
>  
> -	for (bh = head, block_start = 0; bh != head || !block_start;
> +	for (bh = head, block_start = 0;
> +	     (bh != head || !block_start) &&
> +	     (!folio_uptodate || block_start < to);

You simplify this condition to:

  block_start < to || (!folio_uptodate && bh != head)

With this updated feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza


>  	    block++, block_start = block_end, bh = bh->b_this_page) {
>  		block_end = block_start + blocksize;
>  		if (block_end <= from || block_start >= to) {
> -			if (folio_test_uptodate(folio)) {
> +			if (folio_uptodate)
>  				set_buffer_uptodate(bh);
> -			}
>  			continue;
>  		}
>  		if (WARN_ON_ONCE(buffer_new(bh)))
> @@ -1220,7 +1222,7 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
>  				if (should_journal_data)
>  					do_journal_get_write_access(handle,
>  								    inode, bh);
> -				if (folio_test_uptodate(folio)) {
> +				if (folio_uptodate) {
>  					/*
>  					 * Unlike __block_write_begin() we leave
>  					 * dirtying of new uptodate buffers to
> @@ -1237,7 +1239,7 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
>  				continue;
>  			}
>  		}
> -		if (folio_test_uptodate(folio)) {
> +		if (folio_uptodate) {
>  			set_buffer_uptodate(bh);
>  			continue;
>  		}
> -- 
> 2.20.1
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v2 1/2] fs/buffer: avoid tail commit walk for uptodate folios
From: Jan Kara @ 2026-06-08 13:06 UTC (permalink / raw)
  To: Jia Zhu
  Cc: Theodore Ts'o, Andreas Dilger, Matthew Wilcox, Alexander Viro,
	Christian Brauner, Jan Kara, Baokun Li, Ojaswin Mujoo,
	Ritesh Harjani, Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel
In-Reply-To: <20260608120131.45146-2-zhujia.zj@bytedance.com>

On Mon 08-06-26 20:01:30, Jia Zhu wrote:
> block_commit_write() always walks every buffer_head attached to the
> folio.  That was cheap for order-0 folios, but large folios can contain
> hundreds of buffer_heads.  For a small buffered overwrite of an
> already-uptodate large folio, the commit work is therefore proportional
> to the folio size rather than the copied range.
> 
> This became visible with ext4 regular-file large folios, where cached
> small overwrites reach block_commit_write() through block_write_end().
> Before ext4 enabled large folios for regular files, this path was only
> hit with order-0 folios for normal ext4 buffered writes, so the full walk
> was bounded.  The ext4 large-folio commit is therefore the regression
> point for this generic helper cost.
> 
> The full walk is still needed when the folio is not uptodate, because
> block_commit_write() uses per-buffer uptodate state to decide whether
> the whole folio can be marked uptodate.  Keep those folios on the old
> full-buffer path.
> 
> For a folio that was already uptodate on entry, the commit no longer
> needs tail buffers for folio-uptodate discovery.  The copied range has
> already been processed once block_start reaches @to, so stop there and
> avoid the suffix walk.
> 
> Fixes: 7ac67301e82f0 ("ext4: enable large folio for regular file")
> Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Signed-off-by: Jia Zhu <zhujia.zj@bytedance.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/buffer.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index b0b3792b1496e..c8c41c799030d 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2096,6 +2096,7 @@ void block_commit_write(struct folio *folio, size_t from, size_t to)
>  {
>  	size_t block_start, block_end;
>  	bool partial = false;
> +	bool uptodate = folio_test_uptodate(folio);
>  	unsigned blocksize;
>  	struct buffer_head *bh, *head;
>  
> @@ -2118,6 +2119,8 @@ void block_commit_write(struct folio *folio, size_t from, size_t to)
>  			clear_buffer_new(bh);
>  
>  		block_start = block_end;
> +		if (uptodate && block_start >= to)
> +			break;
>  		bh = bh->b_this_page;
>  	} while (bh != head);
>  
> -- 
> 2.20.1
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: security bug reporting: e2fsprogs: Path Traversal and heap overflow
From: Theodore Tso @ 2026-06-08 12:54 UTC (permalink / raw)
  To: Feng Xue; +Cc: Andreas Dilger, linux-ext4@vger.kernel.org
In-Reply-To: <SY0P300MB0070B3B68E4E4FC0C78DE355901C2@SY0P300MB0070.AUSP300.PROD.OUTLOOK.COM>

On Mon, Jun 08, 2026 at 10:18:38AM +0000, Feng Xue wrote:
> Hi Andres,
> 
> Thanks for the quick turn around.
> Agree with you on the inode one.

Anreas, thanks for the the patch.  I had reviewed Feng's security
report and my assessment had aligned with yours.  I had it on my todo
list, along with a mental note to make sure that this particular
corrupted file system would be caught by the kernel's fs/ext4/super.c.

> For the debugfs, yes it is not what people would run on a daily
> basis. it turns interesting in scenarios like incident response and
> forensic, so yes it depends on the ROI and definitely your call to
> whether it should be fixed.

The debugfs concern (as well as another related one) has already been
reported on github[1][2].

[1] https://github.com/tytso/e2fsprogs/issues/272
[2] https://github.com/tytso/e2fsprogs/issues/273

It "should" be fixed, but I'm quite busy these days, and part of it is
many kernel developers needing to react to the large number of
LLM-reported security bugs[3][4][5].  And so we have to prioritize the
order in which we address security or other bug fix reports.

[3] https://lwn.net/Articles/1063303/
[4] https://www.tomshardware.com/software/linux/linus-torvalds-says-ai-bug-reports-have-made-the-linux-security-mailing-list-almost-entirely-unmanageable
[5] https://medium.com/@tridge60/rsync-and-outrage-d9849599e5a0

I do appreciate that your report included a suggested fix.  The thing
that would make it even *more* useful (and gain you an a win of a
patch landed in e2fsprogs or the Linux kernel :-) would be if the
report came with an formal patch submission.  If you send it using the
git send-email (and see the Linux kernel's instructions for how to
submit changes[6]), to any mailing list at linux-*@vger.kernel.org,
including linux-ext4@vger.kernel.org, the Linux kernel changes will
get automatically get reviewed by an LLM agent, Sashiko[7].

[6] https://www.kernel.org/doc/html/latest/process/submitting-patches.html
[7] https://lwn.net/Articles/1063303/

For bonus points, if your AI agent can also create a regression test
much like how Andreas's patch, that would be great.  And if we can
collaborate on a skills.md that can teach AI agent how to create
reports that are easier for Linux developers to consume, and perhaps
to even create patches that require much less work on Linux developers
(or which Linux developers can use themselves), that would be really
exciting.  Yeah, that could result in a certain amount of LLM fear and
yourage as with rsync[5], but we have that with the Linux kernel
already.  :-)

Cheers,

						- Ted

^ permalink raw reply

* Re: [PATCH v2 1/3] ext4: reject mount if clusters/inodes per group are not 8-aligned
From: Zhang Yi @ 2026-06-08 12:53 UTC (permalink / raw)
  To: Baokun Li, linux-ext4
  Cc: tytso, adilger.kernel, jack, ojaswin, ritesh.list, Sashiko
In-Reply-To: <20260608111150.827117-2-libaokun@linux.alibaba.com>

On 6/8/2026 7:11 PM, Baokun Li wrote:
> The block and inode bitmap checksums are computed over a whole number of
> bytes: ext4_inode_bitmap_csum_*() use EXT4_INODES_PER_GROUP(sb) >> 3 and
> ext4_block_bitmap_csum_*() use EXT4_CLUSTERS_PER_GROUP(sb) / 8 as the
> length passed to ext4_chksum().
> 
> If s_inodes_per_group or s_clusters_per_group is not a multiple of 8, the
> trailing fractional bits are excluded from the checksum.  Those bits are
> then unprotected, and any incremental csum update path that assumes a
> byte-aligned bitmap can compute a checksum inconsistent with the full
> recalculation, corrupting the on-disk bitmap checksum.
> 
> Reject such filesystems at mount time by adding the missing " & 7"
> alignment checks alongside the existing range validation.
> 
> Suggested-by: Theodore Ts'o <tytso@mit.edu>
> Link: https://patch.msgid.link/h3n7jlfhyna64dn5o76qxcspnhxdddcs6crpxftmy7gnl7b3sx@jenszfpcsnit
> Reported-by: Sashiko <sashiko-bot@kernel.org>
> Closes: https://sashiko.dev/#/patchset/20260508121539.4174601-1-libaokun%40linux.alibaba.com?part=10
> Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>

Looks good to me. Thanks!

Reviewed-by: Zhang Yi <yi.zhang@huawei.com>

> ---
>  fs/ext4/super.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 6a77db4d3124..3ddcb4a8d4db 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4472,8 +4472,9 @@ static int ext4_handle_clustersize(struct super_block *sb)
>  		sbi->s_cluster_bits = 0;
>  	}
>  	sbi->s_clusters_per_group = le32_to_cpu(es->s_clusters_per_group);
> -	if (sbi->s_clusters_per_group > sb->s_blocksize * 8) {
> -		ext4_msg(sb, KERN_ERR, "#clusters per group too big: %lu",
> +	if (sbi->s_clusters_per_group > sb->s_blocksize * 8 ||
> +	    sbi->s_clusters_per_group & 7) {
> +		ext4_msg(sb, KERN_ERR, "invalid #clusters per group: %lu",
>  			 sbi->s_clusters_per_group);
>  		return -EINVAL;
>  	}
> @@ -5304,8 +5305,9 @@ static int ext4_block_group_meta_init(struct super_block *sb, int silent)
>  		return -EINVAL;
>  	}
>  	if (sbi->s_inodes_per_group < sbi->s_inodes_per_block ||
> -	    sbi->s_inodes_per_group > sb->s_blocksize * 8) {
> -		ext4_msg(sb, KERN_ERR, "invalid inodes per group: %lu\n",
> +	    sbi->s_inodes_per_group > sb->s_blocksize * 8 ||
> +	    sbi->s_inodes_per_group & 7) {
> +		ext4_msg(sb, KERN_ERR, "invalid inodes per group: %lu",
>  			 sbi->s_inodes_per_group);
>  		return -EINVAL;
>  	}


^ permalink raw reply

* Re: [PATCH v2 0/4] show orphan file inode detail info
From: yebin @ 2026-06-08 11:44 UTC (permalink / raw)
  To: Jan Kara; +Cc: tytso, adilger.kernel, linux-ext4
In-Reply-To: <lweohtnu6fdvhi2fnbo7obo7vtvehsr4m6u4zwgeqx2abhgx3w@dv5e6zapdwlq>



On 2026/4/16 1:59, Jan Kara wrote:
> Hello!
>
> On Wed 15-04-26 18:55:01, Ye Bin wrote:
>> From: Ye Bin <yebin10@huawei.com>
>>
>> Diffs v2 vs v1:
>> (1) Fix sashiko review issues:
>> https://sashiko.dev/#/patchset/20260403082507.1882703-1-yebin%40huaweicloud.com
>> (2) Change "orphan_list" file mode from 0444 to 0400;
>> (3) The display format of the "orphan_list" file is modified according
>>      to Andreas' suggestions.
>> Fault injection tests have been conducted to address the issues raised
>> in the sashik review. There is no UAF issue in the ext4_seq_orphan_release()
>> function. The reason for this has already been explained in the code comments.
>> In addition to the fault injection tests, we also performed a stress test by
>> observing the /proc/fs/ext4/XX/orphan_list and the concurrent processes of
>> adding and removing orphan nodes, and no issues were found so far.
>>
>>
>> In actual production environments, the issue of inconsistency between
>> df and du is frequently encountered. In many cases, the cause of the
>> problem can be identified through the use of lsof. However, when
>> overlayfs is combined with project quota configuration, the issue becomes
>> more complex and troublesome to diagnose. First, to determine the project
>> ID, one needs to obtain orphaned nodes using `fsck.ext4 -fn /dev/xx`, and
>> then retrieve file information through `debugfs`. However, the file names
>> cannot always be obtained, and it is often unclear which files they are.
>> To identify which files these are, one would need to use crash for online
>> debugging or use kprobe to gather information incrementally. However, some
>> customers in production environments do not agree to upload any tools, and
>> online debugging might impact the business. There are also scenarios where
>> files are opened in kernel mode, which do not generate file descriptors(fds),
>> making it impossible to identify which files were deleted but still have
>> references through lsof. This patchset adds a procfs interface to query
>> information about orphaned nodes, which can assist in the analysis and
>> localization of such issues.
>
> Ye, did you read my comments to the v1 of the patchset [1]? I didn't see
> any reply from you. I don't think this is a good way how to expose orphan
> information for a filesystem for reasons I've outlined in that email.
>

Hi Jan

I thought about how to prevent resource exhaustion caused by making too many
FDs in a single application. My idea is that IOCTL should only obtain one FD
at a time, and the next time it should start obtaining orphan nodes from the
inode after the previous one. Each time an fd is obtained, the previous fd
should be closed. I expect that after traversing all the fds from the beginning,
they will all be closed and there will be no need for user space to close them
manually. I wonder if this approach is feasible? Or do you have any good
suggestions?

> 								Honza
>
> [1] https://lore.kernel.org/all/n4sccudy5avcgnkdhc27rzofzoprxqtwhfrlmsh3yyrj6vbc6d@mmu73gmtawkq/
>
>>
>> Ye Bin (4):
>>    ext4: register 'orphan_list' procfs
>>    ext4: skip cursor node in ext4_orphan_del()
>>    ext4: show inode orphan list detail information
>>    ext4: show orphan file inode detail info
>>
>>   fs/ext4/ext4.h   |   1 +
>>   fs/ext4/orphan.c | 326 ++++++++++++++++++++++++++++++++++++++++++++++-
>>   fs/ext4/sysfs.c  |   2 +
>>   3 files changed, 328 insertions(+), 1 deletion(-)
>>
>> --
>> 2.34.1
>>


^ permalink raw reply

* [PATCH v2 2/2] ext4: avoid tail write_begin walk for uptodate folios
From: Jia Zhu @ 2026-06-08 12:01 UTC (permalink / raw)
  To: Theodore Ts'o, Andreas Dilger
  Cc: Matthew Wilcox, Alexander Viro, Christian Brauner, Jan Kara,
	Baokun Li, Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, linux-ext4,
	linux-fsdevel, linux-kernel, Jia Zhu
In-Reply-To: <20260608120131.45146-1-zhujia.zj@bytedance.com>

Ext4 buffered writes into large folios also pay a full buffer_head
walk in ext4_block_write_begin().  For a small overwrite of an existing
cached folio, the folio is already uptodate and the write only needs to
prepare the buffers through the written range.  Walking the suffix still
makes the write_begin cost proportional to the folio size.

Before ext4 enabled large folios for regular files, the same loop was
bounded by a single page of buffers.  That commit made the existing
full-folio walk visible as a regression for cached small overwrites.

The suffix walk is needed for non-uptodate folios, where ext4 may have
to submit reads for partial blocks, preserve new-buffer cleanup, and run
error zeroing.  Keep those folios on the old full walk.

For already-uptodate folios, keep the walk starting at the first buffer
rather than seeking directly to from.  This preserves the existing prefix
buffer state handling.  Stop once block_start reaches the end of the
write range, because the skipped suffix would only repeat the
outside-range uptodate handling for buffers beyond @to.

On current master, the libMicro ext4 large-folio overwrite test shows
the following full-series result.  Results are median usecs/call over 10
runs, lower is better:

case        nofix     this series   improvement
write_u1k   1.418     0.3405        76.0%
write_u10k  1.887     0.4175        77.9%
pwrite_u1k  1.6775    0.3390        79.8%
pwrite_u10k 1.9035    0.4130        78.3%

Fixes: 7ac67301e82f0 ("ext4: enable large folio for regular file")
Signed-off-by: Jia Zhu <zhujia.zj@bytedance.com>
---
 fs/ext4/inode.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c2c2d6ac7f3d1..d63785fcd2acb 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1182,6 +1182,7 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
 	int nr_wait = 0;
 	int i;
 	bool should_journal_data = ext4_should_journal_data(inode);
+	bool folio_uptodate = folio_test_uptodate(folio);

 	BUG_ON(!folio_test_locked(folio));
 	BUG_ON(to > folio_size(folio));
@@ -1193,13 +1194,14 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
 		head = create_empty_buffers(folio, blocksize, 0);
 	block = EXT4_PG_TO_LBLK(inode, folio->index);

-	for (bh = head, block_start = 0; bh != head || !block_start;
+	for (bh = head, block_start = 0;
+	     (bh != head || !block_start) &&
+	     (!folio_uptodate || block_start < to);
 	    block++, block_start = block_end, bh = bh->b_this_page) {
 		block_end = block_start + blocksize;
 		if (block_end <= from || block_start >= to) {
-			if (folio_test_uptodate(folio)) {
+			if (folio_uptodate)
 				set_buffer_uptodate(bh);
-			}
 			continue;
 		}
 		if (WARN_ON_ONCE(buffer_new(bh)))
@@ -1220,7 +1222,7 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
 				if (should_journal_data)
 					do_journal_get_write_access(handle,
 								    inode, bh);
-				if (folio_test_uptodate(folio)) {
+				if (folio_uptodate) {
 					/*
 					 * Unlike __block_write_begin() we leave
 					 * dirtying of new uptodate buffers to
@@ -1237,7 +1239,7 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
 				continue;
 			}
 		}
-		if (folio_test_uptodate(folio)) {
+		if (folio_uptodate) {
 			set_buffer_uptodate(bh);
 			continue;
 		}
-- 
2.20.1

^ permalink raw reply related

* [PATCH v2 1/2] fs/buffer: avoid tail commit walk for uptodate folios
From: Jia Zhu @ 2026-06-08 12:01 UTC (permalink / raw)
  To: Theodore Ts'o, Andreas Dilger
  Cc: Matthew Wilcox, Alexander Viro, Christian Brauner, Jan Kara,
	Baokun Li, Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, linux-ext4,
	linux-fsdevel, linux-kernel, Jia Zhu
In-Reply-To: <20260608120131.45146-1-zhujia.zj@bytedance.com>

block_commit_write() always walks every buffer_head attached to the
folio.  That was cheap for order-0 folios, but large folios can contain
hundreds of buffer_heads.  For a small buffered overwrite of an
already-uptodate large folio, the commit work is therefore proportional
to the folio size rather than the copied range.

This became visible with ext4 regular-file large folios, where cached
small overwrites reach block_commit_write() through block_write_end().
Before ext4 enabled large folios for regular files, this path was only
hit with order-0 folios for normal ext4 buffered writes, so the full walk
was bounded.  The ext4 large-folio commit is therefore the regression
point for this generic helper cost.

The full walk is still needed when the folio is not uptodate, because
block_commit_write() uses per-buffer uptodate state to decide whether
the whole folio can be marked uptodate.  Keep those folios on the old
full-buffer path.

For a folio that was already uptodate on entry, the commit no longer
needs tail buffers for folio-uptodate discovery.  The copied range has
already been processed once block_start reaches @to, so stop there and
avoid the suffix walk.

Fixes: 7ac67301e82f0 ("ext4: enable large folio for regular file")
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Jia Zhu <zhujia.zj@bytedance.com>
---
 fs/buffer.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/buffer.c b/fs/buffer.c
index b0b3792b1496e..c8c41c799030d 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2096,6 +2096,7 @@ void block_commit_write(struct folio *folio, size_t from, size_t to)
 {
 	size_t block_start, block_end;
 	bool partial = false;
+	bool uptodate = folio_test_uptodate(folio);
 	unsigned blocksize;
 	struct buffer_head *bh, *head;

@@ -2118,6 +2119,8 @@ void block_commit_write(struct folio *folio, size_t from, size_t to)
 			clear_buffer_new(bh);

 		block_start = block_end;
+		if (uptodate && block_start >= to)
+			break;
 		bh = bh->b_this_page;
 	} while (bh != head);

-- 
2.20.1

^ permalink raw reply related

* [PATCH v2 0/2] ext4: avoid tail walks for cached large-folio writes
From: Jia Zhu @ 2026-06-08 12:01 UTC (permalink / raw)
  To: Theodore Ts'o, Andreas Dilger
  Cc: Matthew Wilcox, Alexander Viro, Christian Brauner, Jan Kara,
	Baokun Li, Ojaswin Mujoo, Ritesh Harjani, Zhang Yi, linux-ext4,
	linux-fsdevel, linux-kernel, Jia Zhu
In-Reply-To: <20260603134800.25155-1-zhujia.zj@bytedance.com>

Hi,

This series addresses a buffered-write regression we found during our
v6.12 -> v6.18 LTS upgrade testing on ext4.

The regression is in the remaining buffer_head path.  A small overwrite
of an already cached, uptodate large folio still walks every buffer_head
attached to the folio in both write_begin and write_end.  With order-0
folios this was bounded by the page size.  After ext4 enabled large
folios for regular files, the same loops became proportional to the
folio size.

I agree that converting ext4 buffered I/O to iomap is the right long-term
direction, and that would avoid this problem.  This series is meant as a
small fix for current and LTS kernels that still use the buffer_head path.

Patch 1 follows Willy's suggestion for block_commit_write(): if the folio
was already uptodate on entry, stop the commit walk once the copied range
has been processed.

Patch 2 applies the same conservative shape to ext4_block_write_begin().
It keeps walking from the first buffer, so prefix buffer state handling is
unchanged, and only skips the suffix for folios that were already
uptodate on entry.

The workload is from libMicro, which we use in kernel release testing:

  https://github.com/rzezeski/libMicro

The table below includes the v6.12 baseline from the same release
benchmark.  The v6.12 and v6.18 columns were run with THP=always.  The
last column is v6.18 with this series applied.  Results are usecs/call,
lower is better, and the improvement is relative to unpatched v6.18.

case           v6.12    v6.18   v6.18 + series   improvement
write_u1k      0.609    4.659       0.528           88.7%
write_u10k     1.408    4.869       0.809           83.4%
pwrite_u1k     0.609    4.659       0.538           88.5%
pwrite_u10k    1.399    4.889       0.819           83.2%
writev_u1k     2.238    5.277       1.179           77.7%
writev_u10k   11.057    8.029       4.219           47.5%

For the cases that regressed from v6.12 to v6.18 in this test, this
series brings the v6.18 numbers back below the v6.12 cost.

Link: https://lore.kernel.org/all/20260603134800.25155-1-zhujia.zj@bytedance.com/

Changes since v1:
- replace the ext4 seek-to-@from optimization with a conservative tail
  break that preserves prefix buffer handling;
- add the block_commit_write() tail break suggested by Willy;
- add v6.12 and v6.18 benchmark results for the full series.

Jia Zhu (2):
  fs/buffer: avoid tail commit walk for uptodate folios
  ext4: avoid tail write_begin walk for uptodate folios

 fs/buffer.c     |  3 +++
 fs/ext4/inode.c | 12 +++++++-----
 2 files changed, 10 insertions(+), 5 deletions(-)

-- 
2.20.1

^ permalink raw reply

* Re: [PATCH] ext4: avoid full buffer walks for large folio partial writes
From: Jia Zhu @ 2026-06-08 11:56 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jia Zhu, Theodore Ts'o, Andreas Dilger, Alexander Viro,
	Christian Brauner, Jan Kara, Baokun Li, Ojaswin Mujoo,
	Ritesh Harjani, Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel
In-Reply-To: <aiLcFP2drmHGjEL2@casper.infradead.org>

On Fri, Jun 05, 2026 at 03:24:20PM +0100, Matthew Wilcox wrote:
> > The reason I am still looking at this path is that the regression is
> > visible in our LTS upgrade testing from 6.12 to 6.18.  It was introduced
> > by the ext4 large-folio enablement in v6.16.  For example, in our
> > libMicro release benchmark with THP always enabled, usecs/call, lower is
> > better:
> > 
> > case        v6.12        v6.18        regression
> > write_u1k   0.609        4.659        +665.0%
> > write_u10k  1.408        4.869        +245.8%
> 
> Ouch ;-)  No wonder you want to address this.  Do you recover all the
> regression with this fix?

With the full v2 series applied to v6.18, the small overwrite cases look
like this.  Results are usecs/call, lower is better:

case           v6.12    v6.18   v6.18 + series
write_u1k      0.609    4.659       0.528
write_u10k     1.408    4.869       0.809
pwrite_u1k     0.609    4.659       0.538
pwrite_u10k    1.399    4.889       0.819
writev_u1k     2.238    5.277       1.179
writev_u10k   11.057    8.029       4.219

This matches the regression I was trying to address.

> > The iomap conversion is the long-term fix, but it does not help kernels
> > which still use the buffer-head buffered write path.  I would like to keep
> > this as a small regression fix for that path, and make it minimal enough
> > to be suitable for stable/LTS backport.
> 
> Is it that you're using some ext4 features that aren't supported by
> iomap yet?  Could you say which ones?  That might motivate someone to
> prioritise that support.

No, this benchmark is not using a specific ext4 feature that prevents
iomap.  It is just the default ext4 buffered write path on a regular
file.

I agree that iomap looks like the better long-term direction for ext4
buffered writes.  This small fix is mainly motivated by current/LTS
kernels that still have the buffer-head path (from v6.16 through current
mainline, until ext4 buffered writes are converted to iomap upstream),
where the large-folio enablement made this tail-walk cost visible.

^ permalink raw reply

* Re: [PATCH] ext4: reject mount if clusters/inodes per group are not 8-aligned
From: Baokun Li @ 2026-06-08 11:21 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: tytso, Ext4 Developers List
In-Reply-To: <20260608063318.266E61F00893@smtp.kernel.org>


I've sent a V2 that includes two additional patches to address the issues
Sashiko reported:

https://patch.msgid.link/20260608111150.827117-1-libaokun@linux.alibaba.com


Thanks,
Baokun

On 2026/6/8 14:33, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
>
> Pre-existing issues:
> - [Critical] Missing alignment check allows integer truncation in s_itb_per_group, leading to out-of-bounds inode table access and potential local privilege escalation via forged inodes.
> - [High] Integer overflow in bigalloc cluster consistency check on 32-bit systems allows mounting filesystems with arbitrary cross-group overlaps.
> --
>
> commit 063bdba3eb2bd319656b0b2534e820244460e17a
> Author: Baokun Li <libaokun@linux.alibaba.com>
>
> ext4: reject mount if clusters/inodes per group are not 8-aligned
>
> The block and inode bitmap checksums are computed over a whole number of
> bytes. If s_inodes_per_group or s_clusters_per_group is not a multiple of
> 8, the trailing fractional bits are excluded from the checksum. Reject such
> filesystems at mount time by adding the missing bitwise alignment checks.
>
> Link: https://patch.msgid.link/h3n7jlfhyna64dn5o76qxcspnhxdddcs6crpxftmy7gnl7b3sx@jenszfpcsnit
>
>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>> index 245f67d10ded3..98f1cefbc6880 100644
>> --- a/fs/ext4/super.c
>> +++ b/fs/ext4/super.c
>> @@ -4475,8 +4475,9 @@ static int ext4_handle_clustersize(struct super_block *sb)
>>  		sbi->s_cluster_bits = 0;
>>  	}
>>  	sbi->s_clusters_per_group = le32_to_cpu(es->s_clusters_per_group);
>> -	if (sbi->s_clusters_per_group > sb->s_blocksize * 8) {
>> -		ext4_msg(sb, KERN_ERR, "#clusters per group too big: %lu",
>> +	if (sbi->s_clusters_per_group > sb->s_blocksize * 8 ||
>> +	    sbi->s_clusters_per_group & 7) {
>> +		ext4_msg(sb, KERN_ERR, "invalid #clusters per group: %lu",
>>  			 sbi->s_clusters_per_group);
>>  		return -EINVAL;
>>  	}
> [Severity: High]
> This isn't a bug introduced by this patch, but while we are validating
> s_clusters_per_group, could the bigalloc consistency check slightly below
> this in ext4_handle_clustersize() overflow on 32-bit systems?
>
> The check does:
>
>     if (sbi->s_blocks_per_group !=
>         (sbi->s_clusters_per_group * (clustersize / sb->s_blocksize))) {
>
> Since unsigned long is 32 bits on 32-bit systems, the multiplication can wrap
> around modulo 2^32. For example, if an image has 16392 clusters and a ratio
> of 262144, the product evaluates to 2^32 + 2097152, matching an
> s_blocks_per_group of 2097152.
>
> Because bigalloc bypasses the usual s_blocks_per_group size limits, the
> wrapped validation would succeed. The block allocator could then allocate
> blocks far outside the truncated group, overwriting metadata of subsequent
> groups. Should this multiplication be cast to a 64-bit type to prevent
> the overflow?
>
> [ ... ]
>
>> @@ -5308,7 +5309,8 @@ static int ext4_block_group_meta_init(struct super_block *sb, int silent)
>>  		return -EINVAL;
>>  	}
>>  	if (sbi->s_inodes_per_group < sbi->s_inodes_per_block ||
>> -	    sbi->s_inodes_per_group > sb->s_blocksize * 8) {
>> +	    sbi->s_inodes_per_group > sb->s_blocksize * 8 ||
>> +	    sbi->s_inodes_per_group & 7) {
>>  		ext4_msg(sb, KERN_ERR, "invalid inodes per group: %lu\n",
>>  			 sbi->s_inodes_per_group);
>>  		return -EINVAL;
> [Severity: Critical]
> This is a pre-existing issue, but does adding the & 7 check go far enough
> to prevent integer truncation here?
>
> A few lines below this check in ext4_block_group_meta_init(), the code does:
>
>     sbi->s_itb_per_group = sbi->s_inodes_per_group / sbi->s_inodes_per_block;
>
> Since s_inodes_per_group is now guaranteed to be a multiple of 8, but
> s_inodes_per_block can be 16, 32, etc., this division can still truncate.
>
> Could this truncation lead to out-of-bounds accesses? Looking at
> __ext4_get_inode_loc() in fs/ext4/inode.c, the block is calculated without
> checking against the truncated s_itb_per_group:
>
>     block += (inode_offset / inodes_per_block);
>
> This could allow reading or writing to the block following the truncated
> inode table, potentially allowing local privilege escalation via forged
> inodes if that block is allocated to a user file. Should we also ensure
> that s_inodes_per_group is a multiple of s_inodes_per_block here?
>


^ permalink raw reply

* [PATCH v2 3/3] ext4: reject mount if inodes per group is not a multiple of inodes per block
From: Baokun Li @ 2026-06-08 11:11 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
	Sashiko
In-Reply-To: <20260608111150.827117-1-libaokun@linux.alibaba.com>

If s_inodes_per_group is not a multiple of s_inodes_per_block, the
division that computes s_itb_per_group truncates, reserving fewer blocks
for the inode table than needed.

On a crafted filesystem image, this allows __ext4_get_inode_loc() to
compute a block offset beyond the inode table, reading unrelated data as
an inode structure.

Add the missing divisibility check alongside the existing validation in
ext4_block_group_meta_init().

Reported-by: Sashiko <sashiko-bot@kernel.org>
Closes: https://sashiko.dev/#/patchset/20260608061112.392391-1-libaokun%40linux.alibaba.com
Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
 fs/ext4/super.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 3ddcb4a8d4db..5ec9e1ef00c0 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -5306,7 +5306,8 @@ static int ext4_block_group_meta_init(struct super_block *sb, int silent)
 	}
 	if (sbi->s_inodes_per_group < sbi->s_inodes_per_block ||
 	    sbi->s_inodes_per_group > sb->s_blocksize * 8 ||
-	    sbi->s_inodes_per_group & 7) {
+	    sbi->s_inodes_per_group & 7 ||
+	    sbi->s_inodes_per_group % sbi->s_inodes_per_block) {
 		ext4_msg(sb, KERN_ERR, "invalid inodes per group: %lu",
 			 sbi->s_inodes_per_group);
 		return -EINVAL;
-- 
2.43.7


^ permalink raw reply related

* [PATCH v2 2/3] ext4: reduce max cluster size to match documented 256MB limit
From: Baokun Li @ 2026-06-08 11:11 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
	Sashiko
In-Reply-To: <20260608111150.827117-1-libaokun@linux.alibaba.com>

The mke2fs man page documents:

  Valid cluster-size values are from 2048 to 256M bytes per cluster.

but EXT4_MAX_CLUSTER_LOG_SIZE was set to 30 (1GB), allowing crafted
filesystem images to specify cluster sizes up to 1GB.

On 32-bit systems with bigalloc enabled, the consistency check in
ext4_handle_clustersize():

  s_blocks_per_group == s_clusters_per_group * (clustersize / blocksize)

can overflow when the cluster ratio is large enough. Since
s_blocks_per_group is not range-checked in the bigalloc path, the
wrapped product can pass the consistency check, leading to inconsistent
group geometry and potential out-of-bounds block allocation.

Reduce EXT4_MAX_CLUSTER_LOG_SIZE to 28 to match the documented 256MB
limit. With this cap, the maximum product is:

  (blocksize * 8) * (256M / blocksize) = 2^31

which fits safely in a 32-bit unsigned long for all block sizes.

Reported-by: Sashiko <sashiko-bot@kernel.org>
Closes: https://sashiko.dev/#/patchset/20260608061112.392391-1-libaokun%40linux.alibaba.com
Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
 fs/ext4/ext4.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 94283a991e5c..11e41a864db8 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -334,7 +334,7 @@ struct ext4_io_submit {
 #define	EXT4_MAX_BLOCK_SIZE		65536
 #define EXT4_MIN_BLOCK_LOG_SIZE		10
 #define EXT4_MAX_BLOCK_LOG_SIZE		16
-#define EXT4_MAX_CLUSTER_LOG_SIZE	30
+#define EXT4_MAX_CLUSTER_LOG_SIZE	28
 #ifdef __KERNEL__
 # define EXT4_BLOCK_SIZE(s)		((s)->s_blocksize)
 #else
-- 
2.43.7

^ permalink raw reply related

* [PATCH v2 1/3] ext4: reject mount if clusters/inodes per group are not 8-aligned
From: Baokun Li @ 2026-06-08 11:11 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
	Sashiko
In-Reply-To: <20260608111150.827117-1-libaokun@linux.alibaba.com>

The block and inode bitmap checksums are computed over a whole number of
bytes: ext4_inode_bitmap_csum_*() use EXT4_INODES_PER_GROUP(sb) >> 3 and
ext4_block_bitmap_csum_*() use EXT4_CLUSTERS_PER_GROUP(sb) / 8 as the
length passed to ext4_chksum().

If s_inodes_per_group or s_clusters_per_group is not a multiple of 8, the
trailing fractional bits are excluded from the checksum.  Those bits are
then unprotected, and any incremental csum update path that assumes a
byte-aligned bitmap can compute a checksum inconsistent with the full
recalculation, corrupting the on-disk bitmap checksum.

Reject such filesystems at mount time by adding the missing " & 7"
alignment checks alongside the existing range validation.

Suggested-by: Theodore Ts'o <tytso@mit.edu>
Link: https://patch.msgid.link/h3n7jlfhyna64dn5o76qxcspnhxdddcs6crpxftmy7gnl7b3sx@jenszfpcsnit
Reported-by: Sashiko <sashiko-bot@kernel.org>
Closes: https://sashiko.dev/#/patchset/20260508121539.4174601-1-libaokun%40linux.alibaba.com?part=10
Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
 fs/ext4/super.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 6a77db4d3124..3ddcb4a8d4db 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4472,8 +4472,9 @@ static int ext4_handle_clustersize(struct super_block *sb)
 		sbi->s_cluster_bits = 0;
 	}
 	sbi->s_clusters_per_group = le32_to_cpu(es->s_clusters_per_group);
-	if (sbi->s_clusters_per_group > sb->s_blocksize * 8) {
-		ext4_msg(sb, KERN_ERR, "#clusters per group too big: %lu",
+	if (sbi->s_clusters_per_group > sb->s_blocksize * 8 ||
+	    sbi->s_clusters_per_group & 7) {
+		ext4_msg(sb, KERN_ERR, "invalid #clusters per group: %lu",
 			 sbi->s_clusters_per_group);
 		return -EINVAL;
 	}
@@ -5304,8 +5305,9 @@ static int ext4_block_group_meta_init(struct super_block *sb, int silent)
 		return -EINVAL;
 	}
 	if (sbi->s_inodes_per_group < sbi->s_inodes_per_block ||
-	    sbi->s_inodes_per_group > sb->s_blocksize * 8) {
-		ext4_msg(sb, KERN_ERR, "invalid inodes per group: %lu\n",
+	    sbi->s_inodes_per_group > sb->s_blocksize * 8 ||
+	    sbi->s_inodes_per_group & 7) {
+		ext4_msg(sb, KERN_ERR, "invalid inodes per group: %lu",
 			 sbi->s_inodes_per_group);
 		return -EINVAL;
 	}
-- 
2.43.7


^ permalink raw reply related

* [PATCH v2 0/3] ext4: tighten mount-time superblock geometry validation
From: Baokun Li @ 2026-06-08 11:11 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list

Changes since v1:
 * Patch 1: Removed a spurious newline in the error message format string.
 * Added Patches 2 and 3 to fix additional issues reported by Sashiko
   (independent of Patch 1).

v1: https://patch.msgid.link/20260608061112.392391-1-libaokun@linux.alibaba.com

This series adds missing mount-time sanity checks for superblock
geometry parameters, preventing crafted filesystem images from causing
bitmap checksum corruption, integer overflow, or out-of-bounds inode
table access.

Patch 1 rejects filesystems where s_clusters_per_group or
s_inodes_per_group is not 8-aligned, since the bitmap checksum
functions operate on whole bytes and would leave trailing bits
unprotected.

Patch 2 reduces EXT4_MAX_CLUSTER_LOG_SIZE from 30 to 28 to match
the documented 256MB limit in mke2fs, preventing a 32-bit overflow
in the blocks-per-group consistency check on bigalloc filesystems.

Patch 3 rejects filesystems where s_inodes_per_group is not a
multiple of s_inodes_per_block, preventing truncation in the
s_itb_per_group calculation that could lead __ext4_get_inode_loc()
to read beyond the inode table.

Baokun Li (3):
  ext4: reject mount if clusters/inodes per group are not 8-aligned
  ext4: reduce max cluster size to match documented 256MB limit
  ext4: reject mount if inodes per group is not a multiple of inodes per
    block

 fs/ext4/ext4.h  |  2 +-
 fs/ext4/super.c | 11 +++++++----
 2 files changed, 8 insertions(+), 5 deletions(-)

-- 
2.43.7

^ permalink raw reply

* [PATCH v3] common/defrag: skip defrag tests on DAX-enabled filesystems
From: Disha Goel @ 2026-06-08 10:23 UTC (permalink / raw)
  To: fstests
  Cc: linux-ext4, linux-fsdevel, linux-xfs, ritesh.list, ojaswin,
	djwong, Disha Goel

Online defragmentation is not supported on ext4 DAX-enabled filesystems.
The ext4 defrag ioctl (EXT4_IOC_MOVE_EXT) returns EOPNOTSUPP when used
on DAX files.

Add an ext4-specific check in _require_defrag() to skip tests when DAX
is enabled, avoiding false failures on ext4/301-304, ext4/308, and
generic/018.

XFS defrag works with DAX, so this check is ext4-specific.

Suggested-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Disha Goel <disgoel@linux.ibm.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
Changes in v3:
- Move the DAX check inside the ext4 case statement as
  suggested by Darrick

 common/defrag | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/common/defrag b/common/defrag
index 055d0d0e..baf05d94 100644
--- a/common/defrag
+++ b/common/defrag
@@ -13,6 +13,8 @@ _require_defrag()
         DEFRAG_PROG="$XFS_FSR_PROG"
 	;;
     ext4)
+        __scratch_uses_fsdax && _notrun "ext4 online defrag not supported with DAX"
+
 	testfile="$TEST_DIR/$$-test.defrag"
 	donorfile="$TEST_DIR/$$-donor.defrag"
 	bsize=`_get_block_size $TEST_DIR`
-- 
2.45.1


^ permalink raw reply related

* Re: [PATCH RFC 8/8] super: make fs_holder_ops private
From: Jan Kara @ 2026-06-08 10:18 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christoph Hellwig, Jan Kara, Jens Axboe, Alexander Viro,
	linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
	linux-xfs, Chris Mason, David Sterba, linux-btrfs,
	Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <20260602-work-super-bdev_holder_global-v1-8-bb0fd82f3861@kernel.org>

On Tue 02-06-26 12:10:14, Christian Brauner wrote:
> There's no need to expose it anymore.
> 
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/super.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index cea743f699e4..983c2fbf5202 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1643,13 +1643,12 @@ static int fs_bdev_thaw(struct block_device *bdev)
>  	return error;
>  }
>  
> -const struct blk_holder_ops fs_holder_ops = {
> +static const struct blk_holder_ops fs_holder_ops = {
>  	.mark_dead		= fs_bdev_mark_dead,
>  	.sync			= fs_bdev_sync,
>  	.freeze			= fs_bdev_freeze,
>  	.thaw			= fs_bdev_thaw,
>  };
> -EXPORT_SYMBOL_GPL(fs_holder_ops);
>  
>  static int fs_bdev_register(struct file *bdev_file, struct super_block *sb)
>  {
> 
> -- 
> 2.47.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH RFC 6/8] ext4: open via dedicated fs bdev helpers
From: Jan Kara @ 2026-06-08 10:18 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christoph Hellwig, Jan Kara, Jens Axboe, Alexander Viro,
	linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
	linux-xfs, Chris Mason, David Sterba, linux-btrfs,
	Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <20260602-work-super-bdev_holder_global-v1-6-bb0fd82f3861@kernel.org>

On Tue 02-06-26 12:10:12, Christian Brauner wrote:
> Route opens through fs_bdev_file_open_by_path() so each external device
> is registered against the correct superblock, and convert the matching
> releases.
> 
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/super.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 6a77db4d3124..8108d999008e 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -5793,7 +5793,7 @@ failed_mount8: __maybe_unused
>  	brelse(sbi->s_sbh);
>  	if (sbi->s_journal_bdev_file) {
>  		invalidate_bdev(file_bdev(sbi->s_journal_bdev_file));
> -		bdev_fput(sbi->s_journal_bdev_file);
> +		fs_bdev_file_release(sbi->s_journal_bdev_file, sb);
>  	}
>  out_fail:
>  	invalidate_bdev(sb->s_bdev);
> @@ -5972,9 +5972,9 @@ static struct file *ext4_get_journal_blkdev(struct super_block *sb,
>  	struct ext4_super_block *es;
>  	int errno;
>  
> -	bdev_file = bdev_file_open_by_dev(j_dev,
> +	bdev_file = fs_bdev_file_open_by_dev(j_dev,
>  		BLK_OPEN_READ | BLK_OPEN_WRITE | BLK_OPEN_RESTRICT_WRITES,
> -		sb, &fs_holder_ops);
> +		sb, sb);
>  	if (IS_ERR(bdev_file)) {
>  		ext4_msg(sb, KERN_ERR,
>  			 "failed to open journal device unknown-block(%u,%u) %ld",
> @@ -6034,7 +6034,7 @@ static struct file *ext4_get_journal_blkdev(struct super_block *sb,
>  out_bh:
>  	brelse(bh);
>  out_bdev:
> -	bdev_fput(bdev_file);
> +	fs_bdev_file_release(bdev_file, sb);
>  	return ERR_PTR(errno);
>  }
>  
> @@ -6073,7 +6073,7 @@ static journal_t *ext4_open_dev_journal(struct super_block *sb,
>  out_journal:
>  	ext4_journal_destroy(EXT4_SB(sb), journal);
>  out_bdev:
> -	bdev_fput(bdev_file);
> +	fs_bdev_file_release(bdev_file, sb);
>  	return ERR_PTR(errno);
>  }
>  
> @@ -7492,7 +7492,7 @@ static void ext4_kill_sb(struct super_block *sb)
>  	kill_block_super(sb);
>  
>  	if (bdev_file)
> -		bdev_fput(bdev_file);
> +		fs_bdev_file_release(bdev_file, sb);
>  }
>  
>  static struct file_system_type ext4_fs_type = {
> 
> -- 
> 2.47.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH RFC 4/8] xfs: port to fs_bdev_file_open_by_path()
From: Jan Kara @ 2026-06-08 10:15 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christoph Hellwig, Jan Kara, Jens Axboe, Alexander Viro,
	linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
	linux-xfs, Chris Mason, David Sterba, linux-btrfs,
	Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <20260602-work-super-bdev_holder_global-v1-4-bb0fd82f3861@kernel.org>

On Tue 02-06-26 12:10:10, Christian Brauner wrote:
> Route opens through fs_bdev_file_open_by_path() so each external device
> is registered against mp->m_super, and convert the matching releases.
> 
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/xfs/xfs_buf.c   |  2 +-
>  fs/xfs/xfs_super.c | 10 +++++-----
>  2 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 580d40a5ee57..3d3b29edb156 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1601,7 +1601,7 @@ xfs_free_buftarg(
>  	fs_put_dax(btp->bt_daxdev, btp->bt_mount);
>  	/* the main block device is closed by kill_block_super */
>  	if (btp->bt_bdev != btp->bt_mount->m_super->s_bdev)
> -		bdev_fput(btp->bt_file);
> +		fs_bdev_file_release(btp->bt_file, btp->bt_mount->m_super);
>  	kfree(btp);
>  }
>  
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index f8de44443e81..304667210695 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -400,8 +400,8 @@ xfs_blkdev_get(
>  	blk_mode_t		mode;
>  
>  	mode = sb_open_mode(mp->m_super->s_flags);
> -	*bdev_filep = bdev_file_open_by_path(name, mode,
> -			mp->m_super, &fs_holder_ops);
> +	*bdev_filep = fs_bdev_file_open_by_path(name, mode,
> +			mp->m_super, mp->m_super);
>  	if (IS_ERR(*bdev_filep)) {
>  		error = PTR_ERR(*bdev_filep);
>  		*bdev_filep = NULL;
> @@ -526,7 +526,7 @@ xfs_open_devices(
>  		mp->m_logdev_targp = mp->m_ddev_targp;
>  		/* Handle won't be used, drop it */
>  		if (logdev_file)
> -			bdev_fput(logdev_file);
> +			fs_bdev_file_release(logdev_file, mp->m_super);
>  	}
>  
>  	return 0;
> @@ -538,10 +538,10 @@ xfs_open_devices(
>  	xfs_free_buftarg(mp->m_ddev_targp);
>   out_close_rtdev:
>  	 if (rtdev_file)
> -		bdev_fput(rtdev_file);
> +		fs_bdev_file_release(rtdev_file, mp->m_super);
>   out_close_logdev:
>  	if (logdev_file)
> -		bdev_fput(logdev_file);
> +		fs_bdev_file_release(logdev_file, mp->m_super);
>  	return error;
>  }
>  
> 
> -- 
> 2.47.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH RFC 2/8] fs: add a global device to super block hash table
From: Jan Kara @ 2026-06-08 10:14 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christoph Hellwig, Jan Kara, Jens Axboe, Alexander Viro,
	linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
	linux-xfs, Chris Mason, David Sterba, linux-btrfs,
	Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <20260602-work-super-bdev_holder_global-v1-2-bb0fd82f3861@kernel.org>

On Tue 02-06-26 12:10:08, Christian Brauner wrote:
> fs_holder_ops recovers the owning superblock from bdev->bd_holder, which
> forces the holder to be exactly one superblock and prevents several
> superblocks from sharing one block device. That's what erofs is doing.
> 
> Introduce a global dev_t-keyed rhltable mapping each block device to the
> superblock(s) using it. The holder argument becomes purely the block
> layer's exclusivity token (a superblock, or a file_system_type for
> shared devices) and is no longer needed by the fs specific callbacks.
> 
> Registration keeps one entry per (device, superblock). When a filesystem
> claims a device it already uses (xfs with its log on the data device), no
> second entry is added, so each superblock is acted on once.
> 
> Each table entry holds a passive reference (s_count) on its superblock,
> so the struct stays valid for as long as the entry is reachable. The
> callbacks look the device up in the table and act on every superblock
> using it:
> 
> Unlinking an entry is deferred to the last unpin, so a cursor never
> resumes from a removed node. After this it's possible to act on all
> superblocks that share a given device.
> 
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

Looks good! One comment below:

>  static void fs_bdev_mark_dead(struct block_device *bdev, bool surprise)
>  {
> -	struct super_block *sb;
> +	struct fs_bdev_holder *h;
> +	dev_t dev = bdev->bd_dev;
>  
> -	sb = bdev_super_lock(bdev, false);
> -	if (!sb)
> -		return;
> +	mutex_unlock(&bdev->bd_holder_lock);

The moment we drop bd_holder_lock, there's nothing which prevents the bdev
owner from changing. So this can lead to a situation where we miss calling
->mark_dead callback of the new holder. Similarly for all the other holder
ops. I didn't find a situation where it would actually matter so I think
we're fine but it's a potential catch. Anyway, feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

>  
> -	if (sb->s_op->remove_bdev) {
> -		int ret;
> +	for (h = fs_bdev_first(dev); h; h = fs_bdev_next(h)) {
> +		struct super_block *sb = h->sb;
>  
> -		ret = sb->s_op->remove_bdev(sb, bdev);
> -		if (!ret) {
> -			super_unlock_shared(sb);
> -			return;
> +		if (!super_lock_shared(sb))
> +			continue;
> +		if (sb->s_root && (sb->s_flags & SB_ACTIVE)) {
> +			if (!sb->s_op->remove_bdev ||
> +			    sb->s_op->remove_bdev(sb, bdev)) {
> +				if (!surprise)
> +					sync_filesystem(sb);
> +				shrink_dcache_sb(sb);
> +				evict_inodes(sb);
> +				if (sb->s_op->shutdown)
> +					sb->s_op->shutdown(sb);
> +			}
>  		}
> -		/* Fallback to shutdown. */
> +		super_unlock_shared(sb);
>  	}
> -
> -	if (!surprise)
> -		sync_filesystem(sb);
> -	shrink_dcache_sb(sb);
> -	evict_inodes(sb);
> -	if (sb->s_op->shutdown)
> -		sb->s_op->shutdown(sb);
> -
> -	super_unlock_shared(sb);
>  }
>  
>  static void fs_bdev_sync(struct block_device *bdev)
>  {
> -	struct super_block *sb;
> +	struct fs_bdev_holder *h;
> +	dev_t dev = bdev->bd_dev;
>  
> -	sb = bdev_super_lock(bdev, false);
> -	if (!sb)
> -		return;
> +	mutex_unlock(&bdev->bd_holder_lock);
>  
> -	sync_filesystem(sb);
> -	super_unlock_shared(sb);
> -}
> +	for (h = fs_bdev_first(dev); h; h = fs_bdev_next(h)) {
> +		struct super_block *sb = h->sb;
>  
> -static struct super_block *get_bdev_super(struct block_device *bdev)
> -{
> -	bool active = false;
> -	struct super_block *sb;
> -
> -	sb = bdev_super_lock(bdev, true);
> -	if (sb) {
> -		active = atomic_inc_not_zero(&sb->s_active);
> -		super_unlock_excl(sb);
> +		if (!super_lock_shared(sb))
> +			continue;
> +		if (sb->s_root && (sb->s_flags & SB_ACTIVE))
> +			sync_filesystem(sb);
> +		super_unlock_shared(sb);
>  	}
> -	if (!active)
> -		return NULL;
> -	return sb;
>  }
>  
>  /**
> - * fs_bdev_freeze - freeze owning filesystem of block device
> + * fs_bdev_freeze - freeze every superblock using a block device
>   * @bdev: block device
>   *
> - * Freeze the filesystem that owns this block device if it is still
> - * active.
> - *
> - * A filesystem that owns multiple block devices may be frozen from each
> - * block device and won't be unfrozen until all block devices are
> - * unfrozen. Each block device can only freeze the filesystem once as we
> - * nest freezes for block devices in the block layer.
> + * Freeze each live superblock using @bdev.  A superblock owning several block
> + * devices is frozen once per device and stays frozen until all are thawed; the
> + * block layer nests these freezes so the count stays balanced.
>   *
> - * Return: If the freeze was successful zero is returned. If the freeze
> - *         failed a negative error code is returned.
> + * Return: 0, or the error from the one superblock on a single-fs device.  When
> + *         several superblocks share @bdev a per-superblock failure is swallowed
> + *         (see below), but a sync_blockdev() failure is always reported.
>   */
>  static int fs_bdev_freeze(struct block_device *bdev)
>  {
> -	struct super_block *sb;
> -	int error = 0;
> +	dev_t dev = bdev->bd_dev;
> +	struct fs_bdev_holder *h;
> +	unsigned int count = 0;
> +	int error = 0, err;
>  
>  	lockdep_assert_held(&bdev->bd_fsfreeze_mutex);
>  
> -	sb = get_bdev_super(bdev);
> -	if (!sb)
> -		return -EINVAL;
> +	mutex_unlock(&bdev->bd_holder_lock);
>  
> -	if (sb->s_op->freeze_super)
> -		error = sb->s_op->freeze_super(sb,
> -				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
> -	else
> -		error = freeze_super(sb,
> -				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
> +	for (h = fs_bdev_first(dev); h; h = fs_bdev_next(h)) {
> +		if (!atomic_inc_not_zero(&h->sb->s_active))
> +			continue;
> +		err = fs_super_freeze(h->sb);
> +		if (err && !error)
> +			error = err;
> +		deactivate_super(h->sb);
> +		count++;
> +	}
> +
> +	/*
> +	 * When several superblocks share the device, keep it frozen even if some
> +	 * of them failed to freeze and swallow the error: rolling the rest back
> +	 * via thaw_super() can fail too, so neither is a clear win. A single
> +	 * filesystem (count == 1) still reports its error.
> +	 */
> +	if (error && count > 1)
> +		error = 0;
>  	if (!error)
>  		error = sync_blockdev(bdev);
> -	deactivate_super(sb);
>  	return error;
>  }
>  
>  /**
> - * fs_bdev_thaw - thaw owning filesystem of block device
> + * fs_bdev_thaw - thaw every superblock using a block device
>   * @bdev: block device
>   *
> - * Thaw the filesystem that owns this block device.
> + * The counterpart to fs_bdev_freeze(): thaw each live superblock using @bdev.
> + * A zero return does not imply a superblock is fully unfrozen; it may have been
> + * frozen more than once (by the kernel or via another device).
>   *
> - * A filesystem that owns multiple block devices may be frozen from each
> - * block device and won't be unfrozen until all block devices are
> - * unfrozen. Each block device can only freeze the filesystem once as we
> - * nest freezes for block devices in the block layer.
> - *
> - * Return: If the thaw was successful zero is returned. If the thaw
> - *         failed a negative error code is returned. If this function
> - *         returns zero it doesn't mean that the filesystem is unfrozen
> - *         as it may have been frozen multiple times (kernel may hold a
> - *         freeze or might be frozen from other block devices).
> + * Return: 0, or the first error on a single-fs device; a shared device swallows
> + *         per-superblock errors, as fs_bdev_freeze() does.
>   */
>  static int fs_bdev_thaw(struct block_device *bdev)
>  {
> -	struct super_block *sb;
> -	int error;
> +	dev_t dev = bdev->bd_dev;
> +	struct fs_bdev_holder *h;
> +	unsigned int count = 0;
> +	int error = 0, err;
>  
>  	lockdep_assert_held(&bdev->bd_fsfreeze_mutex);
>  
> -	/*
> -	 * The block device may have been frozen before it was claimed by a
> -	 * filesystem. Concurrently another process might try to mount that
> -	 * frozen block device and has temporarily claimed the block device for
> -	 * that purpose causing a concurrent fs_bdev_thaw() to end up here. The
> -	 * mounter is already about to abort mounting because they still saw an
> -	 * elevanted bdev->bd_fsfreeze_count so get_bdev_super() will return
> -	 * NULL in that case.
> -	 */
> -	sb = get_bdev_super(bdev);
> -	if (!sb)
> -		return -EINVAL;
> +	mutex_unlock(&bdev->bd_holder_lock);
>  
> -	if (sb->s_op->thaw_super)
> -		error = sb->s_op->thaw_super(sb,
> -				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
> -	else
> -		error = thaw_super(sb,
> -				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
> -	deactivate_super(sb);
> +	for (h = fs_bdev_first(dev); h; h = fs_bdev_next(h)) {
> +		if (!atomic_inc_not_zero(&h->sb->s_active))
> +			continue;
> +		err = fs_super_thaw(h->sb);
> +		if (err && !error)
> +			error = err;
> +		deactivate_super(h->sb);
> +		count++;
> +	}
> +
> +	/* Shared device: swallow per-superblock errors, like fs_bdev_freeze(). */
> +	if (error && count > 1)
> +		error = 0;
>  	return error;
>  }
>  
> @@ -1602,6 +1651,131 @@ const struct blk_holder_ops fs_holder_ops = {
>  };
>  EXPORT_SYMBOL_GPL(fs_holder_ops);
>  
> +static int fs_bdev_register(struct file *bdev_file, struct super_block *sb)
> +{
> +	dev_t dev = file_bdev(bdev_file)->bd_dev;
> +	struct rhlist_head *list, *pos;
> +	struct fs_bdev_holder *h;
> +	int err;
> +
> +	/*
> +	 * A superblock may claim one device more than once (xfs with its log on
> +	 * the data device).  Keep a single entry per (device, superblock) and
> +	 * count the claims in @fs_bdev_active; the entry lives until the last one
> +	 * is released.
> +	 */
> +	scoped_guard(rcu) {
> +		list = rhltable_lookup(&fs_bdev_supers, &dev, fs_bdev_params);
> +		rhl_for_each_entry_rcu(h, pos, list, node)
> +			if (h->sb == sb && refcount_inc_not_zero(&h->fs_bdev_active))
> +				return 0;
> +	}
> +
> +	h = kmalloc(sizeof(*h), GFP_KERNEL);
> +	if (!h)
> +		return -ENOMEM;
> +	h->dev = dev;
> +	h->sb = sb;
> +	refcount_set(&h->fs_bdev_passive, 1);
> +	refcount_set(&h->fs_bdev_active, 1);
> +
> +	err = rhltable_insert(&fs_bdev_supers, &h->node, fs_bdev_params);
> +	if (err) {
> +		kfree(h);
> +		return err;
> +	}
> +
> +	/* The sb->s_count ref keeps @h->sb valid for as long as the entry exists. */
> +	spin_lock(&sb_lock);
> +	sb->s_count++;
> +	spin_unlock(&sb_lock);
> +
> +	return 0;
> +}
> +
> +/**
> + * fs_bdev_file_open_by_dev - claim a block device on behalf of a superblock
> + * @dev: block device number
> + * @mode: open mode
> + * @holder: block-layer exclusivity token (a superblock, or the file_system_type
> + *          when the device may be shared by several superblocks of that type)
> + * @sb: superblock to drive fs_holder_ops events for
> + *
> + * Open @dev with &fs_holder_ops and register that @sb uses it, so device
> + * removal/sync/freeze/thaw are propagated to @sb (and any other superblock
> + * sharing @dev).  Must be paired with fs_bdev_file_release().
> + *
> + * Return: an opened block-device file or an ERR_PTR().
> + */
> +struct file *fs_bdev_file_open_by_dev(dev_t dev, blk_mode_t mode, void *holder,
> +				      struct super_block *sb)
> +{
> +	struct file *bdev_file;
> +	int err;
> +
> +	bdev_file = bdev_file_open_by_dev(dev, mode, holder, &fs_holder_ops);
> +	if (IS_ERR(bdev_file))
> +		return bdev_file;
> +
> +	err = fs_bdev_register(bdev_file, sb);
> +	if (err) {
> +		bdev_fput(bdev_file);
> +		return ERR_PTR(err);
> +	}
> +	return bdev_file;
> +}
> +EXPORT_SYMBOL_GPL(fs_bdev_file_open_by_dev);
> +
> +struct file *fs_bdev_file_open_by_path(const char *path, blk_mode_t mode,
> +				       void *holder, struct super_block *sb)
> +{
> +	struct file *bdev_file;
> +	int err;
> +
> +	bdev_file = bdev_file_open_by_path(path, mode, holder, &fs_holder_ops);
> +	if (IS_ERR(bdev_file))
> +		return bdev_file;
> +
> +	err = fs_bdev_register(bdev_file, sb);
> +	if (err) {
> +		bdev_fput(bdev_file);
> +		return ERR_PTR(err);
> +	}
> +	return bdev_file;
> +}
> +EXPORT_SYMBOL_GPL(fs_bdev_file_open_by_path);
> +
> +/**
> + * fs_bdev_file_release - release a block device claimed for a superblock
> + * @bdev_file: file returned by fs_bdev_file_open_by_{dev,path}()
> + * @sb: superblock the device was claimed for
> + *
> + * Drop one claim on the {dev, @sb} entry; the last claim unregisters it (a
> + * pinning cursor defers the actual unlink).  Then close the block device.
> + */
> +void fs_bdev_file_release(struct file *bdev_file, struct super_block *sb)
> +{
> +	dev_t dev = file_bdev(bdev_file)->bd_dev;
> +	struct fs_bdev_holder *h, *found = NULL;
> +	struct rhlist_head *list, *pos;
> +
> +	rcu_read_lock();
> +	list = rhltable_lookup(&fs_bdev_supers, &dev, fs_bdev_params);
> +	rhl_for_each_entry_rcu(h, pos, list, node) {
> +		if (h->sb != sb)
> +			continue;
> +		/* At most one entry per (dev, sb); the last claim drops the bias. */
> +		if (refcount_dec_and_test(&h->fs_bdev_active))
> +			found = h;
> +		break;
> +	}
> +	rcu_read_unlock();
> +	if (found)
> +		fs_bdev_holder_put(found);
> +	bdev_fput(bdev_file);
> +}
> +EXPORT_SYMBOL_GPL(fs_bdev_file_release);
> +
>  int setup_bdev_super(struct super_block *sb, int sb_flags,
>  		struct fs_context *fc)
>  {
> @@ -1609,7 +1783,7 @@ int setup_bdev_super(struct super_block *sb, int sb_flags,
>  	struct file *bdev_file;
>  	struct block_device *bdev;
>  
> -	bdev_file = bdev_file_open_by_dev(sb->s_dev, mode, sb, &fs_holder_ops);
> +	bdev_file = fs_bdev_file_open_by_dev(sb->s_dev, mode, sb, sb);
>  	if (IS_ERR(bdev_file)) {
>  		if (fc)
>  			errorf(fc, "%s: Can't open blockdev", fc->source);
> @@ -1623,7 +1797,7 @@ int setup_bdev_super(struct super_block *sb, int sb_flags,
>  	 * writable from userspace even for a read-only block device.
>  	 */
>  	if ((mode & BLK_OPEN_WRITE) && bdev_read_only(bdev)) {
> -		bdev_fput(bdev_file);
> +		fs_bdev_file_release(bdev_file, sb);
>  		return -EACCES;
>  	}
>  
> @@ -1634,7 +1808,7 @@ int setup_bdev_super(struct super_block *sb, int sb_flags,
>  	if (atomic_read(&bdev->bd_fsfreeze_count) > 0) {
>  		if (fc)
>  			warnf(fc, "%pg: Can't mount, blockdev is frozen", bdev);
> -		bdev_fput(bdev_file);
> +		fs_bdev_file_release(bdev_file, sb);
>  		return -EBUSY;
>  	}
>  	spin_lock(&sb_lock);
> @@ -1725,7 +1899,7 @@ void kill_block_super(struct super_block *sb)
>  	generic_shutdown_super(sb);
>  	if (bdev) {
>  		sync_blockdev(bdev);
> -		bdev_fput(sb->s_bdev_file);
> +		fs_bdev_file_release(sb->s_bdev_file, sb);
>  	}
>  }
>  
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index c8494d64a69d..43d37c02febf 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1760,13 +1760,6 @@ struct blk_holder_ops {
>  	int (*thaw)(struct block_device *bdev);
>  };
>  
> -/*
> - * For filesystems using @fs_holder_ops, the @holder argument passed to
> - * helpers used to open and claim block devices via
> - * bd_prepare_to_claim() must point to a superblock.
> - */
> -extern const struct blk_holder_ops fs_holder_ops;
> -
>  /*
>   * Return the correct open flags for blkdev_get_by_* for super block flags
>   * as stored in sb->s_flags.
> diff --git a/include/linux/fs/super.h b/include/linux/fs/super.h
> index f21ffbb6dea5..721d842e3b24 100644
> --- a/include/linux/fs/super.h
> +++ b/include/linux/fs/super.h
> @@ -235,4 +235,11 @@ int freeze_super(struct super_block *super, enum freeze_holder who,
>  int thaw_super(struct super_block *super, enum freeze_holder who,
>  	       const void *freeze_owner);
>  
> +struct file;
> +struct file *fs_bdev_file_open_by_dev(dev_t dev, blk_mode_t mode, void *holder,
> +				      struct super_block *sb);
> +struct file *fs_bdev_file_open_by_path(const char *path, blk_mode_t mode,
> +				       void *holder, struct super_block *sb);
> +void fs_bdev_file_release(struct file *bdev_file, struct super_block *sb);
> +
>  #endif /* _LINUX_FS_SUPER_H */
> 
> -- 
> 2.47.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH RFC 3/8] fs: refuse to claim any frozen block device
From: Jan Kara @ 2026-06-08 10:01 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christoph Hellwig, Jan Kara, Jens Axboe, Alexander Viro,
	linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
	linux-xfs, Chris Mason, David Sterba, linux-btrfs,
	Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <20260602-work-super-bdev_holder_global-v1-3-bb0fd82f3861@kernel.org>

On Tue 02-06-26 12:10:09, Christian Brauner wrote:
> setup_bdev_super() already refuses to bring a filesystem up on a frozen
> block device but only for the primary device. Now that filesystems claim
> every device through fs_bdev_file_open_by_{dev,path}(), do that check
> once in the registration helper so it covers all of them.
> 
> Drop the now-redundant check from setup_bdev_super().
> 
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
> ---
>  fs/super.c | 21 +++++++++++----------
>  1 file changed, 11 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index e0174d5819a0..cea743f699e4 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1690,6 +1690,17 @@ static int fs_bdev_register(struct file *bdev_file, struct super_block *sb)
>  	sb->s_count++;
>  	spin_unlock(&sb_lock);
>  
> +	/*
> +	 * Don't bring a filesystem up on a frozen device.  The entry is already
> +	 * published, so a freeze either is seen here or finds it and waits in
> +	 * super_lock() until this mount is born or (on -EBUSY) dies.  The mount
> +	 * aborts, so the entry is torn down without rebalancing @fs_bdev_active.
> +	 */
> +	if (atomic_read(&file_bdev(bdev_file)->bd_fsfreeze_count) > 0) {
> +		fs_bdev_holder_put(h);
> +		return -EBUSY;
> +	}
> +
>  	return 0;
>  }

Shouldn't this check be common also for the branch where we only increase
the refcount? Or is a filesystem where a superblock claims the bdev
multiple times and can get frozen inbetween too insane?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH RFC 1/8] fs, block: move blk_mode_t and fop_flags_t into <linux/types.h>
From: Jan Kara @ 2026-06-08  9:57 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christoph Hellwig, Jan Kara, Jens Axboe, Alexander Viro,
	linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
	linux-xfs, Chris Mason, David Sterba, linux-btrfs,
	Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <20260602-work-super-bdev_holder_global-v1-1-bb0fd82f3861@kernel.org>

On Tue 02-06-26 12:10:07, Christian Brauner wrote:
> blk_mode_t and fop_flags_t are both plain 'unsigned int __bitwise' flag
> typedefs, exactly like the gfp_t, slab_flags_t and fmode_t that already
> live in <linux/types.h>. Move them there so they are available
> everywhere without having to drag in a subsystem header.
> 
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

Makes sense. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  include/linux/blkdev.h | 2 --
>  include/linux/fs.h     | 2 --
>  include/linux/types.h  | 2 ++
>  3 files changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 890128cdea1c..c8494d64a69d 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -126,8 +126,6 @@ struct blk_integrity {
>  	unsigned char				pi_tuple_size;
>  };
>  
> -typedef unsigned int __bitwise blk_mode_t;
> -
>  /* open for reading */
>  #define BLK_OPEN_READ		((__force blk_mode_t)(1 << 0))
>  /* open for writing */
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 11559c513dfb..e9346be8470f 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1921,8 +1921,6 @@ struct dir_context {
>  struct io_uring_cmd;
>  struct offset_ctx;
>  
> -typedef unsigned int __bitwise fop_flags_t;
> -
>  struct file_operations {
>  	struct module *owner;
>  	fop_flags_t fop_flags;
> diff --git a/include/linux/types.h b/include/linux/types.h
> index 608050dbca6a..ef026585420b 100644
> --- a/include/linux/types.h
> +++ b/include/linux/types.h
> @@ -163,6 +163,8 @@ typedef u32 dma_addr_t;
>  typedef unsigned int __bitwise gfp_t;
>  typedef unsigned int __bitwise slab_flags_t;
>  typedef unsigned int __bitwise fmode_t;
> +typedef unsigned int __bitwise blk_mode_t;
> +typedef unsigned int __bitwise fop_flags_t;
>  
>  #ifdef CONFIG_PHYS_ADDR_T_64BIT
>  typedef u64 phys_addr_t;
> 
> -- 
> 2.47.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: security bug reporting: e2fsprogs: Path Traversal and heap overflow
From: Andreas Dilger @ 2026-06-08  9:49 UTC (permalink / raw)
  To: Feng Xue; +Cc: tytso@mit.edu, linux-ext4@vger.kernel.org
In-Reply-To: <SY0P300MB0070F750CCF6F2C3A2A91FDE901F2@SY0P300MB0070.AUSP300.PROD.OUTLOOK.COM>

[-- Attachment #1: Type: text/plain, Size: 949 bytes --]

Hello Feng Xue,
thank you for your report.  The inode blocks overflow looks legitimate, and trivial to fix.  The reproducer is a bit strange, since it is a python script that generates a synthetic ext4 image directly rather than writing an e2fsck test case like "f_64kblock" using mke2fs to create the filesystem with mostly appropriate parameters, and debugfs to overwrite the values.

Then e2fsck can be run on the filesystem to fix the superblock s_blocks_per_group value.

A patch is attached with the trivial code fix for review and includes a test case.

The debugfs issue seems less important, since this requires the administrator to run the specific debugfs command on the specific file.  

> On Jun 7, 2026, at 07:34, Feng Xue <feng.xue@outlook.com> wrote:
> 
> Hi there,
> 
> I'd like to report two potential security bugs for your review.
> detailed report and pocs attached.
> 
> Best,
> Feng

Cheers, Andreas

[-- Attachment #2: 0001-libext2fs-fix-inode_blocks-overflow-in-ext2fs_open.patch --]
[-- Type: application/octet-stream, Size: 6294 bytes --]

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox