public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed
From: Helen Koike <koike@igalia.com>
To: tytso@mit.edu, adilger.kernel@dilger.ca,
	linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, kernel-dev@igalia.com,
	syzbot+b73703b873a33d8eb8f6@syzkaller.appspotmail.com
Subject: Re: [PATCH] ext4: fix bigalloc cluster arithmetic when s_first_data_block != 0
Date: Fri, 13 Mar 2026 12:28:40 -0300	[thread overview]
Message-ID: <2d3ae00d-17ec-46ca-933d-1008a049948f@igalia.com> (raw)
In-Reply-To: <20260313151835.1248953-1-koike@igalia.com>

Please ignore this patch, I just found about:

https://lore.kernel.org/all/1361433665-16880-1-git-send-email-lczerner@redhat.com/

I'll re-evaluate and submit v2 if it make sense.

Thanks
Helen

On 3/13/26 12:18 PM, Helen Koike wrote:
> On filesystems with bigalloc enabled and s_first_data_block != 0,
> mballoc defines clusters starting from s_first_data_block, so cluster
> N covers blocks [s_first_data_block + N*s_cluster_ratio, ...).
> 
> EXT4_B2C/EXT4_C2B perform a plain bit-shift on the absolute block
> number, ignoring s_first_data_block.
> 
> For instance, with cluster_ratio = 4 and s_first_data_block = 1:
> 
> This is how EXT4_B2C/EXT4_C2B view it:
> 
>    block:    0   1   2   3   4   5   6   7   8   9  10  11  12 ...
>              |___________|   |___________|   |___________|
>    cluster:       0               1               2
> 
> This is how mballoc views it:
> 
>    block:    0   1   2   3   4   5   6   7   8   9  10  11  12 ...
>                  |___________|   |___________|   |___________|
>    cluster:           0               1               2
> 
> This causes the following issues:
> 
> 1) In extents.c, partial->pclu is stored and compared using EXT4_B2C,
>     then passed to ext4_free_blocks() via EXT4_C2B. The resulting
>     block address is misaligned with mballoc's cluster boundaries,
>     causing the wrong cluster to be freed.
> 
> 2) In ext4_free_blocks(), EXT4_PBLK_COFF uses the same plain
>     bit-mask, so the intra-cluster offset of the start block is
>     computed incorrectly, misaligning the freed range.
> 
> Introduce four macros that subtract s_first_data_block before
> operating, matching the coordinate system used by mballoc:
> 
>    EXT4_MB_B2C(sbi, blk)        block -> cluster number
>    EXT4_MB_C2B(sbi, cluster)    cluster -> first block of cluster
>    EXT4_MB_PBLK_COFF(sbi, blk)  intra-cluster offset of a block
>    EXT4_MB_LBLK_COFF(sbi, n)    intra-cluster offset of a block count
> 
> Use EXT4_MB_B2C/EXT4_MB_C2B for all partial->pclu operations in
> extents.c, and EXT4_MB_PBLK_COFF/EXT4_MB_LBLK_COFF in the alignment
> prologue of ext4_free_blocks().
> 
> Regarding the issue reported by syzbot:
> 
>    Context: s_first_data_block=1, cluster size 16 and block 145 contains
>    a extents leaf for inode 15.
>    When an extent is removed (block 145 went from 7 to 6 valid extent
>    entries), ext4_ext_rm_leaf() sees the cluster partial->pclu (which is
>    10) to be freed.
> 
>    Note that EXT4_C2B(partial->pclu=10) is 160, and EXT4_B2C(145) is 9
>    (so the extents leaf is in a different cluster). Finally
>    ext4_free_blocks(..,block=160, count=16..) is called, which shouldn't
>    be a problem (it should not free block 145).
> 
>    Except that in  ext4_free_blocks() (and later in ext4_mb_clear_bb()
>    and mb_free_blocks()) , block 160 resolves to group 0 bit 9 count 1
>    (which frees block 145 containing extents leaf!!!!), causing block
>    145 to be reused by other inodes while inode 15 still thinks that
>    block 145 contains extents metadata, resulting later in the UAF we see
>    by syzbot.
> 
>    The issue doesn't reproduce with the current patch.
> 
> Test results:
> 
>> kvm-xfstests --kernel $BUILD_DIR/arch/x86/boot/bzImage smoke
> 
> ext4/4k: 7 tests, 1 skipped, 1113 seconds
>    generic/475  Pass     185s
>    generic/476  Pass     184s
>    generic/521  Pass     182s
>    generic/522  Pass     181s
>    generic/642  Pass     185s
>    generic/750  Pass     185s
>    generic/751  Skipped  1s
> Totals: 7 tests, 1 skipped, 0 failures, 0 errors, 1103s
> 
>> kvm-xfstests --kernel $BUILD_DIR/arch/x86/boot/bzImage -c 1k smoke
> 
> ext4/1k: 5 tests, 1 skipped, 755 seconds
>    generic/521  Pass     183s
>    generic/522  Pass     182s
>    generic/642  Pass     186s
>    generic/750  Pass     186s
>    generic/751  Skipped  2s
> Totals: 5 tests, 1 skipped, 0 failures, 0 errors, 739s
> 
> Signed-off-by: Helen Koike <koike@igalia.com>
> Reported-by: syzbot+b73703b873a33d8eb8f6@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=b73703b873a33d8eb8f6
> 
> ---
> 
> Hi all,
> 
> I'm sorry if the commit message is too verbose. I'm not a file system
> developer, so I tried to be didactic for those like me.
> 
> I was also wondering if I should update the existing macros instead of
> creating new ones, but since I'm not very familiar with this subsystem
> and current discussions I decided to be safe and change only for the use
> case I could test.
> 
> If you think this line of solution make sense I could update the
> existing macros, please just let me know.
> 
> Thanks
> Helen
> 
> ---
>   fs/ext4/ext4.h    | 21 +++++++++++++++++++++
>   fs/ext4/extents.c | 20 ++++++++++----------
>   fs/ext4/mballoc.c |  4 ++--
>   3 files changed, 33 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 7e8f66ba17f4..83fd6e53c003 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -315,6 +315,17 @@ struct ext4_io_submit {
>   #define EXT4_B2C(sbi, blk)	((blk) >> (sbi)->s_cluster_bits)
>   /* Translate a cluster number to a block number */
>   #define EXT4_C2B(sbi, cluster)	((cluster) << (sbi)->s_cluster_bits)
> +/*
> + * Translate a physical block number to a cluster number using mballoc's
> + * cluster definition, which accounts for s_first_data_block.  Use these
> + * when tracking partial->pclu so that the cluster numbers are consistent
> + * with what ext4_get_group_no_and_offset() (and thus ext4_free_blocks())
> + * expects.
> + */
> +#define EXT4_MB_B2C(sbi, nr) \
> +	(((nr) - le32_to_cpu((sbi)->s_es->s_first_data_block)) >> (sbi)->s_cluster_bits)
> +#define EXT4_MB_C2B(sbi, cluster) \
> +	(((cluster) << (sbi)->s_cluster_bits) + le32_to_cpu((sbi)->s_es->s_first_data_block))
>   /* Translate # of blks to # of clusters */
>   #define EXT4_NUM_B2C(sbi, blks)	(((blks) + (sbi)->s_cluster_ratio - 1) >> \
>   				 (sbi)->s_cluster_bits)
> @@ -331,6 +342,16 @@ struct ext4_io_submit {
>   				 ((ext4_fsblk_t) (s)->s_cluster_ratio - 1))
>   #define EXT4_LBLK_COFF(s, lblk) ((lblk) &				\
>   				 ((ext4_lblk_t) (s)->s_cluster_ratio - 1))
> +/*
> + * Cluster-offset macros that account for s_first_data_block, consistent
> + * with mballoc's cluster numbering.  Use EXT4_MB_PBLK_COFF when aligning a
> + * physical block number and EXT4_MB_LBLK_COFF when aligning a block count.
> + */
> +#define EXT4_MB_PBLK_COFF(sbi, pblk) \
> +	(((pblk) - le32_to_cpu((sbi)->s_es->s_first_data_block)) & \
> +	 ((ext4_fsblk_t)(sbi)->s_cluster_ratio - 1))
> +#define EXT4_MB_LBLK_COFF(sbi, lblk) \
> +	((lblk) & ((ext4_lblk_t)(sbi)->s_cluster_ratio - 1))
>   
>   /*
>    * Structure of a blocks group descriptor
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 35703dce23a3..38d49f784c22 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -2493,13 +2493,13 @@ static int ext4_remove_blocks(handle_t *handle, struct inode *inode,
>   	last_pblk = ext4_ext_pblock(ex) + ee_len - 1;
>   
>   	if (partial->state != initial &&
> -	    partial->pclu != EXT4_B2C(sbi, last_pblk)) {
> +	    partial->pclu != EXT4_MB_B2C(sbi, last_pblk)) {
>   		if (partial->state == tofree) {
>   			flags = get_default_free_blocks_flags(inode);
>   			if (ext4_is_pending(inode, partial->lblk))
>   				flags |= EXT4_FREE_BLOCKS_RERESERVE_CLUSTER;
>   			ext4_free_blocks(handle, inode, NULL,
> -					 EXT4_C2B(sbi, partial->pclu),
> +					 EXT4_MB_C2B(sbi, partial->pclu),
>   					 sbi->s_cluster_ratio, flags);
>   			if (flags & EXT4_FREE_BLOCKS_RERESERVE_CLUSTER)
>   				ext4_rereserve_cluster(inode, partial->lblk);
> @@ -2545,7 +2545,7 @@ static int ext4_remove_blocks(handle_t *handle, struct inode *inode,
>   	ext4_free_blocks(handle, inode, NULL, pblk, num, flags);
>   
>   	/* reset the partial cluster if we've freed past it */
> -	if (partial->state != initial && partial->pclu != EXT4_B2C(sbi, pblk))
> +	if (partial->state != initial && partial->pclu != EXT4_MB_B2C(sbi, pblk))
>   		partial->state = initial;
>   
>   	/*
> @@ -2560,7 +2560,7 @@ static int ext4_remove_blocks(handle_t *handle, struct inode *inode,
>   	 */
>   	if (EXT4_LBLK_COFF(sbi, from) && num == ee_len) {
>   		if (partial->state == initial) {
> -			partial->pclu = EXT4_B2C(sbi, pblk);
> +			partial->pclu = EXT4_MB_B2C(sbi, pblk);
>   			partial->lblk = from;
>   			partial->state = tofree;
>   		}
> @@ -2651,7 +2651,7 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
>   			 */
>   			if (sbi->s_cluster_ratio > 1) {
>   				pblk = ext4_ext_pblock(ex);
> -				partial->pclu = EXT4_B2C(sbi, pblk);
> +				partial->pclu = EXT4_MB_B2C(sbi, pblk);
>   				partial->state = nofree;
>   			}
>   			ex--;
> @@ -2766,13 +2766,13 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
>   	 */
>   	if (partial->state == tofree && ex >= EXT_FIRST_EXTENT(eh)) {
>   		pblk = ext4_ext_pblock(ex) + ex_ee_len - 1;
> -		if (partial->pclu != EXT4_B2C(sbi, pblk)) {
> +		if (partial->pclu != EXT4_MB_B2C(sbi, pblk)) {
>   			int flags = get_default_free_blocks_flags(inode);
>   
>   			if (ext4_is_pending(inode, partial->lblk))
>   				flags |= EXT4_FREE_BLOCKS_RERESERVE_CLUSTER;
>   			ext4_free_blocks(handle, inode, NULL,
> -					 EXT4_C2B(sbi, partial->pclu),
> +					 EXT4_MB_C2B(sbi, partial->pclu),
>   					 sbi->s_cluster_ratio, flags);
>   			if (flags & EXT4_FREE_BLOCKS_RERESERVE_CLUSTER)
>   				ext4_rereserve_cluster(inode, partial->lblk);
> @@ -2886,7 +2886,7 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
>   			 */
>   			if (sbi->s_cluster_ratio > 1) {
>   				pblk = ext4_ext_pblock(ex) + end - ee_block + 1;
> -				partial.pclu = EXT4_B2C(sbi, pblk);
> +				partial.pclu = EXT4_MB_B2C(sbi, pblk);
>   				partial.state = nofree;
>   			}
>   
> @@ -2919,7 +2919,7 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
>   			if (err < 0)
>   				goto out;
>   			if (pblk) {
> -				partial.pclu = EXT4_B2C(sbi, pblk);
> +				partial.pclu = EXT4_MB_B2C(sbi, pblk);
>   				partial.state = nofree;
>   			}
>   		}
> @@ -3041,7 +3041,7 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
>   		if (ext4_is_pending(inode, partial.lblk))
>   			flags |= EXT4_FREE_BLOCKS_RERESERVE_CLUSTER;
>   		ext4_free_blocks(handle, inode, NULL,
> -				 EXT4_C2B(sbi, partial.pclu),
> +				 EXT4_MB_C2B(sbi, partial.pclu),
>   				 sbi->s_cluster_ratio, flags);
>   		if (flags & EXT4_FREE_BLOCKS_RERESERVE_CLUSTER)
>   			ext4_rereserve_cluster(inode, partial.lblk);
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 9c7881a4ea75..5c7ad20e4e7a 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -6360,7 +6360,7 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
>   	 * blocks at the beginning or the end unless we are explicitly
>   	 * requested to avoid doing so.
>   	 */
> -	overflow = EXT4_PBLK_COFF(sbi, block);
> +	overflow = EXT4_MB_PBLK_COFF(sbi, block);
>   	if (overflow) {
>   		if (flags & EXT4_FREE_BLOCKS_NOFREE_FIRST_CLUSTER) {
>   			overflow = sbi->s_cluster_ratio - overflow;
> @@ -6376,7 +6376,7 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
>   		/* The range changed so it's no longer validated */
>   		flags &= ~EXT4_FREE_BLOCKS_VALIDATED;
>   	}
> -	overflow = EXT4_LBLK_COFF(sbi, count);
> +	overflow = EXT4_MB_LBLK_COFF(sbi, count);
>   	if (overflow) {
>   		if (flags & EXT4_FREE_BLOCKS_NOFREE_LAST_CLUSTER) {
>   			if (count > overflow)


  reply	other threads:[~2026-03-13 15:28 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-13 15:18 [PATCH] ext4: fix bigalloc cluster arithmetic when s_first_data_block != 0 Helen Koike
2026-03-13 15:28 ` Helen Koike [this message]
2026-03-14  1:29 ` Theodore Tso
2026-03-17 13:53   ` Helen Koike

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2d3ae00d-17ec-46ca-933d-1008a049948f@igalia.com \
    --to=koike@igalia.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=kernel-dev@igalia.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=syzbot+b73703b873a33d8eb8f6@syzkaller.appspotmail.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox