public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed
From: Helen Koike <koike@igalia.com>
To: tytso@mit.edu, adilger.kernel@dilger.ca,
	linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, kernel-dev@igalia.com,
	koike@igalia.com,
	syzbot+b73703b873a33d8eb8f6@syzkaller.appspotmail.com
Subject: [PATCH] ext4: fix bigalloc cluster arithmetic when s_first_data_block != 0
Date: Fri, 13 Mar 2026 12:18:30 -0300	[thread overview]
Message-ID: <20260313151835.1248953-1-koike@igalia.com> (raw)

On filesystems with bigalloc enabled and s_first_data_block != 0,
mballoc defines clusters starting from s_first_data_block, so cluster
N covers blocks [s_first_data_block + N*s_cluster_ratio, ...).

EXT4_B2C/EXT4_C2B perform a plain bit-shift on the absolute block
number, ignoring s_first_data_block.

For instance, with cluster_ratio = 4 and s_first_data_block = 1:

This is how EXT4_B2C/EXT4_C2B view it:

  block:    0   1   2   3   4   5   6   7   8   9  10  11  12 ...
            |___________|   |___________|   |___________|
  cluster:       0               1               2

This is how mballoc views it:

  block:    0   1   2   3   4   5   6   7   8   9  10  11  12 ...
                |___________|   |___________|   |___________|
  cluster:           0               1               2

This causes the following issues:

1) In extents.c, partial->pclu is stored and compared using EXT4_B2C,
   then passed to ext4_free_blocks() via EXT4_C2B. The resulting
   block address is misaligned with mballoc's cluster boundaries,
   causing the wrong cluster to be freed.

2) In ext4_free_blocks(), EXT4_PBLK_COFF uses the same plain
   bit-mask, so the intra-cluster offset of the start block is
   computed incorrectly, misaligning the freed range.

Introduce four macros that subtract s_first_data_block before
operating, matching the coordinate system used by mballoc:

  EXT4_MB_B2C(sbi, blk)        block -> cluster number
  EXT4_MB_C2B(sbi, cluster)    cluster -> first block of cluster
  EXT4_MB_PBLK_COFF(sbi, blk)  intra-cluster offset of a block
  EXT4_MB_LBLK_COFF(sbi, n)    intra-cluster offset of a block count

Use EXT4_MB_B2C/EXT4_MB_C2B for all partial->pclu operations in
extents.c, and EXT4_MB_PBLK_COFF/EXT4_MB_LBLK_COFF in the alignment
prologue of ext4_free_blocks().

Regarding the issue reported by syzbot:

  Context: s_first_data_block=1, cluster size 16 and block 145 contains
  a extents leaf for inode 15.
  When an extent is removed (block 145 went from 7 to 6 valid extent
  entries), ext4_ext_rm_leaf() sees the cluster partial->pclu (which is
  10) to be freed.

  Note that EXT4_C2B(partial->pclu=10) is 160, and EXT4_B2C(145) is 9
  (so the extents leaf is in a different cluster). Finally
  ext4_free_blocks(..,block=160, count=16..) is called, which shouldn't
  be a problem (it should not free block 145).

  Except that in  ext4_free_blocks() (and later in ext4_mb_clear_bb()
  and mb_free_blocks()) , block 160 resolves to group 0 bit 9 count 1
  (which frees block 145 containing extents leaf!!!!), causing block
  145 to be reused by other inodes while inode 15 still thinks that
  block 145 contains extents metadata, resulting later in the UAF we see
  by syzbot.

  The issue doesn't reproduce with the current patch.

Test results:

> kvm-xfstests --kernel $BUILD_DIR/arch/x86/boot/bzImage smoke

ext4/4k: 7 tests, 1 skipped, 1113 seconds
  generic/475  Pass     185s
  generic/476  Pass     184s
  generic/521  Pass     182s
  generic/522  Pass     181s
  generic/642  Pass     185s
  generic/750  Pass     185s
  generic/751  Skipped  1s
Totals: 7 tests, 1 skipped, 0 failures, 0 errors, 1103s

> kvm-xfstests --kernel $BUILD_DIR/arch/x86/boot/bzImage -c 1k smoke

ext4/1k: 5 tests, 1 skipped, 755 seconds
  generic/521  Pass     183s
  generic/522  Pass     182s
  generic/642  Pass     186s
  generic/750  Pass     186s
  generic/751  Skipped  2s
Totals: 5 tests, 1 skipped, 0 failures, 0 errors, 739s

Signed-off-by: Helen Koike <koike@igalia.com>
Reported-by: syzbot+b73703b873a33d8eb8f6@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=b73703b873a33d8eb8f6

---

Hi all,

I'm sorry if the commit message is too verbose. I'm not a file system
developer, so I tried to be didactic for those like me.

I was also wondering if I should update the existing macros instead of
creating new ones, but since I'm not very familiar with this subsystem
and current discussions I decided to be safe and change only for the use
case I could test.

If you think this line of solution make sense I could update the
existing macros, please just let me know.

Thanks
Helen

---
 fs/ext4/ext4.h    | 21 +++++++++++++++++++++
 fs/ext4/extents.c | 20 ++++++++++----------
 fs/ext4/mballoc.c |  4 ++--
 3 files changed, 33 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 7e8f66ba17f4..83fd6e53c003 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -315,6 +315,17 @@ struct ext4_io_submit {
 #define EXT4_B2C(sbi, blk)	((blk) >> (sbi)->s_cluster_bits)
 /* Translate a cluster number to a block number */
 #define EXT4_C2B(sbi, cluster)	((cluster) << (sbi)->s_cluster_bits)
+/*
+ * Translate a physical block number to a cluster number using mballoc's
+ * cluster definition, which accounts for s_first_data_block.  Use these
+ * when tracking partial->pclu so that the cluster numbers are consistent
+ * with what ext4_get_group_no_and_offset() (and thus ext4_free_blocks())
+ * expects.
+ */
+#define EXT4_MB_B2C(sbi, nr) \
+	(((nr) - le32_to_cpu((sbi)->s_es->s_first_data_block)) >> (sbi)->s_cluster_bits)
+#define EXT4_MB_C2B(sbi, cluster) \
+	(((cluster) << (sbi)->s_cluster_bits) + le32_to_cpu((sbi)->s_es->s_first_data_block))
 /* Translate # of blks to # of clusters */
 #define EXT4_NUM_B2C(sbi, blks)	(((blks) + (sbi)->s_cluster_ratio - 1) >> \
 				 (sbi)->s_cluster_bits)
@@ -331,6 +342,16 @@ struct ext4_io_submit {
 				 ((ext4_fsblk_t) (s)->s_cluster_ratio - 1))
 #define EXT4_LBLK_COFF(s, lblk) ((lblk) &				\
 				 ((ext4_lblk_t) (s)->s_cluster_ratio - 1))
+/*
+ * Cluster-offset macros that account for s_first_data_block, consistent
+ * with mballoc's cluster numbering.  Use EXT4_MB_PBLK_COFF when aligning a
+ * physical block number and EXT4_MB_LBLK_COFF when aligning a block count.
+ */
+#define EXT4_MB_PBLK_COFF(sbi, pblk) \
+	(((pblk) - le32_to_cpu((sbi)->s_es->s_first_data_block)) & \
+	 ((ext4_fsblk_t)(sbi)->s_cluster_ratio - 1))
+#define EXT4_MB_LBLK_COFF(sbi, lblk) \
+	((lblk) & ((ext4_lblk_t)(sbi)->s_cluster_ratio - 1))
 
 /*
  * Structure of a blocks group descriptor
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 35703dce23a3..38d49f784c22 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2493,13 +2493,13 @@ static int ext4_remove_blocks(handle_t *handle, struct inode *inode,
 	last_pblk = ext4_ext_pblock(ex) + ee_len - 1;
 
 	if (partial->state != initial &&
-	    partial->pclu != EXT4_B2C(sbi, last_pblk)) {
+	    partial->pclu != EXT4_MB_B2C(sbi, last_pblk)) {
 		if (partial->state == tofree) {
 			flags = get_default_free_blocks_flags(inode);
 			if (ext4_is_pending(inode, partial->lblk))
 				flags |= EXT4_FREE_BLOCKS_RERESERVE_CLUSTER;
 			ext4_free_blocks(handle, inode, NULL,
-					 EXT4_C2B(sbi, partial->pclu),
+					 EXT4_MB_C2B(sbi, partial->pclu),
 					 sbi->s_cluster_ratio, flags);
 			if (flags & EXT4_FREE_BLOCKS_RERESERVE_CLUSTER)
 				ext4_rereserve_cluster(inode, partial->lblk);
@@ -2545,7 +2545,7 @@ static int ext4_remove_blocks(handle_t *handle, struct inode *inode,
 	ext4_free_blocks(handle, inode, NULL, pblk, num, flags);
 
 	/* reset the partial cluster if we've freed past it */
-	if (partial->state != initial && partial->pclu != EXT4_B2C(sbi, pblk))
+	if (partial->state != initial && partial->pclu != EXT4_MB_B2C(sbi, pblk))
 		partial->state = initial;
 
 	/*
@@ -2560,7 +2560,7 @@ static int ext4_remove_blocks(handle_t *handle, struct inode *inode,
 	 */
 	if (EXT4_LBLK_COFF(sbi, from) && num == ee_len) {
 		if (partial->state == initial) {
-			partial->pclu = EXT4_B2C(sbi, pblk);
+			partial->pclu = EXT4_MB_B2C(sbi, pblk);
 			partial->lblk = from;
 			partial->state = tofree;
 		}
@@ -2651,7 +2651,7 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
 			 */
 			if (sbi->s_cluster_ratio > 1) {
 				pblk = ext4_ext_pblock(ex);
-				partial->pclu = EXT4_B2C(sbi, pblk);
+				partial->pclu = EXT4_MB_B2C(sbi, pblk);
 				partial->state = nofree;
 			}
 			ex--;
@@ -2766,13 +2766,13 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
 	 */
 	if (partial->state == tofree && ex >= EXT_FIRST_EXTENT(eh)) {
 		pblk = ext4_ext_pblock(ex) + ex_ee_len - 1;
-		if (partial->pclu != EXT4_B2C(sbi, pblk)) {
+		if (partial->pclu != EXT4_MB_B2C(sbi, pblk)) {
 			int flags = get_default_free_blocks_flags(inode);
 
 			if (ext4_is_pending(inode, partial->lblk))
 				flags |= EXT4_FREE_BLOCKS_RERESERVE_CLUSTER;
 			ext4_free_blocks(handle, inode, NULL,
-					 EXT4_C2B(sbi, partial->pclu),
+					 EXT4_MB_C2B(sbi, partial->pclu),
 					 sbi->s_cluster_ratio, flags);
 			if (flags & EXT4_FREE_BLOCKS_RERESERVE_CLUSTER)
 				ext4_rereserve_cluster(inode, partial->lblk);
@@ -2886,7 +2886,7 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 			 */
 			if (sbi->s_cluster_ratio > 1) {
 				pblk = ext4_ext_pblock(ex) + end - ee_block + 1;
-				partial.pclu = EXT4_B2C(sbi, pblk);
+				partial.pclu = EXT4_MB_B2C(sbi, pblk);
 				partial.state = nofree;
 			}
 
@@ -2919,7 +2919,7 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 			if (err < 0)
 				goto out;
 			if (pblk) {
-				partial.pclu = EXT4_B2C(sbi, pblk);
+				partial.pclu = EXT4_MB_B2C(sbi, pblk);
 				partial.state = nofree;
 			}
 		}
@@ -3041,7 +3041,7 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 		if (ext4_is_pending(inode, partial.lblk))
 			flags |= EXT4_FREE_BLOCKS_RERESERVE_CLUSTER;
 		ext4_free_blocks(handle, inode, NULL,
-				 EXT4_C2B(sbi, partial.pclu),
+				 EXT4_MB_C2B(sbi, partial.pclu),
 				 sbi->s_cluster_ratio, flags);
 		if (flags & EXT4_FREE_BLOCKS_RERESERVE_CLUSTER)
 			ext4_rereserve_cluster(inode, partial.lblk);
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 9c7881a4ea75..5c7ad20e4e7a 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -6360,7 +6360,7 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
 	 * blocks at the beginning or the end unless we are explicitly
 	 * requested to avoid doing so.
 	 */
-	overflow = EXT4_PBLK_COFF(sbi, block);
+	overflow = EXT4_MB_PBLK_COFF(sbi, block);
 	if (overflow) {
 		if (flags & EXT4_FREE_BLOCKS_NOFREE_FIRST_CLUSTER) {
 			overflow = sbi->s_cluster_ratio - overflow;
@@ -6376,7 +6376,7 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
 		/* The range changed so it's no longer validated */
 		flags &= ~EXT4_FREE_BLOCKS_VALIDATED;
 	}
-	overflow = EXT4_LBLK_COFF(sbi, count);
+	overflow = EXT4_MB_LBLK_COFF(sbi, count);
 	if (overflow) {
 		if (flags & EXT4_FREE_BLOCKS_NOFREE_LAST_CLUSTER) {
 			if (count > overflow)
-- 
2.53.0


             reply	other threads:[~2026-03-13 15:18 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-13 15:18 Helen Koike [this message]
2026-03-13 15:28 ` [PATCH] ext4: fix bigalloc cluster arithmetic when s_first_data_block != 0 Helen Koike
2026-03-14  1:29 ` Theodore Tso
2026-03-17 13:53   ` Helen Koike

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260313151835.1248953-1-koike@igalia.com \
    --to=koike@igalia.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=kernel-dev@igalia.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=syzbot+b73703b873a33d8eb8f6@syzkaller.appspotmail.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox