linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] Some more misc fsmap fixes
@ 2025-09-05  8:14 Ojaswin Mujoo
  2025-09-05  8:14 ` [PATCH 1/2] ext4: Correctly handle queries for metadata mappings Ojaswin Mujoo
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Ojaswin Mujoo @ 2025-09-05  8:14 UTC (permalink / raw)
  To: linux-ext4, Theodore Ts'o
  Cc: Ritesh Harjani, Zhang Yi, linux-kernel, Darrick J . Wong

These 2 patches mostly fixup a couple edge case inconsistent behavior
we were seeing in fsmap. Further, convert fsmap path to use block size
units wherever possible to avoid issues due to block to cluster to block
conversions like [1].

[1] https://lore.kernel.org/linux-ext4/e7472c8535c9c5ec10f425f495366864ea12c9da.1754377641.git.ojaswin@linux.ibm.com/

Tested with some custom edge cases (which I'll see how to consistently
convert to xfstests) as well as generic/365 on 4k, 64k and bigalloc

Ojaswin Mujoo (2):
  ext4: Correctly handle queries for metadata mappings
  fsmap: use blocksize units instead of cluster units

 fs/ext4/fsmap.c   | 56 +++++++++++++++++++++++++++--------------------
 fs/ext4/mballoc.c | 25 ++++++++++++++-------
 2 files changed, 49 insertions(+), 32 deletions(-)

-- 
2.49.0


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/2] ext4: Correctly handle queries for metadata mappings
  2025-09-05  8:14 [PATCH 0/2] Some more misc fsmap fixes Ojaswin Mujoo
@ 2025-09-05  8:14 ` Ojaswin Mujoo
  2025-09-05  8:14 ` [PATCH 2/2] fsmap: use blocksize units instead of cluster units Ojaswin Mujoo
  2025-09-26 21:47 ` [PATCH 0/2] Some more misc fsmap fixes Theodore Ts'o
  2 siblings, 0 replies; 6+ messages in thread
From: Ojaswin Mujoo @ 2025-09-05  8:14 UTC (permalink / raw)
  To: linux-ext4, Theodore Ts'o
  Cc: Ritesh Harjani, Zhang Yi, linux-kernel, Darrick J . Wong

Currently, our handling of metadata is _ambiguous_ in some scenarios,
that is, we end up returning unknown if the range only covers the
mapping partially.

For example, in the following case:

$ xfs_io -c fsmap -d

  0: 254:16 [0..7]: static fs metadata 8
  1: 254:16 [8..15]: special 102:1 8
  2: 254:16 [16..5127]: special 102:2 5112
  3: 254:16 [5128..5255]: special 102:3 128
  4: 254:16 [5256..5383]: special 102:4 128
  5: 254:16 [5384..70919]: inodes 65536
  6: 254:16 [70920..70967]: unknown 48
  ...

$ xfs_io -c fsmap -d 24 33

  0: 254:16 [24..39]: unknown 16  <--- incomplete reporting

$ xfs_io -c fsmap -d 24 33  (With patch)

    0: 254:16 [16..5127]: special 102:2 5112

This is because earlier in ext4_getfsmap_meta_helper, we end up ignoring
any extent that starts before our queried range, but overlaps it. While
the man page [1] is a bit ambiguous on this, this fix makes the output
make more sense since we are anyways returning an "unknown" extent. This
is also consistent to how XFS does it:

$ xfs_io -c fsmap -d

  ...
  6: 254:16 [104..127]: free space 24
  7: 254:16 [128..191]: inodes 64
  ...

$ xfs_io -c fsmap -d 137 150

  0: 254:16 [128..191]: inodes 64   <-- full extent returned

 [1] https://man7.org/linux/man-pages/man2/ioctl_getfsmap.2.html

Reported-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
 fs/ext4/fsmap.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/fsmap.c b/fs/ext4/fsmap.c
index 91185c40f755..22fc333244ef 100644
--- a/fs/ext4/fsmap.c
+++ b/fs/ext4/fsmap.c
@@ -74,7 +74,8 @@ static int ext4_getfsmap_dev_compare(const void *p1, const void *p2)
 static bool ext4_getfsmap_rec_before_low_key(struct ext4_getfsmap_info *info,
 					     struct ext4_fsmap *rec)
 {
-	return rec->fmr_physical < info->gfi_low.fmr_physical;
+	return rec->fmr_physical + rec->fmr_length <=
+	       info->gfi_low.fmr_physical;
 }
 
 /*
@@ -200,15 +201,18 @@ static int ext4_getfsmap_meta_helper(struct super_block *sb,
 			  ext4_group_first_block_no(sb, agno));
 	fs_end = fs_start + EXT4_C2B(sbi, len);
 
-	/* Return relevant extents from the meta_list */
+	/*
+	 * Return relevant extents from the meta_list. We emit all extents that
+	 * partially/fully overlap with the query range
+	 */
 	list_for_each_entry_safe(p, tmp, &info->gfi_meta_list, fmr_list) {
-		if (p->fmr_physical < info->gfi_next_fsblk) {
+		if (p->fmr_physical + p->fmr_length <= info->gfi_next_fsblk) {
 			list_del(&p->fmr_list);
 			kfree(p);
 			continue;
 		}
-		if (p->fmr_physical <= fs_start ||
-		    p->fmr_physical + p->fmr_length <= fs_end) {
+		if (p->fmr_physical <= fs_end &&
+		    p->fmr_physical + p->fmr_length > fs_start) {
 			/* Emit the retained free extent record if present */
 			if (info->gfi_lastfree.fmr_owner) {
 				error = ext4_getfsmap_helper(sb, info,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/2] fsmap: use blocksize units instead of cluster units
  2025-09-05  8:14 [PATCH 0/2] Some more misc fsmap fixes Ojaswin Mujoo
  2025-09-05  8:14 ` [PATCH 1/2] ext4: Correctly handle queries for metadata mappings Ojaswin Mujoo
@ 2025-09-05  8:14 ` Ojaswin Mujoo
  2025-09-26 12:35   ` Theodore Ts'o
  2025-09-26 21:47 ` [PATCH 0/2] Some more misc fsmap fixes Theodore Ts'o
  2 siblings, 1 reply; 6+ messages in thread
From: Ojaswin Mujoo @ 2025-09-05  8:14 UTC (permalink / raw)
  To: linux-ext4, Theodore Ts'o
  Cc: Ritesh Harjani, Zhang Yi, linux-kernel, Darrick J . Wong

Currently, ext4's implementation of fsmap converts the ranges to cluster
units if bigalloc is enabled and then converts to block units whenever
needed. However, this back and forth conversion has known to cause
several edge case issues since even with bigalloc the metadata is still
in block size unit.

Hence, convert the code to use block size instead of cluster size.
Cluster size is only used when dealing with mballoc bitmap. This takes
care of the existing issues with the code, example, for a mapping as
follows:

xfs_io -c fsmap -d (bs=4096, clustersize=65536)

   0: 254:16 [0..7]: static fs metadata 8
   1: 254:16 [8..15]: special 102:1 8
   2: 254:16 [16..327]: special 102:2 312
   3: 254:16 [328..351]: special 102:3 24
   4: 254:16 [352..375]: special 102:4 24
   ...

xfs_io -c fsmap -d 6 16 (before this patch):

   0: 254:16 [0..7]: static fs metadata 8
   1: 254:16 [8..23]: unknown 16       <--- incorrect/ambiguous entry

xfs_io -c fsmap -d 6 16 (after this patch):

   0: 254:16 [0..7]: static fs metadata 8
   1: 254:16 [8..15]: special 102:1 8
   2: 254:16 [16..327]: special 102:2 312

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
 fs/ext4/fsmap.c   | 42 +++++++++++++++++++++++-------------------
 fs/ext4/mballoc.c | 25 +++++++++++++++++--------
 2 files changed, 40 insertions(+), 27 deletions(-)

diff --git a/fs/ext4/fsmap.c b/fs/ext4/fsmap.c
index 22fc333244ef..02e5d501dda8 100644
--- a/fs/ext4/fsmap.c
+++ b/fs/ext4/fsmap.c
@@ -193,13 +193,11 @@ static int ext4_getfsmap_meta_helper(struct super_block *sb,
 	struct ext4_getfsmap_info *info = priv;
 	struct ext4_fsmap *p;
 	struct ext4_fsmap *tmp;
-	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	ext4_fsblk_t fsb, fs_start, fs_end;
 	int error;
 
-	fs_start = fsb = (EXT4_C2B(sbi, start) +
-			  ext4_group_first_block_no(sb, agno));
-	fs_end = fs_start + EXT4_C2B(sbi, len);
+	fs_start = fsb =  start + ext4_group_first_block_no(sb, agno);
+	fs_end = fs_start + len;
 
 	/*
 	 * Return relevant extents from the meta_list. We emit all extents that
@@ -248,13 +246,12 @@ static int ext4_getfsmap_datadev_helper(struct super_block *sb,
 	struct ext4_getfsmap_info *info = priv;
 	struct ext4_fsmap *p;
 	struct ext4_fsmap *tmp;
-	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	ext4_fsblk_t fsb;
 	ext4_fsblk_t fslen;
 	int error;
 
-	fsb = (EXT4_C2B(sbi, start) + ext4_group_first_block_no(sb, agno));
-	fslen = EXT4_C2B(sbi, len);
+	fsb = start + ext4_group_first_block_no(sb, agno);
+	fslen = len;
 
 	/* If the retained free extent record is set... */
 	if (info->gfi_lastfree.fmr_owner) {
@@ -531,13 +528,13 @@ static int ext4_getfsmap_datadev(struct super_block *sb,
 {
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	ext4_fsblk_t start_fsb;
-	ext4_fsblk_t end_fsb;
+	ext4_fsblk_t end_fsb, orig_end_fsb;
 	ext4_fsblk_t bofs;
 	ext4_fsblk_t eofs;
 	ext4_group_t start_ag;
 	ext4_group_t end_ag;
-	ext4_grpblk_t first_cluster;
-	ext4_grpblk_t last_cluster;
+	ext4_grpblk_t first_grpblk;
+	ext4_grpblk_t last_grpblk;
 	struct ext4_fsmap irec;
 	int error = 0;
 
@@ -553,10 +550,18 @@ static int ext4_getfsmap_datadev(struct super_block *sb,
 		return 0;
 	start_fsb = keys[0].fmr_physical;
 	end_fsb = keys[1].fmr_physical;
+	orig_end_fsb = end_fsb;
 
-	/* Determine first and last group to examine based on start and end */
-	ext4_get_group_no_and_offset(sb, start_fsb, &start_ag, &first_cluster);
-	ext4_get_group_no_and_offset(sb, end_fsb, &end_ag, &last_cluster);
+	/*
+	 * Determine first and last group to examine based on start and end.
+	 * NOTE: do_div() should take ext4_fsblk_t instead of ext4_group_t as
+	 * first argument else it will fail on 32bit archs.
+	 */
+	first_grpblk = do_div(start_fsb, EXT4_BLOCKS_PER_GROUP(sb));
+	last_grpblk = do_div(end_fsb, EXT4_BLOCKS_PER_GROUP(sb));
+
+	start_ag = start_fsb;
+	end_ag = end_fsb;
 
 	/*
 	 * Convert the fsmap low/high keys to bg based keys.  Initialize
@@ -564,7 +569,7 @@ static int ext4_getfsmap_datadev(struct super_block *sb,
 	 * of the bg.
 	 */
 	info->gfi_low = keys[0];
-	info->gfi_low.fmr_physical = EXT4_C2B(sbi, first_cluster);
+	info->gfi_low.fmr_physical = first_grpblk;
 	info->gfi_low.fmr_length = 0;
 
 	memset(&info->gfi_high, 0xFF, sizeof(info->gfi_high));
@@ -584,8 +589,7 @@ static int ext4_getfsmap_datadev(struct super_block *sb,
 		 */
 		if (info->gfi_agno == end_ag) {
 			info->gfi_high = keys[1];
-			info->gfi_high.fmr_physical = EXT4_C2B(sbi,
-					last_cluster);
+			info->gfi_high.fmr_physical = last_grpblk;
 			info->gfi_high.fmr_length = 0;
 		}
 
@@ -600,8 +604,8 @@ static int ext4_getfsmap_datadev(struct super_block *sb,
 				info->gfi_high.fmr_owner);
 
 		error = ext4_mballoc_query_range(sb, info->gfi_agno,
-				EXT4_B2C(sbi, info->gfi_low.fmr_physical),
-				EXT4_B2C(sbi, info->gfi_high.fmr_physical),
+				info->gfi_low.fmr_physical,
+				info->gfi_high.fmr_physical,
 				ext4_getfsmap_meta_helper,
 				ext4_getfsmap_datadev_helper, info);
 		if (error)
@@ -627,7 +631,7 @@ static int ext4_getfsmap_datadev(struct super_block *sb,
 	 * any allocated blocks at the end of the range.
 	 */
 	irec.fmr_device = 0;
-	irec.fmr_physical = end_fsb + 1;
+	irec.fmr_physical = orig_end_fsb + 1;
 	irec.fmr_length = 0;
 	irec.fmr_owner = EXT4_FMR_OWN_FREE;
 	irec.fmr_flags = 0;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 5898d92ba19f..cf78776940dd 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -7117,6 +7117,8 @@ ext4_mballoc_query_range(
 	ext4_grpblk_t			start, next;
 	struct ext4_buddy		e4b;
 	int				error;
+	struct ext4_sb_info 		*sbi = EXT4_SB(sb);
+	ext4_grpblk_t 			start_c, end_c, next_c;
 
 	error = ext4_mb_load_buddy(sb, group, &e4b);
 	if (error)
@@ -7125,9 +7127,9 @@ ext4_mballoc_query_range(
 
 	ext4_lock_group(sb, group);
 
-	start = max(e4b.bd_info->bb_first_free, first);
-	if (end >= EXT4_CLUSTERS_PER_GROUP(sb))
-		end = EXT4_CLUSTERS_PER_GROUP(sb) - 1;
+	start = max(EXT4_C2B(sbi, e4b.bd_info->bb_first_free), first);
+	if (end >= EXT4_BLOCKS_PER_GROUP(sb))
+		end = EXT4_BLOCKS_PER_GROUP(sb) - 1;
 	if (meta_formatter && start != first) {
 		if (start > end)
 			start = end;
@@ -7138,19 +7140,26 @@ ext4_mballoc_query_range(
 			goto out_unload;
 		ext4_lock_group(sb, group);
 	}
-	while (start <= end) {
-		start = mb_find_next_zero_bit(bitmap, end + 1, start);
-		if (start > end)
+
+	start_c = EXT4_B2C(sbi, start);
+	end_c = EXT4_B2C(sbi, end);
+
+	while (start_c <= end_c) {
+		start_c = mb_find_next_zero_bit(bitmap, end_c + 1, start_c);
+		if (start_c > end_c)
 			break;
-		next = mb_find_next_bit(bitmap, end + 1, start);
+		next_c = mb_find_next_bit(bitmap, end_c + 1, start_c);
 
 		ext4_unlock_group(sb, group);
+
+		start = EXT4_C2B(sbi, start_c);
+		next = EXT4_C2B(sbi, next_c);
 		error = formatter(sb, group, start, next - start, priv);
 		if (error)
 			goto out_unload;
 		ext4_lock_group(sb, group);
 
-		start = next + 1;
+		start_c = next_c + 1;
 	}
 
 	ext4_unlock_group(sb, group);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] fsmap: use blocksize units instead of cluster units
  2025-09-05  8:14 ` [PATCH 2/2] fsmap: use blocksize units instead of cluster units Ojaswin Mujoo
@ 2025-09-26 12:35   ` Theodore Ts'o
  2025-10-02 17:42     ` Ojaswin Mujoo
  0 siblings, 1 reply; 6+ messages in thread
From: Theodore Ts'o @ 2025-09-26 12:35 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, Ritesh Harjani, Zhang Yi, linux-kernel,
	Darrick J . Wong

On Fri, Sep 05, 2025 at 01:44:47PM +0530, Ojaswin Mujoo wrote:
> Currently, ext4's implementation of fsmap converts the ranges to cluster
> units if bigalloc is enabled and then converts to block units whenever
> needed. However, this back and forth conversion has known to cause
> several edge case issues since even with bigalloc the metadata is still
> in block size unit....

This commit causes ext4/028 to fail with a 1k blocksize.  The failure
happens after just under 45 minutes; before this commit, ext4/028
would complete after a second.

Do you have a preference regarding whether I just drop this commit, or
drop the whole series?  The previous patch looks fine to me and fixes
a real problem, so my plan is to keep the 1/2 commit and drop this
one.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/2] Some more misc fsmap fixes
  2025-09-05  8:14 [PATCH 0/2] Some more misc fsmap fixes Ojaswin Mujoo
  2025-09-05  8:14 ` [PATCH 1/2] ext4: Correctly handle queries for metadata mappings Ojaswin Mujoo
  2025-09-05  8:14 ` [PATCH 2/2] fsmap: use blocksize units instead of cluster units Ojaswin Mujoo
@ 2025-09-26 21:47 ` Theodore Ts'o
  2 siblings, 0 replies; 6+ messages in thread
From: Theodore Ts'o @ 2025-09-26 21:47 UTC (permalink / raw)
  To: linux-ext4, Ojaswin Mujoo
  Cc: Theodore Ts'o, Ritesh Harjani, Zhang Yi, linux-kernel,
	Darrick J . Wong


On Fri, 05 Sep 2025 13:44:45 +0530, Ojaswin Mujoo wrote:
> These 2 patches mostly fixup a couple edge case inconsistent behavior
> we were seeing in fsmap. Further, convert fsmap path to use block size
> units wherever possible to avoid issues due to block to cluster to block
> conversions like [1].
> 
> [1] https://lore.kernel.org/linux-ext4/e7472c8535c9c5ec10f425f495366864ea12c9da.1754377641.git.ojaswin@linux.ibm.com/
> 
> [...]

Applied, thanks!

[1/2] ext4: Correctly handle queries for metadata mappings
      commit: 46c22a8bb4cb03211da1100d7ee4a2005bf77c70

([2/2] fsmap: use blocksize units instead of cluster units was dropped
 because it introduced a regression.)

Best regards,
-- 
Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] fsmap: use blocksize units instead of cluster units
  2025-09-26 12:35   ` Theodore Ts'o
@ 2025-10-02 17:42     ` Ojaswin Mujoo
  0 siblings, 0 replies; 6+ messages in thread
From: Ojaswin Mujoo @ 2025-10-02 17:42 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: linux-ext4, Ritesh Harjani, Zhang Yi, linux-kernel,
	Darrick J . Wong

On Fri, Sep 26, 2025 at 08:35:36AM -0400, Theodore Ts'o wrote:
> On Fri, Sep 05, 2025 at 01:44:47PM +0530, Ojaswin Mujoo wrote:
> > Currently, ext4's implementation of fsmap converts the ranges to cluster
> > units if bigalloc is enabled and then converts to block units whenever
> > needed. However, this back and forth conversion has known to cause
> > several edge case issues since even with bigalloc the metadata is still
> > in block size unit....
> 
> This commit causes ext4/028 to fail with a 1k blocksize.  The failure
> happens after just under 45 minutes; before this commit, ext4/028
> would complete after a second.
> 
> Do you have a preference regarding whether I just drop this commit, or
> drop the whole series?  The previous patch looks fine to me and fixes
> a real problem, so my plan is to keep the 1/2 commit and drop this
> one.

Hey Ted,

Thanks for pointing this out, I'll look into the failure. Sorry for the
late reply I've been on vacation this week. 

I'll get to it as soon as possible.

Regards,
ojaswin

> 
> Cheers,
> 
> 						- Ted

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-10-02 17:43 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-05  8:14 [PATCH 0/2] Some more misc fsmap fixes Ojaswin Mujoo
2025-09-05  8:14 ` [PATCH 1/2] ext4: Correctly handle queries for metadata mappings Ojaswin Mujoo
2025-09-05  8:14 ` [PATCH 2/2] fsmap: use blocksize units instead of cluster units Ojaswin Mujoo
2025-09-26 12:35   ` Theodore Ts'o
2025-10-02 17:42     ` Ojaswin Mujoo
2025-09-26 21:47 ` [PATCH 0/2] Some more misc fsmap fixes Theodore Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).