[RFC 0/5] ext4: Implement support for extsize hints

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC 0/5] ext4: Implement support for extsize hints
@ 2024-09-11  9:01 Ojaswin Mujoo
  2024-09-11  9:01 ` [RFC 1/5] ext4: add aligned allocation hint in mballoc Ojaswin Mujoo
                   ` (6 more replies)
  0 siblings, 7 replies; 13+ messages in thread
From: Ojaswin Mujoo @ 2024-09-11  9:01 UTC (permalink / raw)
  To: linux-ext4, Theodore Ts'o
  Cc: Ritesh Harjani, linux-kernel, Darrick J . Wong, linux-fsdevel,
	John Garry, dchinner

This patchset implements extsize hint feature for ext4. Posting this RFC to get
some early review comments on the design and implementation bits. This feature
is similar to what we have in XFS too with some differences.

extsize on ext4 is a hint to mballoc (multi-block allocator) and extent
handling layer to do aligned allocations. We use allocation criteria 0
(CR_POWER2_ALIGNED) for doing aligned power-of-2 allocations. With extsize hint
we try to align the logical start (m_lblk) and length(m_len) of the allocation
to be extsize aligned. CR_POWER2_ALIGNED criteria in mballoc automatically make
sure that we get the aligned physical start (m_pblk) as well. So in this way
extsize can make sure that lblk, len and pblk all are aligned for the allocated
extent w.r.t extsize.

Note that extsize feature is just a hinting mechanism to ext4 multi-block
allocator. That means that if we are unable to get an aligned allocation for
some reason, than we drop this flag and continue with unaligned allocation to
serve the request. However when we will add atomic/untorn writes support, then
we will enforce the aligned allocation and can return -ENOSPC if aligned
allocation was not successful.

Comparison with XFS extsize feature -
=====================================
1. extsize in XFS is a hint for aligning only the logical start and the lengh
   of the allocation v/s extsize on ext4 make sure the physical start of the
   extent gets aligned as well.

2. eof allocation on XFS trims the blocks allocated beyond eof with extsize
   hint. That means on XFS for eof allocations (with extsize hint) only logical
   start gets aligned. However extsize hint in ext4 for eof allocation is not
   supported in this version of the series.

3. XFS allows extsize to be set on file with no extents but delayed data.
   However, ext4 don't allow that for simplicity. The user is expected to set
   it on a file before changing it's i_size.

4. XFS allows non-power-of-2 values for extsize but ext4 does not, since we
   primarily would like to support atomic writes with extsize.

5. In ext4 we chose to store the extsize value in SYSTEM_XATTR rather than an
   inode field as it was simple and most flexible, since there might be more
   features like atomic/untorn writes coming in future.

6. In buffered-io path XFS switches to non-delalloc allocations for extsize hint.
   The same has been kept for EXT4 as well.

Some TODOs:
===========
1. EOF allocations support can be added and can be kept similar to XFS.

Rest of the design details can be found in the individual commit messages.

Thoughts and suggestions are welcome!

Ojaswin Mujoo (5):
  ext4: add aligned allocation hint in mballoc
  ext4: allow inode preallocation for aligned alloc
  ext4: Support for extsize hint using FS_IOC_FS(GET/SET)XATTR
  ext4: pass lblk and len explicitly to ext4_split_extent*()
  ext4: Add extsize hint support

 fs/ext4/ext4.h              |  12 +-
 fs/ext4/ext4_jbd2.h         |  15 ++
 fs/ext4/extents.c           | 224 ++++++++++++++----
 fs/ext4/inode.c             | 442 +++++++++++++++++++++++++++++++++---
 fs/ext4/ioctl.c             | 119 ++++++++++
 fs/ext4/mballoc.c           | 126 ++++++++--
 fs/ext4/super.c             |   1 +
 include/trace/events/ext4.h |   2 +
 8 files changed, 841 insertions(+), 100 deletions(-)

-- 
2.43.5

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC 1/5] ext4: add aligned allocation hint in mballoc
  2024-09-11  9:01 [RFC 0/5] ext4: Implement support for extsize hints Ojaswin Mujoo
@ 2024-09-11  9:01 ` Ojaswin Mujoo
  2024-09-11  9:01 ` [RFC 2/5] ext4: allow inode preallocation for aligned alloc Ojaswin Mujoo
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Ojaswin Mujoo @ 2024-09-11  9:01 UTC (permalink / raw)
  To: linux-ext4, Theodore Ts'o
  Cc: Ritesh Harjani, linux-kernel, Darrick J . Wong, linux-fsdevel,
	John Garry, dchinner

Add support in mballoc for allocating blocks that are aligned
to a certain power-of-2 offset.

1. We define a new flag EXT4_MB_ALIGNED_HINT to indicate that we want
an aligned allocation. This is just a hint, mballoc tries its best to
provide aligned blocks but if it can't then it'll fallback to normal
allocation

2. The alignment is determined by the length of the allocation, for
example if we ask for 8192 bytes, then the alignment of physical blocks
will also be 8192 bytes aligned (ie 2 blocks aligned on 4k blocksize).

3. We dont yet support arbitrary alignment. For aligned writes, the
length/alignment must be power of 2 in blocks, ie for 4k blocksize we
can get 4k byte aligned, 8k byte aligned, 16k byte aligned ...
allocation but not 12k byte aligned.

4. We use CR_POWER2_ALIGNED criteria for aligned allocation which by
design allocates in an aligned manner. Since CR_POWER2_ALIGNED needs the
ac->ac_g_ex.fe_len to be power of 2, thats where the restriction in
point 3 above comes from. Since right now aligned allocation support is
added mainly for atomic writes use case, this restriction should be fine
since atomic write capable devices usually support only power of 2
alignments

5. For ease of review enabling inode preallocation support is done in
upcoming patches and is disabled in this patch.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
 fs/ext4/ext4.h              |  2 ++
 fs/ext4/mballoc.c           | 60 +++++++++++++++++++++++++++++++++----
 include/trace/events/ext4.h |  1 +
 3 files changed, 58 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 8cc15d00e5c8..17964994a049 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -222,6 +222,8 @@ enum criteria {
 /* Avg fragment size rb tree lookup succeeded at least once for
  * CR_BEST_AVAIL_LEN */
 #define EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED		0x00020000
+/* The allocation must respect alignment requirements for physical blocks */
+#define EXT4_MB_HINT_ALIGNED	      	0x40000
 
 struct ext4_allocation_request {
 	/* target inode for block we're allocating */
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index d73e38323879..724905552f3b 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2177,8 +2177,11 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
 	 * user requested originally, we store allocated
 	 * space in a special descriptor.
 	 */
-	if (ac->ac_o_ex.fe_len < ac->ac_b_ex.fe_len)
+	if (ac->ac_o_ex.fe_len < ac->ac_b_ex.fe_len) {
+		/* Aligned allocation doesn't have preallocation support */
+		WARN_ON(ac->ac_flags & EXT4_MB_HINT_ALIGNED);
 		ext4_mb_new_preallocation(ac);
+	}
 
 }
 
@@ -2814,10 +2817,15 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 
 	BUG_ON(ac->ac_status == AC_STATUS_FOUND);
 
-	/* first, try the goal */
-	err = ext4_mb_find_by_goal(ac, &e4b);
-	if (err || ac->ac_status == AC_STATUS_FOUND)
-		goto out;
+	/*
+	 * first, try the goal. Skip trying goal for aligned allocations since
+	 * goal determination logic is not alignment aware (yet)
+	 */
+	if (!(ac->ac_flags & EXT4_MB_HINT_ALIGNED)) {
+		err = ext4_mb_find_by_goal(ac, &e4b);
+		if (err || ac->ac_status == AC_STATUS_FOUND)
+			goto out;
+	}
 
 	if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY))
 		goto out;
@@ -2858,9 +2866,22 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 	 */
 	if (ac->ac_2order)
 		cr = CR_POWER2_ALIGNED;
+	else
+		WARN_ON_ONCE(ac->ac_g_ex.fe_len > 1 &&
+			     ac->ac_flags & EXT4_MB_HINT_ALIGNED);
 repeat:
 	for (; cr < EXT4_MB_NUM_CRS && ac->ac_status == AC_STATUS_CONTINUE; cr++) {
 		ac->ac_criteria = cr;
+
+		if (ac->ac_criteria > CR_POWER2_ALIGNED &&
+		    ac->ac_flags & EXT4_MB_HINT_ALIGNED &&
+		    ac->ac_g_ex.fe_len > 1) {
+			ext4_warning_inode(
+				ac->ac_inode,
+				"Aligned allocation not possible, using unaligned allocation");
+			ac->ac_flags &= ~EXT4_MB_HINT_ALIGNED;
+		}
+
 		/*
 		 * searching for the right group start
 		 * from the goal value specified
@@ -2993,6 +3014,24 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 	if (!err && ac->ac_status != AC_STATUS_FOUND && first_err)
 		err = first_err;
 
+	if (ac->ac_flags & EXT4_MB_HINT_ALIGNED && ac->ac_status == AC_STATUS_FOUND) {
+		ext4_fsblk_t start = ext4_grp_offs_to_block(sb, &ac->ac_b_ex);
+		ext4_grpblk_t len = EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
+
+		if (!len) {
+			ext4_warning_inode(ac->ac_inode,
+					   "Expected a non zero len extent");
+			ac->ac_status = AC_STATUS_BREAK;
+			goto exit;
+		}
+
+		WARN_ON_ONCE(!is_power_of_2(len));
+		WARN_ON_ONCE(start % len);
+		/* We don't support preallocation yet */
+		WARN_ON_ONCE(ac->ac_b_ex.fe_len != ac->ac_o_ex.fe_len);
+	}
+
+ exit:
 	mb_debug(sb, "Best len %d, origin len %d, ac_status %u, ac_flags 0x%x, cr %d ret %d\n",
 		 ac->ac_b_ex.fe_len, ac->ac_o_ex.fe_len, ac->ac_status,
 		 ac->ac_flags, cr, err);
@@ -4440,6 +4479,13 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
 	if (ac->ac_flags & EXT4_MB_HINT_NOPREALLOC)
 		return;
 
+	/*
+	 * caller may have strict alignment requirements. In this case, avoid
+	 * normalization since it is not alignment aware.
+	 */
+	if (ac->ac_flags & EXT4_MB_HINT_ALIGNED)
+		return;
+
 	if (ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC) {
 		ext4_mb_normalize_group_request(ac);
 		return ;
@@ -4794,6 +4840,10 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
 	if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
 		return false;
 
+	/* using preallocated blocks is not alignment aware. */
+	if (ac->ac_flags & EXT4_MB_HINT_ALIGNED)
+		return false;
+
 	/*
 	 * first, try per-file preallocation by searching the inode pa rbtree.
 	 *
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index cc5e9b7b2b44..05441f87c5d2 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -36,6 +36,7 @@ struct partial_cluster;
 	{ EXT4_MB_STREAM_ALLOC,		"STREAM_ALLOC" },	\
 	{ EXT4_MB_USE_ROOT_BLOCKS,	"USE_ROOT_BLKS" },	\
 	{ EXT4_MB_USE_RESERVED,		"USE_RESV" },		\
+	{ EXT4_MB_HINT_ALIGNED,		"HINT_ALIGNED" }, \
 	{ EXT4_MB_STRICT_CHECK,		"STRICT_CHECK" })
 
 #define show_map_flags(flags) __print_flags(flags, "|",			\
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC 2/5] ext4: allow inode preallocation for aligned alloc
  2024-09-11  9:01 [RFC 0/5] ext4: Implement support for extsize hints Ojaswin Mujoo
  2024-09-11  9:01 ` [RFC 1/5] ext4: add aligned allocation hint in mballoc Ojaswin Mujoo
@ 2024-09-11  9:01 ` Ojaswin Mujoo
  2024-09-11  9:01 ` [RFC 3/5] ext4: Support for extsize hint using FS_IOC_FS(GET/SET)XATTR Ojaswin Mujoo
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Ojaswin Mujoo @ 2024-09-11  9:01 UTC (permalink / raw)
  To: linux-ext4, Theodore Ts'o
  Cc: Ritesh Harjani, linux-kernel, Darrick J . Wong, linux-fsdevel,
	John Garry, dchinner

Enable inode preallocation support for aligned allocations. Inode
preallocation will only be used if the preallocated blocks are able to
satisfy the length and alignment requirements of the allocations, else
we disable preallocation for this particular allocation and proceed as
usual. Disabling inode preallocation is required otherwise we might end
up with overlapping preallocated ranges which can trigger a BUG() later.

Further, during normalizing, we usually try to round it up to a power of
2 which can still give us aligned allocation. We also make sure not
change the goal start so aligned allocation is more straightforward. If for
whatever reason the goal is not power of 2 or doesn't contain the original
request, then we throw a warning and proceed as normal.

For now, group preallocation is disabled for aligned allocations.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
 fs/ext4/mballoc.c | 96 +++++++++++++++++++++++++++++++----------------
 1 file changed, 63 insertions(+), 33 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 724905552f3b..23a553ad02fa 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2178,8 +2178,6 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
 	 * space in a special descriptor.
 	 */
 	if (ac->ac_o_ex.fe_len < ac->ac_b_ex.fe_len) {
-		/* Aligned allocation doesn't have preallocation support */
-		WARN_ON(ac->ac_flags & EXT4_MB_HINT_ALIGNED);
 		ext4_mb_new_preallocation(ac);
 	}
 
@@ -3027,8 +3025,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 
 		WARN_ON_ONCE(!is_power_of_2(len));
 		WARN_ON_ONCE(start % len);
-		/* We don't support preallocation yet */
-		WARN_ON_ONCE(ac->ac_b_ex.fe_len != ac->ac_o_ex.fe_len);
+		WARN_ON_ONCE(ac->ac_b_ex.fe_len < ac->ac_o_ex.fe_len);
 	}
 
  exit:
@@ -4479,13 +4476,6 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
 	if (ac->ac_flags & EXT4_MB_HINT_NOPREALLOC)
 		return;
 
-	/*
-	 * caller may have strict alignment requirements. In this case, avoid
-	 * normalization since it is not alignment aware.
-	 */
-	if (ac->ac_flags & EXT4_MB_HINT_ALIGNED)
-		return;
-
 	if (ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC) {
 		ext4_mb_normalize_group_request(ac);
 		return ;
@@ -4542,6 +4532,21 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
 		size	  = (loff_t) EXT4_C2B(sbi,
 					      ac->ac_o_ex.fe_len) << bsbits;
 	}
+
+	/*
+	 * For aligned allocations, we need to ensure 2 things:
+	 *
+	 * 1. The start should remain same as original start so that finding
+	 * aligned physical blocks for it is straight forward.
+	 *
+	 * 2. The new_size should not be less than the original len. This
+	 * can sometimes happen due to the way we predict size above.
+	 */
+	if (ac->ac_flags & EXT4_MB_HINT_ALIGNED) {
+		start_off = ac->ac_o_ex.fe_logical << bsbits;
+		size = max_t(loff_t, size,
+				 EXT4_C2B(sbi, ac->ac_o_ex.fe_len) << bsbits);
+	}
 	size = size >> bsbits;
 	start = start_off >> bsbits;
 
@@ -4792,32 +4797,46 @@ ext4_mb_check_group_pa(ext4_fsblk_t goal_block,
 }
 
 /*
- * check if found pa meets EXT4_MB_HINT_GOAL_ONLY
+ * check if found pa meets EXT4_MB_HINT_GOAL_ONLY or EXT4_MB_HINT_ALIGNED
  */
 static bool
-ext4_mb_pa_goal_check(struct ext4_allocation_context *ac,
+ext4_mb_pa_check(struct ext4_allocation_context *ac,
 		      struct ext4_prealloc_space *pa)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
 	ext4_fsblk_t start;
 
-	if (likely(!(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY)))
+	if (likely(!(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY ||
+		     ac->ac_flags & EXT4_MB_HINT_ALIGNED)))
 		return true;
 
-	/*
-	 * If EXT4_MB_HINT_GOAL_ONLY is set, ac_g_ex will not be adjusted
-	 * in ext4_mb_normalize_request and will keep same with ac_o_ex
-	 * from ext4_mb_initialize_context. Choose ac_g_ex here to keep
-	 * consistent with ext4_mb_find_by_goal.
-	 */
-	start = pa->pa_pstart +
-		(ac->ac_g_ex.fe_logical - pa->pa_lstart);
-	if (ext4_grp_offs_to_block(ac->ac_sb, &ac->ac_g_ex) != start)
-		return false;
+	if (ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY) {
+		/*
+		 * If EXT4_MB_HINT_GOAL_ONLY is set, ac_g_ex will not be adjusted
+		 * in ext4_mb_normalize_request and will keep same with ac_o_ex
+		 * from ext4_mb_initialize_context. Choose ac_g_ex here to keep
+		 * consistent with ext4_mb_find_by_goal.
+		 */
+		start = pa->pa_pstart +
+			(ac->ac_g_ex.fe_logical - pa->pa_lstart);
+		if (ext4_grp_offs_to_block(ac->ac_sb, &ac->ac_g_ex) != start)
+			return false;
 
-	if (ac->ac_g_ex.fe_len > pa->pa_len -
-	    EXT4_B2C(sbi, ac->ac_g_ex.fe_logical - pa->pa_lstart))
-		return false;
+		if (ac->ac_g_ex.fe_len >
+		    pa->pa_len - EXT4_B2C(sbi, ac->ac_g_ex.fe_logical -
+						       pa->pa_lstart))
+			return false;
+	} else if (ac->ac_flags & EXT4_MB_HINT_ALIGNED) {
+		start = pa->pa_pstart +
+			(ac->ac_g_ex.fe_logical - pa->pa_lstart);
+		if (start % EXT4_C2B(sbi, ac->ac_g_ex.fe_len))
+			return false;
+
+		if (EXT4_C2B(sbi, ac->ac_g_ex.fe_len) >
+		    (EXT4_C2B(sbi, pa->pa_len) -
+		     (ac->ac_g_ex.fe_logical - pa->pa_lstart)))
+			return false;
+	}
 
 	return true;
 }
@@ -4840,10 +4859,6 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
 	if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
 		return false;
 
-	/* using preallocated blocks is not alignment aware. */
-	if (ac->ac_flags & EXT4_MB_HINT_ALIGNED)
-		return false;
-
 	/*
 	 * first, try per-file preallocation by searching the inode pa rbtree.
 	 *
@@ -4949,7 +4964,7 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
 		goto try_group_pa;
 	}
 
-	if (tmp_pa->pa_free && likely(ext4_mb_pa_goal_check(ac, tmp_pa))) {
+	if (tmp_pa->pa_free && likely(ext4_mb_pa_check(ac, tmp_pa))) {
 		atomic_inc(&tmp_pa->pa_count);
 		ext4_mb_use_inode_pa(ac, tmp_pa);
 		spin_unlock(&tmp_pa->pa_lock);
@@ -4984,6 +4999,19 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
 		 * pa_free == 0.
 		 */
 		WARN_ON_ONCE(tmp_pa->pa_free == 0);
+
+		/*
+		 * If, for any reason, we reach here then we need to disable PA
+		 * because otherwise ext4_mb_normalize_request() will try to
+		 * allocate a new PA for this logical range where another PA
+		 * already exists. This is not allowed and will trigger BUG_ONs.
+		 * Hence, as a workaround we disable PA.
+		 *
+		 * NOTE: ideally we would want to have some logic to take care
+		 * of the unusable PA. Maybe a more fine grained discard logic
+		 * that could allow us to discard only specific PAs.
+		 */
+		ac->ac_flags |= EXT4_MB_HINT_NOPREALLOC;
 	}
 	spin_unlock(&tmp_pa->pa_lock);
 try_group_pa:
@@ -5790,6 +5818,7 @@ static void ext4_mb_group_or_file(struct ext4_allocation_context *ac)
 	int bsbits = ac->ac_sb->s_blocksize_bits;
 	loff_t size, isize;
 	bool inode_pa_eligible, group_pa_eligible;
+	bool is_aligned = (ac->ac_flags & EXT4_MB_HINT_ALIGNED);
 
 	if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
 		return;
@@ -5797,7 +5826,8 @@ static void ext4_mb_group_or_file(struct ext4_allocation_context *ac)
 	if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY))
 		return;
 
-	group_pa_eligible = sbi->s_mb_group_prealloc > 0;
+	/* Aligned allocation does not support group pa */
+	group_pa_eligible = (!is_aligned && sbi->s_mb_group_prealloc > 0);
 	inode_pa_eligible = true;
 	size = extent_logical_end(sbi, &ac->ac_o_ex);
 	isize = (i_size_read(ac->ac_inode) + ac->ac_sb->s_blocksize - 1)
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC 3/5] ext4: Support for extsize hint using FS_IOC_FS(GET/SET)XATTR
  2024-09-11  9:01 [RFC 0/5] ext4: Implement support for extsize hints Ojaswin Mujoo
  2024-09-11  9:01 ` [RFC 1/5] ext4: add aligned allocation hint in mballoc Ojaswin Mujoo
  2024-09-11  9:01 ` [RFC 2/5] ext4: allow inode preallocation for aligned alloc Ojaswin Mujoo
@ 2024-09-11  9:01 ` Ojaswin Mujoo
  2024-09-11  9:01 ` [RFC 4/5] ext4: pass lblk and len explicitly to ext4_split_extent*() Ojaswin Mujoo
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Ojaswin Mujoo @ 2024-09-11  9:01 UTC (permalink / raw)
  To: linux-ext4, Theodore Ts'o
  Cc: Ritesh Harjani, linux-kernel, Darrick J . Wong, linux-fsdevel,
	John Garry, dchinner

This patch adds support for getting and setting extsize hint using
FS_IOC_GETXATTR and FS_IOC_SETXATTR interface. The extsize is stored
in xattr of type EXT4_XATTR_INDEX_SYSTEM.

Restrictions on setting extsize:

1. extsize can't be set on files with data
2. extsize can't be set on non regular files
3. extsize hint can't be used with bigalloc (yet)
4. extsize (in blocks) should be power-of-2 for simplicity.
5. extsize must be a multiple of block size

The ioctl behavior has been kept as close to the XFS equivalent
as possible.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

ext4: Some modifications to extsize ioctl (To be Squashed)

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
 fs/ext4/ext4.h  |   6 +++
 fs/ext4/inode.c |  89 ++++++++++++++++++++++++++++++++++++
 fs/ext4/ioctl.c | 119 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/ext4/super.c |   1 +
 4 files changed, 215 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 17964994a049..d34e60cf6458 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1170,6 +1170,8 @@ struct ext4_inode_info {
 	__u32 i_csum_seed;
 
 	kprojid_t i_projid;
+	/* The extentsize hint for the inode in blocks */
+	ext4_grpblk_t i_extsize;
 };
 
 /*
@@ -3037,6 +3039,10 @@ extern void ext4_da_update_reserve_space(struct inode *inode,
 					int used, int quota_claim);
 extern int ext4_issue_zeroout(struct inode *inode, ext4_lblk_t lblk,
 			      ext4_fsblk_t pblk, ext4_lblk_t len);
+int ext4_inode_xattr_get_extsize(struct inode *inode);
+int ext4_inode_xattr_set_extsize(struct inode *inode, ext4_grpblk_t extsize);
+ext4_grpblk_t ext4_inode_get_extsize(struct ext4_inode_info *ei);
+void ext4_inode_set_extsize(struct ext4_inode_info *ei, ext4_grpblk_t extsize);
 
 /* indirect.c */
 extern int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 7475deef9793..898b41751cf4 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4881,6 +4881,20 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
 	INIT_LIST_HEAD(&ei->i_orphan);
 	ext4_fc_init_inode(&ei->vfs_inode);
 
+	ret = ext4_inode_xattr_get_extsize(&ei->vfs_inode);
+	if (ret >= 0) {
+		ei->i_extsize = ret;
+	} else if (ret == -ENODATA) {
+		/* extsize is not set */
+		ei->i_extsize = 0;
+	} else {
+		ext4_error_inode(
+			inode, function, line, 0,
+			"iget: error while retrieving extsize from xattr: %ld", ret);
+		ret = -EFSCORRUPTED;
+		goto bad_inode;
+	}
+
 	/*
 	 * Set transaction id's of transactions that have to be committed
 	 * to finish f[data]sync. We set them to currently running transaction
@@ -6216,3 +6230,78 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
 	ext4_journal_stop(handle);
 	goto out;
 }
+
+/*
+ * Returns positive extsize if set, 0 if not set else error
+ */
+ext4_grpblk_t ext4_inode_xattr_get_extsize(struct inode *inode)
+{
+	char *buf;
+	int size, ret = 0;
+	ext4_grpblk_t extsize = 0;
+
+	size = ext4_xattr_get(inode, EXT4_XATTR_INDEX_SYSTEM, "extsize", NULL, 0);
+
+	if (size == -ENODATA || size == 0) {
+		return 0;
+	} else if (size < 0) {
+		ret = size;
+		goto exit;
+	}
+
+	buf = (char *)kmalloc(size + 1, GFP_KERNEL);
+	if (!buf) {
+		ret = -ENOMEM;
+		goto exit;
+	}
+
+	size = ext4_xattr_get(inode, EXT4_XATTR_INDEX_SYSTEM, "extsize", buf,
+			      size);
+	if (size == -ENODATA)
+		/* No extsize is set */
+		extsize = 0;
+	else if (size < 0)
+		ret = size;
+	else {
+		buf[size] = '\0';
+		ret = kstrtoint(buf, 10, &extsize);
+	}
+
+	kfree(buf);
+exit:
+	if (ret)
+		return ret;
+	return extsize;
+}
+
+int ext4_inode_xattr_set_extsize(struct inode *inode, ext4_grpblk_t extsize)
+{
+	int err = 0;
+	/* max value of extsize should fit within 11 chars */
+	char extsize_str[11];
+
+	if ((err = snprintf(extsize_str, 10, "%u", extsize)) < 0) {
+		return err;
+	}
+
+	/* Try to replace the xattr if it exists, else try to create it */
+	err = ext4_xattr_set(inode, EXT4_XATTR_INDEX_SYSTEM, "extsize",
+			     extsize_str, strlen(extsize_str), XATTR_REPLACE);
+
+	if (err == -ENODATA)
+		err = ext4_xattr_set(inode, EXT4_XATTR_INDEX_SYSTEM, "extsize",
+				     extsize_str, strlen(extsize_str),
+				     XATTR_CREATE);
+
+	return err;
+}
+
+ext4_grpblk_t ext4_inode_get_extsize(struct ext4_inode_info *ei)
+{
+	return ei->i_extsize;
+}
+
+void ext4_inode_set_extsize(struct ext4_inode_info *ei, ext4_grpblk_t extsize)
+{
+	ei->i_extsize = extsize;
+}
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index e8bf5972dd47..e456e71e6187 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -708,6 +708,90 @@ static int ext4_ioctl_setflags(struct inode *inode,
 	return err;
 }
 
+static u32 ext4_ioctl_getextsize(struct inode *inode)
+{
+	ext4_grpblk_t extsize;
+
+	extsize = ext4_inode_get_extsize(EXT4_I(inode));
+
+	return (u32) extsize << inode->i_blkbits;
+}
+
+
+static int ext4_ioctl_setextsize(struct inode *inode, u32 extsize, u32 xflags)
+{
+	int err;
+	ext4_grpblk_t extsize_blks = extsize >> inode->i_blkbits;
+	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+	int blksize = 1 << inode->i_blkbits;
+	char *msg = NULL;
+
+	if (!S_ISREG(inode->i_mode)) {
+		msg = "Cannot set extsize on non regular file";
+		err = -EOPNOTSUPP;
+		goto error;
+	}
+
+	/* TODO: Can we just use i_size here? */
+	if (ext4_has_inline_data(inode) ||
+	    READ_ONCE(EXT4_I(inode)->i_disksize) ||
+	    EXT4_I(inode)->i_reserved_data_blocks) {
+		msg = "Cannot set extsize on file with data";
+		err = -EOPNOTSUPP;
+		goto error;
+	}
+
+	if (extsize % blksize) {
+		msg = "extsize must be multiple of blocksize";
+		err = -EINVAL;
+		goto error;
+	}
+
+	if (sbi->s_cluster_ratio > 1) {
+		msg = "Can't use extsize hint with bigalloc";
+		err = -EINVAL;
+		goto error;
+	}
+
+	if ((xflags & FS_XFLAG_EXTSIZE) && extsize == 0) {
+		msg = "fsx_extsize can't be 0 if FS_XFLAG_EXTSIZE is passed";
+		err = -EINVAL;
+		goto error;
+	}
+
+	if (extsize_blks > sbi->s_blocks_per_group) {
+		msg = "extsize cannot exceed number of bytes in block group";
+		err = -EINVAL;
+		goto error;
+	}
+
+	if (extsize && !is_power_of_2(extsize_blks)) {
+		msg = "extsize must be either power-of-2 in fs blocks or 0";
+		err = -EINVAL;
+		goto error;
+	}
+
+	if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
+		msg = "extsize can't be set on non-extent based files";
+		err = -EINVAL;
+		goto error;
+	}
+
+	/* update the extsize in inode xattr */
+	if ((err = ext4_inode_xattr_set_extsize(inode, extsize_blks)) < 0)
+		return err;
+
+	/* Update the new extsize in the in-core inode */
+	ext4_inode_set_extsize(EXT4_I(inode), extsize_blks);
+	return 0;
+
+error:
+	if (msg)
+		ext4_warning_inode(inode, "%s\n", msg);
+
+	return err;
+}
+
 #ifdef CONFIG_QUOTA
 static int ext4_ioctl_setproject(struct inode *inode, __u32 projid)
 {
@@ -985,6 +1069,7 @@ int ext4_fileattr_get(struct dentry *dentry, struct fileattr *fa)
 	struct inode *inode = d_inode(dentry);
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	u32 flags = ei->i_flags & EXT4_FL_USER_VISIBLE;
+	u32 extsize=0;
 
 	if (S_ISREG(inode->i_mode))
 		flags &= ~FS_PROJINHERIT_FL;
@@ -993,6 +1078,13 @@ int ext4_fileattr_get(struct dentry *dentry, struct fileattr *fa)
 	if (ext4_has_feature_project(inode->i_sb))
 		fa->fsx_projid = from_kprojid(&init_user_ns, ei->i_projid);
 
+	extsize = ext4_ioctl_getextsize(inode);
+	/* Flag is only set if extsize is non zero */
+	if (extsize > 0) {
+		fa->fsx_extsize = extsize;
+		fa->fsx_xflags |= FS_XFLAG_EXTSIZE;
+	}
+
 	return 0;
 }
 
@@ -1022,6 +1114,33 @@ int ext4_fileattr_set(struct mnt_idmap *idmap,
 	if (err)
 		goto out;
 	err = ext4_ioctl_setproject(inode, fa->fsx_projid);
+	if (err)
+		goto out;
+
+	if (fa->fsx_xflags & FS_XFLAG_EXTSIZE) {
+		err = ext4_ioctl_setextsize(inode, fa->fsx_extsize,
+					    fa->fsx_xflags);
+		if (err)
+			goto out;
+	} else if (fa->fsx_extsize == 0) {
+		/*
+		 * Even when user explicitly passes extsize=0 the flag is cleared in
+		 * fileattr_set_prepare().
+		 */
+		if (ext4_inode_get_extsize(EXT4_I(inode)) != 0) {
+			err = ext4_ioctl_setextsize(inode, fa->fsx_extsize,
+						    fa->fsx_xflags);
+			if (err)
+				goto out;
+		}
+
+	} else {
+		/* Unexpected usage, reset extsize to 0 */
+		err = ext4_ioctl_setextsize(inode, 0, fa->fsx_xflags);
+		if (err)
+			goto out;
+		fa->fsx_xflags = 0;
+	}
 out:
 	return err;
 }
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 16a4ce704460..4e293a2bccd3 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1421,6 +1421,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
 	spin_lock_init(&ei->i_completed_io_lock);
 	ei->i_sync_tid = 0;
 	ei->i_datasync_tid = 0;
+	ei->i_extsize = 0;
 	atomic_set(&ei->i_unwritten, 0);
 	INIT_WORK(&ei->i_rsv_conversion_work, ext4_end_io_rsv_work);
 	ext4_fc_init_inode(&ei->vfs_inode);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC 4/5] ext4: pass lblk and len explicitly to ext4_split_extent*()
  2024-09-11  9:01 [RFC 0/5] ext4: Implement support for extsize hints Ojaswin Mujoo
                   ` (2 preceding siblings ...)
  2024-09-11  9:01 ` [RFC 3/5] ext4: Support for extsize hint using FS_IOC_FS(GET/SET)XATTR Ojaswin Mujoo
@ 2024-09-11  9:01 ` Ojaswin Mujoo
  2024-09-11  9:01 ` [RFC 5/5] ext4: Add extsize hint support Ojaswin Mujoo
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Ojaswin Mujoo @ 2024-09-11  9:01 UTC (permalink / raw)
  To: linux-ext4, Theodore Ts'o
  Cc: Ritesh Harjani, linux-kernel, Darrick J . Wong, linux-fsdevel,
	John Garry, dchinner

Since these functions only use the map to determine lblk and len of
the split, pass them explicitly. This is in preparation for making
them work with extent size hints cleanly.

No functional change in this patch.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
 fs/ext4/extents.c | 57 +++++++++++++++++++++++++----------------------
 1 file changed, 30 insertions(+), 27 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 34e25eee6521..94aeb5b47971 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3347,7 +3347,8 @@ static struct ext4_ext_path *ext4_split_extent_at(handle_t *handle,
 static struct ext4_ext_path *ext4_split_extent(handle_t *handle,
 					       struct inode *inode,
 					       struct ext4_ext_path *path,
-					       struct ext4_map_blocks *map,
+					       ext4_lblk_t lblk,
+					       unsigned int len,
 					       int split_flag, int flags,
 					       unsigned int *allocated)
 {
@@ -3363,7 +3364,7 @@ static struct ext4_ext_path *ext4_split_extent(handle_t *handle,
 	ee_len = ext4_ext_get_actual_len(ex);
 	unwritten = ext4_ext_is_unwritten(ex);
 
-	if (map->m_lblk + map->m_len < ee_block + ee_len) {
+	if (lblk + len < ee_block + ee_len) {
 		split_flag1 = split_flag & EXT4_EXT_MAY_ZEROOUT;
 		flags1 = flags | EXT4_GET_BLOCKS_PRE_IO;
 		if (unwritten)
@@ -3372,28 +3373,28 @@ static struct ext4_ext_path *ext4_split_extent(handle_t *handle,
 		if (split_flag & EXT4_EXT_DATA_VALID2)
 			split_flag1 |= EXT4_EXT_DATA_VALID1;
 		path = ext4_split_extent_at(handle, inode, path,
-				map->m_lblk + map->m_len, split_flag1, flags1);
+				lblk + len, split_flag1, flags1);
 		if (IS_ERR(path))
 			return path;
 		/*
 		 * Update path is required because previous ext4_split_extent_at
 		 * may result in split of original leaf or extent zeroout.
 		 */
-		path = ext4_find_extent(inode, map->m_lblk, path, flags);
+		path = ext4_find_extent(inode, lblk, path, flags);
 		if (IS_ERR(path))
 			return path;
 		depth = ext_depth(inode);
 		ex = path[depth].p_ext;
 		if (!ex) {
 			EXT4_ERROR_INODE(inode, "unexpected hole at %lu",
-					(unsigned long) map->m_lblk);
+					(unsigned long) lblk);
 			ext4_free_ext_path(path);
 			return ERR_PTR(-EFSCORRUPTED);
 		}
 		unwritten = ext4_ext_is_unwritten(ex);
 	}
 
-	if (map->m_lblk >= ee_block) {
+	if (lblk >= ee_block) {
 		split_flag1 = split_flag & EXT4_EXT_DATA_VALID2;
 		if (unwritten) {
 			split_flag1 |= EXT4_EXT_MARK_UNWRIT1;
@@ -3401,16 +3402,16 @@ static struct ext4_ext_path *ext4_split_extent(handle_t *handle,
 						     EXT4_EXT_MARK_UNWRIT2);
 		}
 		path = ext4_split_extent_at(handle, inode, path,
-				map->m_lblk, split_flag1, flags);
+				lblk, split_flag1, flags);
 		if (IS_ERR(path))
 			return path;
 	}
 
 	if (allocated) {
-		if (map->m_lblk + map->m_len > ee_block + ee_len)
-			*allocated = ee_len - (map->m_lblk - ee_block);
+		if (lblk + len > ee_block + ee_len)
+			*allocated = ee_len - (lblk - ee_block);
 		else
-			*allocated = map->m_len;
+			*allocated = len;
 	}
 	ext4_ext_show_leaf(inode, path);
 	return path;
@@ -3658,8 +3659,8 @@ ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode,
 	}
 
 fallback:
-	path = ext4_split_extent(handle, inode, path, &split_map, split_flag,
-				 flags, NULL);
+	path = ext4_split_extent(handle, inode, path, split_map.m_lblk,
+				 split_map.m_len, split_flag, flags, NULL);
 	if (IS_ERR(path))
 		return path;
 out:
@@ -3699,11 +3700,11 @@ ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode,
  * allocated pointer. Return an extent path pointer on success, or an error
  * pointer on failure.
  */
-static struct ext4_ext_path *ext4_split_convert_extents(handle_t *handle,
-					struct inode *inode,
-					struct ext4_map_blocks *map,
-					struct ext4_ext_path *path,
-					int flags, unsigned int *allocated)
+static struct ext4_ext_path *
+ext4_split_convert_extents(handle_t *handle, struct inode *inode,
+			   ext4_lblk_t lblk, unsigned int len,
+			   struct ext4_ext_path *path, int flags,
+			   unsigned int *allocated)
 {
 	ext4_lblk_t eof_block;
 	ext4_lblk_t ee_block;
@@ -3712,12 +3713,12 @@ static struct ext4_ext_path *ext4_split_convert_extents(handle_t *handle,
 	int split_flag = 0, depth;
 
 	ext_debug(inode, "logical block %llu, max_blocks %u\n",
-		  (unsigned long long)map->m_lblk, map->m_len);
+		  (unsigned long long)lblk, len);
 
 	eof_block = (EXT4_I(inode)->i_disksize + inode->i_sb->s_blocksize - 1)
 			>> inode->i_sb->s_blocksize_bits;
-	if (eof_block < map->m_lblk + map->m_len)
-		eof_block = map->m_lblk + map->m_len;
+	if (eof_block < lblk + len)
+		eof_block = lblk + len;
 	/*
 	 * It is safe to convert extent to initialized via explicit
 	 * zeroout only if extent is fully inside i_size or new_size.
@@ -3737,8 +3738,8 @@ static struct ext4_ext_path *ext4_split_convert_extents(handle_t *handle,
 		split_flag |= (EXT4_EXT_MARK_UNWRIT2 | EXT4_EXT_DATA_VALID2);
 	}
 	flags |= EXT4_GET_BLOCKS_PRE_IO;
-	return ext4_split_extent(handle, inode, path, map, split_flag, flags,
-				 allocated);
+	return ext4_split_extent(handle, inode, path, lblk, len, split_flag,
+				 flags, allocated);
 }
 
 static struct ext4_ext_path *
@@ -3773,7 +3774,7 @@ ext4_convert_unwritten_extents_endio(handle_t *handle, struct inode *inode,
 			     inode->i_ino, (unsigned long long)ee_block, ee_len,
 			     (unsigned long long)map->m_lblk, map->m_len);
 #endif
-		path = ext4_split_convert_extents(handle, inode, map, path,
+		path = ext4_split_convert_extents(handle, inode, map->m_lblk, map->m_len, path,
 						EXT4_GET_BLOCKS_CONVERT, NULL);
 		if (IS_ERR(path))
 			return path;
@@ -3837,8 +3838,9 @@ convert_initialized_extent(handle_t *handle, struct inode *inode,
 		  (unsigned long long)ee_block, ee_len);
 
 	if (ee_block != map->m_lblk || ee_len > map->m_len) {
-		path = ext4_split_convert_extents(handle, inode, map, path,
-				EXT4_GET_BLOCKS_CONVERT_UNWRITTEN, NULL);
+		path = ext4_split_convert_extents(
+			handle, inode, map->m_lblk, map->m_len, path,
+			EXT4_GET_BLOCKS_CONVERT_UNWRITTEN, NULL);
 		if (IS_ERR(path))
 			return path;
 
@@ -3909,8 +3911,9 @@ ext4_ext_handle_unwritten_extents(handle_t *handle, struct inode *inode,
 
 	/* get_block() before submitting IO, split the extent */
 	if (flags & EXT4_GET_BLOCKS_PRE_IO) {
-		path = ext4_split_convert_extents(handle, inode, map, path,
-				flags | EXT4_GET_BLOCKS_CONVERT, allocated);
+		path = ext4_split_convert_extents(
+			handle, inode, map->m_lblk, map->m_len, path,
+			flags | EXT4_GET_BLOCKS_CONVERT, allocated);
 		if (IS_ERR(path))
 			return path;
 		/*
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC 5/5] ext4: Add extsize hint support
  2024-09-11  9:01 [RFC 0/5] ext4: Implement support for extsize hints Ojaswin Mujoo
                   ` (3 preceding siblings ...)
  2024-09-11  9:01 ` [RFC 4/5] ext4: pass lblk and len explicitly to ext4_split_extent*() Ojaswin Mujoo
@ 2024-09-11  9:01 ` Ojaswin Mujoo
  2024-09-13 10:06 ` [RFC 0/5] ext4: Implement support for extsize hints John Garry
  2024-09-18  9:54 ` Dave Chinner
  6 siblings, 0 replies; 13+ messages in thread
From: Ojaswin Mujoo @ 2024-09-11  9:01 UTC (permalink / raw)
  To: linux-ext4, Theodore Ts'o
  Cc: Ritesh Harjani, linux-kernel, Darrick J . Wong, linux-fsdevel,
	John Garry, dchinner

Now that the ioctl is in place, add the underlying infrastructure
to support extent size hints.

** MOTIVATION **

1. This feature allows us to ask the allocator for blocks that are
logically AS WELL AS physically aligned to an extent size hint (aka
extsize), that is generally a power of 2.

2. This means both start and the length of the physical and logical range
should be aligned to the extsize.

3. This sets up the infra we'll eventually need for supporting
  non-torn/atomic writes that need to follow a certain alignment as
  required by hardware.

4. This can also be extent to other use cases like stripe alignment

** DESIGN NOTES **

* Physical Alignment *

1. Since the extsize is always a power-of-2 (for now) in fs blocks, we
leverage CR_POWER2_ALIGNED allocation to get the blocks. This ensures the blocks
are physically aligned

2. Since this is just a hint, incase we are not able to get any aligned
blocks we simply drop back to non aligned allocation.

* Logical Alignment *

The flow of extsize aligned allocation with bufferred and
direct IO:

            +--------------------------------------------------------+
            |                     Buffered IO                        |
            +--------------------------------------------------------+
            |    ext4_map_blocks() call with extsize allocation      |
            +--------------------------------------------------------+
                                       |
                 +--------------------------------------------+
                 |  Adjust lblk and len to align to extsize   |
                 +--------------------------------------------+
                                       |
            +--------------------------------------------------------+
            |Pre-existing written/unwritten blocks in extsize range? |
            +--------------------------+-----------------------------+
                   YES                                      NO
                    |                                        |
    +---------------v---------------+      +-----------------v-------------------+
    |     Covers orig range?        |      |      Allocate extsize range        |
    +---------------+---------------+      +---------------------+--------------+
         |                    |                              |
        YES                  NO                              |
         |                    |              +---------------v------------------+
+--------v-------+    +-------v---------+    |   Mark allocated extent as       |
| Return blocks  |    | Fallback to     |    |           unwritten              |
+----------------+    | non-extsize     |    +----------------+-----------------+
                      | allocation      |                     |
                      +-----------------+    +----------------v-----------------+
                                             |   Insert extsize extent          |
                                             |         into tree                |
                                             +----------------+-----------------+
                                                              |
                                             +----------------v-----------------+
                                             |     Return allocated blocks      |
                                             +----------------------------------+

                   +--------------------------------------------+
                   |           During writeback:                |
                   +--------------------------------------------+
                   |  Use PRE_IO to split only the dirty extent |
                   +--------------------------------------------+

                   +--------------------------------------------+
                   |              After IO:                     |
                   +--------------------------------------------+
                   |    Convert the extent under IO to written  |
                   +--------------------------------------------+

Same flow for direct IO:

+----------------------------------------------------------------------+
|                              Direct IO                               |
+----------------------------------------------------------------------+
|   ext4_map_blocks() called with extsize allocation and PRE-IO        |
+----------------------------------------------------------------------+
                                   |
+----------------------------------------------------------------------+
|         Adjust lblk and len to align to extsize                      |
+----------------------------------------------------------------------+
                                   |
+----------------------------------------------------------------------+
|        Pre-existing written blocks in extsize range?                 |
+----------------------------------+-----------------------------------+
             YES                                       NO
              |                                         |
    +---------v----------+           +------------------v-----------------+
    | Covers orig range? |           | Unwritten blocks in extsize range? |
    +---------+----------+           +------------------+-----------------+
        |            |                        |                    |
       YES          NO                       YES                  NO
        |            |                        |                    |
+-------v----+ +-----v--------+    +----------v----------+ +-------v----------+
|   Return   | | Fallback to  |    | Call ext4_ext_map_  | | Allocate extsize |
|   blocks   | | non-extsize  |    | blocks() ->ext4_ext | |      range       |
+------------+ | allocation   |    | _handle_unwritten_  | +-------+----------+
               +--------------+    |     extents()       |         |
                                   +----------+----------+ +-------v----------+
                                              |            |  Mark complete   |
                                   +----------v----------+ | range unwritten  |
                                   | Split orig range    | |  & insert in     |
                                   | from bigger         | |      tree        |
                                   | unwritten extent    | +-------+----------+
                                   +----------+----------+         |
                                              |            +-------v----------+
                                   +----------v----------+ | Split orig range |
                                   | Mark split extent   | | from bigger      |
                                   |    as unwritten     | | allocated extent |
                                   +----------+----------+ +-------+----------+
                                              |                    |
                                   +----------v----------+ +-------v----------+
                                   | Return split extent | | Mark split extent|
                                   |      to user        | |   as unwritten   |
                                   +---------------------+ +-------+----------+
                                                                   |
                                                           +-------v----------+
                                                           | Return split     |
                                                           | extent to user   |
                                                           +------------------+

              +--------------------------------------------+
              |              After IO:                     |
              +--------------------------------------------+
              |    Convert the extent under IO to written  |
              +--------------------------------------------+

** IMPLEMENTATION NOTES **

* Callers of ext4_map_blocks work under the assumption that
  ext4_map_blocks will always only return as much as requested or less
  but now we might end up allocating more so make changes to
  ext4_map_blocks to make sure we adjust the allocated map to only
  return as much as user requested.

* Further, we now maintain 2 maps in ext4_map_blocks - the original map
  and the extsize map that is used when extsize hint allocation is taking
  place. We also pass these 2 maps down because some functions might now
  need information of the original map as well as the extsize map.
  * For example, when we go for direct IO and there's a hole in the orig
    range requested, we allocate based on extsize range and then split the
    bigger unwritten extent onto smaller unwritten extents based on orig
    range. (Needed so we dont have to split after IO). For this, we need
    the information of extsize range as well as orig range hence 2 maps.

* Since now we allocate more than the user requested, to avoid stale
  data exposure, we mark the bigger extsize extent as unwritten and then
  use the similar flow of dioread_nolock to only mark the extent under
  write as written.

* We disable extsize hints when writes are beyond EOF.

* When extsize is set on an inode, we drop to no delalloc allocations
  similar to XFS.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
 fs/ext4/ext4.h              |   4 +-
 fs/ext4/ext4_jbd2.h         |  15 ++
 fs/ext4/extents.c           | 169 +++++++++++++++--
 fs/ext4/inode.c             | 353 ++++++++++++++++++++++++++++++++----
 include/trace/events/ext4.h |   1 +
 5 files changed, 491 insertions(+), 51 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index d34e60cf6458..0782f4268c86 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -722,6 +722,7 @@ enum {
 #define EXT4_GET_BLOCKS_IO_SUBMIT		0x0400
 	/* Caller is in the atomic contex, find extent if it has been cached */
 #define EXT4_GET_BLOCKS_CACHED_NOWAIT		0x0800
+#define EXT4_GET_BLOCKS_EXTSIZE			0x1000
 
 /*
  * The bit position of these flags must not overlap with any of the
@@ -3702,7 +3703,8 @@ struct ext4_extent;
 extern void ext4_ext_tree_init(handle_t *handle, struct inode *inode);
 extern int ext4_ext_index_trans_blocks(struct inode *inode, int extents);
 extern int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
-			       struct ext4_map_blocks *map, int flags);
+			       struct ext4_map_blocks *orig_map,
+			       struct ext4_map_blocks *extsize_map, int flags);
 extern int ext4_ext_truncate(handle_t *, struct inode *);
 extern int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 				 ext4_lblk_t end);
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 0c77697d5e90..20bbf76c0556 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -513,4 +513,19 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
 	return 1;
 }
 
+static inline int ext4_should_use_extsize(struct inode *inode)
+{
+	if (!S_ISREG(inode->i_mode))
+		return 0;
+	if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
+		return 0;
+	return (ext4_inode_get_extsize(EXT4_I(inode)) > 0);
+}
+
+static inline int ext4_should_use_unwrit_extents(struct inode *inode)
+{
+	return (ext4_should_dioread_nolock(inode) ||
+		ext4_should_use_extsize(inode));
+}
+
 #endif	/* _EXT4_JBD2_H */
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 94aeb5b47971..8a37a1ddfc23 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3889,15 +3889,24 @@ convert_initialized_extent(handle_t *handle, struct inode *inode,
 
 static struct ext4_ext_path *
 ext4_ext_handle_unwritten_extents(handle_t *handle, struct inode *inode,
-			struct ext4_map_blocks *map,
+			struct ext4_map_blocks *orig_map,
+			struct ext4_map_blocks *extsize_map,
 			struct ext4_ext_path *path, int flags,
 			unsigned int *allocated, ext4_fsblk_t newblock)
 {
+	struct ext4_map_blocks *map;
 	int err = 0;
 
-	ext_debug(inode, "logical block %llu, max_blocks %u, flags 0x%x, allocated %u\n",
-		  (unsigned long long)map->m_lblk, map->m_len, flags,
-		  *allocated);
+	if (flags & EXT4_GET_BLOCKS_EXTSIZE) {
+		BUG_ON(extsize_map == NULL);
+		map = extsize_map;
+	} else
+		map = orig_map;
+
+	ext_debug(
+		inode,
+		"logical block %llu, max_blocks %u, flags 0x%x, allocated %u\n",
+		(unsigned long long)map->m_lblk, map->m_len, flags, *allocated);
 	ext4_ext_show_leaf(inode, path);
 
 	/*
@@ -3906,13 +3915,14 @@ ext4_ext_handle_unwritten_extents(handle_t *handle, struct inode *inode,
 	 */
 	flags |= EXT4_GET_BLOCKS_METADATA_NOFAIL;
 
-	trace_ext4_ext_handle_unwritten_extents(inode, map, flags,
-						*allocated, newblock);
+	trace_ext4_ext_handle_unwritten_extents(inode, map, flags, *allocated,
+						newblock);
 
 	/* get_block() before submitting IO, split the extent */
 	if (flags & EXT4_GET_BLOCKS_PRE_IO) {
+		/* Split should always happen based on original mapping */
 		path = ext4_split_convert_extents(
-			handle, inode, map->m_lblk, map->m_len, path,
+			handle, inode, orig_map->m_lblk, orig_map->m_len, path,
 			flags | EXT4_GET_BLOCKS_CONVERT, allocated);
 		if (IS_ERR(path))
 			return path;
@@ -3927,11 +3937,19 @@ ext4_ext_handle_unwritten_extents(handle_t *handle, struct inode *inode,
 			err = -EFSCORRUPTED;
 			goto errout;
 		}
+
+		/*
+		 * For extsize case we need to adjust lblk to start of split
+		 * extent because the m_len will be set to len of split extent.
+		 * No change for non extsize case
+		 */
+		map->m_lblk = orig_map->m_lblk;
 		map->m_flags |= EXT4_MAP_UNWRITTEN;
 		goto out;
 	}
 	/* IO end_io complete, convert the filled extent to written */
 	if (flags & EXT4_GET_BLOCKS_CONVERT) {
+		BUG_ON(map == extsize_map);
 		path = ext4_convert_unwritten_extents_endio(handle, inode,
 							    map, path);
 		if (IS_ERR(path))
@@ -4189,7 +4207,8 @@ static ext4_lblk_t ext4_ext_determine_insert_hole(struct inode *inode,
  * return < 0, error case.
  */
 int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
-			struct ext4_map_blocks *map, int flags)
+			struct ext4_map_blocks *orig_map,
+			struct ext4_map_blocks *extsize_map, int flags)
 {
 	struct ext4_ext_path *path = NULL;
 	struct ext4_extent newex, *ex, ex2;
@@ -4200,6 +4219,17 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 	unsigned int allocated_clusters = 0;
 	struct ext4_allocation_request ar;
 	ext4_lblk_t cluster_offset;
+	struct ext4_map_blocks *map;
+#ifdef CONFIG_EXT4_DEBUG
+	struct ext4_ext_path *test_path = NULL;
+#endif
+
+	if (flags & EXT4_GET_BLOCKS_EXTSIZE) {
+		BUG_ON(extsize_map == NULL);
+		map = extsize_map;
+	} else
+		map = orig_map;
+
 
 	ext_debug(inode, "blocks %u/%u requested\n", map->m_lblk, map->m_len);
 	trace_ext4_ext_map_blocks_enter(inode, map->m_lblk, map->m_len, flags);
@@ -4256,6 +4286,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 			 */
 			if ((!ext4_ext_is_unwritten(ex)) &&
 			    (flags & EXT4_GET_BLOCKS_CONVERT_UNWRITTEN)) {
+				BUG_ON(map == extsize_map);
 				path = convert_initialized_extent(handle,
 					inode, map, path, &allocated);
 				if (IS_ERR(path))
@@ -4272,8 +4303,8 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 			}
 
 			path = ext4_ext_handle_unwritten_extents(
-				handle, inode, map, path, flags,
-				&allocated, newblock);
+				handle, inode, orig_map, extsize_map, path,
+				flags, &allocated, newblock);
 			if (IS_ERR(path))
 				err = PTR_ERR(path);
 			goto out;
@@ -4306,6 +4337,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 	 */
 	if (cluster_offset && ex &&
 	    get_implied_cluster_alloc(inode->i_sb, map, ex, path)) {
+		BUG_ON(map == extsize_map);
 		ar.len = allocated = map->m_len;
 		newblock = map->m_pblk;
 		goto got_allocated_blocks;
@@ -4325,6 +4357,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 	 * cluster we can use. */
 	if ((sbi->s_cluster_ratio > 1) && err &&
 	    get_implied_cluster_alloc(inode->i_sb, map, &ex2, path)) {
+		BUG_ON(map == extsize_map);
 		ar.len = allocated = map->m_len;
 		newblock = map->m_pblk;
 		err = 0;
@@ -4379,6 +4412,8 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 		ar.flags |= EXT4_MB_DELALLOC_RESERVED;
 	if (flags & EXT4_GET_BLOCKS_METADATA_NOFAIL)
 		ar.flags |= EXT4_MB_USE_RESERVED;
+	if (flags & EXT4_GET_BLOCKS_EXTSIZE)
+		ar.flags |= EXT4_MB_HINT_ALIGNED;
 	newblock = ext4_mb_new_blocks(handle, &ar, &err);
 	if (!newblock)
 		goto out;
@@ -4400,9 +4435,114 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 		map->m_flags |= EXT4_MAP_UNWRITTEN;
 	}
 
-	path = ext4_ext_insert_extent(handle, inode, path, &newex, flags);
-	if (IS_ERR(path)) {
-		err = PTR_ERR(path);
+	if ((flags & EXT4_GET_BLOCKS_EXTSIZE) &&
+	    (flags & EXT4_GET_BLOCKS_PRE_IO)) {
+		/*
+		 * With EXTSIZE and PRE-IO (direct io case) we have to be careful
+		 * because we want to insert the complete extent but split only the
+		 * originally requested range.
+		 *
+		 * Below are the different (S)cenarios and the (A)ction we take:
+		 *
+		 * S1: New extent covers the original range completely/partially.
+		 * A1: Insert new extent, allow merges. Then split the original
+		 * range from this. Adjust the length of split if new extent only
+		 * partially covers original.
+		 *
+		 * S2: New extent doesn't cover original range at all
+		 * A2: Just insert this range and return. Rest is handled in
+		 * ext4_map_blocks()
+		 * NOTE: We can handle this as an error with EAGAIN in future.
+		 */
+		ext4_lblk_t newex_lblk = le32_to_cpu(newex.ee_block);
+		loff_t newex_len = ext4_ext_get_actual_len(&newex);
+
+		if (in_range(orig_map->m_lblk, newex_lblk, newex_len)) {
+			/* S1 */
+			loff_t split_len = 0;
+
+			BUG_ON(!ext4_ext_is_unwritten(&newex));
+
+			if (newex_lblk + newex_len >=
+			    orig_map->m_lblk + (loff_t)orig_map->m_len)
+				split_len = orig_map->m_len;
+			else
+				split_len = newex_len -
+					    (orig_map->m_lblk - newex_lblk);
+
+			path = ext4_ext_insert_extent(
+				handle, inode, path, &newex,
+				(flags & ~EXT4_GET_BLOCKS_PRE_IO));
+			if (IS_ERR(path)) {
+				err = PTR_ERR(path);
+				goto insert_error;
+			}
+
+			/*
+			 * Update path before split
+			 * NOTE: This might no longer be needed with recent
+			 * changes in ext4_ext_insert_extent()
+			 */
+			path = ext4_find_extent(inode, orig_map->m_lblk, path, 0);
+			if (IS_ERR(path)) {
+				err = PTR_ERR(path);
+				goto insert_error;
+			}
+
+			/*
+			 * GET_BLOCKS_CONVERT is needed to make sure split
+			 * extent is marked unwritten although the flags itself
+			 * means that the extent should be converted to written.
+			 *
+			 * TODO: This is because ext4_split_convert_extents()
+			 * doesn't respect the flags at all but fixing this
+			 * needs more involved design changes.
+			 */
+			path = ext4_split_convert_extents(
+				handle, inode, orig_map->m_lblk, split_len,
+				path, flags | EXT4_GET_BLOCKS_CONVERT, NULL);
+			if (IS_ERR(path)) {
+				err = PTR_ERR(path);
+				goto insert_error;
+			}
+
+#ifdef CONFIG_EXT4_DEBUG
+			test_path = ext4_find_extent(inode, orig_map->m_lblk,
+						     NULL, 0);
+			if (!IS_ERR(test_path)) {
+				/* Confirm we've correctly split and marked the extent unwritten */
+				struct ext4_extent *test_ex =
+					test_path[ext_depth(inode)].p_ext;
+				WARN_ON(!ext4_ext_is_unwritten(test_ex));
+				WARN_ON(test_ex->ee_block != orig_map->m_lblk);
+				WARN_ON(ext4_ext_get_actual_len(test_ex) !=
+					orig_map->m_len);
+				kfree(test_path);
+			}
+#endif
+		} else {
+			/* S2 */
+			BUG_ON(orig_map->m_lblk < newex_lblk + newex_len);
+
+			path = ext4_ext_insert_extent(
+				handle, inode, path, &newex,
+				(flags & ~EXT4_GET_BLOCKS_PRE_IO));
+			if (IS_ERR(path)) {
+				err = PTR_ERR(path);
+				goto insert_error;
+			}
+		}
+	} else {
+		path = ext4_ext_insert_extent(handle, inode, path, &newex,
+					      flags);
+		if (IS_ERR(path)) {
+			err = PTR_ERR(path);
+			goto insert_error;
+		}
+	}
+
+insert_error:
+	if (err) {
 		if (allocated_clusters) {
 			int fb_flags = 0;
 
@@ -4793,6 +4933,9 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 
 	inode_lock(inode);
 
+	if (ext4_should_use_extsize(inode))
+		flags |= EXT4_GET_BLOCKS_EXTSIZE;
+
 	/*
 	 * We only support preallocation for extent-based files only
 	 */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 898b41751cf4..675f24ba009a 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -434,7 +434,7 @@ static void ext4_map_blocks_es_recheck(handle_t *handle,
 	 */
 	down_read(&EXT4_I(inode)->i_data_sem);
 	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
-		retval = ext4_ext_map_blocks(handle, inode, map, 0);
+		retval = ext4_ext_map_blocks(handle, inode, map, NULL, 0);
 	} else {
 		retval = ext4_ind_map_blocks(handle, inode, map, 0);
 	}
@@ -459,15 +459,33 @@ static void ext4_map_blocks_es_recheck(handle_t *handle,
 #endif /* ES_AGGRESSIVE_TEST */
 
 static int ext4_map_query_blocks(handle_t *handle, struct inode *inode,
-				 struct ext4_map_blocks *map)
+				 struct ext4_map_blocks *orig_map,
+				 struct ext4_map_blocks *extsize_map,
+				 bool should_extsize)
 {
 	unsigned int status;
 	int retval;
+	struct ext4_map_blocks *map;
+
+	if (should_extsize) {
+		BUG_ON(extsize_map == NULL);
+		map = extsize_map;
+	} else
+		map = orig_map;
 
 	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
-		retval = ext4_ext_map_blocks(handle, inode, map, 0);
-	else
+		if (should_extsize) {
+			retval = ext4_ext_map_blocks(handle, inode, orig_map,
+						     map,
+						     EXT4_GET_BLOCKS_EXTSIZE);
+		} else {
+			retval = ext4_ext_map_blocks(handle, inode, map, NULL,
+						     0);
+		}
+	else {
+		BUG_ON(should_extsize);
 		retval = ext4_ind_map_blocks(handle, inode, map, 0);
+	}
 
 	if (retval <= 0)
 		return retval;
@@ -488,11 +506,20 @@ static int ext4_map_query_blocks(handle_t *handle, struct inode *inode,
 }
 
 static int ext4_map_create_blocks(handle_t *handle, struct inode *inode,
-				  struct ext4_map_blocks *map, int flags)
+				  struct ext4_map_blocks *orig_map,
+				  struct ext4_map_blocks *extsize_map, int flags,
+				  bool should_extsize)
 {
 	struct extent_status es;
 	unsigned int status;
 	int err, retval = 0;
+	struct ext4_map_blocks *map;
+
+	if (should_extsize) {
+		BUG_ON(extsize_map == NULL);
+		map = extsize_map;
+	} else
+		map = orig_map;
 
 	/*
 	 * We pass in the magic EXT4_GET_BLOCKS_DELALLOC_RESERVE
@@ -513,8 +540,15 @@ static int ext4_map_create_blocks(handle_t *handle, struct inode *inode,
 	 * changed the inode type in between.
 	 */
 	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
-		retval = ext4_ext_map_blocks(handle, inode, map, flags);
+		if (should_extsize) {
+			retval = ext4_ext_map_blocks(handle, inode, orig_map,
+						     map, flags);
+		} else {
+			retval = ext4_ext_map_blocks(handle, inode, map, NULL,
+						     flags);
+		}
 	} else {
+		BUG_ON(should_extsize);
 		retval = ext4_ind_map_blocks(handle, inode, map, flags);
 
 		/*
@@ -569,6 +603,80 @@ static int ext4_map_create_blocks(handle_t *handle, struct inode *inode,
 	return retval;
 }
 
+/**
+ * Extsize hint will change the mapped range and hence we'll end up mapping more.
+ * To not confuse the caller, adjust the struct ext4_map_blocks to reflect the
+ * original mapping requested by them.
+ *
+ * @cur_map:		The block mapping we are working with (for sanity check)
+ * @orig_map:		The originally requested mapping
+ * @extsize_map:	The mapping after adjusting for extsize hint
+ * @flags		Get block flags (for sanity check)
+ *
+ * This function assumes that the orig_mlblk is contained within the mapping
+ * held in extsize_map. Caller must make sure this is true.
+ */
+static inline unsigned int ext4_extsize_adjust_map(struct ext4_map_blocks *cur_map,
+					   struct ext4_map_blocks *orig_map,
+					   struct ext4_map_blocks *extsize_map,
+					   int flags)
+{
+	__u64 map_end = (__u64)extsize_map->m_lblk + extsize_map->m_len;
+
+	BUG_ON(cur_map != extsize_map || !(flags & EXT4_GET_BLOCKS_EXTSIZE));
+
+	orig_map->m_len = min(orig_map->m_len, map_end - orig_map->m_lblk);
+	orig_map->m_pblk =
+		extsize_map->m_pblk + (orig_map->m_lblk - extsize_map->m_lblk);
+	orig_map->m_flags = extsize_map->m_flags;
+
+	return orig_map->m_len;
+}
+
+/**
+ * ext4_error_adjust_map - Adjust map returned upon error in ext4_map_blocks()
+ *
+ * @cur_map: current map we are working with
+ * @orig_map: original map that would be returned to the user.
+ *
+ * Most of the callers of ext4_map_blocks() ignore the map on error, however
+ * some use it for debug logging. In this case, they log state of the map just
+ * before the error, hence this function ensures that map returned to caller is
+ * the one we were working with when error happened. Mostly useful when extsize
+ * hints are enabled.
+ */
+static inline void ext4_error_adjust_map(struct ext4_map_blocks *cur_map,
+					 struct ext4_map_blocks *orig_map)
+{
+	if (cur_map != orig_map)
+		memcpy(orig_map, cur_map, sizeof(*cur_map));
+}
+
+/*
+ * This functions resets the mapping to it's original state after it has been
+ * modified due to extent size hint and drops the extsize hint. To be used
+ * incase we want to fallback from extsize based aligned allocation to normal
+ * allocation
+ *
+ * @map:		The block mapping where lblk and len have been modified
+ *			because	of extsize hint
+ * @flags:		The get_block flags
+ * @orig_mlblk:		The originally requested logical block to map
+ * @orig_mlen:		The originally requested len to map
+ * @orig_flags:		The originally requested get_block flags
+ */
+static inline void ext4_extsize_reset_map(struct ext4_map_blocks *map,
+					  int *flags, ext4_lblk_t orig_mlblk,
+					  unsigned int orig_mlen,
+					  int orig_flags)
+{
+	/* Drop the extsize hint from original flags */
+	*flags = orig_flags & ~EXT4_GET_BLOCKS_EXTSIZE;
+	map->m_lblk = orig_mlblk;
+	map->m_len = orig_mlen;
+	map->m_flags = 0;
+}
+
 /*
  * The ext4_map_blocks() function tries to look up the requested blocks,
  * and returns if the blocks are already mapped.
@@ -593,31 +701,110 @@ static int ext4_map_create_blocks(handle_t *handle, struct inode *inode,
  * It returns the error in case of allocation failure.
  */
 int ext4_map_blocks(handle_t *handle, struct inode *inode,
-		    struct ext4_map_blocks *map, int flags)
+		    struct ext4_map_blocks *orig_map, int flags)
 {
 	struct extent_status es;
 	int retval;
 	int ret = 0;
+
+	ext4_lblk_t orig_mlblk, extsize_mlblk;
+	unsigned int orig_mlen, extsize_mlen;
+	int orig_flags;
+
+	struct ext4_map_blocks *map = NULL;
+	struct ext4_map_blocks extsize_map = {0};
+
+	__u32 extsize = ext4_inode_get_extsize(EXT4_I(inode));
+	bool should_extsize = false;
+
 #ifdef ES_AGGRESSIVE_TEST
-	struct ext4_map_blocks orig_map;
+	struct ext4_map_blocks test_map;
 
-	memcpy(&orig_map, map, sizeof(*map));
+	memcpy(&test_map, map, sizeof(*map));
 #endif
 
-	map->m_flags = 0;
-	ext_debug(inode, "flag 0x%x, max_blocks %u, logical block %lu\n",
-		  flags, map->m_len, (unsigned long) map->m_lblk);
+	orig_map->m_flags = 0;
+	ext_debug(inode, "flag 0x%x, max_blocks %u, logical block %lu\n", flags,
+		  orig_map->m_len, (unsigned long)orig_map->m_lblk);
 
 	/*
 	 * ext4_map_blocks returns an int, and m_len is an unsigned int
 	 */
-	if (unlikely(map->m_len > INT_MAX))
-		map->m_len = INT_MAX;
+	if (unlikely(orig_map->m_len > INT_MAX))
+		orig_map->m_len = INT_MAX;
 
 	/* We can handle the block number less than EXT_MAX_BLOCKS */
-	if (unlikely(map->m_lblk >= EXT_MAX_BLOCKS))
+	if (unlikely(orig_map->m_lblk >= EXT_MAX_BLOCKS))
 		return -EFSCORRUPTED;
 
+	orig_mlblk = orig_map->m_lblk;
+	orig_mlen = orig_map->m_len;
+	orig_flags = flags;
+
+set_map:
+	should_extsize = (extsize && (flags & EXT4_GET_BLOCKS_CREATE) &&
+			       (flags & EXT4_GET_BLOCKS_EXTSIZE));
+	if (should_extsize) {
+		/*
+		 * We adjust the extent size here but we still return the original lblk and len while
+		 * returning to keep the behavior compatible.
+		 */
+		int len, align;
+		/*
+		 * NOTE: Should we import EXT_UNWRITTEN_MAX_LEN from
+		 * ext4_extents.h here?
+		 */
+		int max_unwrit_len = ((1UL << 15) - 1);
+		loff_t end;
+
+		align = orig_map->m_lblk % extsize;
+		len = orig_map->m_len + align;
+
+		extsize_map.m_lblk = orig_map->m_lblk - align;
+		extsize_map.m_len =
+			max_t(unsigned int, roundup_pow_of_two(len), extsize);
+
+		/*
+		 * For now allocations beyond EOF don't use extsize hints so
+		 * that we can avoid dealing with extra blocks allocated past
+		 * EOF. We have inode lock since extsize allocations are
+		 * non-delalloc so i_size can be accessed safely
+		 */
+		end = (extsize_map.m_lblk + (loff_t)extsize_map.m_len) << inode->i_blkbits;
+		if (end > inode->i_size) {
+			flags = orig_flags & ~EXT4_GET_BLOCKS_EXTSIZE;
+			goto set_map;
+		}
+
+		/* Fallback to normal allocation if we go beyond max len */
+		if (extsize_map.m_len >= max_unwrit_len) {
+			flags = orig_flags & ~EXT4_GET_BLOCKS_EXTSIZE;
+			goto set_map;
+		}
+
+		/*
+		 * We are allocating more than requested. We'll have to convert
+		 * the extent to unwritten and then convert only the part
+		 * requested to written. For now we are using the same flow as
+		 * dioread nolock to achieve this. Hence the caller has to pass
+		 * CREATE_UNWRIT with EXTSIZE
+		 */
+		if (!(flags | EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT)) {
+			WARN_ON(true);
+
+			/* Fallback to non extsize allocation */
+			flags = orig_flags & ~EXT4_GET_BLOCKS_EXTSIZE;
+			goto set_map;
+		}
+
+		extsize_mlblk = extsize_map.m_lblk;
+		extsize_mlen = extsize_map.m_len;
+
+		extsize_map.m_flags = orig_map->m_flags;
+		map = &extsize_map;
+	} else
+		map = orig_map;
+
 	/* Lookup extent status tree firstly */
 	if (!(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY) &&
 	    ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es)) {
@@ -647,7 +834,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 			return retval;
 #ifdef ES_AGGRESSIVE_TEST
 		ext4_map_blocks_es_recheck(handle, inode, map,
-					   &orig_map, flags);
+					   &test_map, flags);
 #endif
 		goto found;
 	}
@@ -663,19 +850,60 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 	 * file system block.
 	 */
 	down_read(&EXT4_I(inode)->i_data_sem);
-	retval = ext4_map_query_blocks(handle, inode, map);
+	if (should_extsize) {
+		BUG_ON(map != &extsize_map);
+		retval = ext4_map_query_blocks(handle, inode, orig_map, map,
+					       should_extsize);
+	} else {
+		BUG_ON(map != orig_map);
+		retval = ext4_map_query_blocks(handle, inode, map, NULL,
+					       should_extsize);
+	}
 	up_read((&EXT4_I(inode)->i_data_sem));
 
 found:
 	if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
 		ret = check_block_validity(inode, map);
-		if (ret != 0)
+		if (ret != 0) {
+			ext4_error_adjust_map(map, orig_map);
 			return ret;
+		}
 	}
 
 	/* If it is only a block(s) look up */
-	if ((flags & EXT4_GET_BLOCKS_CREATE) == 0)
+	if ((flags & EXT4_GET_BLOCKS_CREATE) == 0) {
+		BUG_ON(flags & EXT4_GET_BLOCKS_EXTSIZE);
 		return retval;
+	}
+
+	/* Handle some special cases when extsize based allocation is needed */
+	if (retval >= 0 && flags & EXT4_GET_BLOCKS_EXTSIZE) {
+		bool orig_in_range =
+			in_range(orig_mlblk, (__u64)map->m_lblk, map->m_len);
+		/*
+		 * Special case: if the extsize range is mapped already and
+		 * covers the original start, we return it.
+		 */
+		if (map->m_flags & EXT4_MAP_MAPPED && orig_in_range) {
+			/*
+			 * We don't use EXTSIZE with CONVERT_UNWRITTEN so
+			 * we can directly return the written extent
+			 */
+			return ext4_extsize_adjust_map(map, orig_map, &extsize_map, flags);
+		}
+
+		/*
+		 * Fallback case: if the found mapping (or hole) doesn't cover
+		 * the extsize required, then just fall back to normal
+		 * allocation to keep things simple.
+		 */
+
+		if (map->m_lblk != extsize_mlblk ||
+		    map->m_len != extsize_mlen) {
+			flags = orig_flags & ~EXT4_GET_BLOCKS_EXTSIZE;
+			goto set_map;
+		}
+	}
 
 	/*
 	 * Returns if the blocks have already allocated
@@ -699,12 +927,22 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 	 * with create == 1 flag.
 	 */
 	down_write(&EXT4_I(inode)->i_data_sem);
-	retval = ext4_map_create_blocks(handle, inode, map, flags);
+	if (should_extsize) {
+		BUG_ON(map != &extsize_map);
+		retval = ext4_map_create_blocks(handle, inode, orig_map, map, flags,
+						should_extsize);
+	} else {
+		BUG_ON(map != orig_map);
+		retval = ext4_map_create_blocks(handle, inode, map, NULL, flags,
+					       should_extsize);
+	}
 	up_write((&EXT4_I(inode)->i_data_sem));
 	if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
 		ret = check_block_validity(inode, map);
-		if (ret != 0)
+		if (ret != 0) {
+			ext4_error_adjust_map(map, orig_map);
 			return ret;
+		}
 
 		/*
 		 * Inodes with freshly allocated blocks where contents will be
@@ -726,16 +964,38 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 			else
 				ret = ext4_jbd2_inode_add_write(handle, inode,
 						start_byte, length);
-			if (ret)
+			if (ret) {
+				ext4_error_adjust_map(map, orig_map);
 				return ret;
+			}
 		}
 	}
 	if (retval > 0 && (map->m_flags & EXT4_MAP_UNWRITTEN ||
 				map->m_flags & EXT4_MAP_MAPPED))
 		ext4_fc_track_range(handle, inode, map->m_lblk,
 					map->m_lblk + map->m_len - 1);
-	if (retval < 0)
+
+	if (retval > 0 && flags & EXT4_GET_BLOCKS_EXTSIZE) {
+		/*
+		 * In the rare case that we have a short allocation and orig
+		 * lblk doesn't lie in mapped range just try to retry with
+		 * actual allocation. This is not ideal but this should be an
+		 * edge case near ENOSPC.
+		 *
+		 * NOTE: This has a side effect that blocks are allocated but
+		 * not used. Can we avoid that?
+		 */
+		if (!in_range(orig_mlblk, (__u64)map->m_lblk, map->m_len)) {
+			flags = orig_flags & ~EXT4_GET_BLOCKS_EXTSIZE;
+			goto set_map;
+		}
+		return ext4_extsize_adjust_map(map, orig_map, &extsize_map, flags);
+	}
+
+	if (retval < 0) {
+		ext4_error_adjust_map(map, orig_map);
 		ext_debug(inode, "failed with err %d\n", retval);
+	}
 	return retval;
 }
 
@@ -771,18 +1031,20 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
 {
 	struct ext4_map_blocks map;
 	int ret = 0;
+	unsigned int orig_mlen = bh->b_size >> inode->i_blkbits;
 
 	if (ext4_has_inline_data(inode))
 		return -ERANGE;
 
 	map.m_lblk = iblock;
-	map.m_len = bh->b_size >> inode->i_blkbits;
+	map.m_len = orig_mlen;
 
 	ret = ext4_map_blocks(ext4_journal_current_handle(), inode, &map,
 			      flags);
 	if (ret > 0) {
 		map_bh(bh, inode->i_sb, map.m_pblk);
 		ext4_update_bh_state(bh, map.m_flags);
+		WARN_ON(map.m_len != orig_mlen);
 		bh->b_size = inode->i_sb->s_blocksize * map.m_len;
 		ret = 0;
 	} else if (ret == 0) {
@@ -808,11 +1070,14 @@ int ext4_get_block_unwritten(struct inode *inode, sector_t iblock,
 			     struct buffer_head *bh_result, int create)
 {
 	int ret = 0;
+	int flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT;
+
+	if (ext4_should_use_extsize(inode))
+		flags |= EXT4_GET_BLOCKS_EXTSIZE;
 
 	ext4_debug("ext4_get_block_unwritten: inode %lu, create flag %d\n",
 		   inode->i_ino, create);
-	ret = _ext4_get_block(inode, iblock, bh_result,
-			       EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT);
+	ret = _ext4_get_block(inode, iblock, bh_result, flags);
 
 	/*
 	 * If the buffer is marked unwritten, mark it as new to make sure it is
@@ -1155,7 +1420,8 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
 	from = pos & (PAGE_SIZE - 1);
 	to = from + len;
 
-	if (ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA)) {
+	if (!ext4_should_use_extsize(inode) &&
+	    ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA)) {
 		ret = ext4_try_to_write_inline_data(mapping, inode, pos, len,
 						    pagep);
 		if (ret < 0)
@@ -1203,7 +1469,7 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
 	/* In case writeback began while the folio was unlocked */
 	folio_wait_stable(folio);
 
-	if (ext4_should_dioread_nolock(inode))
+	if (ext4_should_use_unwrit_extents(inode))
 		ret = ext4_block_write_begin(handle, folio, pos, len,
 					     ext4_get_block_unwritten);
 	else
@@ -1791,7 +2057,7 @@ static int ext4_da_map_blocks(struct inode *inode, struct ext4_map_blocks *map)
 	if (ext4_has_inline_data(inode))
 		retval = 0;
 	else
-		retval = ext4_map_query_blocks(NULL, inode, map);
+		retval = ext4_map_query_blocks(NULL, inode, map, NULL, false);
 	up_read(&EXT4_I(inode)->i_data_sem);
 	if (retval)
 		return retval < 0 ? retval : 0;
@@ -1814,7 +2080,7 @@ static int ext4_da_map_blocks(struct inode *inode, struct ext4_map_blocks *map)
 			goto found;
 		}
 	} else if (!ext4_has_inline_data(inode)) {
-		retval = ext4_map_query_blocks(NULL, inode, map);
+		retval = ext4_map_query_blocks(NULL, inode, map, NULL, false);
 		if (retval) {
 			up_write(&EXT4_I(inode)->i_data_sem);
 			return retval < 0 ? retval : 0;
@@ -2188,6 +2454,7 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
 	struct ext4_map_blocks *map = &mpd->map;
 	int get_blocks_flags;
 	int err, dioread_nolock;
+	int extsize = ext4_should_use_extsize(inode);
 
 	trace_ext4_da_write_pages_extent(inode, map);
 	/*
@@ -2206,11 +2473,14 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
 	dioread_nolock = ext4_should_dioread_nolock(inode);
 	if (dioread_nolock)
 		get_blocks_flags |= EXT4_GET_BLOCKS_IO_CREATE_EXT;
+	if (extsize)
+		get_blocks_flags |= EXT4_GET_BLOCKS_PRE_IO;
 
 	err = ext4_map_blocks(handle, inode, map, get_blocks_flags);
 	if (err < 0)
 		return err;
-	if (dioread_nolock && (map->m_flags & EXT4_MAP_UNWRITTEN)) {
+	if ((extsize || dioread_nolock) &&
+	    (map->m_flags & EXT4_MAP_UNWRITTEN)) {
 		if (!mpd->io_submit.io_end->handle &&
 		    ext4_handle_valid(handle)) {
 			mpd->io_submit.io_end->handle = handle->h_rsv_handle;
@@ -2633,10 +2903,11 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
 	}
 	mpd->journalled_more_data = 0;
 
-	if (ext4_should_dioread_nolock(inode)) {
+	if (ext4_should_use_unwrit_extents(inode)) {
 		/*
-		 * We may need to convert up to one extent per block in
-		 * the page and we may dirty the inode.
+		 * For extsize allocation or dioread_nolock, we may need to
+		 * convert up to one extent per block in the page and we may
+		 * dirty the inode.
 		 */
 		rsv_blocks = 1 + ext4_chunk_trans_blocks(inode,
 						PAGE_SIZE >> inode->i_blkbits);
@@ -2911,7 +3182,8 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
 
 	index = pos >> PAGE_SHIFT;
 
-	if (ext4_nonda_switch(inode->i_sb) || ext4_verity_in_progress(inode)) {
+	if (ext4_nonda_switch(inode->i_sb) || ext4_verity_in_progress(inode) ||
+	    ext4_should_use_extsize(inode)) {
 		*fsdata = (void *)FALL_BACK_TO_NONDELALLOC;
 		return ext4_write_begin(file, mapping, pos,
 					len, pagep, fsdata);
@@ -3355,12 +3627,19 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
 	 * can complete at any point during the I/O and subsequently push the
 	 * i_disksize out to i_size. This could be beyond where direct I/O is
 	 * happening and thus expose allocated blocks to direct I/O reads.
+	 *
+	 * NOTE for extsize hints: We only support it for writes inside
+	 * EOF (for now) to not have to deal with blocks past EOF
 	 */
 	else if (((loff_t)map->m_lblk << blkbits) >= i_size_read(inode))
 		m_flags = EXT4_GET_BLOCKS_CREATE;
-	else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
+	else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
 		m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
 
+		if (ext4_should_use_extsize(inode) && retries == 0)
+			m_flags |= EXT4_GET_BLOCKS_EXTSIZE;
+	}
+
 	ret = ext4_map_blocks(handle, inode, map, m_flags);
 
 	/*
@@ -6175,7 +6454,7 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
 	}
 	folio_unlock(folio);
 	/* OK, we need to fill the hole... */
-	if (ext4_should_dioread_nolock(inode))
+	if (ext4_should_use_unwrit_extents(inode))
 		get_block = ext4_get_block_unwritten;
 	else
 		get_block = ext4_get_block;
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 05441f87c5d2..0942578a97bc 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -50,6 +50,7 @@ struct partial_cluster;
 	{ EXT4_GET_BLOCKS_CONVERT_UNWRITTEN,	"CONVERT_UNWRITTEN" },  \
 	{ EXT4_GET_BLOCKS_ZERO,			"ZERO" },		\
 	{ EXT4_GET_BLOCKS_IO_SUBMIT,		"IO_SUBMIT" },		\
+	{ EXT4_GET_BLOCKS_EXTSIZE,		"EXTSIZE" },		\
 	{ EXT4_EX_NOCACHE,			"EX_NOCACHE" })
 
 /*
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC 0/5] ext4: Implement support for extsize hints
  2024-09-11  9:01 [RFC 0/5] ext4: Implement support for extsize hints Ojaswin Mujoo
                   ` (4 preceding siblings ...)
  2024-09-11  9:01 ` [RFC 5/5] ext4: Add extsize hint support Ojaswin Mujoo
@ 2024-09-13 10:06 ` John Garry
  2024-09-13 10:54   ` Ritesh Harjani
  2024-09-18  9:54 ` Dave Chinner
  6 siblings, 1 reply; 13+ messages in thread
From: John Garry @ 2024-09-13 10:06 UTC (permalink / raw)
  To: Ojaswin Mujoo, linux-ext4, Theodore Ts'o
  Cc: Ritesh Harjani, linux-kernel, Darrick J . Wong, linux-fsdevel,
	dchinner

On 11/09/2024 10:01, Ojaswin Mujoo wrote:
> This patchset implements extsize hint feature for ext4. Posting this RFC to get
> some early review comments on the design and implementation bits. This feature
> is similar to what we have in XFS too with some differences.
> 
> extsize on ext4 is a hint to mballoc (multi-block allocator) and extent
> handling layer to do aligned allocations. We use allocation criteria 0
> (CR_POWER2_ALIGNED) for doing aligned power-of-2 allocations. With extsize hint
> we try to align the logical start (m_lblk) and length(m_len) of the allocation
> to be extsize aligned. CR_POWER2_ALIGNED criteria in mballoc automatically make
> sure that we get the aligned physical start (m_pblk) as well. So in this way
> extsize can make sure that lblk, len and pblk all are aligned for the allocated
> extent w.r.t extsize.
> 
> Note that extsize feature is just a hinting mechanism to ext4 multi-block
> allocator. That means that if we are unable to get an aligned allocation for
> some reason, than we drop this flag and continue with unaligned allocation to
> serve the request. However when we will add atomic/untorn writes support, then
> we will enforce the aligned allocation and can return -ENOSPC if aligned
> allocation was not successful.

A few questions/confirmations:
- You have no intention of adding an equivalent of forcealign, right?

- Would you also plan on using FS_IOC_FS(GET/SET)XATTR interface for 
enabling atomic writes on a per-inode basis?

- Can extsize be set at mkfs time?

- Is there any userspace support for this series available?

- how would/could extsize interact with bigalloc?

> 
> Comparison with XFS extsize feature -
> =====================================
> 1. extsize in XFS is a hint for aligning only the logical start and the lengh
>     of the allocation v/s extsize on ext4 make sure the physical start of the
>     extent gets aligned as well.

note that forcealign with extsize aligns AG block also

only for atomic writes do we enforce the AG block is aligned to physical 
block

> 
> 2. eof allocation on XFS trims the blocks allocated beyond eof with extsize
>     hint. That means on XFS for eof allocations (with extsize hint) only logical
>     start gets aligned. However extsize hint in ext4 for eof allocation is not
>     supported in this version of the series.
> 
> 3. XFS allows extsize to be set on file with no extents but delayed data.
>     However, ext4 don't allow that for simplicity. The user is expected to set
>     it on a file before changing it's i_size.
> 
> 4. XFS allows non-power-of-2 values for extsize but ext4 does not, since we
>     primarily would like to support atomic writes with extsize.
> 
> 5. In ext4 we chose to store the extsize value in SYSTEM_XATTR rather than an
>     inode field as it was simple and most flexible, since there might be more
>     features like atomic/untorn writes coming in future.
> 
> 6. In buffered-io path XFS switches to non-delalloc allocations for extsize hint.
>     The same has been kept for EXT4 as well.
> 
> Some TODOs:
> ===========
> 1. EOF allocations support can be added and can be kept similar to XFS

Note that EOF alignment for forcealign may change - it needs to be 
discussed further.

Thanks,
John

.
> 
> Rest of the design details can be found in the individual commit messages.
> 
> Thoughts and suggestions are welcome!
> 
> Ojaswin Mujoo (5):
>    ext4: add aligned allocation hint in mballoc
>    ext4: allow inode preallocation for aligned alloc
>    ext4: Support for extsize hint using FS_IOC_FS(GET/SET)XATTR
>    ext4: pass lblk and len explicitly to ext4_split_extent*()
>    ext4: Add extsize hint support
> 
>   fs/ext4/ext4.h              |  12 +-
>   fs/ext4/ext4_jbd2.h         |  15 ++
>   fs/ext4/extents.c           | 224 ++++++++++++++----
>   fs/ext4/inode.c             | 442 +++++++++++++++++++++++++++++++++---
>   fs/ext4/ioctl.c             | 119 ++++++++++
>   fs/ext4/mballoc.c           | 126 ++++++++--
>   fs/ext4/super.c             |   1 +
>   include/trace/events/ext4.h |   2 +
>   8 files changed, 841 insertions(+), 100 deletions(-)
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 0/5] ext4: Implement support for extsize hints
  2024-09-13 10:06 ` [RFC 0/5] ext4: Implement support for extsize hints John Garry
@ 2024-09-13 10:54   ` Ritesh Harjani
  2024-09-13 13:34     ` John Garry
  0 siblings, 1 reply; 13+ messages in thread
From: Ritesh Harjani @ 2024-09-13 10:54 UTC (permalink / raw)
  To: John Garry, Ojaswin Mujoo
  Cc: linux-kernel, Darrick J . Wong, linux-fsdevel, dchinner,
	linux-ext4, Theodore Ts'o

John Garry <john.g.garry@oracle.com> writes:

> On 11/09/2024 10:01, Ojaswin Mujoo wrote:
>> This patchset implements extsize hint feature for ext4. Posting this RFC to get
>> some early review comments on the design and implementation bits. This feature
>> is similar to what we have in XFS too with some differences.
>> 
>> extsize on ext4 is a hint to mballoc (multi-block allocator) and extent
>> handling layer to do aligned allocations. We use allocation criteria 0
>> (CR_POWER2_ALIGNED) for doing aligned power-of-2 allocations. With extsize hint
>> we try to align the logical start (m_lblk) and length(m_len) of the allocation
>> to be extsize aligned. CR_POWER2_ALIGNED criteria in mballoc automatically make
>> sure that we get the aligned physical start (m_pblk) as well. So in this way
>> extsize can make sure that lblk, len and pblk all are aligned for the allocated
>> extent w.r.t extsize.
>> 
>> Note that extsize feature is just a hinting mechanism to ext4 multi-block
>> allocator. That means that if we are unable to get an aligned allocation for
>> some reason, than we drop this flag and continue with unaligned allocation to
>> serve the request. However when we will add atomic/untorn writes support, then
>> we will enforce the aligned allocation and can return -ENOSPC if aligned
>> allocation was not successful.
>
> A few questions/confirmations:
> - You have no intention of adding an equivalent of forcealign, right?

extsize is just a hinting mechanism that too only for __allocation__
path. But for atomic writes we do require some form of forcealign (like
how we have in XFS). So we could either call this directly as atomic
write feature or can may as well call this forcealign feature and make
atomic writes depend upon it, like how XFS is doing it.

I still haven't understood if there is/will be a user specifically for
forcealign other than atomic writes.

Since you asked, I am more curious to know if there is some more context
to your question?

>
> - Would you also plan on using FS_IOC_FS(GET/SET)XATTR interface for 
> enabling atomic writes on a per-inode basis?

Yes, that interface should indeed be kept same for EXT4 too.

>
> - Can extsize be set at mkfs time?

Good point. For now in this series, extsize can only be set using the
same ioctl on a per inode basis.

IIUC, XFS supports doing both right. We can do this on a per-inode basis
during ioctl or it also supports setting this during mkfs.xfs time.
(maybe xfsprogs only allows setting this at mkfs time for rtvolumes for now)

So if this is set during mkfs.xfs time and then by default all inodes will
have this extsize attribute value set right?

BTW, this brings me to another question that I had asked here too [1].
1. For XFS, atomic writes can only be enabled with a fresh mkfs.xfs -d
atomic-writes=1 right? 
2. For atomic writes to be enabled, we need all 3 features to be
enabled during mkfs.xfs time itself right?
i.e. 
"mkfs.xfs -i forcealign=1 -d extsize=16384 -d atomic-writes=1"

[1]: https://lore.kernel.org/linux-xfs/20240817094800.776408-1-john.g.garry@oracle.com/

>
> - Is there any userspace support for this series available?

Make sense to maybe provide a userspace support link too.
For now, a quick hack would be to just allow setting extsize hint for
other fileystems as well in xfs_io.

diff --git a/io/open.c b/io/open.c
index 15850b55..6407b7e8 100644
--- a/io/open.c
+++ b/io/open.c
@@ -980,7 +980,7 @@ open_init(void)
        extsize_cmd.args = _("[-D | -R] [extsize]");
        extsize_cmd.argmin = 0;
        extsize_cmd.argmax = -1;
-       extsize_cmd.flags = CMD_NOMAP_OK;
+       extsize_cmd.flags = CMD_NOMAP_OK | CMD_FOREIGN_OK;
        extsize_cmd.oneline =
                _("get/set preferred extent size (in bytes) for the open file");
        extsize_cmd.help = extsize_help;

<e.g>
/dev/loop6 on /mnt1/test type ext4 (rw,relatime)

root@qemu:~/xt/xfsprogs-dev# ./io/xfs_io -fc "extsize" /mnt1/test/f1
[0] /mnt1/test/f1
root@qemu:~/xt/xfsprogs-dev# ./io/xfs_io -c "extsize 16384" /mnt1/test/f1
root@qemu:~/xt/xfsprogs-dev# ./io/xfs_io -c "extsize" /mnt1/test/f1
[16384] /mnt1/test/f1


>
> - how would/could extsize interact with bigalloc?
>

As of now it is kept disabled with bigalloc.

+	if (sbi->s_cluster_ratio > 1) {
+		msg = "Can't use extsize hint with bigalloc";
+		err = -EINVAL;
+		goto error;
+	}


>> 
>> Comparison with XFS extsize feature -
>> =====================================
>> 1. extsize in XFS is a hint for aligning only the logical start and the lengh
>>     of the allocation v/s extsize on ext4 make sure the physical start of the
>>     extent gets aligned as well.
>
> note that forcealign with extsize aligns AG block also

Can you expand that on a bit. You mean during mkfs.xfs time we ensure
agblock boundaries are extsize aligned?

>
> only for atomic writes do we enforce the AG block is aligned to physical 
> block
>

If you could expand that a bit please? You meant during mkfs.xfs
time for atomic writes we ensure ag block start bounaries are extsize aligned?


>> 
>> 2. eof allocation on XFS trims the blocks allocated beyond eof with extsize
>>     hint. That means on XFS for eof allocations (with extsize hint) only logical
>>     start gets aligned. However extsize hint in ext4 for eof allocation is not
>>     supported in this version of the series.
>> 
>> 3. XFS allows extsize to be set on file with no extents but delayed data.
>>     However, ext4 don't allow that for simplicity. The user is expected to set
>>     it on a file before changing it's i_size.
>> 
>> 4. XFS allows non-power-of-2 values for extsize but ext4 does not, since we
>>     primarily would like to support atomic writes with extsize.
>> 
>> 5. In ext4 we chose to store the extsize value in SYSTEM_XATTR rather than an
>>     inode field as it was simple and most flexible, since there might be more
>>     features like atomic/untorn writes coming in future.
>> 
>> 6. In buffered-io path XFS switches to non-delalloc allocations for extsize hint.
>>     The same has been kept for EXT4 as well.
>> 
>> Some TODOs:
>> ===========
>> 1. EOF allocations support can be added and can be kept similar to XFS
>
> Note that EOF alignment for forcealign may change - it needs to be 
> discussed further.

Sure, thanks for pointing that out.
I guess you are referring to mainly the truncate related EOF alignment change
required with forcealign for XFS.

-ritesh

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC 0/5] ext4: Implement support for extsize hints
  2024-09-13 10:54   ` Ritesh Harjani
@ 2024-09-13 13:34     ` John Garry
  0 siblings, 0 replies; 13+ messages in thread
From: John Garry @ 2024-09-13 13:34 UTC (permalink / raw)
  To: Ritesh Harjani (IBM), Ojaswin Mujoo
  Cc: linux-kernel, Darrick J . Wong, linux-fsdevel, dchinner,
	linux-ext4, Theodore Ts'o

On 13/09/2024 11:54, Ritesh Harjani (IBM) wrote:
> John Garry <john.g.garry@oracle.com> writes:
> 
>> On 11/09/2024 10:01, Ojaswin Mujoo wrote:
>>> This patchset implements extsize hint feature for ext4. Posting this RFC to get
>>> some early review comments on the design and implementation bits. This feature
>>> is similar to what we have in XFS too with some differences.
>>>
>>> extsize on ext4 is a hint to mballoc (multi-block allocator) and extent
>>> handling layer to do aligned allocations. We use allocation criteria 0
>>> (CR_POWER2_ALIGNED) for doing aligned power-of-2 allocations. With extsize hint
>>> we try to align the logical start (m_lblk) and length(m_len) of the allocation
>>> to be extsize aligned. CR_POWER2_ALIGNED criteria in mballoc automatically make
>>> sure that we get the aligned physical start (m_pblk) as well. So in this way
>>> extsize can make sure that lblk, len and pblk all are aligned for the allocated
>>> extent w.r.t extsize.
>>>
>>> Note that extsize feature is just a hinting mechanism to ext4 multi-block
>>> allocator. That means that if we are unable to get an aligned allocation for
>>> some reason, than we drop this flag and continue with unaligned allocation to
>>> serve the request. However when we will add atomic/untorn writes support, then
>>> we will enforce the aligned allocation and can return -ENOSPC if aligned
>>> allocation was not successful.
>>
>> A few questions/confirmations:
>> - You have no intention of adding an equivalent of forcealign, right?
> 
> extsize is just a hinting mechanism that too only for __allocation__
> path. But for atomic writes we do require some form of forcealign (like
> how we have in XFS). So we could either call this directly as atomic
> write feature or can may as well call this forcealign feature and make
> atomic writes depend upon it, like how XFS is doing it.
> 
> I still haven't understood if there is/will be a user specifically for
> forcealign other than atomic writes.
 > > Since you asked, I am more curious to know if there is some more 
context
> to your question?

As Darrick mentioned at the following, forcealign could be used for DAX:
https://lore.kernel.org/linux-xfs/170404855884.1770028.10371509002317647981.stgit@frogsfrogsfrogs/

> 
>>
>> - Would you also plan on using FS_IOC_FS(GET/SET)XATTR interface for
>> enabling atomic writes on a per-inode basis?
> 
> Yes, that interface should indeed be kept same for EXT4 too.
> 
>>
>> - Can extsize be set at mkfs time?
> 
> Good point. For now in this series, extsize can only be set using the
> same ioctl on a per inode basis.
> 
> IIUC, XFS supports doing both right. We can do this on a per-inode basis
> during ioctl or it also supports setting this during mkfs.xfs time.

Right

> (maybe xfsprogs only allows setting this at mkfs time for rtvolumes for now)

extsize hint can already be set at mkfs time for both rtvol and !rtvol 
today.

> 
> So if this is set during mkfs.xfs time and then by default all inodes will
> have this extsize attribute value set right?

Right

But there is still the option to set this later with xfs_io -c "extsize" 
per-inode.

> 
> BTW, this brings me to another question that I had asked here too [1].
> 1. For XFS, atomic writes can only be enabled with a fresh mkfs.xfs -d
> atomic-writes=1 right?

Correct

Setting atomic-writes=1 enables the feature in the SB

> 2. For atomic writes to be enabled, we need all 3 features to be
> enabled during mkfs.xfs time itself right?

Right, that is how it is currently done.  But you could easily set 
extsize=4K at mkfs time so that not all inodes have a 16KB extsize, as 
in the example below. In this case, certain atomic write inodes could 
have their extsize increased to 16KB.

> i.e.
> "mkfs.xfs -i forcealign=1 -d extsize=16384 -d atomic-writes=1"
> 
> [1]: https://urldefense.com/v3/__https://lore.kernel.org/linux-xfs/20240817094800.776408-1-john.g.garry@oracle.com/__;!!ACWV5N9M2RV99hQ!J0dwKULbs9neFPRiUN1VR63Ea-Qgjk77y6SFN4GPBN2zqIGP46CDH0vG6fpvEMDFCq-O05CMePOn70hy9FA3zlw$
> 
>>
>> - Is there any userspace support for this series available?
> 
> Make sense to maybe provide a userspace support link too.
> For now, a quick hack would be to just allow setting extsize hint for
> other fileystems as well in xfs_io.
> 
> diff --git a/io/open.c b/io/open.c
> index 15850b55..6407b7e8 100644
> --- a/io/open.c
> +++ b/io/open.c
> @@ -980,7 +980,7 @@ open_init(void)
>          extsize_cmd.args = _("[-D | -R] [extsize]");
>          extsize_cmd.argmin = 0;
>          extsize_cmd.argmax = -1;
> -       extsize_cmd.flags = CMD_NOMAP_OK;
> +       extsize_cmd.flags = CMD_NOMAP_OK | CMD_FOREIGN_OK;
>          extsize_cmd.oneline =
>                  _("get/set preferred extent size (in bytes) for the open file");
>          extsize_cmd.help = extsize_help;
> 
> <e.g>
> /dev/loop6 on /mnt1/test type ext4 (rw,relatime)
> 
> root@qemu:~/xt/xfsprogs-dev# ./io/xfs_io -fc "extsize" /mnt1/test/f1
> [0] /mnt1/test/f1
> root@qemu:~/xt/xfsprogs-dev# ./io/xfs_io -c "extsize 16384" /mnt1/test/f1
> root@qemu:~/xt/xfsprogs-dev# ./io/xfs_io -c "extsize" /mnt1/test/f1
> [16384] /mnt1/test/f1

ok

> 
> 
>>
>> - how would/could extsize interact with bigalloc?
>>
> 
> As of now it is kept disabled with bigalloc.
> 
> +	if (sbi->s_cluster_ratio > 1) {
> +		msg = "Can't use extsize hint with bigalloc";
> +		err = -EINVAL;
> +		goto error;
> +	}
> 
> 
>>>
>>> Comparison with XFS extsize feature -
>>> =====================================
>>> 1. extsize in XFS is a hint for aligning only the logical start and the lengh
>>>      of the allocation v/s extsize on ext4 make sure the physical start of the
>>>      extent gets aligned as well.
>>
>> note that forcealign with extsize aligns AG block also
> 
> Can you expand that on a bit. You mean during mkfs.xfs time we ensure
> agblock boundaries are extsize aligned?

Yes, see align_ag_geometry() at 
https://github.com/johnpgarry/xfsprogs-dev/commits/atomic-writes/

> 
>>
>> only for atomic writes do we enforce the AG block is aligned to physical
>> block
>>
> 
> If you could expand that a bit please? You meant during mkfs.xfs
> time for atomic writes we ensure ag block start bounaries are extsize aligned?

We do this for forcealign with the extsize value supplied at mkfs time.

There are 2x things to consider about this:
- mkfs-specified extsize need not necessarily be a power-of-2
- even if this mkfs-specified extsize is a power-of-2, attempting to 
increase extsize for an inode enabled for atomic writes may be 
restricted, as the new extsize may not align with the AG count.

For example, extsize was 64KB and AG count = 16400 FSB (1025 * 64KB), 
then we cannot enable an inode for atomic writes with extsize = 128KB, 
as the disk block would not be aligned with the AG block.

> 
> 
>>>
>>> 2. eof allocation on XFS trims the blocks allocated beyond eof with extsize
>>>      hint. That means on XFS for eof allocations (with extsize hint) only logical
>>>      start gets aligned. However extsize hint in ext4 for eof allocation is not
>>>      supported in this version of the series.
>>>
>>> 3. XFS allows extsize to be set on file with no extents but delayed data.
>>>      However, ext4 don't allow that for simplicity. The user is expected to set
>>>      it on a file before changing it's i_size.
>>>
>>> 4. XFS allows non-power-of-2 values for extsize but ext4 does not, since we
>>>      primarily would like to support atomic writes with extsize.
>>>
>>> 5. In ext4 we chose to store the extsize value in SYSTEM_XATTR rather than an
>>>      inode field as it was simple and most flexible, since there might be more
>>>      features like atomic/untorn writes coming in future.
>>>
>>> 6. In buffered-io path XFS switches to non-delalloc allocations for extsize hint.
>>>      The same has been kept for EXT4 as well.
>>>
>>> Some TODOs:
>>> ===========
>>> 1. EOF allocations support can be added and can be kept similar to XFS
>>
>> Note that EOF alignment for forcealign may change - it needs to be
>> discussed further.
> 
> Sure, thanks for pointing that out.
> I guess you are referring to mainly the truncate related EOF alignment change
> required with forcealign for XFS.
> 

Thanks,
John

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 0/5] ext4: Implement support for extsize hints
  2024-09-11  9:01 [RFC 0/5] ext4: Implement support for extsize hints Ojaswin Mujoo
                   ` (5 preceding siblings ...)
  2024-09-13 10:06 ` [RFC 0/5] ext4: Implement support for extsize hints John Garry
@ 2024-09-18  9:54 ` Dave Chinner
  2024-09-19  7:13   ` Ojaswin Mujoo
  6 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2024-09-18  9:54 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, Theodore Ts'o, Ritesh Harjani, linux-kernel,
	Darrick J . Wong, linux-fsdevel, John Garry, dchinner

On Wed, Sep 11, 2024 at 02:31:04PM +0530, Ojaswin Mujoo wrote:
> This patchset implements extsize hint feature for ext4. Posting this RFC to get
> some early review comments on the design and implementation bits. This feature
> is similar to what we have in XFS too with some differences.
> 
> extsize on ext4 is a hint to mballoc (multi-block allocator) and extent
> handling layer to do aligned allocations. We use allocation criteria 0
> (CR_POWER2_ALIGNED) for doing aligned power-of-2 allocations. With extsize hint
> we try to align the logical start (m_lblk) and length(m_len) of the allocation
> to be extsize aligned. CR_POWER2_ALIGNED criteria in mballoc automatically make
> sure that we get the aligned physical start (m_pblk) as well. So in this way
> extsize can make sure that lblk, len and pblk all are aligned for the allocated
> extent w.r.t extsize.
> 
> Note that extsize feature is just a hinting mechanism to ext4 multi-block
> allocator. That means that if we are unable to get an aligned allocation for
> some reason, than we drop this flag and continue with unaligned allocation to
> serve the request. However when we will add atomic/untorn writes support, then
> we will enforce the aligned allocation and can return -ENOSPC if aligned
> allocation was not successful.
> 
> Comparison with XFS extsize feature -
> =====================================
> 1. extsize in XFS is a hint for aligning only the logical start and the lengh
>    of the allocation v/s extsize on ext4 make sure the physical start of the
>    extent gets aligned as well.

What happens when you can't align the physical start of the extent?
It fails the allocation with ENOSPC?

For XFS, the existing extent size behaviour is a hint, and so we
ignore the hint if we cannot perform the allocation with the
suggested alignment. i.e. We should not fail an allocation with an
extent size hint until we are actually very near ENOSPC.

With the new force-align feature, the physical alignment within an
AG gets aligned to the extent size. In this case, if we can't find
an aligned free extent to allocate, we fail the allocation (ENOSPC).
Hence with forced alignment, we can have ENOSPC occur when there are
large amounts of free space available in the filesystem.

This is almost certainly what most people -don't want-, but it is a
requirement for atomic writes. To make matters worse, this behaviour
will almost certainly get worst as filesystem ages and free space
slowly fragments over time.

IOWs, by making the ext4 extsize have forced alignment semantics by
default, it means users will see ENOSPC at lot more frequently and
in situations where it is most definitely not expected.

We also have to keep in mind that there are applications out there
that set and use extent size hints, and so enabling extsize in ext4
will result in those applications silently starting to use them. If
ext4 supporting extsize hints drastically changes the behaviour of
the filesystem then that is going to cause significant unexpected
regressions for users as they upgrade kernels and filesystems.

Hence I strongly suggest that ext4 implements extent size hints in
the same way that XFS does. i.e. unless forced alignment has been
enabled for the inode, extsize is just a hint that gets discarded if
aligned allocation does not succeed.

Behaviour such as extent size hinting *should* be the same across
all filesystems that provide this functionality.  This makes using
extent size hints much easier for users, admins and application
developers. The last thing I want to hear is application devs tell
me at conferences that "we don't use extent size hints anymore
because ext4..."

> 2. eof allocation on XFS trims the blocks allocated beyond eof with extsize
>    hint. That means on XFS for eof allocations (with extsize hint) only logical
>    start gets aligned.

I'm not sure I understand what you are saying here. XFS does extsize
alignment of both the start and end of post-eof extents the same as
it does for extents within EOF. For example:

# xfs_io -fdc "truncate 0" -c "extsize 16k" -c "pwrite 0 4k" -c "bmap -vvp" foo
wrote 4096/4096 bytes at offset 0
4 KiB, 1 ops; 0.0308 sec (129.815 KiB/sec and 32.4538 ops/sec)
foo:
EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
   0: [0..7]:          256504..256511    0 (256504..256511)     8 000000
   1: [8..31]:         256512..256535    0 (256512..256535)    24 010000
 FLAG Values:
    0100000 Shared extent
    0010000 Unwritten preallocated extent

There's a 4k written extent at 0, and a 12k unwritten extent
beyond EOF at 4k. I.e. we have an extent of 16kB as the hint
required that is correctly aligned beyond EOF.

If I then write another 4k at 20k (beyond both EOF and the unwritten
extent beyond EOF:

# xfs_io -fdc "truncate 0" -c "extsize 16k" -c "pwrite 0 4k" -c "pwrite 20k 4k" -c "bmap -vvp" foo
wrote 4096/4096 bytes at offset 0
4 KiB, 1 ops; 0.0210 sec (190.195 KiB/sec and 47.5489 ops/sec)
wrote 4096/4096 bytes at offset 20480
4 KiB, 1 ops; 0.0001 sec (21.701 MiB/sec and 5555.5556 ops/sec)
foo:
 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
   0: [0..7]:          180000..180007    0 (180000..180007)     8 000000
   1: [8..39]:         180008..180039    0 (180008..180039)    32 010000
   2: [40..47]:        180040..180047    0 (180040..180047)     8 000000
   3: [48..63]:        180048..180063    0 (180048..180063)    16 010000
 FLAG Values:
    0100000 Shared extent
    0010000 Unwritten preallocated extent

You can see we did contiguous allocation of another 16kB at offset
16kB, and then wrote to 20k for 4kB.. i.e. the new extent was
correctly aligned at both sides as the extsize hint says it should
be....

>    However extsize hint in ext4 for eof allocation is not
>    supported in this version of the series.

If you can't do extsize aligned allocations for EOF extension, then
how to applications use atomic writes to atomically extend the file?

> 3. XFS allows extsize to be set on file with no extents but delayed data.

It does?

<looks>

Yep, it doesn't check ip->i_delayed_blks is zero when changing
extsize.

I think that's simply a bug, not intended behaviour, because
delalloc will not have reserved space for the extsize hint rounding
needed when writeback occurs. Can you send a patch to add this
check?

>    However, ext4 don't allow that for simplicity. The user is expected to set
>    it on a file before changing it's i_size.

We don't actually care about i_size in XFS - the determining factor
is whether there are extents allocated on disk. i.e. we can truncate
up and then set the extent size hint because there are no extents
allocated even though the size is non-zero. 

There are almost certainly applications out there that change extent
size after truncating to a non-zero size, so this needs to work on
ext4 the same way it does on XFS. Otherwise people are going to
complain that their applications suddenly stop working properly on
ext4....

> 4. XFS allows non-power-of-2 values for extsize but ext4 does not, since we
>    primarily would like to support atomic writes with extsize.

Yes, ext4 can make that restriction if desired.

Keep in mind that the XFS atomic write support is still evolving,
and I think the way we are using extent size hints isn't fully
solidified yet.

Indeed, I think that we can allow non-power-of-2 extent sizes for
atomic writes, because integer multiples of the atomic write unit
will still ensure that physical extents are properly aligned for
atomic writes to succeed.  e.g. 24kB extent size is compatible with
8kB atomic write sizes.

To make that work efficiently unwritten extent boundaries need to be
maintained at atomic write alignments (8kB), not extent size
alignment (24kB), but other than that I don't think anything else is
needed....

This is desirable because it will allow extent size hints to remain
usable for their original purposes even with atomic writes on XFS.
i.e. fragmentation minimisation for small random DIO write worklaods
(exactly the sort of IO you'd consider using atomic writes for!),
alignment of extents to [non-power-of-2] RAID stripe geometry, etc.

> 5. In ext4 we chose to store the extsize value in SYSTEM_XATTR rather than an
>    inode field as it was simple and most flexible, since there might be more
>    features like atomic/untorn writes coming in future.

Does that mean you can query and set it through the user xattr
interfaces? If so, how do you enforce the values users set are
correct?

> 6. In buffered-io path XFS switches to non-delalloc allocations for extsize hint.
>    The same has been kept for EXT4 as well.

That's an internal XFS implementation detail that you don't need to
replicate. Historically speaking, we didn't use unwritten extents
for delayed allocation and so we couldn't do within-EOF extsize
unaligned writes without adding special additional zero-around code to
ensure that we never exposed stale data to userspace from the extra
allocation that the data write did not cover.

We now use unwritten extents for delalloc conversion, so this istale
data exposure issue no longer exists. We should really switch this
code back to using delalloc because it is much faster and less
fragmentation prone than direct extsize allocation....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 0/5] ext4: Implement support for extsize hints
  2024-09-18  9:54 ` Dave Chinner
@ 2024-09-19  7:13   ` Ojaswin Mujoo
  2024-09-19 22:34     ` Dave Chinner
  0 siblings, 1 reply; 13+ messages in thread
From: Ojaswin Mujoo @ 2024-09-19  7:13 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-ext4, Theodore Ts'o, Ritesh Harjani, linux-kernel,
	Darrick J . Wong, linux-fsdevel, John Garry, dchinner

On Wed, Sep 18, 2024 at 07:54:27PM +1000, Dave Chinner wrote:
> On Wed, Sep 11, 2024 at 02:31:04PM +0530, Ojaswin Mujoo wrote:
> > This patchset implements extsize hint feature for ext4. Posting this RFC to get
> > some early review comments on the design and implementation bits. This feature
> > is similar to what we have in XFS too with some differences.
> > 
> > extsize on ext4 is a hint to mballoc (multi-block allocator) and extent
> > handling layer to do aligned allocations. We use allocation criteria 0
> > (CR_POWER2_ALIGNED) for doing aligned power-of-2 allocations. With extsize hint
> > we try to align the logical start (m_lblk) and length(m_len) of the allocation
> > to be extsize aligned. CR_POWER2_ALIGNED criteria in mballoc automatically make
> > sure that we get the aligned physical start (m_pblk) as well. So in this way
> > extsize can make sure that lblk, len and pblk all are aligned for the allocated
> > extent w.r.t extsize.
> > 
> > Note that extsize feature is just a hinting mechanism to ext4 multi-block
> > allocator. That means that if we are unable to get an aligned allocation for
> > some reason, than we drop this flag and continue with unaligned allocation to
> > serve the request. However when we will add atomic/untorn writes support, then
> > we will enforce the aligned allocation and can return -ENOSPC if aligned
> > allocation was not successful.
> > 
> > Comparison with XFS extsize feature -
> > =====================================
> > 1. extsize in XFS is a hint for aligning only the logical start and the lengh
> >    of the allocation v/s extsize on ext4 make sure the physical start of the
> >    extent gets aligned as well.
> 
> What happens when you can't align the physical start of the extent?
> It fails the allocation with ENOSPC?

Hi Dave, thanks for the review.

No, ext4 treats it as a hint as well and we fallback to nonaligned
allocation
> 
> For XFS, the existing extent size behaviour is a hint, and so we
> ignore the hint if we cannot perform the allocation with the
> suggested alignment. i.e. We should not fail an allocation with an
> extent size hint until we are actually very near ENOSPC.
> 
> With the new force-align feature, the physical alignment within an
> AG gets aligned to the extent size. In this case, if we can't find
> an aligned free extent to allocate, we fail the allocation (ENOSPC).
> Hence with forced alignment, we can have ENOSPC occur when there are
> large amounts of free space available in the filesystem.
> 
> This is almost certainly what most people -don't want-, but it is a
> requirement for atomic writes. To make matters worse, this behaviour
> will almost certainly get worst as filesystem ages and free space
> slowly fragments over time.
> 
> IOWs, by making the ext4 extsize have forced alignment semantics by
> default, it means users will see ENOSPC at lot more frequently and
> in situations where it is most definitely not expected.
> 
> We also have to keep in mind that there are applications out there
> that set and use extent size hints, and so enabling extsize in ext4
> will result in those applications silently starting to use them. If
> ext4 supporting extsize hints drastically changes the behaviour of
> the filesystem then that is going to cause significant unexpected
> regressions for users as they upgrade kernels and filesystems.
> 
> Hence I strongly suggest that ext4 implements extent size hints in
> the same way that XFS does. i.e. unless forced alignment has been
> enabled for the inode, extsize is just a hint that gets discarded if
> aligned allocation does not succeed.
> 
> Behaviour such as extent size hinting *should* be the same across
> all filesystems that provide this functionality.  This makes using
> extent size hints much easier for users, admins and application
> developers. The last thing I want to hear is application devs tell
> me at conferences that "we don't use extent size hints anymore
> because ext4..."

Yes, makes sense :)  

Nothing to worry here tho as ext4 also treats the extsize value as a
hint exactly like XFS. We have tried to keep the behavior as similar
to XFS as possible for the exact reasons you mentioned. 

And yes, we do plan to add a forcealign (or similar) feature for ext4 as
well for atomic writes which would change the hint to a mandate

> 
> > 2. eof allocation on XFS trims the blocks allocated beyond eof with extsize
> >    hint. That means on XFS for eof allocations (with extsize hint) only logical
> >    start gets aligned.
> 
> I'm not sure I understand what you are saying here. XFS does extsize
> alignment of both the start and end of post-eof extents the same as
> it does for extents within EOF. For example:
> 
> # xfs_io -fdc "truncate 0" -c "extsize 16k" -c "pwrite 0 4k" -c "bmap -vvp" foo
> wrote 4096/4096 bytes at offset 0
> 4 KiB, 1 ops; 0.0308 sec (129.815 KiB/sec and 32.4538 ops/sec)
> foo:
> EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>    0: [0..7]:          256504..256511    0 (256504..256511)     8 000000
>    1: [8..31]:         256512..256535    0 (256512..256535)    24 010000
>  FLAG Values:
>     0100000 Shared extent
>     0010000 Unwritten preallocated extent
> 
> There's a 4k written extent at 0, and a 12k unwritten extent
> beyond EOF at 4k. I.e. we have an extent of 16kB as the hint
> required that is correctly aligned beyond EOF.
> 
> If I then write another 4k at 20k (beyond both EOF and the unwritten
> extent beyond EOF:
> 
> # xfs_io -fdc "truncate 0" -c "extsize 16k" -c "pwrite 0 4k" -c "pwrite 20k 4k" -c "bmap -vvp" foo
> wrote 4096/4096 bytes at offset 0
> 4 KiB, 1 ops; 0.0210 sec (190.195 KiB/sec and 47.5489 ops/sec)
> wrote 4096/4096 bytes at offset 20480
> 4 KiB, 1 ops; 0.0001 sec (21.701 MiB/sec and 5555.5556 ops/sec)
> foo:
>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>    0: [0..7]:          180000..180007    0 (180000..180007)     8 000000
>    1: [8..39]:         180008..180039    0 (180008..180039)    32 010000
>    2: [40..47]:        180040..180047    0 (180040..180047)     8 000000
>    3: [48..63]:        180048..180063    0 (180048..180063)    16 010000
>  FLAG Values:
>     0100000 Shared extent
>     0010000 Unwritten preallocated extent
> 
> You can see we did contiguous allocation of another 16kB at offset
> 16kB, and then wrote to 20k for 4kB.. i.e. the new extent was
> correctly aligned at both sides as the extsize hint says it should
> be....

Sorry for the confusion Dave. What was meant is that XFS would indeed
respect extsize hint for EOF allocations but if we close the file, since
we trim the blocks past EOF upon close, we would only see that the
lstart is aligned but the end would not.

For example:

$ xfs_io -c "open -dft foo" -c "truncate 0" -c "extsize 16k" -c "pwrite 0 4k" -c "bmap -vvp" -c "close" -c "open foo" -c "bmap -vvp"

wrote 4096/4096 bytes at offset 0
4 KiB, 1 ops; 0.0003 sec (9.864 MiB/sec and 2525.2525 ops/sec)

foo:
 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
   0: [0..7]:          384..391          0 (384..391)           8 000000
   1: [8..31]:         392..415          0 (392..415)          24 010000
 FLAG Values:
    0010000 Unwritten preallocated extent

foo:
 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
   0: [0..7]:          384..391          0 (384..391)           8 000000

> 
> >    However extsize hint in ext4 for eof allocation is not
> >    supported in this version of the series.
> 
> If you can't do extsize aligned allocations for EOF extension, then
> how to applications use atomic writes to atomically extend the file?
> 

In this particular RFC we can't and we'll have to go via the 'set extsize hint
and then truncate' route.  But we do plan to add this in next revision.

> > 3. XFS allows extsize to be set on file with no extents but delayed data.
> 
> It does?
> 
> <looks>
> 
> Yep, it doesn't check ip->i_delayed_blks is zero when changing
> extsize.
> 
> I think that's simply a bug, not intended behaviour, because
> delalloc will not have reserved space for the extsize hint rounding
> needed when writeback occurs. Can you send a patch to add this
> check?

Got it, sure I can send a patch for this.

> 
> >    However, ext4 don't allow that for simplicity. The user is expected to set
> >    it on a file before changing it's i_size.
> 
> We don't actually care about i_size in XFS - the determining factor
> is whether there are extents allocated on disk. i.e. we can truncate
> up and then set the extent size hint because there are no extents
> allocated even though the size is non-zero. 

That's right Dave, my intention was also to just make sure that before
setting extsize:

  1. We dont have dellayed allocs in flight
  2. We dont have blocks allocated on disk

So ideally truncate followed by extsize set should have worked.

And in that sense, you are right that using i_size (or i_disksize in ext4)
is not correct. I will fix this behavior in next revision, thanks.

> 
> There are almost certainly applications out there that change extent
> size after truncating to a non-zero size, so this needs to work on
> ext4 the same way it does on XFS. Otherwise people are going to
> complain that their applications suddenly stop working properly on
> ext4....
> 
> > 4. XFS allows non-power-of-2 values for extsize but ext4 does not, since we
> >    primarily would like to support atomic writes with extsize.
> 
> Yes, ext4 can make that restriction if desired.
> 
> Keep in mind that the XFS atomic write support is still evolving,
> and I think the way we are using extent size hints isn't fully
> solidified yet.
> 
> Indeed, I think that we can allow non-power-of-2 extent sizes for
> atomic writes, because integer multiples of the atomic write unit
> will still ensure that physical extents are properly aligned for
> atomic writes to succeed.  e.g. 24kB extent size is compatible with
> 8kB atomic write sizes.
> 
> To make that work efficiently unwritten extent boundaries need to be
> maintained at atomic write alignments (8kB), not extent size
> alignment (24kB), but other than that I don't think anything else is
> needed....
> 
> This is desirable because it will allow extent size hints to remain
> usable for their original purposes even with atomic writes on XFS.
> i.e. fragmentation minimisation for small random DIO write worklaods
> (exactly the sort of IO you'd consider using atomic writes for!),
> alignment of extents to [non-power-of-2] RAID stripe geometry, etc.

Got it, I agree that extsize doesn't **have** to be power-of-2 but
for this revision we have kept it that way cause getting power of 2
aligned blocks comes almost without much changes in ext4 allocator.

However, it shouldn't be a problem to support non power-of-2 blocks. We
already have some aligned allocation logic for RAID use case which can
be leveraged.

> 
> > 5. In ext4 we chose to store the extsize value in SYSTEM_XATTR rather than an
> >    inode field as it was simple and most flexible, since there might be more
> >    features like atomic/untorn writes coming in future.
> 
> Does that mean you can query and set it through the user xattr
> interfaces? If so, how do you enforce the values users set are
> correct?

AFAICU, ext4 (and XFS) doesn't provide a handler for system xattrs, so
its not possible for a user to get/set it from the xattr interface.
They'd have to go through the ioctl. 

$ getfattr -n system.extsize test
test: system.extsize: Operation not supported

That being said, in case in future we would for some reason want to add
a handler for system xattrs to ext4, we'll have to be mindful of making
sure users can't get or set extsize through the xattr interfaces.

> 
> > 6. In buffered-io path XFS switches to non-delalloc allocations for extsize hint.
> >    The same has been kept for EXT4 as well.
> 
> That's an internal XFS implementation detail that you don't need to
> replicate. Historically speaking, we didn't use unwritten extents
> for delayed allocation and so we couldn't do within-EOF extsize
> unaligned writes without adding special additional zero-around code to
> ensure that we never exposed stale data to userspace from the extra
> allocation that the data write did not cover.
> 
> We now use unwritten extents for delalloc conversion, so this istale
> data exposure issue no longer exists. We should really switch this
> code back to using delalloc because it is much faster and less
> fragmentation prone than direct extsize allocation....

Thanks for the context Dave, I didn't implement it this time around as I
wanted to be sure what challenges XFS faced and ext4 will face while
trying extsize with delalloc. I think this clears things up and this can
be added in the next revisions.

> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 0/5] ext4: Implement support for extsize hints
  2024-09-19  7:13   ` Ojaswin Mujoo
@ 2024-09-19 22:34     ` Dave Chinner
  2024-09-20 15:04       ` Ojaswin Mujoo
  0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2024-09-19 22:34 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, Theodore Ts'o, Ritesh Harjani, linux-kernel,
	Darrick J . Wong, linux-fsdevel, John Garry, dchinner

On Thu, Sep 19, 2024 at 12:43:17PM +0530, Ojaswin Mujoo wrote:
> On Wed, Sep 18, 2024 at 07:54:27PM +1000, Dave Chinner wrote:
> > On Wed, Sep 11, 2024 at 02:31:04PM +0530, Ojaswin Mujoo wrote:
> > Behaviour such as extent size hinting *should* be the same across
> > all filesystems that provide this functionality.  This makes using
> > extent size hints much easier for users, admins and application
> > developers. The last thing I want to hear is application devs tell
> > me at conferences that "we don't use extent size hints anymore
> > because ext4..."
> 
> Yes, makes sense :)  
> 
> Nothing to worry here tho as ext4 also treats the extsize value as a
> hint exactly like XFS. We have tried to keep the behavior as similar
> to XFS as possible for the exact reasons you mentioned. 

It is worth explicitly stating this (i.e. all the behaviours that
are the same) in the design documentation rather than just the
corner cases where it is different. It was certainly not clear how
failures were treated.

> And yes, we do plan to add a forcealign (or similar) feature for ext4 as
> well for atomic writes which would change the hint to a mandate

Ok. That should be stated, too.

FWIW, it would be a good idea to document this all in the kernel
documentation itself, so there is a guideline for other filesystems
to implement the same behaviour. e.g. in
Documentation/filesystems/extent-size-hints.rst

> > > 2. eof allocation on XFS trims the blocks allocated beyond eof with extsize
> > >    hint. That means on XFS for eof allocations (with extsize hint) only logical
> > >    start gets aligned.
> > 
> > I'm not sure I understand what you are saying here. XFS does extsize
> > alignment of both the start and end of post-eof extents the same as
> > it does for extents within EOF. For example:
> > 
> > # xfs_io -fdc "truncate 0" -c "extsize 16k" -c "pwrite 0 4k" -c "bmap -vvp" foo
> > wrote 4096/4096 bytes at offset 0
> > 4 KiB, 1 ops; 0.0308 sec (129.815 KiB/sec and 32.4538 ops/sec)
> > foo:
> > EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
> >    0: [0..7]:          256504..256511    0 (256504..256511)     8 000000
> >    1: [8..31]:         256512..256535    0 (256512..256535)    24 010000
> >  FLAG Values:
> >     0100000 Shared extent
> >     0010000 Unwritten preallocated extent
> > 
> > There's a 4k written extent at 0, and a 12k unwritten extent
> > beyond EOF at 4k. I.e. we have an extent of 16kB as the hint
> > required that is correctly aligned beyond EOF.
> > 
> > If I then write another 4k at 20k (beyond both EOF and the unwritten
> > extent beyond EOF:
> > 
> > # xfs_io -fdc "truncate 0" -c "extsize 16k" -c "pwrite 0 4k" -c "pwrite 20k 4k" -c "bmap -vvp" foo
> > wrote 4096/4096 bytes at offset 0
> > 4 KiB, 1 ops; 0.0210 sec (190.195 KiB/sec and 47.5489 ops/sec)
> > wrote 4096/4096 bytes at offset 20480
> > 4 KiB, 1 ops; 0.0001 sec (21.701 MiB/sec and 5555.5556 ops/sec)
> > foo:
> >  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
> >    0: [0..7]:          180000..180007    0 (180000..180007)     8 000000
> >    1: [8..39]:         180008..180039    0 (180008..180039)    32 010000
> >    2: [40..47]:        180040..180047    0 (180040..180047)     8 000000
> >    3: [48..63]:        180048..180063    0 (180048..180063)    16 010000
> >  FLAG Values:
> >     0100000 Shared extent
> >     0010000 Unwritten preallocated extent
> > 
> > You can see we did contiguous allocation of another 16kB at offset
> > 16kB, and then wrote to 20k for 4kB.. i.e. the new extent was
> > correctly aligned at both sides as the extsize hint says it should
> > be....
> 
> Sorry for the confusion Dave. What was meant is that XFS would indeed
> respect extsize hint for EOF allocations but if we close the file, since
> we trim the blocks past EOF upon close, we would only see that the
> lstart is aligned but the end would not.

Right, but that is desired behaviour, especially when extsize is
large.  i.e. when the file is closed it is an indication that the
file will not be written again, so we don't need to keep post-eof
blocks around for fragmentation prevention reasons.

Removing post-EOF extents on close prevents large extsize hints from
consuming lots of unused space on files that are never going to be
written to again(*).  That's user visible, and because it can cause
premature ENOSPC, users will report this excessive space usage
behaviour as a bug (and they are right).  Hence removing post-eof
extents on file close when extent size hints are in use comes under
the guise of Good Behaviour To Have.

(*) think about how much space is wasted if you clone a kernel git
tree under a 1MB extent size hint directory. All those tiny header
files now take up 1MB of space on disk....

Keep in mind that when the file is opened for write again, the
extent size hint still gets applied to the new extents.  If the
extending write starts beyond the EOF extsize range, then the new
extent after the hole at EOF will be fully extsize aligned, as
expected.

If the new write is exactly extending the file, then the new extents
will not be extsize aligned - the start will be at the EOF block,
and they will be extsize -length-.  IOWs, the extent size is
maintained, just the logical alignment is not exactly extsize
aligned. This could be considered a bug, but it's never been an
issue for anyone because, in XFS, physical extent alignment is
separate (and maintained regardless of logical alignment) for extent
size hint based allocations.

Adding force-align will prevent this behaviour from occurring, as
post-eof trimming will be done to extsize alignment, not to the EOF
block.  Hence open/close/open will not affect logical or physical
alignment of force-align extents (and hence won't affect atomic
writes).

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 0/5] ext4: Implement support for extsize hints
  2024-09-19 22:34     ` Dave Chinner
@ 2024-09-20 15:04       ` Ojaswin Mujoo
  0 siblings, 0 replies; 13+ messages in thread
From: Ojaswin Mujoo @ 2024-09-20 15:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-ext4, Theodore Ts'o, Ritesh Harjani, linux-kernel,
	Darrick J . Wong, linux-fsdevel, John Garry, dchinner

On Fri, Sep 20, 2024 at 08:34:14AM +1000, Dave Chinner wrote:
> On Thu, Sep 19, 2024 at 12:43:17PM +0530, Ojaswin Mujoo wrote:
> > On Wed, Sep 18, 2024 at 07:54:27PM +1000, Dave Chinner wrote:
> > > On Wed, Sep 11, 2024 at 02:31:04PM +0530, Ojaswin Mujoo wrote:
> > > Behaviour such as extent size hinting *should* be the same across
> > > all filesystems that provide this functionality.  This makes using
> > > extent size hints much easier for users, admins and application
> > > developers. The last thing I want to hear is application devs tell
> > > me at conferences that "we don't use extent size hints anymore
> > > because ext4..."
> > 
> > Yes, makes sense :)  
> > 
> > Nothing to worry here tho as ext4 also treats the extsize value as a
> > hint exactly like XFS. We have tried to keep the behavior as similar
> > to XFS as possible for the exact reasons you mentioned. 
> 
> It is worth explicitly stating this (i.e. all the behaviours that
> are the same) in the design documentation rather than just the
> corner cases where it is different. It was certainly not clear how
> failures were treated.

Got it Dave, I did mention it in the actual commit 5/5 but I agree. I
will update the cover letter to be more clear about the design in future
revisions.

> 
> > And yes, we do plan to add a forcealign (or similar) feature for ext4 as
> > well for atomic writes which would change the hint to a mandate
> 
> Ok. That should be stated, too.
> 
> FWIW, it would be a good idea to document this all in the kernel
> documentation itself, so there is a guideline for other filesystems
> to implement the same behaviour. e.g. in
> Documentation/filesystems/extent-size-hints.rst

Okay makes sense, I can look into this as a next step.

> 
> > > > 2. eof allocation on XFS trims the blocks allocated beyond eof with extsize
> > > >    hint. That means on XFS for eof allocations (with extsize hint) only logical
> > > >    start gets aligned.
> > > 
> > > I'm not sure I understand what you are saying here. XFS does extsize
> > > alignment of both the start and end of post-eof extents the same as
> > > it does for extents within EOF. For example:
> > > 
> > > # xfs_io -fdc "truncate 0" -c "extsize 16k" -c "pwrite 0 4k" -c "bmap -vvp" foo
> > > wrote 4096/4096 bytes at offset 0
> > > 4 KiB, 1 ops; 0.0308 sec (129.815 KiB/sec and 32.4538 ops/sec)
> > > foo:
> > > EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
> > >    0: [0..7]:          256504..256511    0 (256504..256511)     8 000000
> > >    1: [8..31]:         256512..256535    0 (256512..256535)    24 010000
> > >  FLAG Values:
> > >     0100000 Shared extent
> > >     0010000 Unwritten preallocated extent
> > > 
> > > There's a 4k written extent at 0, and a 12k unwritten extent
> > > beyond EOF at 4k. I.e. we have an extent of 16kB as the hint
> > > required that is correctly aligned beyond EOF.
> > > 
> > > If I then write another 4k at 20k (beyond both EOF and the unwritten
> > > extent beyond EOF:
> > > 
> > > # xfs_io -fdc "truncate 0" -c "extsize 16k" -c "pwrite 0 4k" -c "pwrite 20k 4k" -c "bmap -vvp" foo
> > > wrote 4096/4096 bytes at offset 0
> > > 4 KiB, 1 ops; 0.0210 sec (190.195 KiB/sec and 47.5489 ops/sec)
> > > wrote 4096/4096 bytes at offset 20480
> > > 4 KiB, 1 ops; 0.0001 sec (21.701 MiB/sec and 5555.5556 ops/sec)
> > > foo:
> > >  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
> > >    0: [0..7]:          180000..180007    0 (180000..180007)     8 000000
> > >    1: [8..39]:         180008..180039    0 (180008..180039)    32 010000
> > >    2: [40..47]:        180040..180047    0 (180040..180047)     8 000000
> > >    3: [48..63]:        180048..180063    0 (180048..180063)    16 010000
> > >  FLAG Values:
> > >     0100000 Shared extent
> > >     0010000 Unwritten preallocated extent
> > > 
> > > You can see we did contiguous allocation of another 16kB at offset
> > > 16kB, and then wrote to 20k for 4kB.. i.e. the new extent was
> > > correctly aligned at both sides as the extsize hint says it should
> > > be....
> > 
> > Sorry for the confusion Dave. What was meant is that XFS would indeed
> > respect extsize hint for EOF allocations but if we close the file, since
> > we trim the blocks past EOF upon close, we would only see that the
> > lstart is aligned but the end would not.
> 
> Right, but that is desired behaviour, especially when extsize is
> large.  i.e. when the file is closed it is an indication that the
> file will not be written again, so we don't need to keep post-eof
> blocks around for fragmentation prevention reasons.
> 
> Removing post-EOF extents on close prevents large extsize hints from
> consuming lots of unused space on files that are never going to be
> written to again(*).  That's user visible, and because it can cause
> premature ENOSPC, users will report this excessive space usage
> behaviour as a bug (and they are right).  Hence removing post-eof
> extents on file close when extent size hints are in use comes under
> the guise of Good Behaviour To Have.
> 
> (*) think about how much space is wasted if you clone a kernel git
> tree under a 1MB extent size hint directory. All those tiny header
> files now take up 1MB of space on disk....
> 
> Keep in mind that when the file is opened for write again, the
> extent size hint still gets applied to the new extents.  If the
> extending write starts beyond the EOF extsize range, then the new
> extent after the hole at EOF will be fully extsize aligned, as
> expected.
> 
> If the new write is exactly extending the file, then the new extents
> will not be extsize aligned - the start will be at the EOF block,
> and they will be extsize -length-.  IOWs, the extent size is
> maintained, just the logical alignment is not exactly extsize
> aligned. This could be considered a bug, but it's never been an
> issue for anyone because, in XFS, physical extent alignment is
> separate (and maintained regardless of logical alignment) for extent
> size hint based allocations.
> 
> Adding force-align will prevent this behaviour from occurring, as
> post-eof trimming will be done to extsize alignment, not to the EOF
> block.  Hence open/close/open will not affect logical or physical
> alignment of force-align extents (and hence won't affect atomic
> writes).

Thanks for the context, I will try to keep this behavior similar to XFS
once we implement the EOF support for extsize hints in next revision.

Regards,
Ojaswin
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-09-20 15:05 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-11  9:01 [RFC 0/5] ext4: Implement support for extsize hints Ojaswin Mujoo
2024-09-11  9:01 ` [RFC 1/5] ext4: add aligned allocation hint in mballoc Ojaswin Mujoo
2024-09-11  9:01 ` [RFC 2/5] ext4: allow inode preallocation for aligned alloc Ojaswin Mujoo
2024-09-11  9:01 ` [RFC 3/5] ext4: Support for extsize hint using FS_IOC_FS(GET/SET)XATTR Ojaswin Mujoo
2024-09-11  9:01 ` [RFC 4/5] ext4: pass lblk and len explicitly to ext4_split_extent*() Ojaswin Mujoo
2024-09-11  9:01 ` [RFC 5/5] ext4: Add extsize hint support Ojaswin Mujoo
2024-09-13 10:06 ` [RFC 0/5] ext4: Implement support for extsize hints John Garry
2024-09-13 10:54   ` Ritesh Harjani
2024-09-13 13:34     ` John Garry
2024-09-18  9:54 ` Dave Chinner
2024-09-19  7:13   ` Ojaswin Mujoo
2024-09-19 22:34     ` Dave Chinner
2024-09-20 15:04       ` Ojaswin Mujoo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).