public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/1] ext4: Fix regression in mballoc due to deleted inode PAs in rbtree
@ 2023-07-22 17:15 Ojaswin Mujoo
  2023-07-22 17:15 ` [PATCH 1/1] ext4: Fix rbtree traversal bug in ext4_mb_use_preallocated Ojaswin Mujoo
  2023-07-23 12:32 ` [PATCH 0/1] ext4: Fix regression in mballoc due to deleted inode PAs in rbtree Theodore Ts'o
  0 siblings, 2 replies; 3+ messages in thread
From: Ojaswin Mujoo @ 2023-07-22 17:15 UTC (permalink / raw)
  To: linux-ext4, Theodore Ts'o
  Cc: Ritesh Harjani, linux-kernel, Jan Kara, Naresh Kamboju

Hello,

Recently there was a regression found in the mballoc [1] due the existence
of deleted preallocations(PAs) in the per inode preallocation rbtree.
Such deleted PAs can occur because ext4_mb_discard_group_preallocations 
traverses the grp->bb_prealloc_list and marks the PAs as deleted without
taking any inode specific locks like i_data_sem. 

Due to presence of such PAs, we were sometimes missing some of the pa
entries when traversing the per node rbtree in ext4_mb_use_preallocated.
Due to this, in some rare conditions we ended up missing a PA that did
overlap with our original request start. when this happens, we exit 
ext4_mb_use_preallocated and proceed with the allocation. However,
during ext4_mb_normalize_request() we were hitting a bug on as
a PA that could satisfy our request already existed. Since normalize
request was already fixed earlier to account for the deleted PAs we
actually able to catch it there.

This patch aims to fix this regression by using a different appraoch to
traverse the rbtree in ext4_mb_use_preallocated(). More comments can be
found in the commit message.

We've run several overnight runs of xfstests generic/269 as well as
several hours of ltp test on both x86 machines with 4k blocks size and
Power machines with 64k and 4k block size and have yet to hit the
regression. Further we added some debug prints in our testing to make
sure we were correctly handling the conditions that were triggering the
bug ons previously.

[1]
https://lore.kernel.org/linux-ext4/CA+G9fYv2FRpLqBZf34ZinR8bU2_ZRAUOjKAD3+tKRFaEQHtt8Q@mail.gmail.com/

Regards,
ojaswin

Ojaswin Mujoo (1):
  ext4: Fix rbtree traversal bug in ext4_mb_use_preallocated

 fs/ext4/mballoc.c | 158 ++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 131 insertions(+), 27 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH 1/1] ext4: Fix rbtree traversal bug in ext4_mb_use_preallocated
  2023-07-22 17:15 [PATCH 0/1] ext4: Fix regression in mballoc due to deleted inode PAs in rbtree Ojaswin Mujoo
@ 2023-07-22 17:15 ` Ojaswin Mujoo
  2023-07-23 12:32 ` [PATCH 0/1] ext4: Fix regression in mballoc due to deleted inode PAs in rbtree Theodore Ts'o
  1 sibling, 0 replies; 3+ messages in thread
From: Ojaswin Mujoo @ 2023-07-22 17:15 UTC (permalink / raw)
  To: linux-ext4, Theodore Ts'o
  Cc: Ritesh Harjani, linux-kernel, Jan Kara, Naresh Kamboju

During allocations, while looking for preallocations(PA) in the per
inode rbtree, we can't do a direct traversal of the tree because
ext4_mb_discard_group_preallocation() can paralelly mark the pa deleted
and that can cause direct traversal to skip some entries. This was
leading to a BUG_ON() being hit [1] when we missed a PA that could satisfy
our request and ultimately tried to create a new PA that would overlap
with the missed one.

To makes sure we handle that case while still keeping the performance of
the rbtree, we make use of the fact that the only pa that could possibly
overlap the original goal start is the one that satisfies the below
conditions:

  1. It must have it's logical start immediately to the left of
  (ie less than) original logical start.

  2. It must not be deleted

To find this pa we use the following traversal method:

1. Descend into the rbtree normally to find the immediate neighboring
PA. Here we keep descending irrespective of if the PA is deleted or if
it overlaps with our request etc. The goal is to find an immediately
adjacent PA.

2. If the found PA is on right of original goal, use rb_prev() to find
the left adjacent PA.

3. Check if this PA is deleted and keep moving left with rb_prev() until
a non deleted PA is found.

4. This is the PA we are looking for. Now we can check if it can satisfy
the original request and proceed accordingly.

This approach also takes care of having deleted PAs in the tree.

(While we are at it, also fix a possible overflow bug in calculating the
end of a PA)

[1] https://lore.kernel.org/linux-ext4/CA+G9fYv2FRpLqBZf34ZinR8bU2_ZRAUOjKAD3+tKRFaEQHtt8Q@mail.gmail.com/

Fixes: 3872778664e3 ("ext4: Use rbtrees to manage PAs instead of inode i_prealloc_list")
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reviewed-by: Ritesh Harjani (IBM) ritesh.list@gmail.com
Tested-by: Ritesh Harjani (IBM) ritesh.list@gmail.com
---
 fs/ext4/mballoc.c | 158 ++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 131 insertions(+), 27 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 456150ef6111..21b903fe546e 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4765,8 +4765,8 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
 	int order, i;
 	struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
 	struct ext4_locality_group *lg;
-	struct ext4_prealloc_space *tmp_pa, *cpa = NULL;
-	ext4_lblk_t tmp_pa_start, tmp_pa_end;
+	struct ext4_prealloc_space *tmp_pa = NULL, *cpa = NULL;
+	loff_t tmp_pa_end;
 	struct rb_node *iter;
 	ext4_fsblk_t goal_block;
 
@@ -4774,47 +4774,151 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
 	if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
 		return false;
 
-	/* first, try per-file preallocation */
+	/*
+	 * first, try per-file preallocation by searching the inode pa rbtree.
+	 *
+	 * Here, we can't do a direct traversal of the tree because
+	 * ext4_mb_discard_group_preallocation() can paralelly mark the pa
+	 * deleted and that can cause direct traversal to skip some entries.
+	 */
 	read_lock(&ei->i_prealloc_lock);
+
+	if (RB_EMPTY_ROOT(&ei->i_prealloc_node)) {
+		goto try_group_pa;
+	}
+
+	/*
+	 * Step 1: Find a pa with logical start immediately adjacent to the
+	 * original logical start. This could be on the left or right.
+	 *
+	 * (tmp_pa->pa_lstart never changes so we can skip locking for it).
+	 */
 	for (iter = ei->i_prealloc_node.rb_node; iter;
 	     iter = ext4_mb_pa_rb_next_iter(ac->ac_o_ex.fe_logical,
-					    tmp_pa_start, iter)) {
+					    tmp_pa->pa_lstart, iter)) {
 		tmp_pa = rb_entry(iter, struct ext4_prealloc_space,
 				  pa_node.inode_node);
+	}
 
-		/* all fields in this condition don't change,
-		 * so we can skip locking for them */
-		tmp_pa_start = tmp_pa->pa_lstart;
-		tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
-
-		/* original request start doesn't lie in this PA */
-		if (ac->ac_o_ex.fe_logical < tmp_pa_start ||
-		    ac->ac_o_ex.fe_logical >= tmp_pa_end)
-			continue;
+	/*
+	 * Step 2: The adjacent pa might be to the right of logical start, find
+	 * the left adjacent pa. After this step we'd have a valid tmp_pa whose
+	 * logical start is towards the left of original request's logical start
+	 */
+	if (tmp_pa->pa_lstart > ac->ac_o_ex.fe_logical) {
+		struct rb_node *tmp;
+		tmp = rb_prev(&tmp_pa->pa_node.inode_node);
 
-		/* non-extent files can't have physical blocks past 2^32 */
-		if (!(ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS)) &&
-		    (tmp_pa->pa_pstart + EXT4_C2B(sbi, tmp_pa->pa_len) >
-		     EXT4_MAX_BLOCK_FILE_PHYS)) {
+		if (tmp) {
+			tmp_pa = rb_entry(tmp, struct ext4_prealloc_space,
+					    pa_node.inode_node);
+		} else {
 			/*
-			 * Since PAs don't overlap, we won't find any
-			 * other PA to satisfy this.
+			 * If there is no adjacent pa to the left then finding
+			 * an overlapping pa is not possible hence stop searching
+			 * inode pa tree
 			 */
-			break;
+			goto try_group_pa;
 		}
+	}
 
-		/* found preallocated blocks, use them */
+	BUG_ON(!(tmp_pa && tmp_pa->pa_lstart <= ac->ac_o_ex.fe_logical));
+
+	/*
+	 * Step 3: If the left adjacent pa is deleted, keep moving left to find
+	 * the first non deleted adjacent pa. After this step we should have a
+	 * valid tmp_pa which is guaranteed to be non deleted.
+	 */
+	for (iter = &tmp_pa->pa_node.inode_node;; iter = rb_prev(iter)) {
+		if (!iter) {
+			/*
+			 * no non deleted left adjacent pa, so stop searching
+			 * inode pa tree
+			 */
+			goto try_group_pa;
+		}
+		tmp_pa = rb_entry(iter, struct ext4_prealloc_space,
+				  pa_node.inode_node);
 		spin_lock(&tmp_pa->pa_lock);
-		if (tmp_pa->pa_deleted == 0 && tmp_pa->pa_free &&
-		    likely(ext4_mb_pa_goal_check(ac, tmp_pa))) {
-			atomic_inc(&tmp_pa->pa_count);
-			ext4_mb_use_inode_pa(ac, tmp_pa);
+		if (tmp_pa->pa_deleted == 0) {
+			/*
+			 * We will keep holding the pa_lock from
+			 * this point on because we don't want group discard
+			 * to delete this pa underneath us. Since group
+			 * discard is anyways an ENOSPC operation it
+			 * should be okay for it to wait a few more cycles.
+			 */
+			break;
+		} else {
 			spin_unlock(&tmp_pa->pa_lock);
-			read_unlock(&ei->i_prealloc_lock);
-			return true;
 		}
+	}
+
+	BUG_ON(!(tmp_pa && tmp_pa->pa_lstart <= ac->ac_o_ex.fe_logical));
+	BUG_ON(tmp_pa->pa_deleted == 1);
+
+	/*
+	 * Step 4: We now have the non deleted left adjacent pa. Only this
+	 * pa can possibly satisfy the request hence check if it overlaps
+	 * original logical start and stop searching if it doesn't.
+	 */
+	tmp_pa_end = (loff_t)tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
+
+	if (ac->ac_o_ex.fe_logical >= tmp_pa_end) {
+		spin_unlock(&tmp_pa->pa_lock);
+		goto try_group_pa;
+	}
+
+	/* non-extent files can't have physical blocks past 2^32 */
+	if (!(ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS)) &&
+	    (tmp_pa->pa_pstart + EXT4_C2B(sbi, tmp_pa->pa_len) >
+	     EXT4_MAX_BLOCK_FILE_PHYS)) {
+		/*
+		 * Since PAs don't overlap, we won't find any other PA to
+		 * satisfy this.
+		 */
 		spin_unlock(&tmp_pa->pa_lock);
+		goto try_group_pa;
+	}
+
+	if (tmp_pa->pa_free && likely(ext4_mb_pa_goal_check(ac, tmp_pa))) {
+		atomic_inc(&tmp_pa->pa_count);
+		ext4_mb_use_inode_pa(ac, tmp_pa);
+		spin_unlock(&tmp_pa->pa_lock);
+		read_unlock(&ei->i_prealloc_lock);
+		return true;
+	} else {
+		/*
+		 * We found a valid overlapping pa but couldn't use it because
+		 * it had no free blocks. This should ideally never happen
+		 * because:
+		 *
+		 * 1. When a new inode pa is added to rbtree it must have
+		 *    pa_free > 0 since otherwise we won't actually need
+		 *    preallocation.
+		 *
+		 * 2. An inode pa that is in the rbtree can only have it's
+		 *    pa_free become zero when another thread calls:
+		 *      ext4_mb_new_blocks
+		 *       ext4_mb_use_preallocated
+		 *        ext4_mb_use_inode_pa
+		 *
+		 * 3. Further, after the above calls make pa_free == 0, we will
+		 *    immediately remove it from the rbtree in:
+		 *      ext4_mb_new_blocks
+		 *       ext4_mb_release_context
+		 *        ext4_mb_put_pa
+		 *
+		 * 4. Since the pa_free becoming 0 and pa_free getting removed
+		 * from tree both happen in ext4_mb_new_blocks, which is always
+		 * called with i_data_sem held for data allocations, we can be
+		 * sure that another process will never see a pa in rbtree with
+		 * pa_free == 0.
+		 */
+		WARN_ON_ONCE(tmp_pa->pa_free == 0);
 	}
+	spin_unlock(&tmp_pa->pa_lock);
+try_group_pa:
 	read_unlock(&ei->i_prealloc_lock);
 
 	/* can we use group allocation? */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH 0/1] ext4: Fix regression in mballoc due to deleted inode PAs in rbtree
  2023-07-22 17:15 [PATCH 0/1] ext4: Fix regression in mballoc due to deleted inode PAs in rbtree Ojaswin Mujoo
  2023-07-22 17:15 ` [PATCH 1/1] ext4: Fix rbtree traversal bug in ext4_mb_use_preallocated Ojaswin Mujoo
@ 2023-07-23 12:32 ` Theodore Ts'o
  1 sibling, 0 replies; 3+ messages in thread
From: Theodore Ts'o @ 2023-07-23 12:32 UTC (permalink / raw)
  To: linux-ext4, Ojaswin Mujoo
  Cc: Theodore Ts'o, Ritesh Harjani, linux-kernel, Jan Kara,
	Naresh Kamboju


On Sat, 22 Jul 2023 22:45:23 +0530, Ojaswin Mujoo wrote:
> Recently there was a regression found in the mballoc [1] due the existence
> of deleted preallocations(PAs) in the per inode preallocation rbtree.
> Such deleted PAs can occur because ext4_mb_discard_group_preallocations
> traverses the grp->bb_prealloc_list and marks the PAs as deleted without
> taking any inode specific locks like i_data_sem.
> 
> Due to presence of such PAs, we were sometimes missing some of the pa
> entries when traversing the per node rbtree in ext4_mb_use_preallocated.
> Due to this, in some rare conditions we ended up missing a PA that did
> overlap with our original request start. when this happens, we exit
> ext4_mb_use_preallocated and proceed with the allocation. However,
> during ext4_mb_normalize_request() we were hitting a bug on as
> a PA that could satisfy our request already existed. Since normalize
> request was already fixed earlier to account for the deleted PAs we
> actually able to catch it there.
> 
> [...]

Applied, thanks!

[1/1] ext4: Fix rbtree traversal bug in ext4_mb_use_preallocated
      commit: 9d3de7ee192a6a253f475197fe4d2e2af10a731f

Best regards,
-- 
Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-07-23 12:38 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-22 17:15 [PATCH 0/1] ext4: Fix regression in mballoc due to deleted inode PAs in rbtree Ojaswin Mujoo
2023-07-22 17:15 ` [PATCH 1/1] ext4: Fix rbtree traversal bug in ext4_mb_use_preallocated Ojaswin Mujoo
2023-07-23 12:32 ` [PATCH 0/1] ext4: Fix regression in mballoc due to deleted inode PAs in rbtree Theodore Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox