Linux Btrfs filesystem development
 help / color / mirror / Atom feed
* [PATCH v2 0/2] btrfs: fix logical_to_ino panic in btrfs_map_bio
@ 2023-05-31 16:22 Boris Burkov
  2023-05-31 16:22 ` [PATCH v2 1/2] btrfs: warn on invalid slot in tree mod log rewind Boris Burkov
  2023-05-31 16:22 ` [PATCH v2 2/2] btrfs: insert tree mod log move in push_node_left Boris Burkov
  0 siblings, 2 replies; 7+ messages in thread
From: Boris Burkov @ 2023-05-31 16:22 UTC (permalink / raw)
  To: linux-btrfs, kernel-team, fdmanana

The gory details are in the second patch, but it is possible to panic
the kernel by running the ioctl BTRFS_IOC_LOGICAL_INO (and V2 of that
ioctl).

The TL;DR of the problem is that we do not properly handle logging a
move from a push_node_left btree balancing operation in the tree mod
log, so it is possible for backref walking using the tree mod log to
construct an invalid extent_buffer and ultimately try to map invalid
bios for block 0 which ultimately hits a null pointer error and panics.

The patch set introduces additional bookkeeping in tree mod log to warn
on this issue and also fixes the issue itself.

---
Changelog:
v2:
- move WARN to before the bad memmove
- change WARN to WARN_ON + btrfs_warn
- fix tm freeing bug in tree_mod_log_insert_move
- unify error handling for tm alloc failures on setting tm=NULL after
  setting ret=PTR_ERR(tm) and then calling kfree unconditionally
- tidying/nits

Boris Burkov (2):
  btrfs: warn on invalid slot in tree mod log rewind
  btrfs: insert tree mod log move in push_node_left

 fs/btrfs/ctree.c        | 17 +++++---
 fs/btrfs/tree-mod-log.c | 94 +++++++++++++++++++++++++++++++++++++----
 2 files changed, 95 insertions(+), 16 deletions(-)

-- 
2.40.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v2 1/2] btrfs: warn on invalid slot in tree mod log rewind
  2023-05-31 16:22 [PATCH v2 0/2] btrfs: fix logical_to_ino panic in btrfs_map_bio Boris Burkov
@ 2023-05-31 16:22 ` Boris Burkov
  2023-06-01  9:28   ` Filipe Manana
  2023-05-31 16:22 ` [PATCH v2 2/2] btrfs: insert tree mod log move in push_node_left Boris Burkov
  1 sibling, 1 reply; 7+ messages in thread
From: Boris Burkov @ 2023-05-31 16:22 UTC (permalink / raw)
  To: linux-btrfs, kernel-team, fdmanana

The way that tree mod log tracks the ultimate length of the eb, the
variable 'n', eventually turns up the correct value, but at intermediate
steps during the rewind, n can be inaccurate as a representation of the
end of the eb. For example, it doesn't get updated on move rewinds, and
it does get updated for add/remove in the middle of the eb.

To detect cases with invalid moves, introduce a separate variable called
max_slot which tries to track the maximum valid slot in the rewind eb.
We can then warn if we do a move whose src range goes beyond the max
valid slot.

There is a commented caveat that it is possible to have this value be an
overestimate due to the challenge of properly handling 'add' operations
in the middle of the eb, but in practice it doesn't cause enough of a
problem to throw out the max idea in favor of tracking every valid slot.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/tree-mod-log.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/fs/btrfs/tree-mod-log.c b/fs/btrfs/tree-mod-log.c
index a555baa0143a..43f2ffa6f44d 100644
--- a/fs/btrfs/tree-mod-log.c
+++ b/fs/btrfs/tree-mod-log.c
@@ -664,8 +664,10 @@ static void tree_mod_log_rewind(struct btrfs_fs_info *fs_info,
 	unsigned long o_dst;
 	unsigned long o_src;
 	unsigned long p_size = sizeof(struct btrfs_key_ptr);
+	u32 max_slot;
 
 	n = btrfs_header_nritems(eb);
+	max_slot = n - 1;
 	read_lock(&fs_info->tree_mod_log_lock);
 	while (tm && tm->seq >= time_seq) {
 		/*
@@ -684,6 +686,8 @@ static void tree_mod_log_rewind(struct btrfs_fs_info *fs_info,
 			btrfs_set_node_ptr_generation(eb, tm->slot,
 						      tm->generation);
 			n++;
+			if (max_slot == (u32)-1 || tm->slot > max_slot)
+				max_slot = tm->slot;
 			break;
 		case BTRFS_MOD_LOG_KEY_REPLACE:
 			BUG_ON(tm->slot >= n);
@@ -693,12 +697,30 @@ static void tree_mod_log_rewind(struct btrfs_fs_info *fs_info,
 						      tm->generation);
 			break;
 		case BTRFS_MOD_LOG_KEY_ADD:
+			/*
+			 * It is possible we could have already removed keys behind the known
+			 * max slot, so this will be an overestimate. In practice, the copy
+			 * operation inserts them in increasing order, and overestimating just
+			 * means we miss some warnings, so it's OK. It isn't worth carefully
+			 * tracking the full array of valid slots to check against when moving.
+			 */
+			if (tm->slot == max_slot)
+				max_slot--;
 			/* if a move operation is needed it's in the log */
 			n--;
 			break;
 		case BTRFS_MOD_LOG_MOVE_KEYS:
 			o_dst = btrfs_node_key_ptr_offset(eb, tm->slot);
 			o_src = btrfs_node_key_ptr_offset(eb, tm->move.dst_slot);
+			if (WARN_ON((tm->move.dst_slot + tm->move.nr_items - 1 > max_slot) ||
+				    (max_slot == (u32)-1 && tm->move.nr_items > 0))) {
+				btrfs_warn(fs_info,
+					   "Move from invalid tree mod log slot eb %llu slot %d dst_slot %d nr_items %d seq %llu n %u max_slot %u\n",
+					   eb->start, tm->slot,
+					   tm->move.dst_slot, tm->move.nr_items,
+					   tm->seq, n, max_slot);
+
+			}
 			memmove_extent_buffer(eb, o_dst, o_src,
 					      tm->move.nr_items * p_size);
 			break;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v2 2/2] btrfs: insert tree mod log move in push_node_left
  2023-05-31 16:22 [PATCH v2 0/2] btrfs: fix logical_to_ino panic in btrfs_map_bio Boris Burkov
  2023-05-31 16:22 ` [PATCH v2 1/2] btrfs: warn on invalid slot in tree mod log rewind Boris Burkov
@ 2023-05-31 16:22 ` Boris Burkov
  2023-06-01  9:37   ` Filipe Manana
  2023-06-01 10:26   ` Filipe Manana
  1 sibling, 2 replies; 7+ messages in thread
From: Boris Burkov @ 2023-05-31 16:22 UTC (permalink / raw)
  To: linux-btrfs, kernel-team, fdmanana

There is a fairly unlikely race condition in tree mod log rewind that
can result in a kernel panic which has the following trace:

[ 1222530.569417] BTRFS critical (device sda3): unable to find logical 0
length 4096
[ 1222530.585809] BTRFS critical (device sda3): unable to find logical 0
length 4096
[ 1222530.602227] BUG: kernel NULL pointer dereference, address:
0000000000000002
[ 1222530.618003] #PF: supervisor read access in kernel mode
[ 1222530.629746] #PF: error_code(0x0000) - not-present page
[ 1222530.641491] PGD 0 P4D 0
[ 1222530.647480] Oops: 0000 [#1] SMP
[ 1222530.654812] CPU: 30 PID: 398973 Comm: below Kdump: loaded Tainted:
G S         O  K   5.12.0-0_fbk13_clang_7455_gb24de3bdb045 #1
[ 1222530.680772] Hardware name: Quanta Mono Lake-M.2 SATA
1HY9U9Z001G/Mono Lake-M.2 SATA, BIOS F20_3A15 08/16/2017
[ 1222530.703081] RIP: 0010:__btrfs_map_block+0xaa/0xd00
[ 1222530.714070] Code: 00 4c 8b 40 18 48 89 44 24 38 4c 8b 48 20 4d 01
c1 4d 39 e0 0f 87 85 03 00 00 4c 3b 4c 24 30 0f 82 7a 03 00 00 48 8b 44
24 38 <4c> 8b 40 18 4c 8b 60 70 48 8b 4c 24 30 4c 29 c1 4d 8b 6c 24 10
   48
   [ 1222530.755971] RSP: 0018:ffffc9002c2f7600 EFLAGS: 00010246
   [ 1222530.767894] RAX: ffffffffffffffea RBX: ffff888292e41000 RCX:
   f2702d8b8be15100
   [ 1222530.784058] RDX: ffff88885fda6fb8 RSI: ffff88885fd973c8 RDI:
   ffff88885fd973c8
   [ 1222530.800219] RBP: ffff888292e410d0 R08: ffffffff82fd7fd0 R09:
   00000000fffeffff
   [ 1222530.816380] R10: ffffffff82e57fd0 R11: ffffffff82e57d70 R12:
   0000000000000000
   [ 1222530.832541] R13: 0000000000001000 R14: 0000000000001000 R15:
   ffffc9002c2f76f0
   [ 1222530.848702] FS:  00007f38d64af000(0000)
   GS:ffff88885fd80000(0000) knlGS:0000000000000000
   [ 1222530.866978] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   [ 1222530.880080] CR2: 0000000000000002 CR3: 00000002b6770004 CR4:
   00000000003706e0
   [ 1222530.896245] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
   0000000000000000
   [ 1222530.912407] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
   0000000000000400
   [ 1222530.928570] Call Trace:
   [ 1222530.934368]  ? btrfs_printk+0x13b/0x18c
   [ 1222530.943241]  ? btrfs_bio_counter_inc_blocked+0x3d/0x130
   [ 1222530.955191]  btrfs_map_bio+0x75/0x330
   [ 1222530.963683]  ? kmem_cache_alloc+0x12a/0x2d0
   [ 1222530.973322]  ? btrfs_submit_metadata_bio+0x63/0x100
   [ 1222530.984501]  btrfs_submit_metadata_bio+0xa4/0x100
   [ 1222530.995297]  submit_extent_page+0x30f/0x360
   [ 1222531.004940]  read_extent_buffer_pages+0x49e/0x6d0
   [ 1222531.015733]  ? submit_extent_page+0x360/0x360
   [ 1222531.025770]  btree_read_extent_buffer_pages+0x5f/0x150
   [ 1222531.037522]  read_tree_block+0x37/0x60
   [ 1222531.046202]  read_block_for_search+0x18b/0x410
   [ 1222531.056422]  btrfs_search_old_slot+0x198/0x2f0
   [ 1222531.066641]  resolve_indirect_ref+0xfe/0x6f0
   [ 1222531.076472]  ? ulist_alloc+0x31/0x60
   [ 1222531.084773]  ? kmem_cache_alloc_trace+0x12e/0x2b0
   [ 1222531.095569]  find_parent_nodes+0x720/0x1830
   [ 1222531.105222]  ? ulist_alloc+0x10/0x60
   [ 1222531.113530]  iterate_extent_inodes+0xea/0x370
   [ 1222531.123543]  ? btrfs_previous_extent_item+0x8f/0x110
   [ 1222531.134914]  ? btrfs_search_path_in_tree+0x240/0x240
   [ 1222531.146283]  iterate_inodes_from_logical+0x98/0xd0
   [ 1222531.157268]  ? btrfs_search_path_in_tree+0x240/0x240
   [ 1222531.168638]  btrfs_ioctl_logical_to_ino+0xd9/0x180
   [ 1222531.179622]  btrfs_ioctl+0xe2/0x2eb0

This occurs when logical inode resolution takes a tree mod log sequence
number, and then while backref walking hits a rewind on a busy node
which has the following sequence of tree mod log operations (numbers
filled in from a specific example, but they are somewhat arbitrary)

REMOVE_WHILE_FREEING slot 532
REMOVE_WHILE_FREEING slot 531
REMOVE_WHILE_FREEING slot 530
...
REMOVE_WHILE_FREEING slot 0
REMOVE slot 455
REMOVE slot 454
REMOVE slot 453
...
REMOVE slot 0
ADD slot 455
ADD slot 454
ADD slot 453
...
ADD slot 0
MOVE src slot 0 -> dst slot 456 nritems 533
REMOVE slot 455
REMOVE slot 454
REMOVE slot 453
...
REMOVE slot 0

When this sequence gets applied via btrfs_tree_mod_log_rewind, it
allocates a fresh rewind eb, and first inserts the correct key info for
the 533 elements, then overwrites the first 456 of them, then decrements
the count by 456 via the add ops, then rewinds the move by doing a
memmove from 456:988->0:532. We have never written anything past 532, so
that memmove writes garbage into the 0:532 range. In practice, this
results in a lot of fully 0 keys. The rewind then puts valid keys into
slots 0:455 with the last removes, but 456:532 are still invalid.

When search_old_slot uses this eb, if it uses one of those invalid
slots, it can then read the extent buffer and issue a bio for offset 0
which ultimately panics looking up extent mappings.

This bad tree mod log sequence gets generated when the node balancing
code happens to do a balance_node_right followed by a push_node_left
while logging in the tree mod log. Illustrated for ebs L and R (left and
right):

      L                 R
start:
[XXX|YYY|...]      [ZZZ|...|...]
balance_node_right:
[XXX|YYY|...]      [...|ZZZ|...] move Z to make room for Y
[XXX|...|...]      [YYY|ZZZ|...] copy Y from L to R
push_node_left:
[XXX|YYY|...]      [...|ZZZ|...] copy Y from R to L
[XXX|YYY|...]      [ZZZ|...|...] move Z into emptied space (NOT LOGGED!)

This is because balance_node_right logs a move, but push_node_left
explicitly doesn't. That is because logging the move would remove the
overwritten src < dst range in the right eb, which was already logged
when we called btrfs_tree_mod_log_eb_copy. The correct sequence would
include a move from 456:988 to 0:532 after remove 0:455 and before
removing 0:532. Reversing that sequence would entail creating keys for
0:532, then moving those keys out to 456:988, then creating more keys
for 0:455.

i.e.,
REMOVE_WHILE_FREEING slot 532
REMOVE_WHILE_FREEING slot 531
REMOVE_WHILE_FREEING slot 530
...
REMOVE_WHILE_FREEING slot 0
MOVE src slot 456 -> dst slot 0 nritems 533
REMOVE slot 455
REMOVE slot 454
REMOVE slot 453
...
REMOVE slot 0
ADD slot 455
ADD slot 454
ADD slot 453
...
ADD slot 0
MOVE src slot 0 -> dst slot 456 nritems 533
REMOVE slot 455
REMOVE slot 454
REMOVE slot 453
...
REMOVE slot 0

Fix this to log the move but avoid the double remove by putting all the
logging logic in btrfs_tree_mod_log_eb_copy which has enough information
to detect these cases and properly log moves, removes, and adds. Leave
btrfs_tree_mod_log_insert_move to handle insert_ptr and delete_ptr's
tree mod logging.

Unfortunately, this is quite difficult to reproduce, and I was only
able to reproduce it by adding sleeps in btrfs_search_old_slot that
would encourage more log rewinding during ino_to_logical ioctls. I was
able to hit the warning in the previous patch in the series without the
fix quite quickly, but not after this patch.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/ctree.c        | 17 ++++++----
 fs/btrfs/tree-mod-log.c | 72 +++++++++++++++++++++++++++++++++++------
 2 files changed, 73 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 2f2071d64c52..5226a66ad3dd 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2785,8 +2785,8 @@ static int push_node_left(struct btrfs_trans_handle *trans,
 
 	if (push_items < src_nritems) {
 		/*
-		 * Don't call btrfs_tree_mod_log_insert_move() here, key removal
-		 * was already fully logged by btrfs_tree_mod_log_eb_copy() above.
+		 * btrfs_tree_mod_log_eb_copy handles logging the move, so we
+		 * don't need to do an explicit tree mod log operation for it.
 		 */
 		memmove_extent_buffer(src, btrfs_node_key_ptr_offset(src, 0),
 				      btrfs_node_key_ptr_offset(src, push_items),
@@ -2847,12 +2847,14 @@ static int balance_node_right(struct btrfs_trans_handle *trans,
 		btrfs_abort_transaction(trans, ret);
 		return ret;
 	}
-	ret = btrfs_tree_mod_log_insert_move(dst, push_items, 0, dst_nritems);
-	BUG_ON(ret < 0);
+
+	/*
+	 * btrfs_tree_mod_log_eb_copy handles logging the move, so we
+	 * don't need to do an explicit tree mod log operation for it.
+	 */
 	memmove_extent_buffer(dst, btrfs_node_key_ptr_offset(dst, push_items),
-				      btrfs_node_key_ptr_offset(dst, 0),
-				      (dst_nritems) *
-				      sizeof(struct btrfs_key_ptr));
+			      btrfs_node_key_ptr_offset(dst, 0),
+			      (dst_nritems) * sizeof(struct btrfs_key_ptr));
 
 	ret = btrfs_tree_mod_log_eb_copy(dst, src, 0, src_nritems - push_items,
 					 push_items);
@@ -2860,6 +2862,7 @@ static int balance_node_right(struct btrfs_trans_handle *trans,
 		btrfs_abort_transaction(trans, ret);
 		return ret;
 	}
+
 	copy_extent_buffer(dst, src,
 			   btrfs_node_key_ptr_offset(dst, 0),
 			   btrfs_node_key_ptr_offset(src, src_nritems - push_items),
diff --git a/fs/btrfs/tree-mod-log.c b/fs/btrfs/tree-mod-log.c
index 43f2ffa6f44d..8d8e59557647 100644
--- a/fs/btrfs/tree-mod-log.c
+++ b/fs/btrfs/tree-mod-log.c
@@ -248,6 +248,25 @@ int btrfs_tree_mod_log_insert_key(struct extent_buffer *eb, int slot,
 	return ret;
 }
 
+static struct tree_mod_elem *tree_mod_log_alloc_move(struct extent_buffer *eb,
+						     int dst_slot, int src_slot,
+						     int nr_items)
+{
+	struct tree_mod_elem *tm;
+
+	tm = kzalloc(sizeof(*tm), GFP_NOFS);
+	if (!tm)
+		return ERR_PTR(-ENOMEM);
+
+	tm->logical = eb->start;
+	tm->slot = src_slot;
+	tm->move.dst_slot = dst_slot;
+	tm->move.nr_items = nr_items;
+	tm->op = BTRFS_MOD_LOG_MOVE_KEYS;
+
+	return tm;
+}
+
 int btrfs_tree_mod_log_insert_move(struct extent_buffer *eb,
 				   int dst_slot, int src_slot,
 				   int nr_items)
@@ -265,18 +284,13 @@ int btrfs_tree_mod_log_insert_move(struct extent_buffer *eb,
 	if (!tm_list)
 		return -ENOMEM;
 
-	tm = kzalloc(sizeof(*tm), GFP_NOFS);
-	if (!tm) {
-		ret = -ENOMEM;
+	tm = tree_mod_log_alloc_move(eb, dst_slot, src_slot, nr_items);
+	if (IS_ERR(tm)) {
+		ret = PTR_ERR(tm);
+		tm = NULL;
 		goto free_tms;
 	}
 
-	tm->logical = eb->start;
-	tm->slot = src_slot;
-	tm->move.dst_slot = dst_slot;
-	tm->move.nr_items = nr_items;
-	tm->op = BTRFS_MOD_LOG_MOVE_KEYS;
-
 	for (i = 0; i + dst_slot < src_slot && i < nr_items; i++) {
 		tm_list[i] = alloc_tree_mod_elem(eb, i + dst_slot,
 				BTRFS_MOD_LOG_KEY_REMOVE_WHILE_MOVING);
@@ -489,6 +503,10 @@ int btrfs_tree_mod_log_eb_copy(struct extent_buffer *dst,
 	struct tree_mod_elem **tm_list_add, **tm_list_rem;
 	int i;
 	bool locked = false;
+	struct tree_mod_elem *dst_move_tm = NULL;
+	struct tree_mod_elem *src_move_tm = NULL;
+	u32 dst_move_nr_items = btrfs_header_nritems(dst) - dst_offset;
+	u32 src_move_nr_items = btrfs_header_nritems(src) - (src_offset + nr_items);
 
 	if (!tree_mod_need_log(fs_info, NULL))
 		return 0;
@@ -501,6 +519,26 @@ int btrfs_tree_mod_log_eb_copy(struct extent_buffer *dst,
 	if (!tm_list)
 		return -ENOMEM;
 
+	if (dst_move_nr_items) {
+		dst_move_tm = tree_mod_log_alloc_move(dst, dst_offset + nr_items,
+						      dst_offset, dst_move_nr_items);
+		if (IS_ERR(dst_move_tm)) {
+			ret = PTR_ERR(dst_move_tm);
+			dst_move_tm = NULL;
+			goto free_tms;
+		}
+	}
+	if (src_move_nr_items) {
+		src_move_tm = tree_mod_log_alloc_move(src, src_offset,
+						      src_offset + nr_items,
+						      src_move_nr_items);
+		if (IS_ERR(src_move_tm)) {
+			ret = PTR_ERR(src_move_tm);
+			src_move_tm = NULL;
+			goto free_tms;
+		}
+	}
+
 	tm_list_add = tm_list;
 	tm_list_rem = tm_list + nr_items;
 	for (i = 0; i < nr_items; i++) {
@@ -523,6 +561,11 @@ int btrfs_tree_mod_log_eb_copy(struct extent_buffer *dst,
 		goto free_tms;
 	locked = true;
 
+	if (dst_move_tm) {
+		ret = tree_mod_log_insert(fs_info, dst_move_tm);
+		if (ret)
+			goto free_tms;
+	}
 	for (i = 0; i < nr_items; i++) {
 		ret = tree_mod_log_insert(fs_info, tm_list_rem[i]);
 		if (ret)
@@ -531,6 +574,11 @@ int btrfs_tree_mod_log_eb_copy(struct extent_buffer *dst,
 		if (ret)
 			goto free_tms;
 	}
+	if (src_move_tm) {
+		ret = tree_mod_log_insert(fs_info, src_move_tm);
+		if (ret)
+			goto free_tms;
+	}
 
 	write_unlock(&fs_info->tree_mod_log_lock);
 	kfree(tm_list);
@@ -538,6 +586,12 @@ int btrfs_tree_mod_log_eb_copy(struct extent_buffer *dst,
 	return 0;
 
 free_tms:
+	if (dst_move_tm && !RB_EMPTY_NODE(&dst_move_tm->node))
+		rb_erase(&dst_move_tm->node, &fs_info->tree_mod_log);
+	kfree(dst_move_tm);
+	if (src_move_tm && !RB_EMPTY_NODE(&src_move_tm->node))
+		rb_erase(&src_move_tm->node, &fs_info->tree_mod_log);
+	kfree(src_move_tm);
 	for (i = 0; i < nr_items * 2; i++) {
 		if (tm_list[i] && !RB_EMPTY_NODE(&tm_list[i]->node))
 			rb_erase(&tm_list[i]->node, &fs_info->tree_mod_log);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 1/2] btrfs: warn on invalid slot in tree mod log rewind
  2023-05-31 16:22 ` [PATCH v2 1/2] btrfs: warn on invalid slot in tree mod log rewind Boris Burkov
@ 2023-06-01  9:28   ` Filipe Manana
  2023-06-01 15:04     ` Boris Burkov
  0 siblings, 1 reply; 7+ messages in thread
From: Filipe Manana @ 2023-06-01  9:28 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Wed, May 31, 2023 at 5:22 PM Boris Burkov <boris@bur.io> wrote:
>
> The way that tree mod log tracks the ultimate length of the eb, the
> variable 'n', eventually turns up the correct value, but at intermediate
> steps during the rewind, n can be inaccurate as a representation of the
> end of the eb. For example, it doesn't get updated on move rewinds, and
> it does get updated for add/remove in the middle of the eb.
>
> To detect cases with invalid moves, introduce a separate variable called
> max_slot which tries to track the maximum valid slot in the rewind eb.
> We can then warn if we do a move whose src range goes beyond the max
> valid slot.
>
> There is a commented caveat that it is possible to have this value be an
> overestimate due to the challenge of properly handling 'add' operations
> in the middle of the eb, but in practice it doesn't cause enough of a
> problem to throw out the max idea in favor of tracking every valid slot.
>
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/tree-mod-log.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
>
> diff --git a/fs/btrfs/tree-mod-log.c b/fs/btrfs/tree-mod-log.c
> index a555baa0143a..43f2ffa6f44d 100644
> --- a/fs/btrfs/tree-mod-log.c
> +++ b/fs/btrfs/tree-mod-log.c
> @@ -664,8 +664,10 @@ static void tree_mod_log_rewind(struct btrfs_fs_info *fs_info,
>         unsigned long o_dst;
>         unsigned long o_src;
>         unsigned long p_size = sizeof(struct btrfs_key_ptr);
> +       u32 max_slot;
>
>         n = btrfs_header_nritems(eb);
> +       max_slot = n - 1;
>         read_lock(&fs_info->tree_mod_log_lock);
>         while (tm && tm->seq >= time_seq) {
>                 /*
> @@ -684,6 +686,8 @@ static void tree_mod_log_rewind(struct btrfs_fs_info *fs_info,
>                         btrfs_set_node_ptr_generation(eb, tm->slot,
>                                                       tm->generation);
>                         n++;
> +                       if (max_slot == (u32)-1 || tm->slot > max_slot)

It would be better to document the intention of the special value
(u32)-1, and perhaps also let the type be
a signed integer and just check for -1 as meaning "no valid slot".

> +                               max_slot = tm->slot;
>                         break;
>                 case BTRFS_MOD_LOG_KEY_REPLACE:
>                         BUG_ON(tm->slot >= n);
> @@ -693,12 +697,30 @@ static void tree_mod_log_rewind(struct btrfs_fs_info *fs_info,
>                                                       tm->generation);
>                         break;
>                 case BTRFS_MOD_LOG_KEY_ADD:
> +                       /*
> +                        * It is possible we could have already removed keys behind the known
> +                        * max slot, so this will be an overestimate. In practice, the copy
> +                        * operation inserts them in increasing order, and overestimating just
> +                        * means we miss some warnings, so it's OK. It isn't worth carefully
> +                        * tracking the full array of valid slots to check against when moving.
> +                        */
> +                       if (tm->slot == max_slot)
> +                               max_slot--;
>                         /* if a move operation is needed it's in the log */
>                         n--;
>                         break;
>                 case BTRFS_MOD_LOG_MOVE_KEYS:
>                         o_dst = btrfs_node_key_ptr_offset(eb, tm->slot);
>                         o_src = btrfs_node_key_ptr_offset(eb, tm->move.dst_slot);
> +                       if (WARN_ON((tm->move.dst_slot + tm->move.nr_items - 1 > max_slot) ||
> +                                   (max_slot == (u32)-1 && tm->move.nr_items > 0))) {

Like v1, I'm still a bit at odds with the check for  tm->move.nr_items > 0

Such a case should never be possible, we should assert that
separately, plus when I read this it makes me think:
why do we check it only when max_slot is (u32)-1 but not for the
previous condition?
It makes me think what would happen to the first condition if it's 0
and dst_slot happens to be 0, we have an underflow, etc

Maybe add an ASSERT(tm->move.nr_items > 0) before.

Thanks.


> +                               btrfs_warn(fs_info,
> +                                          "Move from invalid tree mod log slot eb %llu slot %d dst_slot %d nr_items %d seq %llu n %u max_slot %u\n",
> +                                          eb->start, tm->slot,
> +                                          tm->move.dst_slot, tm->move.nr_items,
> +                                          tm->seq, n, max_slot);
> +
> +                       }
>                         memmove_extent_buffer(eb, o_dst, o_src,
>                                               tm->move.nr_items * p_size);
>                         break;
> --
> 2.40.1
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 2/2] btrfs: insert tree mod log move in push_node_left
  2023-05-31 16:22 ` [PATCH v2 2/2] btrfs: insert tree mod log move in push_node_left Boris Burkov
@ 2023-06-01  9:37   ` Filipe Manana
  2023-06-01 10:26   ` Filipe Manana
  1 sibling, 0 replies; 7+ messages in thread
From: Filipe Manana @ 2023-06-01  9:37 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Wed, May 31, 2023 at 5:22 PM Boris Burkov <boris@bur.io> wrote:
>
> There is a fairly unlikely race condition in tree mod log rewind that
> can result in a kernel panic which has the following trace:
>
> [ 1222530.569417] BTRFS critical (device sda3): unable to find logical 0
> length 4096
> [ 1222530.585809] BTRFS critical (device sda3): unable to find logical 0
> length 4096
> [ 1222530.602227] BUG: kernel NULL pointer dereference, address:
> 0000000000000002
> [ 1222530.618003] #PF: supervisor read access in kernel mode
> [ 1222530.629746] #PF: error_code(0x0000) - not-present page
> [ 1222530.641491] PGD 0 P4D 0
> [ 1222530.647480] Oops: 0000 [#1] SMP
> [ 1222530.654812] CPU: 30 PID: 398973 Comm: below Kdump: loaded Tainted:
> G S         O  K   5.12.0-0_fbk13_clang_7455_gb24de3bdb045 #1
> [ 1222530.680772] Hardware name: Quanta Mono Lake-M.2 SATA
> 1HY9U9Z001G/Mono Lake-M.2 SATA, BIOS F20_3A15 08/16/2017
> [ 1222530.703081] RIP: 0010:__btrfs_map_block+0xaa/0xd00
> [ 1222530.714070] Code: 00 4c 8b 40 18 48 89 44 24 38 4c 8b 48 20 4d 01
> c1 4d 39 e0 0f 87 85 03 00 00 4c 3b 4c 24 30 0f 82 7a 03 00 00 48 8b 44
> 24 38 <4c> 8b 40 18 4c 8b 60 70 48 8b 4c 24 30 4c 29 c1 4d 8b 6c 24 10
>    48
>    [ 1222530.755971] RSP: 0018:ffffc9002c2f7600 EFLAGS: 00010246
>    [ 1222530.767894] RAX: ffffffffffffffea RBX: ffff888292e41000 RCX:
>    f2702d8b8be15100
>    [ 1222530.784058] RDX: ffff88885fda6fb8 RSI: ffff88885fd973c8 RDI:
>    ffff88885fd973c8
>    [ 1222530.800219] RBP: ffff888292e410d0 R08: ffffffff82fd7fd0 R09:
>    00000000fffeffff
>    [ 1222530.816380] R10: ffffffff82e57fd0 R11: ffffffff82e57d70 R12:
>    0000000000000000
>    [ 1222530.832541] R13: 0000000000001000 R14: 0000000000001000 R15:
>    ffffc9002c2f76f0
>    [ 1222530.848702] FS:  00007f38d64af000(0000)
>    GS:ffff88885fd80000(0000) knlGS:0000000000000000
>    [ 1222530.866978] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>    [ 1222530.880080] CR2: 0000000000000002 CR3: 00000002b6770004 CR4:
>    00000000003706e0
>    [ 1222530.896245] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>    0000000000000000
>    [ 1222530.912407] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>    0000000000000400
>    [ 1222530.928570] Call Trace:
>    [ 1222530.934368]  ? btrfs_printk+0x13b/0x18c
>    [ 1222530.943241]  ? btrfs_bio_counter_inc_blocked+0x3d/0x130
>    [ 1222530.955191]  btrfs_map_bio+0x75/0x330
>    [ 1222530.963683]  ? kmem_cache_alloc+0x12a/0x2d0
>    [ 1222530.973322]  ? btrfs_submit_metadata_bio+0x63/0x100
>    [ 1222530.984501]  btrfs_submit_metadata_bio+0xa4/0x100
>    [ 1222530.995297]  submit_extent_page+0x30f/0x360
>    [ 1222531.004940]  read_extent_buffer_pages+0x49e/0x6d0
>    [ 1222531.015733]  ? submit_extent_page+0x360/0x360
>    [ 1222531.025770]  btree_read_extent_buffer_pages+0x5f/0x150
>    [ 1222531.037522]  read_tree_block+0x37/0x60
>    [ 1222531.046202]  read_block_for_search+0x18b/0x410
>    [ 1222531.056422]  btrfs_search_old_slot+0x198/0x2f0
>    [ 1222531.066641]  resolve_indirect_ref+0xfe/0x6f0
>    [ 1222531.076472]  ? ulist_alloc+0x31/0x60
>    [ 1222531.084773]  ? kmem_cache_alloc_trace+0x12e/0x2b0
>    [ 1222531.095569]  find_parent_nodes+0x720/0x1830
>    [ 1222531.105222]  ? ulist_alloc+0x10/0x60
>    [ 1222531.113530]  iterate_extent_inodes+0xea/0x370
>    [ 1222531.123543]  ? btrfs_previous_extent_item+0x8f/0x110
>    [ 1222531.134914]  ? btrfs_search_path_in_tree+0x240/0x240
>    [ 1222531.146283]  iterate_inodes_from_logical+0x98/0xd0
>    [ 1222531.157268]  ? btrfs_search_path_in_tree+0x240/0x240
>    [ 1222531.168638]  btrfs_ioctl_logical_to_ino+0xd9/0x180
>    [ 1222531.179622]  btrfs_ioctl+0xe2/0x2eb0
>
> This occurs when logical inode resolution takes a tree mod log sequence
> number, and then while backref walking hits a rewind on a busy node
> which has the following sequence of tree mod log operations (numbers
> filled in from a specific example, but they are somewhat arbitrary)
>
> REMOVE_WHILE_FREEING slot 532
> REMOVE_WHILE_FREEING slot 531
> REMOVE_WHILE_FREEING slot 530
> ...
> REMOVE_WHILE_FREEING slot 0
> REMOVE slot 455
> REMOVE slot 454
> REMOVE slot 453
> ...
> REMOVE slot 0
> ADD slot 455
> ADD slot 454
> ADD slot 453
> ...
> ADD slot 0
> MOVE src slot 0 -> dst slot 456 nritems 533
> REMOVE slot 455
> REMOVE slot 454
> REMOVE slot 453
> ...
> REMOVE slot 0
>
> When this sequence gets applied via btrfs_tree_mod_log_rewind, it
> allocates a fresh rewind eb, and first inserts the correct key info for
> the 533 elements, then overwrites the first 456 of them, then decrements
> the count by 456 via the add ops, then rewinds the move by doing a
> memmove from 456:988->0:532. We have never written anything past 532, so
> that memmove writes garbage into the 0:532 range. In practice, this
> results in a lot of fully 0 keys. The rewind then puts valid keys into
> slots 0:455 with the last removes, but 456:532 are still invalid.
>
> When search_old_slot uses this eb, if it uses one of those invalid
> slots, it can then read the extent buffer and issue a bio for offset 0
> which ultimately panics looking up extent mappings.
>
> This bad tree mod log sequence gets generated when the node balancing
> code happens to do a balance_node_right followed by a push_node_left
> while logging in the tree mod log. Illustrated for ebs L and R (left and
> right):
>
>       L                 R
> start:
> [XXX|YYY|...]      [ZZZ|...|...]
> balance_node_right:
> [XXX|YYY|...]      [...|ZZZ|...] move Z to make room for Y
> [XXX|...|...]      [YYY|ZZZ|...] copy Y from L to R
> push_node_left:
> [XXX|YYY|...]      [...|ZZZ|...] copy Y from R to L
> [XXX|YYY|...]      [ZZZ|...|...] move Z into emptied space (NOT LOGGED!)
>
> This is because balance_node_right logs a move, but push_node_left
> explicitly doesn't. That is because logging the move would remove the
> overwritten src < dst range in the right eb, which was already logged
> when we called btrfs_tree_mod_log_eb_copy. The correct sequence would
> include a move from 456:988 to 0:532 after remove 0:455 and before
> removing 0:532. Reversing that sequence would entail creating keys for
> 0:532, then moving those keys out to 456:988, then creating more keys
> for 0:455.
>
> i.e.,
> REMOVE_WHILE_FREEING slot 532
> REMOVE_WHILE_FREEING slot 531
> REMOVE_WHILE_FREEING slot 530
> ...
> REMOVE_WHILE_FREEING slot 0
> MOVE src slot 456 -> dst slot 0 nritems 533
> REMOVE slot 455
> REMOVE slot 454
> REMOVE slot 453
> ...
> REMOVE slot 0
> ADD slot 455
> ADD slot 454
> ADD slot 453
> ...
> ADD slot 0
> MOVE src slot 0 -> dst slot 456 nritems 533
> REMOVE slot 455
> REMOVE slot 454
> REMOVE slot 453
> ...
> REMOVE slot 0
>
> Fix this to log the move but avoid the double remove by putting all the
> logging logic in btrfs_tree_mod_log_eb_copy which has enough information
> to detect these cases and properly log moves, removes, and adds. Leave
> btrfs_tree_mod_log_insert_move to handle insert_ptr and delete_ptr's
> tree mod logging.
>
> Unfortunately, this is quite difficult to reproduce, and I was only
> able to reproduce it by adding sleeps in btrfs_search_old_slot that
> would encourage more log rewinding during ino_to_logical ioctls. I was
> able to hit the warning in the previous patch in the series without the
> fix quite quickly, but not after this patch.
>
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/ctree.c        | 17 ++++++----
>  fs/btrfs/tree-mod-log.c | 72 +++++++++++++++++++++++++++++++++++------
>  2 files changed, 73 insertions(+), 16 deletions(-)
>
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 2f2071d64c52..5226a66ad3dd 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -2785,8 +2785,8 @@ static int push_node_left(struct btrfs_trans_handle *trans,
>
>         if (push_items < src_nritems) {
>                 /*
> -                * Don't call btrfs_tree_mod_log_insert_move() here, key removal
> -                * was already fully logged by btrfs_tree_mod_log_eb_copy() above.
> +                * btrfs_tree_mod_log_eb_copy handles logging the move, so we
> +                * don't need to do an explicit tree mod log operation for it.
>                  */
>                 memmove_extent_buffer(src, btrfs_node_key_ptr_offset(src, 0),
>                                       btrfs_node_key_ptr_offset(src, push_items),
> @@ -2847,12 +2847,14 @@ static int balance_node_right(struct btrfs_trans_handle *trans,
>                 btrfs_abort_transaction(trans, ret);
>                 return ret;
>         }
> -       ret = btrfs_tree_mod_log_insert_move(dst, push_items, 0, dst_nritems);
> -       BUG_ON(ret < 0);
> +
> +       /*
> +        * btrfs_tree_mod_log_eb_copy handles logging the move, so we
> +        * don't need to do an explicit tree mod log operation for it.
> +        */
>         memmove_extent_buffer(dst, btrfs_node_key_ptr_offset(dst, push_items),
> -                                     btrfs_node_key_ptr_offset(dst, 0),
> -                                     (dst_nritems) *
> -                                     sizeof(struct btrfs_key_ptr));
> +                             btrfs_node_key_ptr_offset(dst, 0),
> +                             (dst_nritems) * sizeof(struct btrfs_key_ptr));
>
>         ret = btrfs_tree_mod_log_eb_copy(dst, src, 0, src_nritems - push_items,
>                                          push_items);
> @@ -2860,6 +2862,7 @@ static int balance_node_right(struct btrfs_trans_handle *trans,
>                 btrfs_abort_transaction(trans, ret);
>                 return ret;
>         }
> +
>         copy_extent_buffer(dst, src,
>                            btrfs_node_key_ptr_offset(dst, 0),
>                            btrfs_node_key_ptr_offset(src, src_nritems - push_items),
> diff --git a/fs/btrfs/tree-mod-log.c b/fs/btrfs/tree-mod-log.c
> index 43f2ffa6f44d..8d8e59557647 100644
> --- a/fs/btrfs/tree-mod-log.c
> +++ b/fs/btrfs/tree-mod-log.c
> @@ -248,6 +248,25 @@ int btrfs_tree_mod_log_insert_key(struct extent_buffer *eb, int slot,
>         return ret;
>  }
>
> +static struct tree_mod_elem *tree_mod_log_alloc_move(struct extent_buffer *eb,
> +                                                    int dst_slot, int src_slot,
> +                                                    int nr_items)
> +{
> +       struct tree_mod_elem *tm;
> +
> +       tm = kzalloc(sizeof(*tm), GFP_NOFS);
> +       if (!tm)
> +               return ERR_PTR(-ENOMEM);
> +
> +       tm->logical = eb->start;
> +       tm->slot = src_slot;
> +       tm->move.dst_slot = dst_slot;
> +       tm->move.nr_items = nr_items;
> +       tm->op = BTRFS_MOD_LOG_MOVE_KEYS;

Here we also need a RB_CLEAR_NODE(&tm->node), because otherwise... see
further below.

> +
> +       return tm;
> +}
> +
>  int btrfs_tree_mod_log_insert_move(struct extent_buffer *eb,
>                                    int dst_slot, int src_slot,
>                                    int nr_items)
> @@ -265,18 +284,13 @@ int btrfs_tree_mod_log_insert_move(struct extent_buffer *eb,
>         if (!tm_list)
>                 return -ENOMEM;
>
> -       tm = kzalloc(sizeof(*tm), GFP_NOFS);
> -       if (!tm) {
> -               ret = -ENOMEM;
> +       tm = tree_mod_log_alloc_move(eb, dst_slot, src_slot, nr_items);
> +       if (IS_ERR(tm)) {
> +               ret = PTR_ERR(tm);
> +               tm = NULL;
>                 goto free_tms;
>         }
>
> -       tm->logical = eb->start;
> -       tm->slot = src_slot;
> -       tm->move.dst_slot = dst_slot;
> -       tm->move.nr_items = nr_items;
> -       tm->op = BTRFS_MOD_LOG_MOVE_KEYS;
> -
>         for (i = 0; i + dst_slot < src_slot && i < nr_items; i++) {
>                 tm_list[i] = alloc_tree_mod_elem(eb, i + dst_slot,
>                                 BTRFS_MOD_LOG_KEY_REMOVE_WHILE_MOVING);
> @@ -489,6 +503,10 @@ int btrfs_tree_mod_log_eb_copy(struct extent_buffer *dst,
>         struct tree_mod_elem **tm_list_add, **tm_list_rem;
>         int i;
>         bool locked = false;
> +       struct tree_mod_elem *dst_move_tm = NULL;
> +       struct tree_mod_elem *src_move_tm = NULL;
> +       u32 dst_move_nr_items = btrfs_header_nritems(dst) - dst_offset;
> +       u32 src_move_nr_items = btrfs_header_nritems(src) - (src_offset + nr_items);
>
>         if (!tree_mod_need_log(fs_info, NULL))
>                 return 0;
> @@ -501,6 +519,26 @@ int btrfs_tree_mod_log_eb_copy(struct extent_buffer *dst,
>         if (!tm_list)
>                 return -ENOMEM;
>
> +       if (dst_move_nr_items) {
> +               dst_move_tm = tree_mod_log_alloc_move(dst, dst_offset + nr_items,
> +                                                     dst_offset, dst_move_nr_items);
> +               if (IS_ERR(dst_move_tm)) {
> +                       ret = PTR_ERR(dst_move_tm);
> +                       dst_move_tm = NULL;
> +                       goto free_tms;
> +               }
> +       }
> +       if (src_move_nr_items) {
> +               src_move_tm = tree_mod_log_alloc_move(src, src_offset,
> +                                                     src_offset + nr_items,
> +                                                     src_move_nr_items);
> +               if (IS_ERR(src_move_tm)) {
> +                       ret = PTR_ERR(src_move_tm);
> +                       src_move_tm = NULL;
> +                       goto free_tms;
> +               }
> +       }
> +
>         tm_list_add = tm_list;
>         tm_list_rem = tm_list + nr_items;
>         for (i = 0; i < nr_items; i++) {
> @@ -523,6 +561,11 @@ int btrfs_tree_mod_log_eb_copy(struct extent_buffer *dst,
>                 goto free_tms;
>         locked = true;
>
> +       if (dst_move_tm) {
> +               ret = tree_mod_log_insert(fs_info, dst_move_tm);
> +               if (ret)
> +                       goto free_tms;
> +       }
>         for (i = 0; i < nr_items; i++) {
>                 ret = tree_mod_log_insert(fs_info, tm_list_rem[i]);
>                 if (ret)
> @@ -531,6 +574,11 @@ int btrfs_tree_mod_log_eb_copy(struct extent_buffer *dst,
>                 if (ret)
>                         goto free_tms;
>         }
> +       if (src_move_tm) {
> +               ret = tree_mod_log_insert(fs_info, src_move_tm);
> +               if (ret)
> +                       goto free_tms;
> +       }
>
>         write_unlock(&fs_info->tree_mod_log_lock);
>         kfree(tm_list);
> @@ -538,6 +586,12 @@ int btrfs_tree_mod_log_eb_copy(struct extent_buffer *dst,
>         return 0;
>
>  free_tms:
> +       if (dst_move_tm && !RB_EMPTY_NODE(&dst_move_tm->node))
> +               rb_erase(&dst_move_tm->node, &fs_info->tree_mod_log);
> +       kfree(dst_move_tm);
> +       if (src_move_tm && !RB_EMPTY_NODE(&src_move_tm->node))
> +               rb_erase(&src_move_tm->node, &fs_info->tree_mod_log);

So if we end up in this error path with dst_move_tm or src_move_tm
allocated but not yet
inserted in the rbtree, the RB_EMPTY_NODE() call will return true, due
to the missing initialization of the ->node field,
and therefore we will call rb_erase() for a node that's not in the
rbtree, which will cause some chaos.

Thanks.

> +       kfree(src_move_tm);
>         for (i = 0; i < nr_items * 2; i++) {
>                 if (tm_list[i] && !RB_EMPTY_NODE(&tm_list[i]->node))
>                         rb_erase(&tm_list[i]->node, &fs_info->tree_mod_log);
> --
> 2.40.1
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 2/2] btrfs: insert tree mod log move in push_node_left
  2023-05-31 16:22 ` [PATCH v2 2/2] btrfs: insert tree mod log move in push_node_left Boris Burkov
  2023-06-01  9:37   ` Filipe Manana
@ 2023-06-01 10:26   ` Filipe Manana
  1 sibling, 0 replies; 7+ messages in thread
From: Filipe Manana @ 2023-06-01 10:26 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Wed, May 31, 2023 at 5:22 PM Boris Burkov <boris@bur.io> wrote:
>
> There is a fairly unlikely race condition in tree mod log rewind that
> can result in a kernel panic which has the following trace:
>
> [ 1222530.569417] BTRFS critical (device sda3): unable to find logical 0
> length 4096
> [ 1222530.585809] BTRFS critical (device sda3): unable to find logical 0
> length 4096
> [ 1222530.602227] BUG: kernel NULL pointer dereference, address:
> 0000000000000002
> [ 1222530.618003] #PF: supervisor read access in kernel mode
> [ 1222530.629746] #PF: error_code(0x0000) - not-present page
> [ 1222530.641491] PGD 0 P4D 0
> [ 1222530.647480] Oops: 0000 [#1] SMP
> [ 1222530.654812] CPU: 30 PID: 398973 Comm: below Kdump: loaded Tainted:
> G S         O  K   5.12.0-0_fbk13_clang_7455_gb24de3bdb045 #1
> [ 1222530.680772] Hardware name: Quanta Mono Lake-M.2 SATA
> 1HY9U9Z001G/Mono Lake-M.2 SATA, BIOS F20_3A15 08/16/2017
> [ 1222530.703081] RIP: 0010:__btrfs_map_block+0xaa/0xd00
> [ 1222530.714070] Code: 00 4c 8b 40 18 48 89 44 24 38 4c 8b 48 20 4d 01
> c1 4d 39 e0 0f 87 85 03 00 00 4c 3b 4c 24 30 0f 82 7a 03 00 00 48 8b 44
> 24 38 <4c> 8b 40 18 4c 8b 60 70 48 8b 4c 24 30 4c 29 c1 4d 8b 6c 24 10
>    48
>    [ 1222530.755971] RSP: 0018:ffffc9002c2f7600 EFLAGS: 00010246
>    [ 1222530.767894] RAX: ffffffffffffffea RBX: ffff888292e41000 RCX:
>    f2702d8b8be15100
>    [ 1222530.784058] RDX: ffff88885fda6fb8 RSI: ffff88885fd973c8 RDI:
>    ffff88885fd973c8
>    [ 1222530.800219] RBP: ffff888292e410d0 R08: ffffffff82fd7fd0 R09:
>    00000000fffeffff
>    [ 1222530.816380] R10: ffffffff82e57fd0 R11: ffffffff82e57d70 R12:
>    0000000000000000
>    [ 1222530.832541] R13: 0000000000001000 R14: 0000000000001000 R15:
>    ffffc9002c2f76f0
>    [ 1222530.848702] FS:  00007f38d64af000(0000)
>    GS:ffff88885fd80000(0000) knlGS:0000000000000000
>    [ 1222530.866978] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>    [ 1222530.880080] CR2: 0000000000000002 CR3: 00000002b6770004 CR4:
>    00000000003706e0
>    [ 1222530.896245] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>    0000000000000000
>    [ 1222530.912407] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>    0000000000000400
>    [ 1222530.928570] Call Trace:
>    [ 1222530.934368]  ? btrfs_printk+0x13b/0x18c
>    [ 1222530.943241]  ? btrfs_bio_counter_inc_blocked+0x3d/0x130
>    [ 1222530.955191]  btrfs_map_bio+0x75/0x330
>    [ 1222530.963683]  ? kmem_cache_alloc+0x12a/0x2d0
>    [ 1222530.973322]  ? btrfs_submit_metadata_bio+0x63/0x100
>    [ 1222530.984501]  btrfs_submit_metadata_bio+0xa4/0x100
>    [ 1222530.995297]  submit_extent_page+0x30f/0x360
>    [ 1222531.004940]  read_extent_buffer_pages+0x49e/0x6d0
>    [ 1222531.015733]  ? submit_extent_page+0x360/0x360
>    [ 1222531.025770]  btree_read_extent_buffer_pages+0x5f/0x150
>    [ 1222531.037522]  read_tree_block+0x37/0x60
>    [ 1222531.046202]  read_block_for_search+0x18b/0x410
>    [ 1222531.056422]  btrfs_search_old_slot+0x198/0x2f0
>    [ 1222531.066641]  resolve_indirect_ref+0xfe/0x6f0
>    [ 1222531.076472]  ? ulist_alloc+0x31/0x60
>    [ 1222531.084773]  ? kmem_cache_alloc_trace+0x12e/0x2b0
>    [ 1222531.095569]  find_parent_nodes+0x720/0x1830
>    [ 1222531.105222]  ? ulist_alloc+0x10/0x60
>    [ 1222531.113530]  iterate_extent_inodes+0xea/0x370
>    [ 1222531.123543]  ? btrfs_previous_extent_item+0x8f/0x110
>    [ 1222531.134914]  ? btrfs_search_path_in_tree+0x240/0x240
>    [ 1222531.146283]  iterate_inodes_from_logical+0x98/0xd0
>    [ 1222531.157268]  ? btrfs_search_path_in_tree+0x240/0x240
>    [ 1222531.168638]  btrfs_ioctl_logical_to_ino+0xd9/0x180
>    [ 1222531.179622]  btrfs_ioctl+0xe2/0x2eb0
>
> This occurs when logical inode resolution takes a tree mod log sequence
> number, and then while backref walking hits a rewind on a busy node
> which has the following sequence of tree mod log operations (numbers
> filled in from a specific example, but they are somewhat arbitrary)
>
> REMOVE_WHILE_FREEING slot 532
> REMOVE_WHILE_FREEING slot 531
> REMOVE_WHILE_FREEING slot 530
> ...
> REMOVE_WHILE_FREEING slot 0
> REMOVE slot 455
> REMOVE slot 454
> REMOVE slot 453
> ...
> REMOVE slot 0
> ADD slot 455
> ADD slot 454
> ADD slot 453
> ...
> ADD slot 0
> MOVE src slot 0 -> dst slot 456 nritems 533
> REMOVE slot 455
> REMOVE slot 454
> REMOVE slot 453
> ...
> REMOVE slot 0
>
> When this sequence gets applied via btrfs_tree_mod_log_rewind, it
> allocates a fresh rewind eb, and first inserts the correct key info for
> the 533 elements, then overwrites the first 456 of them, then decrements
> the count by 456 via the add ops, then rewinds the move by doing a
> memmove from 456:988->0:532. We have never written anything past 532, so
> that memmove writes garbage into the 0:532 range. In practice, this
> results in a lot of fully 0 keys. The rewind then puts valid keys into
> slots 0:455 with the last removes, but 456:532 are still invalid.
>
> When search_old_slot uses this eb, if it uses one of those invalid
> slots, it can then read the extent buffer and issue a bio for offset 0
> which ultimately panics looking up extent mappings.
>
> This bad tree mod log sequence gets generated when the node balancing
> code happens to do a balance_node_right followed by a push_node_left
> while logging in the tree mod log. Illustrated for ebs L and R (left and
> right):
>
>       L                 R
> start:
> [XXX|YYY|...]      [ZZZ|...|...]
> balance_node_right:
> [XXX|YYY|...]      [...|ZZZ|...] move Z to make room for Y
> [XXX|...|...]      [YYY|ZZZ|...] copy Y from L to R
> push_node_left:
> [XXX|YYY|...]      [...|ZZZ|...] copy Y from R to L
> [XXX|YYY|...]      [ZZZ|...|...] move Z into emptied space (NOT LOGGED!)
>
> This is because balance_node_right logs a move, but push_node_left
> explicitly doesn't. That is because logging the move would remove the
> overwritten src < dst range in the right eb, which was already logged
> when we called btrfs_tree_mod_log_eb_copy. The correct sequence would
> include a move from 456:988 to 0:532 after remove 0:455 and before
> removing 0:532. Reversing that sequence would entail creating keys for
> 0:532, then moving those keys out to 456:988, then creating more keys
> for 0:455.
>
> i.e.,
> REMOVE_WHILE_FREEING slot 532
> REMOVE_WHILE_FREEING slot 531
> REMOVE_WHILE_FREEING slot 530
> ...
> REMOVE_WHILE_FREEING slot 0
> MOVE src slot 456 -> dst slot 0 nritems 533
> REMOVE slot 455
> REMOVE slot 454
> REMOVE slot 453
> ...
> REMOVE slot 0
> ADD slot 455
> ADD slot 454
> ADD slot 453
> ...
> ADD slot 0
> MOVE src slot 0 -> dst slot 456 nritems 533
> REMOVE slot 455
> REMOVE slot 454
> REMOVE slot 453
> ...
> REMOVE slot 0
>
> Fix this to log the move but avoid the double remove by putting all the
> logging logic in btrfs_tree_mod_log_eb_copy which has enough information
> to detect these cases and properly log moves, removes, and adds. Leave
> btrfs_tree_mod_log_insert_move to handle insert_ptr and delete_ptr's
> tree mod logging.
>
> Unfortunately, this is quite difficult to reproduce, and I was only
> able to reproduce it by adding sleeps in btrfs_search_old_slot that
> would encourage more log rewinding during ino_to_logical ioctls. I was
> able to hit the warning in the previous patch in the series without the
> fix quite quickly, but not after this patch.
>
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/ctree.c        | 17 ++++++----
>  fs/btrfs/tree-mod-log.c | 72 +++++++++++++++++++++++++++++++++++------
>  2 files changed, 73 insertions(+), 16 deletions(-)
>
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 2f2071d64c52..5226a66ad3dd 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -2785,8 +2785,8 @@ static int push_node_left(struct btrfs_trans_handle *trans,
>
>         if (push_items < src_nritems) {
>                 /*
> -                * Don't call btrfs_tree_mod_log_insert_move() here, key removal
> -                * was already fully logged by btrfs_tree_mod_log_eb_copy() above.
> +                * btrfs_tree_mod_log_eb_copy handles logging the move, so we
> +                * don't need to do an explicit tree mod log operation for it.
>                  */
>                 memmove_extent_buffer(src, btrfs_node_key_ptr_offset(src, 0),
>                                       btrfs_node_key_ptr_offset(src, push_items),
> @@ -2847,12 +2847,14 @@ static int balance_node_right(struct btrfs_trans_handle *trans,
>                 btrfs_abort_transaction(trans, ret);
>                 return ret;
>         }
> -       ret = btrfs_tree_mod_log_insert_move(dst, push_items, 0, dst_nritems);
> -       BUG_ON(ret < 0);
> +
> +       /*
> +        * btrfs_tree_mod_log_eb_copy handles logging the move, so we
> +        * don't need to do an explicit tree mod log operation for it.
> +        */
>         memmove_extent_buffer(dst, btrfs_node_key_ptr_offset(dst, push_items),
> -                                     btrfs_node_key_ptr_offset(dst, 0),
> -                                     (dst_nritems) *
> -                                     sizeof(struct btrfs_key_ptr));
> +                             btrfs_node_key_ptr_offset(dst, 0),
> +                             (dst_nritems) * sizeof(struct btrfs_key_ptr));

Besides the error path bug with rb_erase() mentioned before, there's
also this unnecessary
and unrelated change, just changing indentation without changing anything else.
We usually only do this if we change the actual code.

>
>         ret = btrfs_tree_mod_log_eb_copy(dst, src, 0, src_nritems - push_items,
>                                          push_items);
> @@ -2860,6 +2862,7 @@ static int balance_node_right(struct btrfs_trans_handle *trans,
>                 btrfs_abort_transaction(trans, ret);
>                 return ret;
>         }
> +

Same here, a stray newline out of nowhere.

Thanks.

>         copy_extent_buffer(dst, src,
>                            btrfs_node_key_ptr_offset(dst, 0),
>                            btrfs_node_key_ptr_offset(src, src_nritems - push_items),
> diff --git a/fs/btrfs/tree-mod-log.c b/fs/btrfs/tree-mod-log.c
> index 43f2ffa6f44d..8d8e59557647 100644
> --- a/fs/btrfs/tree-mod-log.c
> +++ b/fs/btrfs/tree-mod-log.c
> @@ -248,6 +248,25 @@ int btrfs_tree_mod_log_insert_key(struct extent_buffer *eb, int slot,
>         return ret;
>  }
>
> +static struct tree_mod_elem *tree_mod_log_alloc_move(struct extent_buffer *eb,
> +                                                    int dst_slot, int src_slot,
> +                                                    int nr_items)
> +{
> +       struct tree_mod_elem *tm;
> +
> +       tm = kzalloc(sizeof(*tm), GFP_NOFS);
> +       if (!tm)
> +               return ERR_PTR(-ENOMEM);
> +
> +       tm->logical = eb->start;
> +       tm->slot = src_slot;
> +       tm->move.dst_slot = dst_slot;
> +       tm->move.nr_items = nr_items;
> +       tm->op = BTRFS_MOD_LOG_MOVE_KEYS;
> +
> +       return tm;
> +}
> +
>  int btrfs_tree_mod_log_insert_move(struct extent_buffer *eb,
>                                    int dst_slot, int src_slot,
>                                    int nr_items)
> @@ -265,18 +284,13 @@ int btrfs_tree_mod_log_insert_move(struct extent_buffer *eb,
>         if (!tm_list)
>                 return -ENOMEM;
>
> -       tm = kzalloc(sizeof(*tm), GFP_NOFS);
> -       if (!tm) {
> -               ret = -ENOMEM;
> +       tm = tree_mod_log_alloc_move(eb, dst_slot, src_slot, nr_items);
> +       if (IS_ERR(tm)) {
> +               ret = PTR_ERR(tm);
> +               tm = NULL;
>                 goto free_tms;
>         }
>
> -       tm->logical = eb->start;
> -       tm->slot = src_slot;
> -       tm->move.dst_slot = dst_slot;
> -       tm->move.nr_items = nr_items;
> -       tm->op = BTRFS_MOD_LOG_MOVE_KEYS;
> -
>         for (i = 0; i + dst_slot < src_slot && i < nr_items; i++) {
>                 tm_list[i] = alloc_tree_mod_elem(eb, i + dst_slot,
>                                 BTRFS_MOD_LOG_KEY_REMOVE_WHILE_MOVING);
> @@ -489,6 +503,10 @@ int btrfs_tree_mod_log_eb_copy(struct extent_buffer *dst,
>         struct tree_mod_elem **tm_list_add, **tm_list_rem;
>         int i;
>         bool locked = false;
> +       struct tree_mod_elem *dst_move_tm = NULL;
> +       struct tree_mod_elem *src_move_tm = NULL;
> +       u32 dst_move_nr_items = btrfs_header_nritems(dst) - dst_offset;
> +       u32 src_move_nr_items = btrfs_header_nritems(src) - (src_offset + nr_items);
>
>         if (!tree_mod_need_log(fs_info, NULL))
>                 return 0;
> @@ -501,6 +519,26 @@ int btrfs_tree_mod_log_eb_copy(struct extent_buffer *dst,
>         if (!tm_list)
>                 return -ENOMEM;
>
> +       if (dst_move_nr_items) {
> +               dst_move_tm = tree_mod_log_alloc_move(dst, dst_offset + nr_items,
> +                                                     dst_offset, dst_move_nr_items);
> +               if (IS_ERR(dst_move_tm)) {
> +                       ret = PTR_ERR(dst_move_tm);
> +                       dst_move_tm = NULL;
> +                       goto free_tms;
> +               }
> +       }
> +       if (src_move_nr_items) {
> +               src_move_tm = tree_mod_log_alloc_move(src, src_offset,
> +                                                     src_offset + nr_items,
> +                                                     src_move_nr_items);
> +               if (IS_ERR(src_move_tm)) {
> +                       ret = PTR_ERR(src_move_tm);
> +                       src_move_tm = NULL;
> +                       goto free_tms;
> +               }
> +       }
> +
>         tm_list_add = tm_list;
>         tm_list_rem = tm_list + nr_items;
>         for (i = 0; i < nr_items; i++) {
> @@ -523,6 +561,11 @@ int btrfs_tree_mod_log_eb_copy(struct extent_buffer *dst,
>                 goto free_tms;
>         locked = true;
>
> +       if (dst_move_tm) {
> +               ret = tree_mod_log_insert(fs_info, dst_move_tm);
> +               if (ret)
> +                       goto free_tms;
> +       }
>         for (i = 0; i < nr_items; i++) {
>                 ret = tree_mod_log_insert(fs_info, tm_list_rem[i]);
>                 if (ret)
> @@ -531,6 +574,11 @@ int btrfs_tree_mod_log_eb_copy(struct extent_buffer *dst,
>                 if (ret)
>                         goto free_tms;
>         }
> +       if (src_move_tm) {
> +               ret = tree_mod_log_insert(fs_info, src_move_tm);
> +               if (ret)
> +                       goto free_tms;
> +       }
>
>         write_unlock(&fs_info->tree_mod_log_lock);
>         kfree(tm_list);
> @@ -538,6 +586,12 @@ int btrfs_tree_mod_log_eb_copy(struct extent_buffer *dst,
>         return 0;
>
>  free_tms:
> +       if (dst_move_tm && !RB_EMPTY_NODE(&dst_move_tm->node))
> +               rb_erase(&dst_move_tm->node, &fs_info->tree_mod_log);
> +       kfree(dst_move_tm);
> +       if (src_move_tm && !RB_EMPTY_NODE(&src_move_tm->node))
> +               rb_erase(&src_move_tm->node, &fs_info->tree_mod_log);
> +       kfree(src_move_tm);
>         for (i = 0; i < nr_items * 2; i++) {
>                 if (tm_list[i] && !RB_EMPTY_NODE(&tm_list[i]->node))
>                         rb_erase(&tm_list[i]->node, &fs_info->tree_mod_log);
> --
> 2.40.1
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 1/2] btrfs: warn on invalid slot in tree mod log rewind
  2023-06-01  9:28   ` Filipe Manana
@ 2023-06-01 15:04     ` Boris Burkov
  0 siblings, 0 replies; 7+ messages in thread
From: Boris Burkov @ 2023-06-01 15:04 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs, kernel-team

On Thu, Jun 01, 2023 at 10:28:58AM +0100, Filipe Manana wrote:
> On Wed, May 31, 2023 at 5:22 PM Boris Burkov <boris@bur.io> wrote:
> >
> > The way that tree mod log tracks the ultimate length of the eb, the
> > variable 'n', eventually turns up the correct value, but at intermediate
> > steps during the rewind, n can be inaccurate as a representation of the
> > end of the eb. For example, it doesn't get updated on move rewinds, and
> > it does get updated for add/remove in the middle of the eb.
> >
> > To detect cases with invalid moves, introduce a separate variable called
> > max_slot which tries to track the maximum valid slot in the rewind eb.
> > We can then warn if we do a move whose src range goes beyond the max
> > valid slot.
> >
> > There is a commented caveat that it is possible to have this value be an
> > overestimate due to the challenge of properly handling 'add' operations
> > in the middle of the eb, but in practice it doesn't cause enough of a
> > problem to throw out the max idea in favor of tracking every valid slot.
> >
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> >  fs/btrfs/tree-mod-log.c | 22 ++++++++++++++++++++++
> >  1 file changed, 22 insertions(+)
> >
> > diff --git a/fs/btrfs/tree-mod-log.c b/fs/btrfs/tree-mod-log.c
> > index a555baa0143a..43f2ffa6f44d 100644
> > --- a/fs/btrfs/tree-mod-log.c
> > +++ b/fs/btrfs/tree-mod-log.c
> > @@ -664,8 +664,10 @@ static void tree_mod_log_rewind(struct btrfs_fs_info *fs_info,
> >         unsigned long o_dst;
> >         unsigned long o_src;
> >         unsigned long p_size = sizeof(struct btrfs_key_ptr);
> > +       u32 max_slot;
> >
> >         n = btrfs_header_nritems(eb);
> > +       max_slot = n - 1;
> >         read_lock(&fs_info->tree_mod_log_lock);
> >         while (tm && tm->seq >= time_seq) {
> >                 /*
> > @@ -684,6 +686,8 @@ static void tree_mod_log_rewind(struct btrfs_fs_info *fs_info,
> >                         btrfs_set_node_ptr_generation(eb, tm->slot,
> >                                                       tm->generation);
> >                         n++;
> > +                       if (max_slot == (u32)-1 || tm->slot > max_slot)
> 
> It would be better to document the intention of the special value
> (u32)-1, and perhaps also let the type be
> a signed integer and just check for -1 as meaning "no valid slot".

I really like the signed integer idea, I'll do that for v3.

> 
> > +                               max_slot = tm->slot;
> >                         break;
> >                 case BTRFS_MOD_LOG_KEY_REPLACE:
> >                         BUG_ON(tm->slot >= n);
> > @@ -693,12 +697,30 @@ static void tree_mod_log_rewind(struct btrfs_fs_info *fs_info,
> >                                                       tm->generation);
> >                         break;
> >                 case BTRFS_MOD_LOG_KEY_ADD:
> > +                       /*
> > +                        * It is possible we could have already removed keys behind the known
> > +                        * max slot, so this will be an overestimate. In practice, the copy
> > +                        * operation inserts them in increasing order, and overestimating just
> > +                        * means we miss some warnings, so it's OK. It isn't worth carefully
> > +                        * tracking the full array of valid slots to check against when moving.
> > +                        */
> > +                       if (tm->slot == max_slot)
> > +                               max_slot--;
> >                         /* if a move operation is needed it's in the log */
> >                         n--;
> >                         break;
> >                 case BTRFS_MOD_LOG_MOVE_KEYS:
> >                         o_dst = btrfs_node_key_ptr_offset(eb, tm->slot);
> >                         o_src = btrfs_node_key_ptr_offset(eb, tm->move.dst_slot);
> > +                       if (WARN_ON((tm->move.dst_slot + tm->move.nr_items - 1 > max_slot) ||
> > +                                   (max_slot == (u32)-1 && tm->move.nr_items > 0))) {
> 
> Like v1, I'm still a bit at odds with the check for  tm->move.nr_items > 0
> 
> Such a case should never be possible, we should assert that
> separately, plus when I read this it makes me think:
> why do we check it only when max_slot is (u32)-1 but not for the
> previous condition?
> It makes me think what would happen to the first condition if it's 0
> and dst_slot happens to be 0, we have an underflow, etc

Good point, I am actually assuming it is >0 for that other calculation,
so 0 is not as "valid" (if obviously pointless) of a value as I argued.

> 
> Maybe add an ASSERT(tm->move.nr_items > 0) before.

I think this is a good idea, and I'll mix it into the warning too. I
think the only thing I would "insist" on is that I don't want to panic
non debug builds on this condition, even if I can't think of a way it
would happen.

> 
> Thanks.
> 
> 
> > +                               btrfs_warn(fs_info,
> > +                                          "Move from invalid tree mod log slot eb %llu slot %d dst_slot %d nr_items %d seq %llu n %u max_slot %u\n",
> > +                                          eb->start, tm->slot,
> > +                                          tm->move.dst_slot, tm->move.nr_items,
> > +                                          tm->seq, n, max_slot);
> > +
> > +                       }
> >                         memmove_extent_buffer(eb, o_dst, o_src,
> >                                               tm->move.nr_items * p_size);
> >                         break;
> > --
> > 2.40.1
> >

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-06-01 15:07 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-05-31 16:22 [PATCH v2 0/2] btrfs: fix logical_to_ino panic in btrfs_map_bio Boris Burkov
2023-05-31 16:22 ` [PATCH v2 1/2] btrfs: warn on invalid slot in tree mod log rewind Boris Burkov
2023-06-01  9:28   ` Filipe Manana
2023-06-01 15:04     ` Boris Burkov
2023-05-31 16:22 ` [PATCH v2 2/2] btrfs: insert tree mod log move in push_node_left Boris Burkov
2023-06-01  9:37   ` Filipe Manana
2023-06-01 10:26   ` Filipe Manana

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox