* [PATCH 0/8] btrfs: some speedups for io tree operations and cleanups
@ 2023-09-22 10:39 fdmanana
2023-09-22 10:39 ` [PATCH 1/8] btrfs: make extent state merges more efficient during insertions fdmanana
` (9 more replies)
0 siblings, 10 replies; 20+ messages in thread
From: fdmanana @ 2023-09-22 10:39 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
These are some changes to make some io tree operations more efficient and
some cleanups. More details on the changelogs.
Filipe Manana (8):
btrfs: make extent state merges more efficient during insertions
btrfs: update stale comment at extent_io_tree_release()
btrfs: make wait_extent_bit() static
btrfs: remove pointless memory barrier from extent_io_tree_release()
btrfs: collapse wait_on_state() to its caller wait_extent_bit()
btrfs: make extent_io_tree_release() more efficient
btrfs: use extent_io_tree_release() to empty dirty log pages
btrfs: make sure we cache next state in find_first_extent_bit()
fs/btrfs/extent-io-tree.c | 201 ++++++++++++++++++++++++--------------
fs/btrfs/extent-io-tree.h | 2 -
fs/btrfs/tree-log.c | 3 +-
3 files changed, 126 insertions(+), 80 deletions(-)
--
2.40.1
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH 1/8] btrfs: make extent state merges more efficient during insertions
2023-09-22 10:39 [PATCH 0/8] btrfs: some speedups for io tree operations and cleanups fdmanana
@ 2023-09-22 10:39 ` fdmanana
2023-09-22 10:39 ` [PATCH 2/8] btrfs: update stale comment at extent_io_tree_release() fdmanana
` (8 subsequent siblings)
9 siblings, 0 replies; 20+ messages in thread
From: fdmanana @ 2023-09-22 10:39 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
When inserting a new extent state record into an io tree that happens to
be mergeable, we currently do the following:
1) Insert the extent state record in the io tree's rbtree. This requires
going down the tree to find where to insert it, and during the
insertion we often need to balance the rbtree;
2) We then check if the previous node is mergeable, so we call rb_prev()
to find it, which requires some looping to find the previous node;
3) If the previous node is mergeable, we adjust our node to include the
range of the previous node and then delete the previous node from the
rbtree, which again may need to balance the rbtree;
4) Then we check if the next node is mergeable with the node we inserted,
so we call rb_next(), which requires some looping too. If the next node
is indeed mergeable, we expand the range of our node to include the
next node's range and then delete the next node from the rbtree, which
again may need to balance the tree.
So these are quite of lot of iterations and looping over the rbtree, and
some of the operations may need to rebalance the rb tree. This can be made
a bit more efficient by:
1) When iterating the rbtree, once we find a node that is mergeable with
the node we want to insert, we can just adjust that node's range with
the range of the node to insert - this avoids continuing iterating
over the tree and deleting a node from the rbtree;
2) If we expand the range of a mergeable node, then we find the next or
the previous node, depending on other we merged a range to the right or
to the left of the node we are currently at during the iteration. This
merging is as before, we find the next or previous node with rb_next()
or rb_prev() and if that other node is mergeable with the current one,
we adjust the range of the current node and remove the other node from
the rbtree;
3) Whenever we need to insert the new extent state record it's because
we don't have any extent state record in the rbtree which can be
merged, so we can remove the call to merge_state() after the insertion,
saving rb_next() and rb_prev() calls, which require some looping.
So update the insertion function insert_state() to have this behaviour.
Running dbench for 120 seconds and capturing the execution times of
set_extent_bit() at pin_down_extent(), resulted in the following data
(time values are in nanoseconds):
Before this change:
Count: 2278299
Range: 0.000 - 4003728.000; Mean: 713.436; Median: 612.000; Stddev: 3606.952
Percentiles: 90th: 1187.000; 95th: 1350.000; 99th: 1724.000
0.000 - 7.534: 5 |
7.534 - 35.418: 36 |
35.418 - 154.403: 273 |
154.403 - 662.138: 1244016 #####################################################
662.138 - 2828.745: 1031335 ############################################
2828.745 - 12074.102: 1395 |
12074.102 - 51525.930: 806 |
51525.930 - 219874.955: 162 |
219874.955 - 938254.688: 22 |
938254.688 - 4003728.000: 3 |
After this change:
Count: 2275862
Range: 0.000 - 1605175.000; Mean: 678.903; Median: 590.000; Stddev: 2149.785
Percentiles: 90th: 1105.000; 95th: 1245.000; 99th: 1590.000
0.000 - 10.219: 10 |
10.219 - 40.957: 36 |
40.957 - 155.907: 262 |
155.907 - 585.789: 1127214 ####################################################
585.789 - 2193.431: 1145134 #####################################################
2193.431 - 8205.578: 1648 |
8205.578 - 30689.378: 1039 |
30689.378 - 114772.699: 362 |
114772.699 - 429221.537: 52 |
429221.537 - 1605175.000: 10 |
Maximum duration (range), average duration, percentiles and standard
deviation are all better.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/extent-io-tree.c | 130 ++++++++++++++++++++++++++------------
1 file changed, 88 insertions(+), 42 deletions(-)
diff --git a/fs/btrfs/extent-io-tree.c b/fs/btrfs/extent-io-tree.c
index ff8e117a1ace..dd9dd473654d 100644
--- a/fs/btrfs/extent-io-tree.c
+++ b/fs/btrfs/extent-io-tree.c
@@ -327,6 +327,38 @@ static void extent_io_tree_panic(struct extent_io_tree *tree, int err)
"locking error: extent tree was modified by another thread while locked");
}
+static void merge_prev_state(struct extent_io_tree *tree, struct extent_state *state)
+{
+ struct extent_state *prev;
+
+ prev = prev_state(state);
+ if (prev && prev->end == state->start - 1 &&
+ prev->state == state->state) {
+ if (tree->inode)
+ btrfs_merge_delalloc_extent(tree->inode, state, prev);
+ state->start = prev->start;
+ rb_erase(&prev->rb_node, &tree->state);
+ RB_CLEAR_NODE(&prev->rb_node);
+ free_extent_state(prev);
+ }
+}
+
+static void merge_next_state(struct extent_io_tree *tree, struct extent_state *state)
+{
+ struct extent_state *next;
+
+ next = next_state(state);
+ if (next && next->start == state->end + 1 &&
+ next->state == state->state) {
+ if (tree->inode)
+ btrfs_merge_delalloc_extent(tree->inode, state, next);
+ state->end = next->end;
+ rb_erase(&next->rb_node, &tree->state);
+ RB_CLEAR_NODE(&next->rb_node);
+ free_extent_state(next);
+ }
+}
+
/*
* Utility function to look for merge candidates inside a given range. Any
* extents with matching state are merged together into a single extent in the
@@ -338,31 +370,11 @@ static void extent_io_tree_panic(struct extent_io_tree *tree, int err)
*/
static void merge_state(struct extent_io_tree *tree, struct extent_state *state)
{
- struct extent_state *other;
-
if (state->state & (EXTENT_LOCKED | EXTENT_BOUNDARY))
return;
- other = prev_state(state);
- if (other && other->end == state->start - 1 &&
- other->state == state->state) {
- if (tree->inode)
- btrfs_merge_delalloc_extent(tree->inode, state, other);
- state->start = other->start;
- rb_erase(&other->rb_node, &tree->state);
- RB_CLEAR_NODE(&other->rb_node);
- free_extent_state(other);
- }
- other = next_state(state);
- if (other && other->start == state->end + 1 &&
- other->state == state->state) {
- if (tree->inode)
- btrfs_merge_delalloc_extent(tree->inode, state, other);
- state->end = other->end;
- rb_erase(&other->rb_node, &tree->state);
- RB_CLEAR_NODE(&other->rb_node);
- free_extent_state(other);
- }
+ merge_prev_state(tree, state);
+ merge_next_state(tree, state);
}
static void set_state_bits(struct extent_io_tree *tree,
@@ -384,19 +396,26 @@ static void set_state_bits(struct extent_io_tree *tree,
* Insert an extent_state struct into the tree. 'bits' are set on the
* struct before it is inserted.
*
- * This may return -EEXIST if the extent is already there, in which case the
- * state struct is freed.
+ * Returns a pointer to the struct extent_state record containing the range
+ * requested for insertion, which may be the same as the given struct or it
+ * may be an existing record in the tree that was expanded to accommodate the
+ * requested range. In case of an extent_state different from the one that was
+ * given, the later can be freed or reused by the caller.
+ * On error it returns an error pointer.
*
* The tree lock is not taken internally. This is a utility function and
* probably isn't what you want to call (see set/clear_extent_bit).
*/
-static int insert_state(struct extent_io_tree *tree,
- struct extent_state *state,
- u32 bits, struct extent_changeset *changeset)
+static struct extent_state *insert_state(struct extent_io_tree *tree,
+ struct extent_state *state,
+ u32 bits,
+ struct extent_changeset *changeset)
{
struct rb_node **node;
struct rb_node *parent = NULL;
- const u64 end = state->end;
+ const u64 start = state->start - 1;
+ const u64 end = state->end + 1;
+ const bool try_merge = !(bits & (EXTENT_LOCKED | EXTENT_BOUNDARY));
set_state_bits(tree, state, bits, changeset);
@@ -407,23 +426,40 @@ static int insert_state(struct extent_io_tree *tree,
parent = *node;
entry = rb_entry(parent, struct extent_state, rb_node);
- if (end < entry->start) {
+ if (state->end < entry->start) {
+ if (try_merge && end == entry->start &&
+ state->state == entry->state) {
+ if (tree->inode)
+ btrfs_merge_delalloc_extent(tree->inode, state, entry);
+ entry->start = state->start;
+ merge_prev_state(tree, entry);
+ state->state = 0;
+ return entry;
+ }
node = &(*node)->rb_left;
- } else if (end > entry->end) {
+ } else if (state->end > entry->end) {
+ if (try_merge && entry->end == start &&
+ state->state == entry->state) {
+ if (tree->inode)
+ btrfs_merge_delalloc_extent(tree->inode, state, entry);
+ entry->end = state->end;
+ merge_next_state(tree, entry);
+ state->state = 0;
+ return entry;
+ }
node = &(*node)->rb_right;
} else {
btrfs_err(tree->fs_info,
"found node %llu %llu on insert of %llu %llu",
- entry->start, entry->end, state->start, end);
- return -EEXIST;
+ entry->start, entry->end, state->start, state->end);
+ return ERR_PTR(-EEXIST);
}
}
rb_link_node(&state->rb_node, parent, node);
rb_insert_color(&state->rb_node, &tree->state);
- merge_state(tree, state);
- return 0;
+ return state;
}
/*
@@ -1133,6 +1169,8 @@ static int __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
*/
if (state->start > start) {
u64 this_end;
+ struct extent_state *inserted_state;
+
if (end < last_start)
this_end = end;
else
@@ -1148,12 +1186,15 @@ static int __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
*/
prealloc->start = start;
prealloc->end = this_end;
- err = insert_state(tree, prealloc, bits, changeset);
- if (err)
+ inserted_state = insert_state(tree, prealloc, bits, changeset);
+ if (IS_ERR(inserted_state)) {
+ err = PTR_ERR(inserted_state);
extent_io_tree_panic(tree, err);
+ }
- cache_state(prealloc, cached_state);
- prealloc = NULL;
+ cache_state(inserted_state, cached_state);
+ if (inserted_state == prealloc)
+ prealloc = NULL;
start = this_end + 1;
goto search_again;
}
@@ -1356,6 +1397,8 @@ int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
*/
if (state->start > start) {
u64 this_end;
+ struct extent_state *inserted_state;
+
if (end < last_start)
this_end = end;
else
@@ -1373,11 +1416,14 @@ int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
*/
prealloc->start = start;
prealloc->end = this_end;
- err = insert_state(tree, prealloc, bits, NULL);
- if (err)
+ inserted_state = insert_state(tree, prealloc, bits, NULL);
+ if (IS_ERR(inserted_state)) {
+ err = PTR_ERR(inserted_state);
extent_io_tree_panic(tree, err);
- cache_state(prealloc, cached_state);
- prealloc = NULL;
+ }
+ cache_state(inserted_state, cached_state);
+ if (inserted_state == prealloc)
+ prealloc = NULL;
start = this_end + 1;
goto search_again;
}
--
2.40.1
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 2/8] btrfs: update stale comment at extent_io_tree_release()
2023-09-22 10:39 [PATCH 0/8] btrfs: some speedups for io tree operations and cleanups fdmanana
2023-09-22 10:39 ` [PATCH 1/8] btrfs: make extent state merges more efficient during insertions fdmanana
@ 2023-09-22 10:39 ` fdmanana
2023-09-22 10:39 ` [PATCH 3/8] btrfs: make wait_extent_bit() static fdmanana
` (7 subsequent siblings)
9 siblings, 0 replies; 20+ messages in thread
From: fdmanana @ 2023-09-22 10:39 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
There's this comment at extent_io_tree_release() when mentions io btrees,
but this function is no longer used only for io btrees. Originally it was
added as a static function named clear_btree_io_tree() at transaction.c,
in commit 663dfbb07774 ("Btrfs: deal with convert_extent_bit errors to
avoid fs corruption"), as it was used only for cleaning one of the io
trees that track dirty extent buffers, the dirty_log_pages io tree of a
a root and the dirty_pages io tree of a transaction. Later it was renamed
and exported and now it's used to cleanup other io trees such as the
allocation state io tree of a device or the cums range io tree of a log
root.
So remove that comment and replace it with one at the top of the function
that is more complete, mentioning what the function does and that it's
expected to be called only when a task is sure no one else will need to
use the tree anymore, as well as there should be no locked ranges in the
tree and therefore no waiters on its extent state records. Also add an
assertion to check that there are no locked extent state records in the
tree.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/extent-io-tree.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/fs/btrfs/extent-io-tree.c b/fs/btrfs/extent-io-tree.c
index dd9dd473654d..1ca0827493a6 100644
--- a/fs/btrfs/extent-io-tree.c
+++ b/fs/btrfs/extent-io-tree.c
@@ -105,6 +105,13 @@ void extent_io_tree_init(struct btrfs_fs_info *fs_info,
lockdep_set_class(&tree->lock, &file_extent_tree_class);
}
+/*
+ * Empty an io tree, removing and freeing every extent state record from the
+ * tree. This should be called once we are sure no other task can access the
+ * tree anymore, so no tree updates happen after we empty the tree and there
+ * aren't any waiters on any extent state record (EXTENT_LOCKED bit is never
+ * set on any extent state when calling this function).
+ */
void extent_io_tree_release(struct extent_io_tree *tree)
{
spin_lock(&tree->lock);
@@ -122,10 +129,7 @@ void extent_io_tree_release(struct extent_io_tree *tree)
state = rb_entry(node, struct extent_state, rb_node);
rb_erase(&state->rb_node, &tree->state);
RB_CLEAR_NODE(&state->rb_node);
- /*
- * btree io trees aren't supposed to have tasks waiting for
- * changes in the flags of extent states ever.
- */
+ ASSERT(!(state->state & EXTENT_LOCKED));
ASSERT(!waitqueue_active(&state->wq));
free_extent_state(state);
--
2.40.1
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 3/8] btrfs: make wait_extent_bit() static
2023-09-22 10:39 [PATCH 0/8] btrfs: some speedups for io tree operations and cleanups fdmanana
2023-09-22 10:39 ` [PATCH 1/8] btrfs: make extent state merges more efficient during insertions fdmanana
2023-09-22 10:39 ` [PATCH 2/8] btrfs: update stale comment at extent_io_tree_release() fdmanana
@ 2023-09-22 10:39 ` fdmanana
2023-09-22 22:49 ` Anand Jain
2023-09-22 10:39 ` [PATCH 4/8] btrfs: remove pointless memory barrier from extent_io_tree_release() fdmanana
` (6 subsequent siblings)
9 siblings, 1 reply; 20+ messages in thread
From: fdmanana @ 2023-09-22 10:39 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
The function wait_extent_bit() is not used outside extent-io-tree.c so
make it static. Furthermore the function doesn't have the 'btrfs_' prefix.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/extent-io-tree.c | 4 ++--
fs/btrfs/extent-io-tree.h | 2 --
2 files changed, 2 insertions(+), 4 deletions(-)
diff --git a/fs/btrfs/extent-io-tree.c b/fs/btrfs/extent-io-tree.c
index 1ca0827493a6..033544f79e2b 100644
--- a/fs/btrfs/extent-io-tree.c
+++ b/fs/btrfs/extent-io-tree.c
@@ -766,8 +766,8 @@ static void wait_on_state(struct extent_io_tree *tree,
* The range [start, end] is inclusive.
* The tree lock is taken by this function
*/
-void wait_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, u32 bits,
- struct extent_state **cached_state)
+static void wait_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
+ u32 bits, struct extent_state **cached_state)
{
struct extent_state *state;
diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
index 28c23a23d121..ddcc8bbf1a05 100644
--- a/fs/btrfs/extent-io-tree.h
+++ b/fs/btrfs/extent-io-tree.h
@@ -192,7 +192,5 @@ int find_contiguous_extent_bit(struct extent_io_tree *tree, u64 start,
bool btrfs_find_delalloc_range(struct extent_io_tree *tree, u64 *start,
u64 *end, u64 max_bytes,
struct extent_state **cached_state);
-void wait_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, u32 bits,
- struct extent_state **cached_state);
#endif /* BTRFS_EXTENT_IO_TREE_H */
--
2.40.1
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 4/8] btrfs: remove pointless memory barrier from extent_io_tree_release()
2023-09-22 10:39 [PATCH 0/8] btrfs: some speedups for io tree operations and cleanups fdmanana
` (2 preceding siblings ...)
2023-09-22 10:39 ` [PATCH 3/8] btrfs: make wait_extent_bit() static fdmanana
@ 2023-09-22 10:39 ` fdmanana
2023-09-29 15:25 ` David Sterba
2023-09-22 10:39 ` [PATCH 5/8] btrfs: collapse wait_on_state() to its caller wait_extent_bit() fdmanana
` (5 subsequent siblings)
9 siblings, 1 reply; 20+ messages in thread
From: fdmanana @ 2023-09-22 10:39 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
The memory barrier at extent_io_tree_release() is pointless because:
1) We have just called spin_lock() and that implies a memory barrier;
2) We only change the waitqueue of an extent state record while holding
the tree lock - see wait_on_state()
So remove the memory barrier.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/extent-io-tree.c | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/fs/btrfs/extent-io-tree.c b/fs/btrfs/extent-io-tree.c
index 033544f79e2b..c939c2bc88e5 100644
--- a/fs/btrfs/extent-io-tree.c
+++ b/fs/btrfs/extent-io-tree.c
@@ -115,12 +115,6 @@ void extent_io_tree_init(struct btrfs_fs_info *fs_info,
void extent_io_tree_release(struct extent_io_tree *tree)
{
spin_lock(&tree->lock);
- /*
- * Do a single barrier for the waitqueue_active check here, the state
- * of the waitqueue should not change once extent_io_tree_release is
- * called.
- */
- smp_mb();
while (!RB_EMPTY_ROOT(&tree->state)) {
struct rb_node *node;
struct extent_state *state;
@@ -130,6 +124,11 @@ void extent_io_tree_release(struct extent_io_tree *tree)
rb_erase(&state->rb_node, &tree->state);
RB_CLEAR_NODE(&state->rb_node);
ASSERT(!(state->state & EXTENT_LOCKED));
+ /*
+ * No need for a memory barrier here, as we are holding the tree
+ * lock and we only change the waitqueue while holding that lock
+ * (see wait_on_state()).
+ */
ASSERT(!waitqueue_active(&state->wq));
free_extent_state(state);
--
2.40.1
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 5/8] btrfs: collapse wait_on_state() to its caller wait_extent_bit()
2023-09-22 10:39 [PATCH 0/8] btrfs: some speedups for io tree operations and cleanups fdmanana
` (3 preceding siblings ...)
2023-09-22 10:39 ` [PATCH 4/8] btrfs: remove pointless memory barrier from extent_io_tree_release() fdmanana
@ 2023-09-22 10:39 ` fdmanana
2023-09-22 22:54 ` Anand Jain
2023-09-22 10:39 ` [PATCH 6/8] btrfs: make extent_io_tree_release() more efficient fdmanana
` (4 subsequent siblings)
9 siblings, 1 reply; 20+ messages in thread
From: fdmanana @ 2023-09-22 10:39 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
The wait_on_state() function is very short and has a single caller, which
is wait_extent_bit(), so remove the function and put its code into the
caller.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/extent-io-tree.c | 23 ++++++++---------------
1 file changed, 8 insertions(+), 15 deletions(-)
diff --git a/fs/btrfs/extent-io-tree.c b/fs/btrfs/extent-io-tree.c
index c939c2bc88e5..700b84fc1588 100644
--- a/fs/btrfs/extent-io-tree.c
+++ b/fs/btrfs/extent-io-tree.c
@@ -127,7 +127,7 @@ void extent_io_tree_release(struct extent_io_tree *tree)
/*
* No need for a memory barrier here, as we are holding the tree
* lock and we only change the waitqueue while holding that lock
- * (see wait_on_state()).
+ * (see wait_extent_bit()).
*/
ASSERT(!waitqueue_active(&state->wq));
free_extent_state(state);
@@ -747,19 +747,6 @@ int __clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
}
-static void wait_on_state(struct extent_io_tree *tree,
- struct extent_state *state)
- __releases(tree->lock)
- __acquires(tree->lock)
-{
- DEFINE_WAIT(wait);
- prepare_to_wait(&state->wq, &wait, TASK_UNINTERRUPTIBLE);
- spin_unlock(&tree->lock);
- schedule();
- spin_lock(&tree->lock);
- finish_wait(&state->wq, &wait);
-}
-
/*
* Wait for one or more bits to clear on a range in the state tree.
* The range [start, end] is inclusive.
@@ -797,9 +784,15 @@ static void wait_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
goto out;
if (state->state & bits) {
+ DEFINE_WAIT(wait);
+
start = state->start;
refcount_inc(&state->refs);
- wait_on_state(tree, state);
+ prepare_to_wait(&state->wq, &wait, TASK_UNINTERRUPTIBLE);
+ spin_unlock(&tree->lock);
+ schedule();
+ spin_lock(&tree->lock);
+ finish_wait(&state->wq, &wait);
free_extent_state(state);
goto again;
}
--
2.40.1
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 6/8] btrfs: make extent_io_tree_release() more efficient
2023-09-22 10:39 [PATCH 0/8] btrfs: some speedups for io tree operations and cleanups fdmanana
` (4 preceding siblings ...)
2023-09-22 10:39 ` [PATCH 5/8] btrfs: collapse wait_on_state() to its caller wait_extent_bit() fdmanana
@ 2023-09-22 10:39 ` fdmanana
2023-09-22 10:39 ` [PATCH 7/8] btrfs: use extent_io_tree_release() to empty dirty log pages fdmanana
` (3 subsequent siblings)
9 siblings, 0 replies; 20+ messages in thread
From: fdmanana @ 2023-09-22 10:39 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
Currently extent_io_tree_release() is a loop that keeps getting the first
node in the io tree, using rb_first() which is a loop that gets to the
leftmost node of the rbtree, and then for each node it calls rb_erase(),
which often requires rebalancing the rbtree.
We can make this more efficient by using
rbtree_postorder_for_each_entry_safe() to free each node without having
to delete it from the rbtree and without looping to get the first node.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/extent-io-tree.c | 16 +++++++---------
1 file changed, 7 insertions(+), 9 deletions(-)
diff --git a/fs/btrfs/extent-io-tree.c b/fs/btrfs/extent-io-tree.c
index 700b84fc1588..f2372d6cd304 100644
--- a/fs/btrfs/extent-io-tree.c
+++ b/fs/btrfs/extent-io-tree.c
@@ -114,14 +114,12 @@ void extent_io_tree_init(struct btrfs_fs_info *fs_info,
*/
void extent_io_tree_release(struct extent_io_tree *tree)
{
- spin_lock(&tree->lock);
- while (!RB_EMPTY_ROOT(&tree->state)) {
- struct rb_node *node;
- struct extent_state *state;
+ struct extent_state *state;
+ struct extent_state *tmp;
- node = rb_first(&tree->state);
- state = rb_entry(node, struct extent_state, rb_node);
- rb_erase(&state->rb_node, &tree->state);
+ spin_lock(&tree->lock);
+ rbtree_postorder_for_each_entry_safe(state, tmp, &tree->state, rb_node) {
+ /* Clear node to keep free_extent_state() happy. */
RB_CLEAR_NODE(&state->rb_node);
ASSERT(!(state->state & EXTENT_LOCKED));
/*
@@ -131,9 +129,9 @@ void extent_io_tree_release(struct extent_io_tree *tree)
*/
ASSERT(!waitqueue_active(&state->wq));
free_extent_state(state);
-
- cond_resched_lock(&tree->lock);
+ cond_resched();
}
+ tree->state = RB_ROOT;
spin_unlock(&tree->lock);
}
--
2.40.1
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 7/8] btrfs: use extent_io_tree_release() to empty dirty log pages
2023-09-22 10:39 [PATCH 0/8] btrfs: some speedups for io tree operations and cleanups fdmanana
` (5 preceding siblings ...)
2023-09-22 10:39 ` [PATCH 6/8] btrfs: make extent_io_tree_release() more efficient fdmanana
@ 2023-09-22 10:39 ` fdmanana
2023-09-26 6:25 ` kernel test robot
2023-09-22 10:39 ` [PATCH 8/8] btrfs: make sure we cache next state in find_first_extent_bit() fdmanana
` (2 subsequent siblings)
9 siblings, 1 reply; 20+ messages in thread
From: fdmanana @ 2023-09-22 10:39 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
When freeing a log tree, during a transaction commit, we clear its dirty
log pages io tree by calling clear_extent_bits() using a range from 0 to
(u64)-1. This will iterate the io tree's rbtree and call rb_erase() on
each node before freeing it, which will often trigger rebalance operations
on the rbtree. A better alternative it to use extent_io_tree_release(),
which will not do deletions and trigger rebalances.
So use extent_io_tree_release() instead of clear_extent_bits().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/tree-log.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index f4257be56bd3..8687c944451f 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3207,8 +3207,7 @@ static void free_log_tree(struct btrfs_trans_handle *trans,
}
}
- clear_extent_bits(&log->dirty_log_pages, 0, (u64)-1,
- EXTENT_DIRTY | EXTENT_NEW | EXTENT_NEED_WAIT);
+ extent_io_tree_release(&log->dirty_log_pages);
extent_io_tree_release(&log->log_csum_range);
btrfs_put_root(log);
--
2.40.1
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 8/8] btrfs: make sure we cache next state in find_first_extent_bit()
2023-09-22 10:39 [PATCH 0/8] btrfs: some speedups for io tree operations and cleanups fdmanana
` (6 preceding siblings ...)
2023-09-22 10:39 ` [PATCH 7/8] btrfs: use extent_io_tree_release() to empty dirty log pages fdmanana
@ 2023-09-22 10:39 ` fdmanana
2023-09-22 18:18 ` [PATCH 0/8] btrfs: some speedups for io tree operations and cleanups David Sterba
2023-09-29 16:47 ` David Sterba
9 siblings, 0 replies; 20+ messages in thread
From: fdmanana @ 2023-09-22 10:39 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
Currently, at find_first_extent_bit(), when we are given a cached extent
state that happens to have its end offset match the desired range start,
we find the next extent state using that cached state, with next_state()
calls, and then return it.
We then try to cache that next state by calling cache_state_if_flags(),
but that will not cache the state because we haven't reset *cached_state
to NULL, so we end up with the cached_state unchanged, and if the caller
is iterating over extent states in the io tree, its next call to
find_first_extent_bit() will not use the current cached state as its end
offset does not match the minimum start range offset, therefore the cached
state is reset and we have to search the rbtree to find the next suitable
extent state record.
So fix this by resetting the cached state to NULL (and dropping our ref
on it) when we have a suitable cached state and we found a next state by
using next_state() starting from the cached state. This makes use cases
of calling find_first_extent_bit() to go over all ranges in the io tree
to do a single rbtree full search, only on the first call, and the next
calls will just do next_state() (rb_next() wrapper) calls, which is more
efficient.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/extent-io-tree.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/extent-io-tree.c b/fs/btrfs/extent-io-tree.c
index f2372d6cd304..6e3645afaa38 100644
--- a/fs/btrfs/extent-io-tree.c
+++ b/fs/btrfs/extent-io-tree.c
@@ -877,10 +877,19 @@ bool find_first_extent_bit(struct extent_io_tree *tree, u64 start,
if (state->end == start - 1 && extent_state_in_tree(state)) {
while ((state = next_state(state)) != NULL) {
if (state->state & bits)
- goto got_it;
+ break;
}
+ /*
+ * If we found the next extent state, clear cached_state
+ * so that we can cache the next extent state below and
+ * avoid future calls going over the same extent state
+ * again. If we haven't found any, clear as well since
+ * it's now useless.
+ */
free_extent_state(*cached_state);
*cached_state = NULL;
+ if (state)
+ goto got_it;
goto out;
}
free_extent_state(*cached_state);
--
2.40.1
^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH 0/8] btrfs: some speedups for io tree operations and cleanups
2023-09-22 10:39 [PATCH 0/8] btrfs: some speedups for io tree operations and cleanups fdmanana
` (7 preceding siblings ...)
2023-09-22 10:39 ` [PATCH 8/8] btrfs: make sure we cache next state in find_first_extent_bit() fdmanana
@ 2023-09-22 18:18 ` David Sterba
2023-09-22 18:43 ` Filipe Manana
2023-09-29 16:47 ` David Sterba
9 siblings, 1 reply; 20+ messages in thread
From: David Sterba @ 2023-09-22 18:18 UTC (permalink / raw)
To: fdmanana; +Cc: linux-btrfs
On Fri, Sep 22, 2023 at 11:39:01AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> These are some changes to make some io tree operations more efficient and
> some cleanups. More details on the changelogs.
>
> Filipe Manana (8):
> btrfs: make extent state merges more efficient during insertions
> btrfs: update stale comment at extent_io_tree_release()
> btrfs: make wait_extent_bit() static
> btrfs: remove pointless memory barrier from extent_io_tree_release()
> btrfs: collapse wait_on_state() to its caller wait_extent_bit()
> btrfs: make extent_io_tree_release() more efficient
> btrfs: use extent_io_tree_release() to empty dirty log pages
> btrfs: make sure we cache next state in find_first_extent_bit()
I see a lot of reports like:
btrfs/004 [16:14:23][ 468.732077] run fstests btrfs/004 at 2023-09-22 16:14:24
[ 470.921989] BTRFS: device fsid f7d57de2-899a-4b33-b77a-084058ac36e9 devid 1 transid 11 /dev/vda scanned by mount (31993)
[ 470.926438] BTRFS info (device vda): using crc32c (crc32c-generic) checksum algorithm
[ 470.928013] BTRFS info (device vda): using free space tree
[ 470.952723] BTRFS info (device vda): auto enabling async discard
[ 472.385556] BTRFS: device fsid 097a012d-8e9b-4bd8-960d-87577821cbbe devid 1 transid 6 /dev/vdb scanned by mount (32061)
[ 472.395192] BTRFS info (device vdb): using crc32c (crc32c-generic) checksum algorithm
[ 472.398895] BTRFS info (device vdb): using free space tree
[ 472.438755] BTRFS info (device vdb): auto enabling async discard
[ 472.440900] BTRFS info (device vdb): checking UUID tree
[ 472.534254] BUG: sleeping function called from invalid context at fs/btrfs/extent-io-tree.c:132
[ 472.539305] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 32079, name: fsstress
[ 472.542685] preempt_count: 1, expected: 0
[ 472.543641] RCU nest depth: 0, expected: 0
[ 472.544778] 6 locks held by fsstress/32079:
[ 472.546916] #0: ffff888017ad4648 (sb_internal#2){.+.+}-{0:0}, at: btrfs_sync_file+0x594/0xa20 [btrfs]
[ 472.551474] #1: ffff888014c72790 (btrfs_trans_completed){.+.+}-{0:0}, at: btrfs_commit_transaction+0x840/0x1580 [btrfs]
[ 472.556334] #2: ffff888014c72760 (btrfs_trans_super_committed){.+.+}-{0:0}, at: btrfs_commit_transaction+0x840/0x1580 [btrfs]
[ 472.561372] #3: ffff888014c72730 (btrfs_trans_unblocked){++++}-{0:0}, at: btrfs_commit_transaction+0x840/0x1580 [btrfs]
[ 472.565793] #4: ffff888014c70de0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_commit_transaction+0x8ed/0x1580 [btrfs]
[ 472.569931] #5: ffff88802ec1c248 (&tree->lock#2){+.+.}-{2:2}, at: extent_io_tree_release+0x1c/0x120 [btrfs]
[ 472.573099] Preemption disabled at:
[ 472.573110] [<0000000000000000>] 0x0
[ 472.575200] CPU: 0 PID: 32079 Comm: fsstress Not tainted 6.6.0-rc2-default+ #2197
[ 472.577440] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
[ 472.580902] Call Trace:
[ 472.581738] <TASK>
[ 472.582480] dump_stack_lvl+0x60/0x70
[ 472.583621] __might_resched+0x224/0x360
[ 472.584591] extent_io_tree_release+0xa5/0x120 [btrfs]
[ 472.585864] free_log_tree+0xec/0x250 [btrfs]
[ 472.587007] ? walk_log_tree+0x4a0/0x4a0 [btrfs]
[ 472.588116] ? reacquire_held_locks+0x280/0x280
[ 472.589104] ? btrfs_log_holes+0x430/0x430 [btrfs]
[ 472.590296] ? node_tag_clear+0xf4/0x160
[ 472.591292] btrfs_free_log+0x2c/0x60 [btrfs]
[ 472.592468] commit_fs_roots+0x1e0/0x440 [btrfs]
[ 472.593569] ? __lock_release.isra.0+0x14e/0x510
[ 472.594470] ? percpu_up_read+0xe0/0xe0 [btrfs]
[ 472.595496] ? btrfs_run_delayed_refs+0xf6/0x180 [btrfs]
[ 472.596758] ? btrfs_assert_delayed_root_empty+0x2d/0xd0 [btrfs]
[ 472.598077] ? btrfs_commit_transaction+0x932/0x1580 [btrfs]
[ 472.599437] btrfs_commit_transaction+0x94e/0x1580 [btrfs]
[ 472.600845] ? cleanup_transaction+0x650/0x650 [btrfs]
[ 472.602164] ? preempt_count_sub+0x18/0xc0
[ 472.603111] ? __rcu_read_unlock+0x67/0x90
[ 472.604011] ? preempt_count_add+0x71/0xd0
[ 472.604840] ? preempt_count_sub+0x18/0xc0
[ 472.605664] ? preempt_count_add+0x71/0xd0
[ 472.606454] ? preempt_count_sub+0x18/0xc0
[ 472.607251] ? __up_write+0x125/0x300
[ 472.608059] btrfs_sync_file+0x794/0xa20 [btrfs]
[ 472.609105] ? start_ordered_ops.constprop.0+0xd0/0xd0 [btrfs]
[ 472.610392] ? mark_held_locks+0x1a/0x80
[ 472.611179] __x64_sys_fdatasync+0x70/0xb0
[ 472.614054] do_syscall_64+0x3d/0x90
[ 472.614584] entry_SYSCALL_64_after_hwframe+0x46/0xb0
[ 472.615522] RIP: 0033:0x7f9bbbe983d4
115 void extent_io_tree_release(struct extent_io_tree *tree)
116 {
117 struct extent_state *state;
118 struct extent_state *tmp;
119
120 spin_lock(&tree->lock);
121 rbtree_postorder_for_each_entry_safe(state, tmp, &tree->state, rb_node) {
122 /* Clear node to keep free_extent_state() happy. */
123 RB_CLEAR_NODE(&state->rb_node);
124 ASSERT(!(state->state & EXTENT_LOCKED));
125 /*
126 * No need for a memory barrier here, as we are holding the tree
127 * lock and we only change the waitqueue while holding that lock
128 * (see wait_extent_bit()).
129 */
130 ASSERT(!waitqueue_active(&state->wq));
131 free_extent_state(state);
132 cond_resched();
^^^
should be cond_resched_lock() as it's under spinlock but then I'm not
sure if relocking is still safe in the middle of the tree traversal.
133 }
134 tree->state = RB_ROOT;
135 spin_unlock(&tree->lock);
136 }
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/8] btrfs: some speedups for io tree operations and cleanups
2023-09-22 18:18 ` [PATCH 0/8] btrfs: some speedups for io tree operations and cleanups David Sterba
@ 2023-09-22 18:43 ` Filipe Manana
2023-09-22 19:08 ` David Sterba
2023-09-22 19:13 ` Filipe Manana
0 siblings, 2 replies; 20+ messages in thread
From: Filipe Manana @ 2023-09-22 18:43 UTC (permalink / raw)
To: dsterba; +Cc: linux-btrfs
On Fri, Sep 22, 2023 at 7:24 PM David Sterba <dsterba@suse.cz> wrote:
>
> On Fri, Sep 22, 2023 at 11:39:01AM +0100, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > These are some changes to make some io tree operations more efficient and
> > some cleanups. More details on the changelogs.
> >
> > Filipe Manana (8):
> > btrfs: make extent state merges more efficient during insertions
> > btrfs: update stale comment at extent_io_tree_release()
> > btrfs: make wait_extent_bit() static
> > btrfs: remove pointless memory barrier from extent_io_tree_release()
> > btrfs: collapse wait_on_state() to its caller wait_extent_bit()
> > btrfs: make extent_io_tree_release() more efficient
> > btrfs: use extent_io_tree_release() to empty dirty log pages
> > btrfs: make sure we cache next state in find_first_extent_bit()
>
> I see a lot of reports like:
>
> btrfs/004 [16:14:23][ 468.732077] run fstests btrfs/004 at 2023-09-22 16:14:24
> [ 470.921989] BTRFS: device fsid f7d57de2-899a-4b33-b77a-084058ac36e9 devid 1 transid 11 /dev/vda scanned by mount (31993)
> [ 470.926438] BTRFS info (device vda): using crc32c (crc32c-generic) checksum algorithm
> [ 470.928013] BTRFS info (device vda): using free space tree
> [ 470.952723] BTRFS info (device vda): auto enabling async discard
> [ 472.385556] BTRFS: device fsid 097a012d-8e9b-4bd8-960d-87577821cbbe devid 1 transid 6 /dev/vdb scanned by mount (32061)
> [ 472.395192] BTRFS info (device vdb): using crc32c (crc32c-generic) checksum algorithm
> [ 472.398895] BTRFS info (device vdb): using free space tree
> [ 472.438755] BTRFS info (device vdb): auto enabling async discard
> [ 472.440900] BTRFS info (device vdb): checking UUID tree
> [ 472.534254] BUG: sleeping function called from invalid context at fs/btrfs/extent-io-tree.c:132
> [ 472.539305] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 32079, name: fsstress
> [ 472.542685] preempt_count: 1, expected: 0
> [ 472.543641] RCU nest depth: 0, expected: 0
> [ 472.544778] 6 locks held by fsstress/32079:
> [ 472.546916] #0: ffff888017ad4648 (sb_internal#2){.+.+}-{0:0}, at: btrfs_sync_file+0x594/0xa20 [btrfs]
> [ 472.551474] #1: ffff888014c72790 (btrfs_trans_completed){.+.+}-{0:0}, at: btrfs_commit_transaction+0x840/0x1580 [btrfs]
> [ 472.556334] #2: ffff888014c72760 (btrfs_trans_super_committed){.+.+}-{0:0}, at: btrfs_commit_transaction+0x840/0x1580 [btrfs]
> [ 472.561372] #3: ffff888014c72730 (btrfs_trans_unblocked){++++}-{0:0}, at: btrfs_commit_transaction+0x840/0x1580 [btrfs]
> [ 472.565793] #4: ffff888014c70de0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_commit_transaction+0x8ed/0x1580 [btrfs]
> [ 472.569931] #5: ffff88802ec1c248 (&tree->lock#2){+.+.}-{2:2}, at: extent_io_tree_release+0x1c/0x120 [btrfs]
> [ 472.573099] Preemption disabled at:
> [ 472.573110] [<0000000000000000>] 0x0
> [ 472.575200] CPU: 0 PID: 32079 Comm: fsstress Not tainted 6.6.0-rc2-default+ #2197
> [ 472.577440] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
> [ 472.580902] Call Trace:
> [ 472.581738] <TASK>
> [ 472.582480] dump_stack_lvl+0x60/0x70
> [ 472.583621] __might_resched+0x224/0x360
> [ 472.584591] extent_io_tree_release+0xa5/0x120 [btrfs]
> [ 472.585864] free_log_tree+0xec/0x250 [btrfs]
> [ 472.587007] ? walk_log_tree+0x4a0/0x4a0 [btrfs]
> [ 472.588116] ? reacquire_held_locks+0x280/0x280
> [ 472.589104] ? btrfs_log_holes+0x430/0x430 [btrfs]
> [ 472.590296] ? node_tag_clear+0xf4/0x160
> [ 472.591292] btrfs_free_log+0x2c/0x60 [btrfs]
> [ 472.592468] commit_fs_roots+0x1e0/0x440 [btrfs]
> [ 472.593569] ? __lock_release.isra.0+0x14e/0x510
> [ 472.594470] ? percpu_up_read+0xe0/0xe0 [btrfs]
> [ 472.595496] ? btrfs_run_delayed_refs+0xf6/0x180 [btrfs]
> [ 472.596758] ? btrfs_assert_delayed_root_empty+0x2d/0xd0 [btrfs]
> [ 472.598077] ? btrfs_commit_transaction+0x932/0x1580 [btrfs]
> [ 472.599437] btrfs_commit_transaction+0x94e/0x1580 [btrfs]
> [ 472.600845] ? cleanup_transaction+0x650/0x650 [btrfs]
> [ 472.602164] ? preempt_count_sub+0x18/0xc0
> [ 472.603111] ? __rcu_read_unlock+0x67/0x90
> [ 472.604011] ? preempt_count_add+0x71/0xd0
> [ 472.604840] ? preempt_count_sub+0x18/0xc0
> [ 472.605664] ? preempt_count_add+0x71/0xd0
> [ 472.606454] ? preempt_count_sub+0x18/0xc0
> [ 472.607251] ? __up_write+0x125/0x300
> [ 472.608059] btrfs_sync_file+0x794/0xa20 [btrfs]
> [ 472.609105] ? start_ordered_ops.constprop.0+0xd0/0xd0 [btrfs]
> [ 472.610392] ? mark_held_locks+0x1a/0x80
> [ 472.611179] __x64_sys_fdatasync+0x70/0xb0
> [ 472.614054] do_syscall_64+0x3d/0x90
> [ 472.614584] entry_SYSCALL_64_after_hwframe+0x46/0xb0
> [ 472.615522] RIP: 0033:0x7f9bbbe983d4
>
> 115 void extent_io_tree_release(struct extent_io_tree *tree)
> 116 {
> 117 struct extent_state *state;
> 118 struct extent_state *tmp;
> 119
> 120 spin_lock(&tree->lock);
> 121 rbtree_postorder_for_each_entry_safe(state, tmp, &tree->state, rb_node) {
> 122 /* Clear node to keep free_extent_state() happy. */
> 123 RB_CLEAR_NODE(&state->rb_node);
> 124 ASSERT(!(state->state & EXTENT_LOCKED));
> 125 /*
> 126 * No need for a memory barrier here, as we are holding the tree
> 127 * lock and we only change the waitqueue while holding that lock
> 128 * (see wait_extent_bit()).
> 129 */
> 130 ASSERT(!waitqueue_active(&state->wq));
> 131 free_extent_state(state);
> 132 cond_resched();
> ^^^
>
> should be cond_resched_lock() as it's under spinlock but then I'm not
> sure if relocking is still safe in the middle of the tree traversal.
cond_resched_lock() works here under the assumption that at this point no other
tasks access the tree (as documented in an earlier patch) - if the
assumption is broken in
the future, then another task can access a node that was freed before
the cond_resched_lock()
call and result in a use-after-free or other weird issues.
But I think it's best to remove the cond_resched() call. I removed it
and then added it again
but didn't think properly and ran on a vm without several debugging
configs turned on, likely
why I didn't get that splat.
Do you want me to send a v2 with the cond_resched() line deleted?
Thanks.
>
> 133 }
> 134 tree->state = RB_ROOT;
> 135 spin_unlock(&tree->lock);
> 136 }
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/8] btrfs: some speedups for io tree operations and cleanups
2023-09-22 18:43 ` Filipe Manana
@ 2023-09-22 19:08 ` David Sterba
2023-09-22 19:13 ` Filipe Manana
1 sibling, 0 replies; 20+ messages in thread
From: David Sterba @ 2023-09-22 19:08 UTC (permalink / raw)
To: Filipe Manana; +Cc: dsterba, linux-btrfs
On Fri, Sep 22, 2023 at 07:43:24PM +0100, Filipe Manana wrote:
> On Fri, Sep 22, 2023 at 7:24 PM David Sterba <dsterba@suse.cz> wrote:
> >
> > On Fri, Sep 22, 2023 at 11:39:01AM +0100, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > These are some changes to make some io tree operations more efficient and
> > > some cleanups. More details on the changelogs.
> > >
> > > Filipe Manana (8):
> > > btrfs: make extent state merges more efficient during insertions
> > > btrfs: update stale comment at extent_io_tree_release()
> > > btrfs: make wait_extent_bit() static
> > > btrfs: remove pointless memory barrier from extent_io_tree_release()
> > > btrfs: collapse wait_on_state() to its caller wait_extent_bit()
> > > btrfs: make extent_io_tree_release() more efficient
> > > btrfs: use extent_io_tree_release() to empty dirty log pages
> > > btrfs: make sure we cache next state in find_first_extent_bit()
> >
> > I see a lot of reports like:
> >
> > btrfs/004 [16:14:23][ 468.732077] run fstests btrfs/004 at 2023-09-22 16:14:24
> > [ 470.921989] BTRFS: device fsid f7d57de2-899a-4b33-b77a-084058ac36e9 devid 1 transid 11 /dev/vda scanned by mount (31993)
> > [ 470.926438] BTRFS info (device vda): using crc32c (crc32c-generic) checksum algorithm
> > [ 470.928013] BTRFS info (device vda): using free space tree
> > [ 470.952723] BTRFS info (device vda): auto enabling async discard
> > [ 472.385556] BTRFS: device fsid 097a012d-8e9b-4bd8-960d-87577821cbbe devid 1 transid 6 /dev/vdb scanned by mount (32061)
> > [ 472.395192] BTRFS info (device vdb): using crc32c (crc32c-generic) checksum algorithm
> > [ 472.398895] BTRFS info (device vdb): using free space tree
> > [ 472.438755] BTRFS info (device vdb): auto enabling async discard
> > [ 472.440900] BTRFS info (device vdb): checking UUID tree
> > [ 472.534254] BUG: sleeping function called from invalid context at fs/btrfs/extent-io-tree.c:132
> > [ 472.539305] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 32079, name: fsstress
> > [ 472.542685] preempt_count: 1, expected: 0
> > [ 472.543641] RCU nest depth: 0, expected: 0
> > [ 472.544778] 6 locks held by fsstress/32079:
> > [ 472.546916] #0: ffff888017ad4648 (sb_internal#2){.+.+}-{0:0}, at: btrfs_sync_file+0x594/0xa20 [btrfs]
> > [ 472.551474] #1: ffff888014c72790 (btrfs_trans_completed){.+.+}-{0:0}, at: btrfs_commit_transaction+0x840/0x1580 [btrfs]
> > [ 472.556334] #2: ffff888014c72760 (btrfs_trans_super_committed){.+.+}-{0:0}, at: btrfs_commit_transaction+0x840/0x1580 [btrfs]
> > [ 472.561372] #3: ffff888014c72730 (btrfs_trans_unblocked){++++}-{0:0}, at: btrfs_commit_transaction+0x840/0x1580 [btrfs]
> > [ 472.565793] #4: ffff888014c70de0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_commit_transaction+0x8ed/0x1580 [btrfs]
> > [ 472.569931] #5: ffff88802ec1c248 (&tree->lock#2){+.+.}-{2:2}, at: extent_io_tree_release+0x1c/0x120 [btrfs]
> > [ 472.573099] Preemption disabled at:
> > [ 472.573110] [<0000000000000000>] 0x0
> > [ 472.575200] CPU: 0 PID: 32079 Comm: fsstress Not tainted 6.6.0-rc2-default+ #2197
> > [ 472.577440] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
> > [ 472.580902] Call Trace:
> > [ 472.581738] <TASK>
> > [ 472.582480] dump_stack_lvl+0x60/0x70
> > [ 472.583621] __might_resched+0x224/0x360
> > [ 472.584591] extent_io_tree_release+0xa5/0x120 [btrfs]
> > [ 472.585864] free_log_tree+0xec/0x250 [btrfs]
> > [ 472.587007] ? walk_log_tree+0x4a0/0x4a0 [btrfs]
> > [ 472.588116] ? reacquire_held_locks+0x280/0x280
> > [ 472.589104] ? btrfs_log_holes+0x430/0x430 [btrfs]
> > [ 472.590296] ? node_tag_clear+0xf4/0x160
> > [ 472.591292] btrfs_free_log+0x2c/0x60 [btrfs]
> > [ 472.592468] commit_fs_roots+0x1e0/0x440 [btrfs]
> > [ 472.593569] ? __lock_release.isra.0+0x14e/0x510
> > [ 472.594470] ? percpu_up_read+0xe0/0xe0 [btrfs]
> > [ 472.595496] ? btrfs_run_delayed_refs+0xf6/0x180 [btrfs]
> > [ 472.596758] ? btrfs_assert_delayed_root_empty+0x2d/0xd0 [btrfs]
> > [ 472.598077] ? btrfs_commit_transaction+0x932/0x1580 [btrfs]
> > [ 472.599437] btrfs_commit_transaction+0x94e/0x1580 [btrfs]
> > [ 472.600845] ? cleanup_transaction+0x650/0x650 [btrfs]
> > [ 472.602164] ? preempt_count_sub+0x18/0xc0
> > [ 472.603111] ? __rcu_read_unlock+0x67/0x90
> > [ 472.604011] ? preempt_count_add+0x71/0xd0
> > [ 472.604840] ? preempt_count_sub+0x18/0xc0
> > [ 472.605664] ? preempt_count_add+0x71/0xd0
> > [ 472.606454] ? preempt_count_sub+0x18/0xc0
> > [ 472.607251] ? __up_write+0x125/0x300
> > [ 472.608059] btrfs_sync_file+0x794/0xa20 [btrfs]
> > [ 472.609105] ? start_ordered_ops.constprop.0+0xd0/0xd0 [btrfs]
> > [ 472.610392] ? mark_held_locks+0x1a/0x80
> > [ 472.611179] __x64_sys_fdatasync+0x70/0xb0
> > [ 472.614054] do_syscall_64+0x3d/0x90
> > [ 472.614584] entry_SYSCALL_64_after_hwframe+0x46/0xb0
> > [ 472.615522] RIP: 0033:0x7f9bbbe983d4
> >
> > 115 void extent_io_tree_release(struct extent_io_tree *tree)
> > 116 {
> > 117 struct extent_state *state;
> > 118 struct extent_state *tmp;
> > 119
> > 120 spin_lock(&tree->lock);
> > 121 rbtree_postorder_for_each_entry_safe(state, tmp, &tree->state, rb_node) {
> > 122 /* Clear node to keep free_extent_state() happy. */
> > 123 RB_CLEAR_NODE(&state->rb_node);
> > 124 ASSERT(!(state->state & EXTENT_LOCKED));
> > 125 /*
> > 126 * No need for a memory barrier here, as we are holding the tree
> > 127 * lock and we only change the waitqueue while holding that lock
> > 128 * (see wait_extent_bit()).
> > 129 */
> > 130 ASSERT(!waitqueue_active(&state->wq));
> > 131 free_extent_state(state);
> > 132 cond_resched();
> > ^^^
> >
> > should be cond_resched_lock() as it's under spinlock but then I'm not
> > sure if relocking is still safe in the middle of the tree traversal.
>
> cond_resched_lock() works here under the assumption that at this point no other
> tasks access the tree (as documented in an earlier patch) - if the
> assumption is broken in
> the future, then another task can access a node that was freed before
> the cond_resched_lock()
> call and result in a use-after-free or other weird issues.
>
> But I think it's best to remove the cond_resched() call. I removed it
> and then added it again
> but didn't think properly and ran on a vm without several debugging
> configs turned on, likely
> why I didn't get that splat.
>
> Do you want me to send a v2 with the cond_resched() line deleted?
No need to resend for that.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/8] btrfs: some speedups for io tree operations and cleanups
2023-09-22 18:43 ` Filipe Manana
2023-09-22 19:08 ` David Sterba
@ 2023-09-22 19:13 ` Filipe Manana
2023-09-22 19:15 ` David Sterba
1 sibling, 1 reply; 20+ messages in thread
From: Filipe Manana @ 2023-09-22 19:13 UTC (permalink / raw)
To: dsterba; +Cc: linux-btrfs
On Fri, Sep 22, 2023 at 07:43:24PM +0100, Filipe Manana wrote:
> On Fri, Sep 22, 2023 at 7:24 PM David Sterba <dsterba@suse.cz> wrote:
> >
> > On Fri, Sep 22, 2023 at 11:39:01AM +0100, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > These are some changes to make some io tree operations more efficient and
> > > some cleanups. More details on the changelogs.
> > >
> > > Filipe Manana (8):
> > > btrfs: make extent state merges more efficient during insertions
> > > btrfs: update stale comment at extent_io_tree_release()
> > > btrfs: make wait_extent_bit() static
> > > btrfs: remove pointless memory barrier from extent_io_tree_release()
> > > btrfs: collapse wait_on_state() to its caller wait_extent_bit()
> > > btrfs: make extent_io_tree_release() more efficient
> > > btrfs: use extent_io_tree_release() to empty dirty log pages
> > > btrfs: make sure we cache next state in find_first_extent_bit()
> >
> > I see a lot of reports like:
> >
> > btrfs/004 [16:14:23][ 468.732077] run fstests btrfs/004 at 2023-09-22 16:14:24
> > [ 470.921989] BTRFS: device fsid f7d57de2-899a-4b33-b77a-084058ac36e9 devid 1 transid 11 /dev/vda scanned by mount (31993)
> > [ 470.926438] BTRFS info (device vda): using crc32c (crc32c-generic) checksum algorithm
> > [ 470.928013] BTRFS info (device vda): using free space tree
> > [ 470.952723] BTRFS info (device vda): auto enabling async discard
> > [ 472.385556] BTRFS: device fsid 097a012d-8e9b-4bd8-960d-87577821cbbe devid 1 transid 6 /dev/vdb scanned by mount (32061)
> > [ 472.395192] BTRFS info (device vdb): using crc32c (crc32c-generic) checksum algorithm
> > [ 472.398895] BTRFS info (device vdb): using free space tree
> > [ 472.438755] BTRFS info (device vdb): auto enabling async discard
> > [ 472.440900] BTRFS info (device vdb): checking UUID tree
> > [ 472.534254] BUG: sleeping function called from invalid context at fs/btrfs/extent-io-tree.c:132
> > [ 472.539305] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 32079, name: fsstress
> > [ 472.542685] preempt_count: 1, expected: 0
> > [ 472.543641] RCU nest depth: 0, expected: 0
> > [ 472.544778] 6 locks held by fsstress/32079:
> > [ 472.546916] #0: ffff888017ad4648 (sb_internal#2){.+.+}-{0:0}, at: btrfs_sync_file+0x594/0xa20 [btrfs]
> > [ 472.551474] #1: ffff888014c72790 (btrfs_trans_completed){.+.+}-{0:0}, at: btrfs_commit_transaction+0x840/0x1580 [btrfs]
> > [ 472.556334] #2: ffff888014c72760 (btrfs_trans_super_committed){.+.+}-{0:0}, at: btrfs_commit_transaction+0x840/0x1580 [btrfs]
> > [ 472.561372] #3: ffff888014c72730 (btrfs_trans_unblocked){++++}-{0:0}, at: btrfs_commit_transaction+0x840/0x1580 [btrfs]
> > [ 472.565793] #4: ffff888014c70de0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_commit_transaction+0x8ed/0x1580 [btrfs]
> > [ 472.569931] #5: ffff88802ec1c248 (&tree->lock#2){+.+.}-{2:2}, at: extent_io_tree_release+0x1c/0x120 [btrfs]
> > [ 472.573099] Preemption disabled at:
> > [ 472.573110] [<0000000000000000>] 0x0
> > [ 472.575200] CPU: 0 PID: 32079 Comm: fsstress Not tainted 6.6.0-rc2-default+ #2197
> > [ 472.577440] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
> > [ 472.580902] Call Trace:
> > [ 472.581738] <TASK>
> > [ 472.582480] dump_stack_lvl+0x60/0x70
> > [ 472.583621] __might_resched+0x224/0x360
> > [ 472.584591] extent_io_tree_release+0xa5/0x120 [btrfs]
> > [ 472.585864] free_log_tree+0xec/0x250 [btrfs]
> > [ 472.587007] ? walk_log_tree+0x4a0/0x4a0 [btrfs]
> > [ 472.588116] ? reacquire_held_locks+0x280/0x280
> > [ 472.589104] ? btrfs_log_holes+0x430/0x430 [btrfs]
> > [ 472.590296] ? node_tag_clear+0xf4/0x160
> > [ 472.591292] btrfs_free_log+0x2c/0x60 [btrfs]
> > [ 472.592468] commit_fs_roots+0x1e0/0x440 [btrfs]
> > [ 472.593569] ? __lock_release.isra.0+0x14e/0x510
> > [ 472.594470] ? percpu_up_read+0xe0/0xe0 [btrfs]
> > [ 472.595496] ? btrfs_run_delayed_refs+0xf6/0x180 [btrfs]
> > [ 472.596758] ? btrfs_assert_delayed_root_empty+0x2d/0xd0 [btrfs]
> > [ 472.598077] ? btrfs_commit_transaction+0x932/0x1580 [btrfs]
> > [ 472.599437] btrfs_commit_transaction+0x94e/0x1580 [btrfs]
> > [ 472.600845] ? cleanup_transaction+0x650/0x650 [btrfs]
> > [ 472.602164] ? preempt_count_sub+0x18/0xc0
> > [ 472.603111] ? __rcu_read_unlock+0x67/0x90
> > [ 472.604011] ? preempt_count_add+0x71/0xd0
> > [ 472.604840] ? preempt_count_sub+0x18/0xc0
> > [ 472.605664] ? preempt_count_add+0x71/0xd0
> > [ 472.606454] ? preempt_count_sub+0x18/0xc0
> > [ 472.607251] ? __up_write+0x125/0x300
> > [ 472.608059] btrfs_sync_file+0x794/0xa20 [btrfs]
> > [ 472.609105] ? start_ordered_ops.constprop.0+0xd0/0xd0 [btrfs]
> > [ 472.610392] ? mark_held_locks+0x1a/0x80
> > [ 472.611179] __x64_sys_fdatasync+0x70/0xb0
> > [ 472.614054] do_syscall_64+0x3d/0x90
> > [ 472.614584] entry_SYSCALL_64_after_hwframe+0x46/0xb0
> > [ 472.615522] RIP: 0033:0x7f9bbbe983d4
> >
> > 115 void extent_io_tree_release(struct extent_io_tree *tree)
> > 116 {
> > 117 struct extent_state *state;
> > 118 struct extent_state *tmp;
> > 119
> > 120 spin_lock(&tree->lock);
> > 121 rbtree_postorder_for_each_entry_safe(state, tmp, &tree->state, rb_node) {
> > 122 /* Clear node to keep free_extent_state() happy. */
> > 123 RB_CLEAR_NODE(&state->rb_node);
> > 124 ASSERT(!(state->state & EXTENT_LOCKED));
> > 125 /*
> > 126 * No need for a memory barrier here, as we are holding the tree
> > 127 * lock and we only change the waitqueue while holding that lock
> > 128 * (see wait_extent_bit()).
> > 129 */
> > 130 ASSERT(!waitqueue_active(&state->wq));
> > 131 free_extent_state(state);
> > 132 cond_resched();
> > ^^^
> >
> > should be cond_resched_lock() as it's under spinlock but then I'm not
> > sure if relocking is still safe in the middle of the tree traversal.
>
> cond_resched_lock() works here under the assumption that at this point no other
> tasks access the tree (as documented in an earlier patch) - if the
> assumption is broken in
> the future, then another task can access a node that was freed before
> the cond_resched_lock()
> call and result in a use-after-free or other weird issues.
>
> But I think it's best to remove the cond_resched() call. I removed it
> and then added it again
> but didn't think properly and ran on a vm without several debugging
> configs turned on, likely
> why I didn't get that splat.
>
> Do you want me to send a v2 with the cond_resched() line deleted?
Better, we can keep the cond_resched_lock() safely with this incremental on
top of the patch:
diff --git a/fs/btrfs/extent-io-tree.c b/fs/btrfs/extent-io-tree.c
index 6e3645afaa38..32788fb7837e 100644
--- a/fs/btrfs/extent-io-tree.c
+++ b/fs/btrfs/extent-io-tree.c
@@ -114,11 +114,14 @@ void extent_io_tree_init(struct btrfs_fs_info *fs_info,
*/
void extent_io_tree_release(struct extent_io_tree *tree)
{
+ struct rb_root root;
struct extent_state *state;
struct extent_state *tmp;
spin_lock(&tree->lock);
- rbtree_postorder_for_each_entry_safe(state, tmp, &tree->state, rb_node) {
+ root = tree->state;
+ tree->state = RB_ROOT;
+ rbtree_postorder_for_each_entry_safe(state, tmp, &root, rb_node) {
/* Clear node to keep free_extent_state() happy. */
RB_CLEAR_NODE(&state->rb_node);
ASSERT(!(state->state & EXTENT_LOCKED));
@@ -129,9 +132,13 @@ void extent_io_tree_release(struct extent_io_tree *tree)
*/
ASSERT(!waitqueue_active(&state->wq));
free_extent_state(state);
- cond_resched();
+ cond_resched_lock(&tree->lock);
}
- tree->state = RB_ROOT;
+ /*
+ * Should still be empty even after a reschedule, no other task should
+ * be accessing the tree anymore.
+ */
+ ASSERT(RB_EMPTY_ROOT(&tree->state));
spin_unlock(&tree->lock);
}
Also pasted at: https://pastebin.com/raw/uVMP2e5b
Let me know if you prefer I send a v2 or squash this patch.
Thanks.
>
> Thanks.
>
> >
> > 133 }
> > 134 tree->state = RB_ROOT;
> > 135 spin_unlock(&tree->lock);
> > 136 }
^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH 0/8] btrfs: some speedups for io tree operations and cleanups
2023-09-22 19:13 ` Filipe Manana
@ 2023-09-22 19:15 ` David Sterba
0 siblings, 0 replies; 20+ messages in thread
From: David Sterba @ 2023-09-22 19:15 UTC (permalink / raw)
To: Filipe Manana; +Cc: dsterba, linux-btrfs
On Fri, Sep 22, 2023 at 08:13:50PM +0100, Filipe Manana wrote:
> On Fri, Sep 22, 2023 at 07:43:24PM +0100, Filipe Manana wrote:
> > On Fri, Sep 22, 2023 at 7:24 PM David Sterba <dsterba@suse.cz> wrote:
> > >
> > > On Fri, Sep 22, 2023 at 11:39:01AM +0100, fdmanana@kernel.org wrote:
> > > > From: Filipe Manana <fdmanana@suse.com>
> > > >
> > > > These are some changes to make some io tree operations more efficient and
> > > > some cleanups. More details on the changelogs.
> > > >
> > > > Filipe Manana (8):
> > > > btrfs: make extent state merges more efficient during insertions
> > > > btrfs: update stale comment at extent_io_tree_release()
> > > > btrfs: make wait_extent_bit() static
> > > > btrfs: remove pointless memory barrier from extent_io_tree_release()
> > > > btrfs: collapse wait_on_state() to its caller wait_extent_bit()
> > > > btrfs: make extent_io_tree_release() more efficient
> > > > btrfs: use extent_io_tree_release() to empty dirty log pages
> > > > btrfs: make sure we cache next state in find_first_extent_bit()
> > >
> > > I see a lot of reports like:
> > >
> > > btrfs/004 [16:14:23][ 468.732077] run fstests btrfs/004 at 2023-09-22 16:14:24
> > > [ 470.921989] BTRFS: device fsid f7d57de2-899a-4b33-b77a-084058ac36e9 devid 1 transid 11 /dev/vda scanned by mount (31993)
> > > [ 470.926438] BTRFS info (device vda): using crc32c (crc32c-generic) checksum algorithm
> > > [ 470.928013] BTRFS info (device vda): using free space tree
> > > [ 470.952723] BTRFS info (device vda): auto enabling async discard
> > > [ 472.385556] BTRFS: device fsid 097a012d-8e9b-4bd8-960d-87577821cbbe devid 1 transid 6 /dev/vdb scanned by mount (32061)
> > > [ 472.395192] BTRFS info (device vdb): using crc32c (crc32c-generic) checksum algorithm
> > > [ 472.398895] BTRFS info (device vdb): using free space tree
> > > [ 472.438755] BTRFS info (device vdb): auto enabling async discard
> > > [ 472.440900] BTRFS info (device vdb): checking UUID tree
> > > [ 472.534254] BUG: sleeping function called from invalid context at fs/btrfs/extent-io-tree.c:132
> > > [ 472.539305] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 32079, name: fsstress
> > > [ 472.542685] preempt_count: 1, expected: 0
> > > [ 472.543641] RCU nest depth: 0, expected: 0
> > > [ 472.544778] 6 locks held by fsstress/32079:
> > > [ 472.546916] #0: ffff888017ad4648 (sb_internal#2){.+.+}-{0:0}, at: btrfs_sync_file+0x594/0xa20 [btrfs]
> > > [ 472.551474] #1: ffff888014c72790 (btrfs_trans_completed){.+.+}-{0:0}, at: btrfs_commit_transaction+0x840/0x1580 [btrfs]
> > > [ 472.556334] #2: ffff888014c72760 (btrfs_trans_super_committed){.+.+}-{0:0}, at: btrfs_commit_transaction+0x840/0x1580 [btrfs]
> > > [ 472.561372] #3: ffff888014c72730 (btrfs_trans_unblocked){++++}-{0:0}, at: btrfs_commit_transaction+0x840/0x1580 [btrfs]
> > > [ 472.565793] #4: ffff888014c70de0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_commit_transaction+0x8ed/0x1580 [btrfs]
> > > [ 472.569931] #5: ffff88802ec1c248 (&tree->lock#2){+.+.}-{2:2}, at: extent_io_tree_release+0x1c/0x120 [btrfs]
> > > [ 472.573099] Preemption disabled at:
> > > [ 472.573110] [<0000000000000000>] 0x0
> > > [ 472.575200] CPU: 0 PID: 32079 Comm: fsstress Not tainted 6.6.0-rc2-default+ #2197
> > > [ 472.577440] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
> > > [ 472.580902] Call Trace:
> > > [ 472.581738] <TASK>
> > > [ 472.582480] dump_stack_lvl+0x60/0x70
> > > [ 472.583621] __might_resched+0x224/0x360
> > > [ 472.584591] extent_io_tree_release+0xa5/0x120 [btrfs]
> > > [ 472.585864] free_log_tree+0xec/0x250 [btrfs]
> > > [ 472.587007] ? walk_log_tree+0x4a0/0x4a0 [btrfs]
> > > [ 472.588116] ? reacquire_held_locks+0x280/0x280
> > > [ 472.589104] ? btrfs_log_holes+0x430/0x430 [btrfs]
> > > [ 472.590296] ? node_tag_clear+0xf4/0x160
> > > [ 472.591292] btrfs_free_log+0x2c/0x60 [btrfs]
> > > [ 472.592468] commit_fs_roots+0x1e0/0x440 [btrfs]
> > > [ 472.593569] ? __lock_release.isra.0+0x14e/0x510
> > > [ 472.594470] ? percpu_up_read+0xe0/0xe0 [btrfs]
> > > [ 472.595496] ? btrfs_run_delayed_refs+0xf6/0x180 [btrfs]
> > > [ 472.596758] ? btrfs_assert_delayed_root_empty+0x2d/0xd0 [btrfs]
> > > [ 472.598077] ? btrfs_commit_transaction+0x932/0x1580 [btrfs]
> > > [ 472.599437] btrfs_commit_transaction+0x94e/0x1580 [btrfs]
> > > [ 472.600845] ? cleanup_transaction+0x650/0x650 [btrfs]
> > > [ 472.602164] ? preempt_count_sub+0x18/0xc0
> > > [ 472.603111] ? __rcu_read_unlock+0x67/0x90
> > > [ 472.604011] ? preempt_count_add+0x71/0xd0
> > > [ 472.604840] ? preempt_count_sub+0x18/0xc0
> > > [ 472.605664] ? preempt_count_add+0x71/0xd0
> > > [ 472.606454] ? preempt_count_sub+0x18/0xc0
> > > [ 472.607251] ? __up_write+0x125/0x300
> > > [ 472.608059] btrfs_sync_file+0x794/0xa20 [btrfs]
> > > [ 472.609105] ? start_ordered_ops.constprop.0+0xd0/0xd0 [btrfs]
> > > [ 472.610392] ? mark_held_locks+0x1a/0x80
> > > [ 472.611179] __x64_sys_fdatasync+0x70/0xb0
> > > [ 472.614054] do_syscall_64+0x3d/0x90
> > > [ 472.614584] entry_SYSCALL_64_after_hwframe+0x46/0xb0
> > > [ 472.615522] RIP: 0033:0x7f9bbbe983d4
> > >
> > > 115 void extent_io_tree_release(struct extent_io_tree *tree)
> > > 116 {
> > > 117 struct extent_state *state;
> > > 118 struct extent_state *tmp;
> > > 119
> > > 120 spin_lock(&tree->lock);
> > > 121 rbtree_postorder_for_each_entry_safe(state, tmp, &tree->state, rb_node) {
> > > 122 /* Clear node to keep free_extent_state() happy. */
> > > 123 RB_CLEAR_NODE(&state->rb_node);
> > > 124 ASSERT(!(state->state & EXTENT_LOCKED));
> > > 125 /*
> > > 126 * No need for a memory barrier here, as we are holding the tree
> > > 127 * lock and we only change the waitqueue while holding that lock
> > > 128 * (see wait_extent_bit()).
> > > 129 */
> > > 130 ASSERT(!waitqueue_active(&state->wq));
> > > 131 free_extent_state(state);
> > > 132 cond_resched();
> > > ^^^
> > >
> > > should be cond_resched_lock() as it's under spinlock but then I'm not
> > > sure if relocking is still safe in the middle of the tree traversal.
> >
> > cond_resched_lock() works here under the assumption that at this point no other
> > tasks access the tree (as documented in an earlier patch) - if the
> > assumption is broken in
> > the future, then another task can access a node that was freed before
> > the cond_resched_lock()
> > call and result in a use-after-free or other weird issues.
> >
> > But I think it's best to remove the cond_resched() call. I removed it
> > and then added it again
> > but didn't think properly and ran on a vm without several debugging
> > configs turned on, likely
> > why I didn't get that splat.
> >
> > Do you want me to send a v2 with the cond_resched() line deleted?
>
> Better, we can keep the cond_resched_lock() safely with this incremental on
> top of the patch:
>
> diff --git a/fs/btrfs/extent-io-tree.c b/fs/btrfs/extent-io-tree.c
> index 6e3645afaa38..32788fb7837e 100644
> --- a/fs/btrfs/extent-io-tree.c
> +++ b/fs/btrfs/extent-io-tree.c
> @@ -114,11 +114,14 @@ void extent_io_tree_init(struct btrfs_fs_info *fs_info,
> */
> void extent_io_tree_release(struct extent_io_tree *tree)
> {
> + struct rb_root root;
> struct extent_state *state;
> struct extent_state *tmp;
>
> spin_lock(&tree->lock);
> - rbtree_postorder_for_each_entry_safe(state, tmp, &tree->state, rb_node) {
> + root = tree->state;
> + tree->state = RB_ROOT;
> + rbtree_postorder_for_each_entry_safe(state, tmp, &root, rb_node) {
> /* Clear node to keep free_extent_state() happy. */
> RB_CLEAR_NODE(&state->rb_node);
> ASSERT(!(state->state & EXTENT_LOCKED));
> @@ -129,9 +132,13 @@ void extent_io_tree_release(struct extent_io_tree *tree)
> */
> ASSERT(!waitqueue_active(&state->wq));
> free_extent_state(state);
> - cond_resched();
> + cond_resched_lock(&tree->lock);
> }
> - tree->state = RB_ROOT;
> + /*
> + * Should still be empty even after a reschedule, no other task should
> + * be accessing the tree anymore.
> + */
> + ASSERT(RB_EMPTY_ROOT(&tree->state));
> spin_unlock(&tree->lock);
> }
>
> Also pasted at: https://pastebin.com/raw/uVMP2e5b
>
> Let me know if you prefer I send a v2 or squash this patch.
No problem with the fixups, this is easier for me. Good that we can keep
the cond_resched, thanks.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 3/8] btrfs: make wait_extent_bit() static
2023-09-22 10:39 ` [PATCH 3/8] btrfs: make wait_extent_bit() static fdmanana
@ 2023-09-22 22:49 ` Anand Jain
0 siblings, 0 replies; 20+ messages in thread
From: Anand Jain @ 2023-09-22 22:49 UTC (permalink / raw)
To: fdmanana, linux-btrfs
On 22/09/2023 18:39, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> The function wait_extent_bit() is not used outside extent-io-tree.c so
> make it static. Furthermore the function doesn't have the 'btrfs_' prefix.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
LGTM
Reviewed-by: Anand Jain <anand.jain@oracle.com>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 5/8] btrfs: collapse wait_on_state() to its caller wait_extent_bit()
2023-09-22 10:39 ` [PATCH 5/8] btrfs: collapse wait_on_state() to its caller wait_extent_bit() fdmanana
@ 2023-09-22 22:54 ` Anand Jain
0 siblings, 0 replies; 20+ messages in thread
From: Anand Jain @ 2023-09-22 22:54 UTC (permalink / raw)
To: fdmanana, linux-btrfs
On 22/09/2023 18:39, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> The wait_on_state() function is very short and has a single caller, which
> is wait_extent_bit(), so remove the function and put its code into the
> caller.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
LGTM
Reviewed-by: Anand Jain <anand.jain@oracle.com>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 7/8] btrfs: use extent_io_tree_release() to empty dirty log pages
2023-09-22 10:39 ` [PATCH 7/8] btrfs: use extent_io_tree_release() to empty dirty log pages fdmanana
@ 2023-09-26 6:25 ` kernel test robot
2023-09-26 16:27 ` David Sterba
0 siblings, 1 reply; 20+ messages in thread
From: kernel test robot @ 2023-09-26 6:25 UTC (permalink / raw)
To: fdmanana; +Cc: oe-lkp, lkp, linux-btrfs, ltp, oliver.sang
Hello,
kernel test robot noticed "BUG:sleeping_function_called_from_invalid_context_at_fs/btrfs/extent-io-tree.c" on:
commit: 84b23544b95acd2e4c05fc473816d19b749fe17b ("[PATCH 7/8] btrfs: use extent_io_tree_release() to empty dirty log pages")
url: https://github.com/intel-lab-lkp/linux/commits/fdmanana-kernel-org/btrfs-make-extent-state-merges-more-efficient-during-insertions/20230922-184038
base: https://git.kernel.org/cgit/linux/kernel/git/kdave/linux.git for-next
patch link: https://lore.kernel.org/all/459c0d25abdfecdc7c57192fa656c6abda11af31.1695333278.git.fdmanana@suse.com/
patch subject: [PATCH 7/8] btrfs: use extent_io_tree_release() to empty dirty log pages
in testcase: ltp
version: ltp-x86_64-14c1f76-1_20230715
with following parameters:
disk: 1HDD
fs: btrfs
test: ltp-aiodio.part2-01
compiler: gcc-12
test machine: 4 threads 1 sockets Intel(R) Core(TM) i3-3220 CPU @ 3.30GHz (Ivy Bridge) with 8G memory
(please refer to attached dmesg/kmsg for entire log/backtrace)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202309261438.d1bebb50-oliver.sang@intel.com
kern :err : [ 179.301033] BUG: sleeping function called from invalid context at fs/btrfs/extent-io-tree.c:132
kern :err : [ 179.301247] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 3670, name: aiodio_sparse
kern :err : [ 179.301432] preempt_count: 1, expected: 0
kern :err : [ 179.301544] RCU nest depth: 0, expected: 0
kern :warn : [ 179.301708] CPU: 1 PID: 3670 Comm: aiodio_sparse Not tainted 6.6.0-rc2-00143-g84b23544b95a #1
kern :warn : [ 179.301924] Hardware name: Hewlett-Packard p6-1451cx/2ADA, BIOS 8.15 02/05/2013
kern :warn : [ 179.302126] Call Trace:
kern :warn : [ 179.302242] <TASK>
kern :warn : [ 179.302351] dump_stack_lvl (lib/dump_stack.c:107 (discriminator 1))
kern :warn : [ 179.302489] __might_resched (kernel/sched/core.c:10188)
kern :warn : [ 179.302630] ? preempt_notifier_dec (kernel/sched/core.c:10142)
kern :warn : [ 179.302781] ? extent_io_tree_release (fs/btrfs/extent-io-tree.c:132) btrfs
kern :warn : [ 179.303082] ? walk_down_log_tree (fs/btrfs/tree-log.c:2711) btrfs
kern :warn : [ 179.303368] extent_io_tree_release (include/linux/sched.h:2097 fs/btrfs/extent-io-tree.c:132) btrfs
kern :warn : [ 179.303657] free_log_tree (fs/btrfs/tree-log.c:3214) btrfs
kern :warn : [ 179.303922] ? walk_log_tree (fs/btrfs/tree-log.c:3173) btrfs
kern :warn : [ 179.304123] ? free_log_tree (fs/btrfs/tree-log.c:330) btrfs
kern :warn : [ 179.304323] btrfs_free_log (fs/btrfs/tree-log.c:3227) btrfs
kern :warn : [ 179.304519] commit_fs_roots (fs/btrfs/transaction.c:1539) btrfs
kern :warn : [ 179.304721] ? btrfs_read_block_groups (fs/btrfs/block-group.c:2668) btrfs
kern :warn : [ 179.304934] ? __btrfs_run_delayed_refs (fs/btrfs/extent-tree.c:2135) btrfs
kern :warn : [ 179.305142] ? trace_btrfs_transaction_commit (fs/btrfs/transaction.c:1501) btrfs
kern :warn : [ 179.305357] btrfs_commit_transaction (fs/btrfs/transaction.c:2508 (discriminator 3)) btrfs
kern :warn : [ 179.305566] ? cleanup_transaction (fs/btrfs/transaction.c:2243) btrfs
kern :warn : [ 179.305769] ? dput (fs/dcache.c:900)
kern :warn : [ 179.305879] ? dget_parent (fs/dcache.c:971)
kern :warn : [ 179.305988] btrfs_sync_file (fs/btrfs/file.c:1987) btrfs
kern :warn : [ 179.306184] ? start_ordered_ops+0x100/0x100 btrfs
kern :warn : [ 179.306400] ? find_vma (mm/mmap.c:1853)
kern :warn : [ 179.306506] ? find_vma_intersection (mm/mmap.c:1853)
kern :warn : [ 179.306630] __do_sys_msync (mm/msync.c:97)
kern :warn : [ 179.306742] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
kern :warn : [ 179.306852] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
kern :warn : [ 179.306986] RIP: 0033:0x7f3b83d45740
kern :warn : [ 179.307093] Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 80 3d a1 8e 0d 00 00 74 17 b8 1a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 54
All code
========
0: 00 f7 add %dh,%bh
2: d8 64 89 01 fsubs 0x1(%rcx,%rcx,4)
6: 48 83 c8 ff or $0xffffffffffffffff,%rax
a: c3 retq
b: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
12: 00 00 00
15: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
1a: 80 3d a1 8e 0d 00 00 cmpb $0x0,0xd8ea1(%rip) # 0xd8ec2
21: 74 17 je 0x3a
23: b8 1a 00 00 00 mov $0x1a,%eax
28: 0f 05 syscall
2a:* 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax <-- trapping instruction
30: 77 58 ja 0x8a
32: c3 retq
33: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
3a: 48 83 ec 28 sub $0x28,%rsp
3e: 89 .byte 0x89
3f: 54 push %rsp
Code starting with the faulting instruction
===========================================
0: 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax
6: 77 58 ja 0x60
8: c3 retq
9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
10: 48 83 ec 28 sub $0x28,%rsp
14: 89 .byte 0x89
15: 54 push %rsp
kern :warn : [ 179.307439] RSP: 002b:00007ffdafb85f58 EFLAGS: 00000202 ORIG_RAX: 000000000000001a
kern :warn : [ 179.307615] RAX: ffffffffffffffda RBX: 0000000006400000 RCX: 00007f3b83d45740
kern :warn : [ 179.307777] RDX: 0000000000000004 RSI: 0000000006400000 RDI: 00007f3b7d800000
kern :warn : [ 179.307938] RBP: 00007f3b7d800000 R08: 0000000000000004 R09: 0000000000000000
kern :warn : [ 179.308098] R10: 00007f3b83c50168 R11: 0000000000000202 R12: 0000000000000004
kern :warn : [ 179.308287] R13: 000055b86e53d033 R14: 000055b86e53d140 R15: 0000000000000000
kern :warn : [ 179.308454] </TASK>
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20230926/202309261438.d1bebb50-oliver.sang@intel.com
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 7/8] btrfs: use extent_io_tree_release() to empty dirty log pages
2023-09-26 6:25 ` kernel test robot
@ 2023-09-26 16:27 ` David Sterba
0 siblings, 0 replies; 20+ messages in thread
From: David Sterba @ 2023-09-26 16:27 UTC (permalink / raw)
To: kernel test robot; +Cc: fdmanana, oe-lkp, lkp, linux-btrfs, ltp
On Tue, Sep 26, 2023 at 02:25:06PM +0800, kernel test robot wrote:
> Hello,
>
> kernel test robot noticed "BUG:sleeping_function_called_from_invalid_context_at_fs/btrfs/extent-io-tree.c" on:
>
> commit: 84b23544b95acd2e4c05fc473816d19b749fe17b ("[PATCH 7/8] btrfs: use extent_io_tree_release() to empty dirty log pages")
> url: https://github.com/intel-lab-lkp/linux/commits/fdmanana-kernel-org/btrfs-make-extent-state-merges-more-efficient-during-insertions/20230922-184038
> base: https://git.kernel.org/cgit/linux/kernel/git/kdave/linux.git for-next
> patch link: https://lore.kernel.org/all/459c0d25abdfecdc7c57192fa656c6abda11af31.1695333278.git.fdmanana@suse.com/
> patch subject: [PATCH 7/8] btrfs: use extent_io_tree_release() to empty dirty log pages
Known and fixed in the git branch.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 4/8] btrfs: remove pointless memory barrier from extent_io_tree_release()
2023-09-22 10:39 ` [PATCH 4/8] btrfs: remove pointless memory barrier from extent_io_tree_release() fdmanana
@ 2023-09-29 15:25 ` David Sterba
0 siblings, 0 replies; 20+ messages in thread
From: David Sterba @ 2023-09-29 15:25 UTC (permalink / raw)
To: fdmanana; +Cc: linux-btrfs
On Fri, Sep 22, 2023 at 11:39:05AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> The memory barrier at extent_io_tree_release() is pointless because:
>
> 1) We have just called spin_lock() and that implies a memory barrier;
Spin lock ins not a full memory barrier, only half (one direction of
updates), full memory barrier is implied by spin_unlock-spin_lock
sequence, so I this reasoning is a bit imprecise.
>
> 2) We only change the waitqueue of an extent state record while holding
> the tree lock - see wait_on_state()
This looks like the real reason why it can be reomved. What
waitqueue_active cares about are updates that must not be lost between
CPUs for the wakeup.
Here the updates under spin lock mean that there's always the
unlock/lock sequence when one thread does the updates (and unlocks),
then the other one (fist locks) and then checks.
> So remove the memory barrier.
Yeah, it's redundant, I would not call it pointless because there's a
legit reason in connection with waitqueue_active().
I'll reword the changelog along the reasoning above.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/8] btrfs: some speedups for io tree operations and cleanups
2023-09-22 10:39 [PATCH 0/8] btrfs: some speedups for io tree operations and cleanups fdmanana
` (8 preceding siblings ...)
2023-09-22 18:18 ` [PATCH 0/8] btrfs: some speedups for io tree operations and cleanups David Sterba
@ 2023-09-29 16:47 ` David Sterba
9 siblings, 0 replies; 20+ messages in thread
From: David Sterba @ 2023-09-29 16:47 UTC (permalink / raw)
To: fdmanana; +Cc: linux-btrfs
On Fri, Sep 22, 2023 at 11:39:01AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> These are some changes to make some io tree operations more efficient and
> some cleanups. More details on the changelogs.
>
> Filipe Manana (8):
> btrfs: make extent state merges more efficient during insertions
> btrfs: update stale comment at extent_io_tree_release()
> btrfs: make wait_extent_bit() static
> btrfs: remove pointless memory barrier from extent_io_tree_release()
> btrfs: collapse wait_on_state() to its caller wait_extent_bit()
> btrfs: make extent_io_tree_release() more efficient
> btrfs: use extent_io_tree_release() to empty dirty log pages
> btrfs: make sure we cache next state in find_first_extent_bit()
Added to misc-next, thanks. The commit message for patch 4 was changed.
^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2023-09-29 16:53 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-22 10:39 [PATCH 0/8] btrfs: some speedups for io tree operations and cleanups fdmanana
2023-09-22 10:39 ` [PATCH 1/8] btrfs: make extent state merges more efficient during insertions fdmanana
2023-09-22 10:39 ` [PATCH 2/8] btrfs: update stale comment at extent_io_tree_release() fdmanana
2023-09-22 10:39 ` [PATCH 3/8] btrfs: make wait_extent_bit() static fdmanana
2023-09-22 22:49 ` Anand Jain
2023-09-22 10:39 ` [PATCH 4/8] btrfs: remove pointless memory barrier from extent_io_tree_release() fdmanana
2023-09-29 15:25 ` David Sterba
2023-09-22 10:39 ` [PATCH 5/8] btrfs: collapse wait_on_state() to its caller wait_extent_bit() fdmanana
2023-09-22 22:54 ` Anand Jain
2023-09-22 10:39 ` [PATCH 6/8] btrfs: make extent_io_tree_release() more efficient fdmanana
2023-09-22 10:39 ` [PATCH 7/8] btrfs: use extent_io_tree_release() to empty dirty log pages fdmanana
2023-09-26 6:25 ` kernel test robot
2023-09-26 16:27 ` David Sterba
2023-09-22 10:39 ` [PATCH 8/8] btrfs: make sure we cache next state in find_first_extent_bit() fdmanana
2023-09-22 18:18 ` [PATCH 0/8] btrfs: some speedups for io tree operations and cleanups David Sterba
2023-09-22 18:43 ` Filipe Manana
2023-09-22 19:08 ` David Sterba
2023-09-22 19:13 ` Filipe Manana
2023-09-22 19:15 ` David Sterba
2023-09-29 16:47 ` David Sterba
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox