public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] btrfs: some updates to the dev replace finishing parting
@ 2024-07-29 14:51 fdmanana
  2024-07-29 14:51 ` [PATCH 1/2] btrfs: reschedule when updating chunk maps at the end of a device replace fdmanana
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: fdmanana @ 2024-07-29 14:51 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Trivial changes, details in the change logs.

Filipe Manana (2):
  btrfs: reschedule when updating chunk maps at the end of a device replace
  btrfs: more efficient chunk map iteration when device replace finishes

 fs/btrfs/dev-replace.c | 43 ++++++++++++++++++++++++++++++++----------
 1 file changed, 33 insertions(+), 10 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 1/2] btrfs: reschedule when updating chunk maps at the end of a device replace
  2024-07-29 14:51 [PATCH 0/2] btrfs: some updates to the dev replace finishing parting fdmanana
@ 2024-07-29 14:51 ` fdmanana
  2024-07-29 14:51 ` [PATCH 2/2] btrfs: more efficient chunk map iteration when device replace finishes fdmanana
  2024-07-30 13:36 ` [PATCH 0/2] btrfs: some updates to the dev replace finishing parting David Sterba
  2 siblings, 0 replies; 4+ messages in thread
From: fdmanana @ 2024-07-29 14:51 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

At the end of a device replace we must go over all the chunk maps and
update their stripes to point to the target device instead of the source
device. We iterate over the chunk maps while holding a write lock and
we never reschedule, which can result in monopolizing a CPU for too long
and blocking readers for too long (it's a rw lock, non-blocking).

So improve on this by rescheduling if necessary. This is safe because at
this point we are holding the chunk mutex, which means no new chunks can
be allocated and therefore we don't risk missing a new chunk map that
covers a range behind the last one we processed before rescheduling.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/dev-replace.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index f638c458d285..20cf5e95f2bc 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -827,6 +827,14 @@ static void btrfs_dev_replace_update_device_in_mapping_tree(
 	u64 start = 0;
 	int i;
 
+	/*
+	 * The chunk mutex must be held so that no new chunks can be created
+	 * while we are updating existing chunks. This guarantees we don't miss
+	 * any new chunk that gets created for a range that falls before the
+	 * range of the last chunk we processed.
+	 */
+	lockdep_assert_held(&fs_info->chunk_mutex);
+
 	write_lock(&fs_info->mapping_tree_lock);
 	do {
 		struct btrfs_chunk_map *map;
@@ -839,6 +847,7 @@ static void btrfs_dev_replace_update_device_in_mapping_tree(
 				map->stripes[i].dev = tgtdev;
 		start = map->start + map->chunk_len;
 		btrfs_free_chunk_map(map);
+		cond_resched_rwlock_write(&fs_info->mapping_tree_lock);
 	} while (start);
 	write_unlock(&fs_info->mapping_tree_lock);
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH 2/2] btrfs: more efficient chunk map iteration when device replace finishes
  2024-07-29 14:51 [PATCH 0/2] btrfs: some updates to the dev replace finishing parting fdmanana
  2024-07-29 14:51 ` [PATCH 1/2] btrfs: reschedule when updating chunk maps at the end of a device replace fdmanana
@ 2024-07-29 14:51 ` fdmanana
  2024-07-30 13:36 ` [PATCH 0/2] btrfs: some updates to the dev replace finishing parting David Sterba
  2 siblings, 0 replies; 4+ messages in thread
From: fdmanana @ 2024-07-29 14:51 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When iterating the chunk maps when a device replace finishes we are doing
a full rbtree search for each chunk map, which is not the most efficient
thing to do, wasting CPU time. As we are holdwing a write lock on the tree
during the whole iteration, we can simply start from the first node in the
tree and then move to the next chunk map by doing a rb_next() call - the
only exception is when we need to reschedule, in which case we have to do
a full rbtree search since we dropped the write lock and the tree may have
changed (chunk maps may have been removed and the tree got rebalanced).
So just do that.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/dev-replace.c | 36 +++++++++++++++++++++++++-----------
 1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 20cf5e95f2bc..83d5cdd77f29 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -824,8 +824,7 @@ static void btrfs_dev_replace_update_device_in_mapping_tree(
 						struct btrfs_device *srcdev,
 						struct btrfs_device *tgtdev)
 {
-	u64 start = 0;
-	int i;
+	struct rb_node *node;
 
 	/*
 	 * The chunk mutex must be held so that no new chunks can be created
@@ -836,19 +835,34 @@ static void btrfs_dev_replace_update_device_in_mapping_tree(
 	lockdep_assert_held(&fs_info->chunk_mutex);
 
 	write_lock(&fs_info->mapping_tree_lock);
-	do {
+	node = rb_first_cached(&fs_info->mapping_tree);
+	while (node) {
+		struct rb_node *next = rb_next(node);
 		struct btrfs_chunk_map *map;
+		u64 next_start;
 
-		map = btrfs_find_chunk_map_nolock(fs_info, start, U64_MAX);
-		if (!map)
-			break;
-		for (i = 0; i < map->num_stripes; i++)
+		map = rb_entry(node, struct btrfs_chunk_map, rb_node);
+		next_start = map->start + map->chunk_len;
+
+		for (int i = 0; i < map->num_stripes; i++)
 			if (srcdev == map->stripes[i].dev)
 				map->stripes[i].dev = tgtdev;
-		start = map->start + map->chunk_len;
-		btrfs_free_chunk_map(map);
-		cond_resched_rwlock_write(&fs_info->mapping_tree_lock);
-	} while (start);
+
+		if (cond_resched_rwlock_write(&fs_info->mapping_tree_lock)) {
+			map = btrfs_find_chunk_map_nolock(fs_info, next_start, U64_MAX);
+			if (!map)
+				break;
+			node = &map->rb_node;
+			/*
+			 * Drop the lookup reference since we are holding the
+			 * lock in write mode and no one can remove the chunk
+			 * map from the tree and drop its tree reference.
+			 */
+			btrfs_free_chunk_map(map);
+		} else {
+			node = next;
+		}
+	}
 	write_unlock(&fs_info->mapping_tree_lock);
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH 0/2] btrfs: some updates to the dev replace finishing parting
  2024-07-29 14:51 [PATCH 0/2] btrfs: some updates to the dev replace finishing parting fdmanana
  2024-07-29 14:51 ` [PATCH 1/2] btrfs: reschedule when updating chunk maps at the end of a device replace fdmanana
  2024-07-29 14:51 ` [PATCH 2/2] btrfs: more efficient chunk map iteration when device replace finishes fdmanana
@ 2024-07-30 13:36 ` David Sterba
  2 siblings, 0 replies; 4+ messages in thread
From: David Sterba @ 2024-07-30 13:36 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Mon, Jul 29, 2024 at 03:51:21PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Trivial changes, details in the change logs.
> 
> Filipe Manana (2):
>   btrfs: reschedule when updating chunk maps at the end of a device replace
>   btrfs: more efficient chunk map iteration when device replace finishes

Reviewed-by: David Sterba <dsterba@suse.com>

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-07-30 13:36 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-29 14:51 [PATCH 0/2] btrfs: some updates to the dev replace finishing parting fdmanana
2024-07-29 14:51 ` [PATCH 1/2] btrfs: reschedule when updating chunk maps at the end of a device replace fdmanana
2024-07-29 14:51 ` [PATCH 2/2] btrfs: more efficient chunk map iteration when device replace finishes fdmanana
2024-07-30 13:36 ` [PATCH 0/2] btrfs: some updates to the dev replace finishing parting David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox