[PATCH 0/5] btrfs: make extent map shrinker more efficient and re-enable it

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/5] btrfs: make extent map shrinker more efficient and re-enable it
@ 2024-09-24 10:45 fdmanana
  2024-09-24 10:45 ` [PATCH 1/5] btrfs: add and use helper to remove extent map from its inode's tree fdmanana
                   ` (6 more replies)
  0 siblings, 7 replies; 19+ messages in thread
From: fdmanana @ 2024-09-24 10:45 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

This makes the extent map shrinker run by a single task and asynchronously
in a work queue, re-enables it by default and some cleanups. More details
in the changelogs of each patch.

Filipe Manana (5):
  btrfs: add and use helper to remove extent map from its inode's tree
  btrfs: make the extent map shrinker run asynchronously as a work queue job
  btrfs: simplify tracking progress for the extent map shrinker
  btrfs: rename extent map shrinker members from struct btrfs_fs_info
  btrfs: re-enable the extent map shrinker

 fs/btrfs/disk-io.c           |   4 +-
 fs/btrfs/extent_map.c        | 122 +++++++++++++++++------------------
 fs/btrfs/extent_map.h        |   3 +-
 fs/btrfs/fs.h                |   7 +-
 fs/btrfs/super.c             |  21 ++----
 include/trace/events/btrfs.h |  21 +++---
 6 files changed, 83 insertions(+), 95 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/5] btrfs: add and use helper to remove extent map from its inode's tree
  2024-09-24 10:45 [PATCH 0/5] btrfs: make extent map shrinker more efficient and re-enable it fdmanana
@ 2024-09-24 10:45 ` fdmanana
  2024-09-25 22:00   ` Qu Wenruo
  2024-09-24 10:45 ` [PATCH 2/5] btrfs: make the extent map shrinker run asynchronously as a work queue job fdmanana
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: fdmanana @ 2024-09-24 10:45 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Move the common code to remove an extent map from its inode's tree into a
helper function and use it, reducing duplicated code.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_map.c | 18 +++++++-----------
 1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 25d191f1ac10..cb2a6f5dce2b 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -77,10 +77,13 @@ static u64 range_end(u64 start, u64 len)
 	return start + len;
 }
 
-static void dec_evictable_extent_maps(struct btrfs_inode *inode)
+static void remove_em(struct btrfs_inode *inode, struct extent_map *em)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 
+	rb_erase(&em->rb_node, &inode->extent_tree.root);
+	RB_CLEAR_NODE(&em->rb_node);
+
 	if (!btrfs_is_testing(fs_info) && is_fstree(btrfs_root_id(inode->root)))
 		percpu_counter_dec(&fs_info->evictable_extent_maps);
 }
@@ -333,7 +336,6 @@ static void validate_extent_map(struct btrfs_fs_info *fs_info, struct extent_map
 static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	struct extent_map_tree *tree = &inode->extent_tree;
 	struct extent_map *merge = NULL;
 	struct rb_node *rb;
 
@@ -365,10 +367,8 @@ static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
 			em->flags |= EXTENT_FLAG_MERGED;
 
 			validate_extent_map(fs_info, em);
-			rb_erase(&merge->rb_node, &tree->root);
-			RB_CLEAR_NODE(&merge->rb_node);
+			remove_em(inode, merge);
 			free_extent_map(merge);
-			dec_evictable_extent_maps(inode);
 		}
 	}
 
@@ -380,12 +380,10 @@ static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
 		if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE)
 			merge_ondisk_extents(em, merge);
 		validate_extent_map(fs_info, em);
-		rb_erase(&merge->rb_node, &tree->root);
-		RB_CLEAR_NODE(&merge->rb_node);
 		em->generation = max(em->generation, merge->generation);
 		em->flags |= EXTENT_FLAG_MERGED;
+		remove_em(inode, merge);
 		free_extent_map(merge);
-		dec_evictable_extent_maps(inode);
 	}
 }
 
@@ -582,12 +580,10 @@ void remove_extent_mapping(struct btrfs_inode *inode, struct extent_map *em)
 	lockdep_assert_held_write(&tree->lock);
 
 	WARN_ON(em->flags & EXTENT_FLAG_PINNED);
-	rb_erase(&em->rb_node, &tree->root);
 	if (!(em->flags & EXTENT_FLAG_LOGGING))
 		list_del_init(&em->list);
-	RB_CLEAR_NODE(&em->rb_node);
 
-	dec_evictable_extent_maps(inode);
+	remove_em(inode, em);
 }
 
 static void replace_extent_mapping(struct btrfs_inode *inode,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/5] btrfs: add and use helper to remove extent map from its inode's tree
  2024-09-24 10:45 ` [PATCH 1/5] btrfs: add and use helper to remove extent map from its inode's tree fdmanana
@ 2024-09-25 22:00   ` Qu Wenruo
  0 siblings, 0 replies; 19+ messages in thread
From: Qu Wenruo @ 2024-09-25 22:00 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



在 2024/9/24 20:15, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
>
> Move the common code to remove an extent map from its inode's tree into a
> helper function and use it, reducing duplicated code.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu
> ---
>   fs/btrfs/extent_map.c | 18 +++++++-----------
>   1 file changed, 7 insertions(+), 11 deletions(-)
>
> diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
> index 25d191f1ac10..cb2a6f5dce2b 100644
> --- a/fs/btrfs/extent_map.c
> +++ b/fs/btrfs/extent_map.c
> @@ -77,10 +77,13 @@ static u64 range_end(u64 start, u64 len)
>   	return start + len;
>   }
>
> -static void dec_evictable_extent_maps(struct btrfs_inode *inode)
> +static void remove_em(struct btrfs_inode *inode, struct extent_map *em)
>   {
>   	struct btrfs_fs_info *fs_info = inode->root->fs_info;
>
> +	rb_erase(&em->rb_node, &inode->extent_tree.root);
> +	RB_CLEAR_NODE(&em->rb_node);
> +
>   	if (!btrfs_is_testing(fs_info) && is_fstree(btrfs_root_id(inode->root)))
>   		percpu_counter_dec(&fs_info->evictable_extent_maps);
>   }
> @@ -333,7 +336,6 @@ static void validate_extent_map(struct btrfs_fs_info *fs_info, struct extent_map
>   static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
>   {
>   	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> -	struct extent_map_tree *tree = &inode->extent_tree;
>   	struct extent_map *merge = NULL;
>   	struct rb_node *rb;
>
> @@ -365,10 +367,8 @@ static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
>   			em->flags |= EXTENT_FLAG_MERGED;
>
>   			validate_extent_map(fs_info, em);
> -			rb_erase(&merge->rb_node, &tree->root);
> -			RB_CLEAR_NODE(&merge->rb_node);
> +			remove_em(inode, merge);
>   			free_extent_map(merge);
> -			dec_evictable_extent_maps(inode);
>   		}
>   	}
>
> @@ -380,12 +380,10 @@ static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
>   		if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE)
>   			merge_ondisk_extents(em, merge);
>   		validate_extent_map(fs_info, em);
> -		rb_erase(&merge->rb_node, &tree->root);
> -		RB_CLEAR_NODE(&merge->rb_node);
>   		em->generation = max(em->generation, merge->generation);
>   		em->flags |= EXTENT_FLAG_MERGED;
> +		remove_em(inode, merge);
>   		free_extent_map(merge);
> -		dec_evictable_extent_maps(inode);
>   	}
>   }
>
> @@ -582,12 +580,10 @@ void remove_extent_mapping(struct btrfs_inode *inode, struct extent_map *em)
>   	lockdep_assert_held_write(&tree->lock);
>
>   	WARN_ON(em->flags & EXTENT_FLAG_PINNED);
> -	rb_erase(&em->rb_node, &tree->root);
>   	if (!(em->flags & EXTENT_FLAG_LOGGING))
>   		list_del_init(&em->list);
> -	RB_CLEAR_NODE(&em->rb_node);
>
> -	dec_evictable_extent_maps(inode);
> +	remove_em(inode, em);
>   }
>
>   static void replace_extent_mapping(struct btrfs_inode *inode,


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 2/5] btrfs: make the extent map shrinker run asynchronously as a work queue job
  2024-09-24 10:45 [PATCH 0/5] btrfs: make extent map shrinker more efficient and re-enable it fdmanana
  2024-09-24 10:45 ` [PATCH 1/5] btrfs: add and use helper to remove extent map from its inode's tree fdmanana
@ 2024-09-24 10:45 ` fdmanana
  2024-09-25 22:08   ` Qu Wenruo
  2024-09-24 10:45 ` [PATCH 3/5] btrfs: simplify tracking progress for the extent map shrinker fdmanana
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: fdmanana @ 2024-09-24 10:45 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Currently the extent map shrinker is run synchronously for kswapd tasks
that end up calling the fs shrinker (fs/super.c:super_cache_scan()).
This has some disadvantages and for some heavy workloads with memory
pressure it can cause some delays and stalls that make a machine
unresponsive for some periods. This happens because:

1) We can have several kswapd tasks on machines with multiple NUMA zones,
   and running the extent map shrinker concurrently can cause high
   contention on some spin locks, namely the spin locks that protect
   the radix tree that tracks roots, the per root xarray that tracks
   open inodes and the list of delayed iputs. This not only delays the
   shrinker but also causes high CPU consumption and makes the task
   running the shrinker monopolize a core, resulting in the symptoms
   of an unresponsive system. This was noted in previous commits such as
   commit ae1e766f623f ("btrfs: only run the extent map shrinker from
   kswapd tasks");

2) The extent map shrinker's iteration over inodes can often be slow, even
   after changing the data structure that tracks open inodes for a root
   from a red black tree (up to kernel 6.10) to an xarray (kernel 6.10+).
   The transition to the xarray while it made things a bit faster, it's
   still somewhat slow - for example in a test scenario with 10000 inodes
   that have no extent maps loaded, the extent map shrinker took between
   5ms to 8ms, using a release, non-debug kernel. Iterating over the
   extent maps of an inode can also be slow if have an inode with many
   thousands of extent maps, since we use a red black tree to track and
   search extent maps. So having the extent map shrinker run synchronously
   adds extra delay for other things a kswapd task does.

So make the extent map shrinker run asynchronously as a job for the
system unbounded workqueue, just like what we do for data and metadata
space reclaim jobs.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/disk-io.c    |  2 ++
 fs/btrfs/extent_map.c | 51 ++++++++++++++++++++++++++++++++++++-------
 fs/btrfs/extent_map.h |  3 ++-
 fs/btrfs/fs.h         |  2 ++
 fs/btrfs/super.c      | 13 +++--------
 5 files changed, 52 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 25d768e67e37..2148147c5257 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2786,6 +2786,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
 	btrfs_init_scrub(fs_info);
 	btrfs_init_balance(fs_info);
 	btrfs_init_async_reclaim_work(fs_info);
+	btrfs_init_extent_map_shrinker_work(fs_info);
 
 	rwlock_init(&fs_info->block_group_cache_lock);
 	fs_info->block_group_cache_tree = RB_ROOT_CACHED;
@@ -4283,6 +4284,7 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
 	cancel_work_sync(&fs_info->async_reclaim_work);
 	cancel_work_sync(&fs_info->async_data_reclaim_work);
 	cancel_work_sync(&fs_info->preempt_reclaim_work);
+	cancel_work_sync(&fs_info->extent_map_shrinker_work);
 
 	/* Cancel or finish ongoing discard work */
 	btrfs_discard_cleanup(fs_info);
diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index cb2a6f5dce2b..e2eeb94aa349 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -1118,7 +1118,8 @@ struct btrfs_em_shrink_ctx {
 
 static long btrfs_scan_inode(struct btrfs_inode *inode, struct btrfs_em_shrink_ctx *ctx)
 {
-	const u64 cur_fs_gen = btrfs_get_fs_generation(inode->root->fs_info);
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	const u64 cur_fs_gen = btrfs_get_fs_generation(fs_info);
 	struct extent_map_tree *tree = &inode->extent_tree;
 	long nr_dropped = 0;
 	struct rb_node *node;
@@ -1191,7 +1192,8 @@ static long btrfs_scan_inode(struct btrfs_inode *inode, struct btrfs_em_shrink_c
 		 * lock. This is to avoid slowing other tasks trying to take the
 		 * lock.
 		 */
-		if (need_resched() || rwlock_needbreak(&tree->lock))
+		if (need_resched() || rwlock_needbreak(&tree->lock) ||
+		    btrfs_fs_closing(fs_info))
 			break;
 		node = next;
 	}
@@ -1215,7 +1217,8 @@ static long btrfs_scan_root(struct btrfs_root *root, struct btrfs_em_shrink_ctx
 		ctx->last_ino = btrfs_ino(inode);
 		btrfs_add_delayed_iput(inode);
 
-		if (ctx->scanned >= ctx->nr_to_scan)
+		if (ctx->scanned >= ctx->nr_to_scan ||
+		    btrfs_fs_closing(inode->root->fs_info))
 			break;
 
 		cond_resched();
@@ -1244,16 +1247,19 @@ static long btrfs_scan_root(struct btrfs_root *root, struct btrfs_em_shrink_ctx
 	return nr_dropped;
 }
 
-long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan)
+static void btrfs_extent_map_shrinker_worker(struct work_struct *work)
 {
+	struct btrfs_fs_info *fs_info;
 	struct btrfs_em_shrink_ctx ctx;
 	u64 start_root_id;
 	u64 next_root_id;
 	bool cycled = false;
 	long nr_dropped = 0;
 
+	fs_info = container_of(work, struct btrfs_fs_info, extent_map_shrinker_work);
+
 	ctx.scanned = 0;
-	ctx.nr_to_scan = nr_to_scan;
+	ctx.nr_to_scan = atomic64_read(&fs_info->extent_map_shrinker_nr_to_scan);
 
 	/*
 	 * In case we have multiple tasks running this shrinker, make the next
@@ -1271,12 +1277,12 @@ long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan)
 	if (trace_btrfs_extent_map_shrinker_scan_enter_enabled()) {
 		s64 nr = percpu_counter_sum_positive(&fs_info->evictable_extent_maps);
 
-		trace_btrfs_extent_map_shrinker_scan_enter(fs_info, nr_to_scan,
+		trace_btrfs_extent_map_shrinker_scan_enter(fs_info, ctx.nr_to_scan,
 							   nr, ctx.last_root,
 							   ctx.last_ino);
 	}
 
-	while (ctx.scanned < ctx.nr_to_scan) {
+	while (ctx.scanned < ctx.nr_to_scan && !btrfs_fs_closing(fs_info)) {
 		struct btrfs_root *root;
 		unsigned long count;
 
@@ -1334,5 +1340,34 @@ long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan)
 							  ctx.last_ino);
 	}
 
-	return nr_dropped;
+	atomic64_set(&fs_info->extent_map_shrinker_nr_to_scan, 0);
+}
+
+void btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan)
+{
+	/*
+	 * Do nothing if the shrinker is already running. In case of high memory
+	 * pressure we can have a lot of tasks calling us and all passing the
+	 * same nr_to_scan value, but in reality we may need only to free
+	 * nr_to_scan extent maps (or less). In case we need to free more than
+	 * that, we will be called again by the fs shrinker, so no worries about
+	 * not doing enough work to reclaim memory from extent maps.
+	 * We can also be repeatedly called with the same nr_to_scan value
+	 * simply because the shrinker runs asynchronously and multiple calls
+	 * to this function are made before the shrinker does enough progress.
+	 *
+	 * That's why we set the atomic counter to nr_to_scan only if its
+	 * current value is zero, instead of incrementing the counter by
+	 * nr_to_scan.
+	 */
+	if (atomic64_cmpxchg(&fs_info->extent_map_shrinker_nr_to_scan, 0, nr_to_scan) != 0)
+		return;
+
+	queue_work(system_unbound_wq, &fs_info->extent_map_shrinker_work);
+}
+
+void btrfs_init_extent_map_shrinker_work(struct btrfs_fs_info *fs_info)
+{
+	atomic64_set(&fs_info->extent_map_shrinker_nr_to_scan, 0);
+	INIT_WORK(&fs_info->extent_map_shrinker_work, btrfs_extent_map_shrinker_worker);
 }
diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
index 5154a8f1d26c..cd123b266b64 100644
--- a/fs/btrfs/extent_map.h
+++ b/fs/btrfs/extent_map.h
@@ -189,6 +189,7 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode,
 int btrfs_replace_extent_map_range(struct btrfs_inode *inode,
 				   struct extent_map *new_em,
 				   bool modified);
-long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan);
+void btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan);
+void btrfs_init_extent_map_shrinker_work(struct btrfs_fs_info *fs_info);
 
 #endif
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 785ec15c1b84..a246d8dc0b20 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -638,6 +638,8 @@ struct btrfs_fs_info {
 	spinlock_t extent_map_shrinker_lock;
 	u64 extent_map_shrinker_last_root;
 	u64 extent_map_shrinker_last_ino;
+	atomic64_t extent_map_shrinker_nr_to_scan;
+	struct work_struct extent_map_shrinker_work;
 
 	/* Protected by 'trans_lock'. */
 	struct list_head dirty_cowonly_roots;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index e8a5bf4af918..e9e209dd8e05 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -28,7 +28,6 @@
 #include <linux/btrfs.h>
 #include <linux/security.h>
 #include <linux/fs_parser.h>
-#include <linux/swap.h>
 #include "messages.h"
 #include "delayed-inode.h"
 #include "ctree.h"
@@ -2416,16 +2415,10 @@ static long btrfs_free_cached_objects(struct super_block *sb, struct shrink_cont
 	const long nr_to_scan = min_t(unsigned long, LONG_MAX, sc->nr_to_scan);
 	struct btrfs_fs_info *fs_info = btrfs_sb(sb);
 
-	/*
-	 * We may be called from any task trying to allocate memory and we don't
-	 * want to slow it down with scanning and dropping extent maps. It would
-	 * also cause heavy lock contention if many tasks concurrently enter
-	 * here. Therefore only allow kswapd tasks to scan and drop extent maps.
-	 */
-	if (!current_is_kswapd())
-		return 0;
+	btrfs_free_extent_maps(fs_info, nr_to_scan);
 
-	return btrfs_free_extent_maps(fs_info, nr_to_scan);
+	/* The extent map shrinker runs asynchronously, so always return 0. */
+	return 0;
 }
 
 static const struct super_operations btrfs_super_ops = {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/5] btrfs: make the extent map shrinker run asynchronously as a work queue job
  2024-09-24 10:45 ` [PATCH 2/5] btrfs: make the extent map shrinker run asynchronously as a work queue job fdmanana
@ 2024-09-25 22:08   ` Qu Wenruo
  2024-09-26  9:01     ` Filipe Manana
  0 siblings, 1 reply; 19+ messages in thread
From: Qu Wenruo @ 2024-09-25 22:08 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



在 2024/9/24 20:15, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
>
> Currently the extent map shrinker is run synchronously for kswapd tasks
> that end up calling the fs shrinker (fs/super.c:super_cache_scan()).
> This has some disadvantages and for some heavy workloads with memory
> pressure it can cause some delays and stalls that make a machine
> unresponsive for some periods. This happens because:
>
> 1) We can have several kswapd tasks on machines with multiple NUMA zones,
>     and running the extent map shrinker concurrently can cause high
>     contention on some spin locks, namely the spin locks that protect
>     the radix tree that tracks roots, the per root xarray that tracks
>     open inodes and the list of delayed iputs. This not only delays the
>     shrinker but also causes high CPU consumption and makes the task
>     running the shrinker monopolize a core, resulting in the symptoms
>     of an unresponsive system. This was noted in previous commits such as
>     commit ae1e766f623f ("btrfs: only run the extent map shrinker from
>     kswapd tasks");
>
> 2) The extent map shrinker's iteration over inodes can often be slow, even
>     after changing the data structure that tracks open inodes for a root
>     from a red black tree (up to kernel 6.10) to an xarray (kernel 6.10+).
>     The transition to the xarray while it made things a bit faster, it's
>     still somewhat slow - for example in a test scenario with 10000 inodes
>     that have no extent maps loaded, the extent map shrinker took between
>     5ms to 8ms, using a release, non-debug kernel. Iterating over the
>     extent maps of an inode can also be slow if have an inode with many
>     thousands of extent maps, since we use a red black tree to track and
>     search extent maps. So having the extent map shrinker run synchronously
>     adds extra delay for other things a kswapd task does.
>
> So make the extent map shrinker run asynchronously as a job for the
> system unbounded workqueue, just like what we do for data and metadata
> space reclaim jobs.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>   fs/btrfs/disk-io.c    |  2 ++
>   fs/btrfs/extent_map.c | 51 ++++++++++++++++++++++++++++++++++++-------
>   fs/btrfs/extent_map.h |  3 ++-
>   fs/btrfs/fs.h         |  2 ++
>   fs/btrfs/super.c      | 13 +++--------
>   5 files changed, 52 insertions(+), 19 deletions(-)
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 25d768e67e37..2148147c5257 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2786,6 +2786,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
>   	btrfs_init_scrub(fs_info);
>   	btrfs_init_balance(fs_info);
>   	btrfs_init_async_reclaim_work(fs_info);
> +	btrfs_init_extent_map_shrinker_work(fs_info);
>
>   	rwlock_init(&fs_info->block_group_cache_lock);
>   	fs_info->block_group_cache_tree = RB_ROOT_CACHED;
> @@ -4283,6 +4284,7 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
>   	cancel_work_sync(&fs_info->async_reclaim_work);
>   	cancel_work_sync(&fs_info->async_data_reclaim_work);
>   	cancel_work_sync(&fs_info->preempt_reclaim_work);
> +	cancel_work_sync(&fs_info->extent_map_shrinker_work);
>
>   	/* Cancel or finish ongoing discard work */
>   	btrfs_discard_cleanup(fs_info);
> diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
> index cb2a6f5dce2b..e2eeb94aa349 100644
> --- a/fs/btrfs/extent_map.c
> +++ b/fs/btrfs/extent_map.c
> @@ -1118,7 +1118,8 @@ struct btrfs_em_shrink_ctx {
>
>   static long btrfs_scan_inode(struct btrfs_inode *inode, struct btrfs_em_shrink_ctx *ctx)
>   {
> -	const u64 cur_fs_gen = btrfs_get_fs_generation(inode->root->fs_info);
> +	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> +	const u64 cur_fs_gen = btrfs_get_fs_generation(fs_info);
>   	struct extent_map_tree *tree = &inode->extent_tree;
>   	long nr_dropped = 0;
>   	struct rb_node *node;
> @@ -1191,7 +1192,8 @@ static long btrfs_scan_inode(struct btrfs_inode *inode, struct btrfs_em_shrink_c
>   		 * lock. This is to avoid slowing other tasks trying to take the
>   		 * lock.
>   		 */
> -		if (need_resched() || rwlock_needbreak(&tree->lock))
> +		if (need_resched() || rwlock_needbreak(&tree->lock) ||
> +		    btrfs_fs_closing(fs_info))
>   			break;
>   		node = next;
>   	}
> @@ -1215,7 +1217,8 @@ static long btrfs_scan_root(struct btrfs_root *root, struct btrfs_em_shrink_ctx
>   		ctx->last_ino = btrfs_ino(inode);
>   		btrfs_add_delayed_iput(inode);
>
> -		if (ctx->scanned >= ctx->nr_to_scan)
> +		if (ctx->scanned >= ctx->nr_to_scan ||
> +		    btrfs_fs_closing(inode->root->fs_info))
>   			break;
>
>   		cond_resched();
> @@ -1244,16 +1247,19 @@ static long btrfs_scan_root(struct btrfs_root *root, struct btrfs_em_shrink_ctx
>   	return nr_dropped;
>   }
>
> -long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan)
> +static void btrfs_extent_map_shrinker_worker(struct work_struct *work)
>   {
> +	struct btrfs_fs_info *fs_info;
>   	struct btrfs_em_shrink_ctx ctx;
>   	u64 start_root_id;
>   	u64 next_root_id;
>   	bool cycled = false;
>   	long nr_dropped = 0;
>
> +	fs_info = container_of(work, struct btrfs_fs_info, extent_map_shrinker_work);
> +
>   	ctx.scanned = 0;
> -	ctx.nr_to_scan = nr_to_scan;
> +	ctx.nr_to_scan = atomic64_read(&fs_info->extent_map_shrinker_nr_to_scan);
>
>   	/*
>   	 * In case we have multiple tasks running this shrinker, make the next
> @@ -1271,12 +1277,12 @@ long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan)
>   	if (trace_btrfs_extent_map_shrinker_scan_enter_enabled()) {
>   		s64 nr = percpu_counter_sum_positive(&fs_info->evictable_extent_maps);
>
> -		trace_btrfs_extent_map_shrinker_scan_enter(fs_info, nr_to_scan,
> +		trace_btrfs_extent_map_shrinker_scan_enter(fs_info, ctx.nr_to_scan,
>   							   nr, ctx.last_root,
>   							   ctx.last_ino);
>   	}
>
> -	while (ctx.scanned < ctx.nr_to_scan) {
> +	while (ctx.scanned < ctx.nr_to_scan && !btrfs_fs_closing(fs_info)) {
>   		struct btrfs_root *root;
>   		unsigned long count;
>
> @@ -1334,5 +1340,34 @@ long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan)
>   							  ctx.last_ino);
>   	}
>
> -	return nr_dropped;
> +	atomic64_set(&fs_info->extent_map_shrinker_nr_to_scan, 0);
> +}
> +
> +void btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan)
> +{
> +	/*
> +	 * Do nothing if the shrinker is already running. In case of high memory
> +	 * pressure we can have a lot of tasks calling us and all passing the
> +	 * same nr_to_scan value, but in reality we may need only to free
> +	 * nr_to_scan extent maps (or less). In case we need to free more than
> +	 * that, we will be called again by the fs shrinker, so no worries about
> +	 * not doing enough work to reclaim memory from extent maps.
> +	 * We can also be repeatedly called with the same nr_to_scan value
> +	 * simply because the shrinker runs asynchronously and multiple calls
> +	 * to this function are made before the shrinker does enough progress.
> +	 *
> +	 * That's why we set the atomic counter to nr_to_scan only if its
> +	 * current value is zero, instead of incrementing the counter by
> +	 * nr_to_scan.
> +	 */

Since the shrinker can be called frequently, even if we only keep one
shrink work running, will the shrinker be kept running for a long time?
(one queued work done, then immiedately be queued again)

The XFS is queuing the work with an delay, although their reason is that
the reclaim needs to be run more frequently than syncd interval (30s).

Do we also need some delay to prevent such too frequent reclaim work?

Thanks,
Qu

> +	if (atomic64_cmpxchg(&fs_info->extent_map_shrinker_nr_to_scan, 0, nr_to_scan) != 0)
> +		return;
> +
> +	queue_work(system_unbound_wq, &fs_info->extent_map_shrinker_work);
> +}
> +
> +void btrfs_init_extent_map_shrinker_work(struct btrfs_fs_info *fs_info)
> +{
> +	atomic64_set(&fs_info->extent_map_shrinker_nr_to_scan, 0);
> +	INIT_WORK(&fs_info->extent_map_shrinker_work, btrfs_extent_map_shrinker_worker);
>   }
> diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
> index 5154a8f1d26c..cd123b266b64 100644
> --- a/fs/btrfs/extent_map.h
> +++ b/fs/btrfs/extent_map.h
> @@ -189,6 +189,7 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode,
>   int btrfs_replace_extent_map_range(struct btrfs_inode *inode,
>   				   struct extent_map *new_em,
>   				   bool modified);
> -long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan);
> +void btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan);
> +void btrfs_init_extent_map_shrinker_work(struct btrfs_fs_info *fs_info);
>
>   #endif
> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
> index 785ec15c1b84..a246d8dc0b20 100644
> --- a/fs/btrfs/fs.h
> +++ b/fs/btrfs/fs.h
> @@ -638,6 +638,8 @@ struct btrfs_fs_info {
>   	spinlock_t extent_map_shrinker_lock;
>   	u64 extent_map_shrinker_last_root;
>   	u64 extent_map_shrinker_last_ino;
> +	atomic64_t extent_map_shrinker_nr_to_scan;
> +	struct work_struct extent_map_shrinker_work;
>
>   	/* Protected by 'trans_lock'. */
>   	struct list_head dirty_cowonly_roots;
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index e8a5bf4af918..e9e209dd8e05 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -28,7 +28,6 @@
>   #include <linux/btrfs.h>
>   #include <linux/security.h>
>   #include <linux/fs_parser.h>
> -#include <linux/swap.h>
>   #include "messages.h"
>   #include "delayed-inode.h"
>   #include "ctree.h"
> @@ -2416,16 +2415,10 @@ static long btrfs_free_cached_objects(struct super_block *sb, struct shrink_cont
>   	const long nr_to_scan = min_t(unsigned long, LONG_MAX, sc->nr_to_scan);
>   	struct btrfs_fs_info *fs_info = btrfs_sb(sb);
>
> -	/*
> -	 * We may be called from any task trying to allocate memory and we don't
> -	 * want to slow it down with scanning and dropping extent maps. It would
> -	 * also cause heavy lock contention if many tasks concurrently enter
> -	 * here. Therefore only allow kswapd tasks to scan and drop extent maps.
> -	 */
> -	if (!current_is_kswapd())
> -		return 0;
> +	btrfs_free_extent_maps(fs_info, nr_to_scan);
>
> -	return btrfs_free_extent_maps(fs_info, nr_to_scan);
> +	/* The extent map shrinker runs asynchronously, so always return 0. */
> +	return 0;
>   }
>
>   static const struct super_operations btrfs_super_ops = {


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/5] btrfs: make the extent map shrinker run asynchronously as a work queue job
  2024-09-25 22:08   ` Qu Wenruo
@ 2024-09-26  9:01     ` Filipe Manana
  2024-09-26  9:55       ` Qu Wenruo
  0 siblings, 1 reply; 19+ messages in thread
From: Filipe Manana @ 2024-09-26  9:01 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, Sep 25, 2024 at 11:08 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> 在 2024/9/24 20:15, fdmanana@kernel.org 写道:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > Currently the extent map shrinker is run synchronously for kswapd tasks
> > that end up calling the fs shrinker (fs/super.c:super_cache_scan()).
> > This has some disadvantages and for some heavy workloads with memory
> > pressure it can cause some delays and stalls that make a machine
> > unresponsive for some periods. This happens because:
> >
> > 1) We can have several kswapd tasks on machines with multiple NUMA zones,
> >     and running the extent map shrinker concurrently can cause high
> >     contention on some spin locks, namely the spin locks that protect
> >     the radix tree that tracks roots, the per root xarray that tracks
> >     open inodes and the list of delayed iputs. This not only delays the
> >     shrinker but also causes high CPU consumption and makes the task
> >     running the shrinker monopolize a core, resulting in the symptoms
> >     of an unresponsive system. This was noted in previous commits such as
> >     commit ae1e766f623f ("btrfs: only run the extent map shrinker from
> >     kswapd tasks");
> >
> > 2) The extent map shrinker's iteration over inodes can often be slow, even
> >     after changing the data structure that tracks open inodes for a root
> >     from a red black tree (up to kernel 6.10) to an xarray (kernel 6.10+).
> >     The transition to the xarray while it made things a bit faster, it's
> >     still somewhat slow - for example in a test scenario with 10000 inodes
> >     that have no extent maps loaded, the extent map shrinker took between
> >     5ms to 8ms, using a release, non-debug kernel. Iterating over the
> >     extent maps of an inode can also be slow if have an inode with many
> >     thousands of extent maps, since we use a red black tree to track and
> >     search extent maps. So having the extent map shrinker run synchronously
> >     adds extra delay for other things a kswapd task does.
> >
> > So make the extent map shrinker run asynchronously as a job for the
> > system unbounded workqueue, just like what we do for data and metadata
> > space reclaim jobs.
> >
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >   fs/btrfs/disk-io.c    |  2 ++
> >   fs/btrfs/extent_map.c | 51 ++++++++++++++++++++++++++++++++++++-------
> >   fs/btrfs/extent_map.h |  3 ++-
> >   fs/btrfs/fs.h         |  2 ++
> >   fs/btrfs/super.c      | 13 +++--------
> >   5 files changed, 52 insertions(+), 19 deletions(-)
> >
> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > index 25d768e67e37..2148147c5257 100644
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -2786,6 +2786,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
> >       btrfs_init_scrub(fs_info);
> >       btrfs_init_balance(fs_info);
> >       btrfs_init_async_reclaim_work(fs_info);
> > +     btrfs_init_extent_map_shrinker_work(fs_info);
> >
> >       rwlock_init(&fs_info->block_group_cache_lock);
> >       fs_info->block_group_cache_tree = RB_ROOT_CACHED;
> > @@ -4283,6 +4284,7 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
> >       cancel_work_sync(&fs_info->async_reclaim_work);
> >       cancel_work_sync(&fs_info->async_data_reclaim_work);
> >       cancel_work_sync(&fs_info->preempt_reclaim_work);
> > +     cancel_work_sync(&fs_info->extent_map_shrinker_work);
> >
> >       /* Cancel or finish ongoing discard work */
> >       btrfs_discard_cleanup(fs_info);
> > diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
> > index cb2a6f5dce2b..e2eeb94aa349 100644
> > --- a/fs/btrfs/extent_map.c
> > +++ b/fs/btrfs/extent_map.c
> > @@ -1118,7 +1118,8 @@ struct btrfs_em_shrink_ctx {
> >
> >   static long btrfs_scan_inode(struct btrfs_inode *inode, struct btrfs_em_shrink_ctx *ctx)
> >   {
> > -     const u64 cur_fs_gen = btrfs_get_fs_generation(inode->root->fs_info);
> > +     struct btrfs_fs_info *fs_info = inode->root->fs_info;
> > +     const u64 cur_fs_gen = btrfs_get_fs_generation(fs_info);
> >       struct extent_map_tree *tree = &inode->extent_tree;
> >       long nr_dropped = 0;
> >       struct rb_node *node;
> > @@ -1191,7 +1192,8 @@ static long btrfs_scan_inode(struct btrfs_inode *inode, struct btrfs_em_shrink_c
> >                * lock. This is to avoid slowing other tasks trying to take the
> >                * lock.
> >                */
> > -             if (need_resched() || rwlock_needbreak(&tree->lock))
> > +             if (need_resched() || rwlock_needbreak(&tree->lock) ||
> > +                 btrfs_fs_closing(fs_info))
> >                       break;
> >               node = next;
> >       }
> > @@ -1215,7 +1217,8 @@ static long btrfs_scan_root(struct btrfs_root *root, struct btrfs_em_shrink_ctx
> >               ctx->last_ino = btrfs_ino(inode);
> >               btrfs_add_delayed_iput(inode);
> >
> > -             if (ctx->scanned >= ctx->nr_to_scan)
> > +             if (ctx->scanned >= ctx->nr_to_scan ||
> > +                 btrfs_fs_closing(inode->root->fs_info))
> >                       break;
> >
> >               cond_resched();
> > @@ -1244,16 +1247,19 @@ static long btrfs_scan_root(struct btrfs_root *root, struct btrfs_em_shrink_ctx
> >       return nr_dropped;
> >   }
> >
> > -long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan)
> > +static void btrfs_extent_map_shrinker_worker(struct work_struct *work)
> >   {
> > +     struct btrfs_fs_info *fs_info;
> >       struct btrfs_em_shrink_ctx ctx;
> >       u64 start_root_id;
> >       u64 next_root_id;
> >       bool cycled = false;
> >       long nr_dropped = 0;
> >
> > +     fs_info = container_of(work, struct btrfs_fs_info, extent_map_shrinker_work);
> > +
> >       ctx.scanned = 0;
> > -     ctx.nr_to_scan = nr_to_scan;
> > +     ctx.nr_to_scan = atomic64_read(&fs_info->extent_map_shrinker_nr_to_scan);
> >
> >       /*
> >        * In case we have multiple tasks running this shrinker, make the next
> > @@ -1271,12 +1277,12 @@ long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan)
> >       if (trace_btrfs_extent_map_shrinker_scan_enter_enabled()) {
> >               s64 nr = percpu_counter_sum_positive(&fs_info->evictable_extent_maps);
> >
> > -             trace_btrfs_extent_map_shrinker_scan_enter(fs_info, nr_to_scan,
> > +             trace_btrfs_extent_map_shrinker_scan_enter(fs_info, ctx.nr_to_scan,
> >                                                          nr, ctx.last_root,
> >                                                          ctx.last_ino);
> >       }
> >
> > -     while (ctx.scanned < ctx.nr_to_scan) {
> > +     while (ctx.scanned < ctx.nr_to_scan && !btrfs_fs_closing(fs_info)) {
> >               struct btrfs_root *root;
> >               unsigned long count;
> >
> > @@ -1334,5 +1340,34 @@ long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan)
> >                                                         ctx.last_ino);
> >       }
> >
> > -     return nr_dropped;
> > +     atomic64_set(&fs_info->extent_map_shrinker_nr_to_scan, 0);
> > +}
> > +
> > +void btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan)
> > +{
> > +     /*
> > +      * Do nothing if the shrinker is already running. In case of high memory
> > +      * pressure we can have a lot of tasks calling us and all passing the
> > +      * same nr_to_scan value, but in reality we may need only to free
> > +      * nr_to_scan extent maps (or less). In case we need to free more than
> > +      * that, we will be called again by the fs shrinker, so no worries about
> > +      * not doing enough work to reclaim memory from extent maps.
> > +      * We can also be repeatedly called with the same nr_to_scan value
> > +      * simply because the shrinker runs asynchronously and multiple calls
> > +      * to this function are made before the shrinker does enough progress.
> > +      *
> > +      * That's why we set the atomic counter to nr_to_scan only if its
> > +      * current value is zero, instead of incrementing the counter by
> > +      * nr_to_scan.
> > +      */
>
> Since the shrinker can be called frequently, even if we only keep one
> shrink work running, will the shrinker be kept running for a long time?
> (one queued work done, then immiedately be queued again)

What's the problem?
The use of the atomic and not incrementing it, as said in the comment,
makes sure we don't do more work than necessary.

It's also running in the system unbound queue and has plenty of
explicit reschedule calls to ensure it doesn't monopolize a cpu and
doesn't block other tasks for long.

So how can it cause any problem?

>
> The XFS is queuing the work with an delay, although their reason is that
> the reclaim needs to be run more frequently than syncd interval (30s).
>
> Do we also need some delay to prevent such too frequent reclaim work?

See the comment above.

Without the increment of the atomic counter, if it keeps getting
scheduled it's because we're under memory pressure and need to free
memory as quickly as possible.

Thanks.

>
> Thanks,
> Qu
>
> > +     if (atomic64_cmpxchg(&fs_info->extent_map_shrinker_nr_to_scan, 0, nr_to_scan) != 0)
> > +             return;
> > +
> > +     queue_work(system_unbound_wq, &fs_info->extent_map_shrinker_work);
> > +}
> > +
> > +void btrfs_init_extent_map_shrinker_work(struct btrfs_fs_info *fs_info)
> > +{
> > +     atomic64_set(&fs_info->extent_map_shrinker_nr_to_scan, 0);
> > +     INIT_WORK(&fs_info->extent_map_shrinker_work, btrfs_extent_map_shrinker_worker);
> >   }
> > diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
> > index 5154a8f1d26c..cd123b266b64 100644
> > --- a/fs/btrfs/extent_map.h
> > +++ b/fs/btrfs/extent_map.h
> > @@ -189,6 +189,7 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode,
> >   int btrfs_replace_extent_map_range(struct btrfs_inode *inode,
> >                                  struct extent_map *new_em,
> >                                  bool modified);
> > -long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan);
> > +void btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan);
> > +void btrfs_init_extent_map_shrinker_work(struct btrfs_fs_info *fs_info);
> >
> >   #endif
> > diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
> > index 785ec15c1b84..a246d8dc0b20 100644
> > --- a/fs/btrfs/fs.h
> > +++ b/fs/btrfs/fs.h
> > @@ -638,6 +638,8 @@ struct btrfs_fs_info {
> >       spinlock_t extent_map_shrinker_lock;
> >       u64 extent_map_shrinker_last_root;
> >       u64 extent_map_shrinker_last_ino;
> > +     atomic64_t extent_map_shrinker_nr_to_scan;
> > +     struct work_struct extent_map_shrinker_work;
> >
> >       /* Protected by 'trans_lock'. */
> >       struct list_head dirty_cowonly_roots;
> > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> > index e8a5bf4af918..e9e209dd8e05 100644
> > --- a/fs/btrfs/super.c
> > +++ b/fs/btrfs/super.c
> > @@ -28,7 +28,6 @@
> >   #include <linux/btrfs.h>
> >   #include <linux/security.h>
> >   #include <linux/fs_parser.h>
> > -#include <linux/swap.h>
> >   #include "messages.h"
> >   #include "delayed-inode.h"
> >   #include "ctree.h"
> > @@ -2416,16 +2415,10 @@ static long btrfs_free_cached_objects(struct super_block *sb, struct shrink_cont
> >       const long nr_to_scan = min_t(unsigned long, LONG_MAX, sc->nr_to_scan);
> >       struct btrfs_fs_info *fs_info = btrfs_sb(sb);
> >
> > -     /*
> > -      * We may be called from any task trying to allocate memory and we don't
> > -      * want to slow it down with scanning and dropping extent maps. It would
> > -      * also cause heavy lock contention if many tasks concurrently enter
> > -      * here. Therefore only allow kswapd tasks to scan and drop extent maps.
> > -      */
> > -     if (!current_is_kswapd())
> > -             return 0;
> > +     btrfs_free_extent_maps(fs_info, nr_to_scan);
> >
> > -     return btrfs_free_extent_maps(fs_info, nr_to_scan);
> > +     /* The extent map shrinker runs asynchronously, so always return 0. */
> > +     return 0;
> >   }
> >
> >   static const struct super_operations btrfs_super_ops = {
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/5] btrfs: make the extent map shrinker run asynchronously as a work queue job
  2024-09-26  9:01     ` Filipe Manana
@ 2024-09-26  9:55       ` Qu Wenruo
  2024-09-26 10:41         ` Filipe Manana
  0 siblings, 1 reply; 19+ messages in thread
From: Qu Wenruo @ 2024-09-26  9:55 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs



在 2024/9/26 18:31, Filipe Manana 写道:
> On Wed, Sep 25, 2024 at 11:08 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> 在 2024/9/24 20:15, fdmanana@kernel.org 写道:
>>> From: Filipe Manana <fdmanana@suse.com>
>>>
>>> Currently the extent map shrinker is run synchronously for kswapd tasks
>>> that end up calling the fs shrinker (fs/super.c:super_cache_scan()).
>>> This has some disadvantages and for some heavy workloads with memory
>>> pressure it can cause some delays and stalls that make a machine
>>> unresponsive for some periods. This happens because:
>>>
>>> 1) We can have several kswapd tasks on machines with multiple NUMA zones,
>>>      and running the extent map shrinker concurrently can cause high
>>>      contention on some spin locks, namely the spin locks that protect
>>>      the radix tree that tracks roots, the per root xarray that tracks
>>>      open inodes and the list of delayed iputs. This not only delays the
>>>      shrinker but also causes high CPU consumption and makes the task
>>>      running the shrinker monopolize a core, resulting in the symptoms
>>>      of an unresponsive system. This was noted in previous commits such as
>>>      commit ae1e766f623f ("btrfs: only run the extent map shrinker from
>>>      kswapd tasks");
>>>
>>> 2) The extent map shrinker's iteration over inodes can often be slow, even
>>>      after changing the data structure that tracks open inodes for a root
>>>      from a red black tree (up to kernel 6.10) to an xarray (kernel 6.10+).
>>>      The transition to the xarray while it made things a bit faster, it's
>>>      still somewhat slow - for example in a test scenario with 10000 inodes
>>>      that have no extent maps loaded, the extent map shrinker took between
>>>      5ms to 8ms, using a release, non-debug kernel. Iterating over the
>>>      extent maps of an inode can also be slow if have an inode with many
>>>      thousands of extent maps, since we use a red black tree to track and
>>>      search extent maps. So having the extent map shrinker run synchronously
>>>      adds extra delay for other things a kswapd task does.
>>>
>>> So make the extent map shrinker run asynchronously as a job for the
>>> system unbounded workqueue, just like what we do for data and metadata
>>> space reclaim jobs.
>>>
>>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>>> ---
>>>    fs/btrfs/disk-io.c    |  2 ++
>>>    fs/btrfs/extent_map.c | 51 ++++++++++++++++++++++++++++++++++++-------
>>>    fs/btrfs/extent_map.h |  3 ++-
>>>    fs/btrfs/fs.h         |  2 ++
>>>    fs/btrfs/super.c      | 13 +++--------
>>>    5 files changed, 52 insertions(+), 19 deletions(-)
>>>
>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>> index 25d768e67e37..2148147c5257 100644
>>> --- a/fs/btrfs/disk-io.c
>>> +++ b/fs/btrfs/disk-io.c
>>> @@ -2786,6 +2786,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
>>>        btrfs_init_scrub(fs_info);
>>>        btrfs_init_balance(fs_info);
>>>        btrfs_init_async_reclaim_work(fs_info);
>>> +     btrfs_init_extent_map_shrinker_work(fs_info);
>>>
>>>        rwlock_init(&fs_info->block_group_cache_lock);
>>>        fs_info->block_group_cache_tree = RB_ROOT_CACHED;
>>> @@ -4283,6 +4284,7 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
>>>        cancel_work_sync(&fs_info->async_reclaim_work);
>>>        cancel_work_sync(&fs_info->async_data_reclaim_work);
>>>        cancel_work_sync(&fs_info->preempt_reclaim_work);
>>> +     cancel_work_sync(&fs_info->extent_map_shrinker_work);
>>>
>>>        /* Cancel or finish ongoing discard work */
>>>        btrfs_discard_cleanup(fs_info);
>>> diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
>>> index cb2a6f5dce2b..e2eeb94aa349 100644
>>> --- a/fs/btrfs/extent_map.c
>>> +++ b/fs/btrfs/extent_map.c
>>> @@ -1118,7 +1118,8 @@ struct btrfs_em_shrink_ctx {
>>>
>>>    static long btrfs_scan_inode(struct btrfs_inode *inode, struct btrfs_em_shrink_ctx *ctx)
>>>    {
>>> -     const u64 cur_fs_gen = btrfs_get_fs_generation(inode->root->fs_info);
>>> +     struct btrfs_fs_info *fs_info = inode->root->fs_info;
>>> +     const u64 cur_fs_gen = btrfs_get_fs_generation(fs_info);
>>>        struct extent_map_tree *tree = &inode->extent_tree;
>>>        long nr_dropped = 0;
>>>        struct rb_node *node;
>>> @@ -1191,7 +1192,8 @@ static long btrfs_scan_inode(struct btrfs_inode *inode, struct btrfs_em_shrink_c
>>>                 * lock. This is to avoid slowing other tasks trying to take the
>>>                 * lock.
>>>                 */
>>> -             if (need_resched() || rwlock_needbreak(&tree->lock))
>>> +             if (need_resched() || rwlock_needbreak(&tree->lock) ||
>>> +                 btrfs_fs_closing(fs_info))
>>>                        break;
>>>                node = next;
>>>        }
>>> @@ -1215,7 +1217,8 @@ static long btrfs_scan_root(struct btrfs_root *root, struct btrfs_em_shrink_ctx
>>>                ctx->last_ino = btrfs_ino(inode);
>>>                btrfs_add_delayed_iput(inode);
>>>
>>> -             if (ctx->scanned >= ctx->nr_to_scan)
>>> +             if (ctx->scanned >= ctx->nr_to_scan ||
>>> +                 btrfs_fs_closing(inode->root->fs_info))
>>>                        break;
>>>
>>>                cond_resched();
>>> @@ -1244,16 +1247,19 @@ static long btrfs_scan_root(struct btrfs_root *root, struct btrfs_em_shrink_ctx
>>>        return nr_dropped;
>>>    }
>>>
>>> -long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan)
>>> +static void btrfs_extent_map_shrinker_worker(struct work_struct *work)
>>>    {
>>> +     struct btrfs_fs_info *fs_info;
>>>        struct btrfs_em_shrink_ctx ctx;
>>>        u64 start_root_id;
>>>        u64 next_root_id;
>>>        bool cycled = false;
>>>        long nr_dropped = 0;
>>>
>>> +     fs_info = container_of(work, struct btrfs_fs_info, extent_map_shrinker_work);
>>> +
>>>        ctx.scanned = 0;
>>> -     ctx.nr_to_scan = nr_to_scan;
>>> +     ctx.nr_to_scan = atomic64_read(&fs_info->extent_map_shrinker_nr_to_scan);
>>>
>>>        /*
>>>         * In case we have multiple tasks running this shrinker, make the next
>>> @@ -1271,12 +1277,12 @@ long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan)
>>>        if (trace_btrfs_extent_map_shrinker_scan_enter_enabled()) {
>>>                s64 nr = percpu_counter_sum_positive(&fs_info->evictable_extent_maps);
>>>
>>> -             trace_btrfs_extent_map_shrinker_scan_enter(fs_info, nr_to_scan,
>>> +             trace_btrfs_extent_map_shrinker_scan_enter(fs_info, ctx.nr_to_scan,
>>>                                                           nr, ctx.last_root,
>>>                                                           ctx.last_ino);
>>>        }
>>>
>>> -     while (ctx.scanned < ctx.nr_to_scan) {
>>> +     while (ctx.scanned < ctx.nr_to_scan && !btrfs_fs_closing(fs_info)) {
>>>                struct btrfs_root *root;
>>>                unsigned long count;
>>>
>>> @@ -1334,5 +1340,34 @@ long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan)
>>>                                                          ctx.last_ino);
>>>        }
>>>
>>> -     return nr_dropped;
>>> +     atomic64_set(&fs_info->extent_map_shrinker_nr_to_scan, 0);
>>> +}
>>> +
>>> +void btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan)
>>> +{
>>> +     /*
>>> +      * Do nothing if the shrinker is already running. In case of high memory
>>> +      * pressure we can have a lot of tasks calling us and all passing the
>>> +      * same nr_to_scan value, but in reality we may need only to free
>>> +      * nr_to_scan extent maps (or less). In case we need to free more than
>>> +      * that, we will be called again by the fs shrinker, so no worries about
>>> +      * not doing enough work to reclaim memory from extent maps.
>>> +      * We can also be repeatedly called with the same nr_to_scan value
>>> +      * simply because the shrinker runs asynchronously and multiple calls
>>> +      * to this function are made before the shrinker does enough progress.
>>> +      *
>>> +      * That's why we set the atomic counter to nr_to_scan only if its
>>> +      * current value is zero, instead of incrementing the counter by
>>> +      * nr_to_scan.
>>> +      */
>>
>> Since the shrinker can be called frequently, even if we only keep one
>> shrink work running, will the shrinker be kept running for a long time?
>> (one queued work done, then immiedately be queued again)
>
> What's the problem?
> The use of the atomic and not incrementing it, as said in the comment,
> makes sure we don't do more work than necessary.
>
> It's also running in the system unbound queue and has plenty of
> explicit reschedule calls to ensure it doesn't monopolize a cpu and
> doesn't block other tasks for long.
>
> So how can it cause any problem?

Then it will be a workqueue taking CPU 100% (or near that).
Even there is only one running work.

The first one queued the X number to do, and the work got queued, after
freed X items, the next call triggered, queuing another Y number to reclaim.

The we get the same near-100% CPU usage, it may be rescheduled, but not
much difference.
We will always have one reclaim work item running at any moment.

>
>>
>> The XFS is queuing the work with an delay, although their reason is that
>> the reclaim needs to be run more frequently than syncd interval (30s).
>>
>> Do we also need some delay to prevent such too frequent reclaim work?
>
> See the comment above.
>
> Without the increment of the atomic counter, if it keeps getting
> scheduled it's because we're under memory pressure and need to free
> memory as quickly as possible.

Even the atomic stays the same until the current one finished, we just
get a new number of items to reclaim next.

Furthermore, from our existing experience, we didn't really hit a
realworld case where the em cache is causing a huge problem, so the
relaim for em should be a very low priority thing.

Thus I still believe a delayed work will be much safer, just like what
XFS is doing for decades, and also like our cleaner kthread.

Thanks,
Qu

>
> Thanks.
>
>>
>> Thanks,
>> Qu
>>
>>> +     if (atomic64_cmpxchg(&fs_info->extent_map_shrinker_nr_to_scan, 0, nr_to_scan) != 0)
>>> +             return;
>>> +
>>> +     queue_work(system_unbound_wq, &fs_info->extent_map_shrinker_work);
>>> +}
>>> +
>>> +void btrfs_init_extent_map_shrinker_work(struct btrfs_fs_info *fs_info)
>>> +{
>>> +     atomic64_set(&fs_info->extent_map_shrinker_nr_to_scan, 0);
>>> +     INIT_WORK(&fs_info->extent_map_shrinker_work, btrfs_extent_map_shrinker_worker);
>>>    }
>>> diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
>>> index 5154a8f1d26c..cd123b266b64 100644
>>> --- a/fs/btrfs/extent_map.h
>>> +++ b/fs/btrfs/extent_map.h
>>> @@ -189,6 +189,7 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode,
>>>    int btrfs_replace_extent_map_range(struct btrfs_inode *inode,
>>>                                   struct extent_map *new_em,
>>>                                   bool modified);
>>> -long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan);
>>> +void btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan);
>>> +void btrfs_init_extent_map_shrinker_work(struct btrfs_fs_info *fs_info);
>>>
>>>    #endif
>>> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
>>> index 785ec15c1b84..a246d8dc0b20 100644
>>> --- a/fs/btrfs/fs.h
>>> +++ b/fs/btrfs/fs.h
>>> @@ -638,6 +638,8 @@ struct btrfs_fs_info {
>>>        spinlock_t extent_map_shrinker_lock;
>>>        u64 extent_map_shrinker_last_root;
>>>        u64 extent_map_shrinker_last_ino;
>>> +     atomic64_t extent_map_shrinker_nr_to_scan;
>>> +     struct work_struct extent_map_shrinker_work;
>>>
>>>        /* Protected by 'trans_lock'. */
>>>        struct list_head dirty_cowonly_roots;
>>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>>> index e8a5bf4af918..e9e209dd8e05 100644
>>> --- a/fs/btrfs/super.c
>>> +++ b/fs/btrfs/super.c
>>> @@ -28,7 +28,6 @@
>>>    #include <linux/btrfs.h>
>>>    #include <linux/security.h>
>>>    #include <linux/fs_parser.h>
>>> -#include <linux/swap.h>
>>>    #include "messages.h"
>>>    #include "delayed-inode.h"
>>>    #include "ctree.h"
>>> @@ -2416,16 +2415,10 @@ static long btrfs_free_cached_objects(struct super_block *sb, struct shrink_cont
>>>        const long nr_to_scan = min_t(unsigned long, LONG_MAX, sc->nr_to_scan);
>>>        struct btrfs_fs_info *fs_info = btrfs_sb(sb);
>>>
>>> -     /*
>>> -      * We may be called from any task trying to allocate memory and we don't
>>> -      * want to slow it down with scanning and dropping extent maps. It would
>>> -      * also cause heavy lock contention if many tasks concurrently enter
>>> -      * here. Therefore only allow kswapd tasks to scan and drop extent maps.
>>> -      */
>>> -     if (!current_is_kswapd())
>>> -             return 0;
>>> +     btrfs_free_extent_maps(fs_info, nr_to_scan);
>>>
>>> -     return btrfs_free_extent_maps(fs_info, nr_to_scan);
>>> +     /* The extent map shrinker runs asynchronously, so always return 0. */
>>> +     return 0;
>>>    }
>>>
>>>    static const struct super_operations btrfs_super_ops = {
>>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/5] btrfs: make the extent map shrinker run asynchronously as a work queue job
  2024-09-26  9:55       ` Qu Wenruo
@ 2024-09-26 10:41         ` Filipe Manana
  2024-09-26 22:02           ` Qu Wenruo
  0 siblings, 1 reply; 19+ messages in thread
From: Filipe Manana @ 2024-09-26 10:41 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Sep 26, 2024 at 10:55 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> 在 2024/9/26 18:31, Filipe Manana 写道:
> > On Wed, Sep 25, 2024 at 11:08 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >>
> >>
> >>
> >> 在 2024/9/24 20:15, fdmanana@kernel.org 写道:
> >>> From: Filipe Manana <fdmanana@suse.com>
> >>>
> >>> Currently the extent map shrinker is run synchronously for kswapd tasks
> >>> that end up calling the fs shrinker (fs/super.c:super_cache_scan()).
> >>> This has some disadvantages and for some heavy workloads with memory
> >>> pressure it can cause some delays and stalls that make a machine
> >>> unresponsive for some periods. This happens because:
> >>>
> >>> 1) We can have several kswapd tasks on machines with multiple NUMA zones,
> >>>      and running the extent map shrinker concurrently can cause high
> >>>      contention on some spin locks, namely the spin locks that protect
> >>>      the radix tree that tracks roots, the per root xarray that tracks
> >>>      open inodes and the list of delayed iputs. This not only delays the
> >>>      shrinker but also causes high CPU consumption and makes the task
> >>>      running the shrinker monopolize a core, resulting in the symptoms
> >>>      of an unresponsive system. This was noted in previous commits such as
> >>>      commit ae1e766f623f ("btrfs: only run the extent map shrinker from
> >>>      kswapd tasks");
> >>>
> >>> 2) The extent map shrinker's iteration over inodes can often be slow, even
> >>>      after changing the data structure that tracks open inodes for a root
> >>>      from a red black tree (up to kernel 6.10) to an xarray (kernel 6.10+).
> >>>      The transition to the xarray while it made things a bit faster, it's
> >>>      still somewhat slow - for example in a test scenario with 10000 inodes
> >>>      that have no extent maps loaded, the extent map shrinker took between
> >>>      5ms to 8ms, using a release, non-debug kernel. Iterating over the
> >>>      extent maps of an inode can also be slow if have an inode with many
> >>>      thousands of extent maps, since we use a red black tree to track and
> >>>      search extent maps. So having the extent map shrinker run synchronously
> >>>      adds extra delay for other things a kswapd task does.
> >>>
> >>> So make the extent map shrinker run asynchronously as a job for the
> >>> system unbounded workqueue, just like what we do for data and metadata
> >>> space reclaim jobs.
> >>>
> >>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> >>> ---
> >>>    fs/btrfs/disk-io.c    |  2 ++
> >>>    fs/btrfs/extent_map.c | 51 ++++++++++++++++++++++++++++++++++++-------
> >>>    fs/btrfs/extent_map.h |  3 ++-
> >>>    fs/btrfs/fs.h         |  2 ++
> >>>    fs/btrfs/super.c      | 13 +++--------
> >>>    5 files changed, 52 insertions(+), 19 deletions(-)
> >>>
> >>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> >>> index 25d768e67e37..2148147c5257 100644
> >>> --- a/fs/btrfs/disk-io.c
> >>> +++ b/fs/btrfs/disk-io.c
> >>> @@ -2786,6 +2786,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
> >>>        btrfs_init_scrub(fs_info);
> >>>        btrfs_init_balance(fs_info);
> >>>        btrfs_init_async_reclaim_work(fs_info);
> >>> +     btrfs_init_extent_map_shrinker_work(fs_info);
> >>>
> >>>        rwlock_init(&fs_info->block_group_cache_lock);
> >>>        fs_info->block_group_cache_tree = RB_ROOT_CACHED;
> >>> @@ -4283,6 +4284,7 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
> >>>        cancel_work_sync(&fs_info->async_reclaim_work);
> >>>        cancel_work_sync(&fs_info->async_data_reclaim_work);
> >>>        cancel_work_sync(&fs_info->preempt_reclaim_work);
> >>> +     cancel_work_sync(&fs_info->extent_map_shrinker_work);
> >>>
> >>>        /* Cancel or finish ongoing discard work */
> >>>        btrfs_discard_cleanup(fs_info);
> >>> diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
> >>> index cb2a6f5dce2b..e2eeb94aa349 100644
> >>> --- a/fs/btrfs/extent_map.c
> >>> +++ b/fs/btrfs/extent_map.c
> >>> @@ -1118,7 +1118,8 @@ struct btrfs_em_shrink_ctx {
> >>>
> >>>    static long btrfs_scan_inode(struct btrfs_inode *inode, struct btrfs_em_shrink_ctx *ctx)
> >>>    {
> >>> -     const u64 cur_fs_gen = btrfs_get_fs_generation(inode->root->fs_info);
> >>> +     struct btrfs_fs_info *fs_info = inode->root->fs_info;
> >>> +     const u64 cur_fs_gen = btrfs_get_fs_generation(fs_info);
> >>>        struct extent_map_tree *tree = &inode->extent_tree;
> >>>        long nr_dropped = 0;
> >>>        struct rb_node *node;
> >>> @@ -1191,7 +1192,8 @@ static long btrfs_scan_inode(struct btrfs_inode *inode, struct btrfs_em_shrink_c
> >>>                 * lock. This is to avoid slowing other tasks trying to take the
> >>>                 * lock.
> >>>                 */
> >>> -             if (need_resched() || rwlock_needbreak(&tree->lock))
> >>> +             if (need_resched() || rwlock_needbreak(&tree->lock) ||
> >>> +                 btrfs_fs_closing(fs_info))
> >>>                        break;
> >>>                node = next;
> >>>        }
> >>> @@ -1215,7 +1217,8 @@ static long btrfs_scan_root(struct btrfs_root *root, struct btrfs_em_shrink_ctx
> >>>                ctx->last_ino = btrfs_ino(inode);
> >>>                btrfs_add_delayed_iput(inode);
> >>>
> >>> -             if (ctx->scanned >= ctx->nr_to_scan)
> >>> +             if (ctx->scanned >= ctx->nr_to_scan ||
> >>> +                 btrfs_fs_closing(inode->root->fs_info))
> >>>                        break;
> >>>
> >>>                cond_resched();
> >>> @@ -1244,16 +1247,19 @@ static long btrfs_scan_root(struct btrfs_root *root, struct btrfs_em_shrink_ctx
> >>>        return nr_dropped;
> >>>    }
> >>>
> >>> -long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan)
> >>> +static void btrfs_extent_map_shrinker_worker(struct work_struct *work)
> >>>    {
> >>> +     struct btrfs_fs_info *fs_info;
> >>>        struct btrfs_em_shrink_ctx ctx;
> >>>        u64 start_root_id;
> >>>        u64 next_root_id;
> >>>        bool cycled = false;
> >>>        long nr_dropped = 0;
> >>>
> >>> +     fs_info = container_of(work, struct btrfs_fs_info, extent_map_shrinker_work);
> >>> +
> >>>        ctx.scanned = 0;
> >>> -     ctx.nr_to_scan = nr_to_scan;
> >>> +     ctx.nr_to_scan = atomic64_read(&fs_info->extent_map_shrinker_nr_to_scan);
> >>>
> >>>        /*
> >>>         * In case we have multiple tasks running this shrinker, make the next
> >>> @@ -1271,12 +1277,12 @@ long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan)
> >>>        if (trace_btrfs_extent_map_shrinker_scan_enter_enabled()) {
> >>>                s64 nr = percpu_counter_sum_positive(&fs_info->evictable_extent_maps);
> >>>
> >>> -             trace_btrfs_extent_map_shrinker_scan_enter(fs_info, nr_to_scan,
> >>> +             trace_btrfs_extent_map_shrinker_scan_enter(fs_info, ctx.nr_to_scan,
> >>>                                                           nr, ctx.last_root,
> >>>                                                           ctx.last_ino);
> >>>        }
> >>>
> >>> -     while (ctx.scanned < ctx.nr_to_scan) {
> >>> +     while (ctx.scanned < ctx.nr_to_scan && !btrfs_fs_closing(fs_info)) {
> >>>                struct btrfs_root *root;
> >>>                unsigned long count;
> >>>
> >>> @@ -1334,5 +1340,34 @@ long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan)
> >>>                                                          ctx.last_ino);
> >>>        }
> >>>
> >>> -     return nr_dropped;
> >>> +     atomic64_set(&fs_info->extent_map_shrinker_nr_to_scan, 0);
> >>> +}
> >>> +
> >>> +void btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan)
> >>> +{
> >>> +     /*
> >>> +      * Do nothing if the shrinker is already running. In case of high memory
> >>> +      * pressure we can have a lot of tasks calling us and all passing the
> >>> +      * same nr_to_scan value, but in reality we may need only to free
> >>> +      * nr_to_scan extent maps (or less). In case we need to free more than
> >>> +      * that, we will be called again by the fs shrinker, so no worries about
> >>> +      * not doing enough work to reclaim memory from extent maps.
> >>> +      * We can also be repeatedly called with the same nr_to_scan value
> >>> +      * simply because the shrinker runs asynchronously and multiple calls
> >>> +      * to this function are made before the shrinker does enough progress.
> >>> +      *
> >>> +      * That's why we set the atomic counter to nr_to_scan only if its
> >>> +      * current value is zero, instead of incrementing the counter by
> >>> +      * nr_to_scan.
> >>> +      */
> >>
> >> Since the shrinker can be called frequently, even if we only keep one
> >> shrink work running, will the shrinker be kept running for a long time?
> >> (one queued work done, then immiedately be queued again)
> >
> > What's the problem?
> > The use of the atomic and not incrementing it, as said in the comment,
> > makes sure we don't do more work than necessary.
> >
> > It's also running in the system unbound queue and has plenty of
> > explicit reschedule calls to ensure it doesn't monopolize a cpu and
> > doesn't block other tasks for long.
> >
> > So how can it cause any problem?
>
> Then it will be a workqueue taking CPU 100% (or near that).
> Even there is only one running work.

No it won't.
Besides the very frequent reschedule points, the shrinker is always
called with a small percentage of the total number of objects to free.

>
> The first one queued the X number to do, and the work got queued, after
> freed X items, the next call triggered, queuing another Y number to reclaim.

And? Work queue jobs are distributed across cores as needed, so that
work queue won't be monopolized with extent map shrinking.

>
> The we get the same near-100% CPU usage, it may be rescheduled, but not
> much difference.
> We will always have one reclaim work item running at any moment.

And if that happens it's because it's needed.
The same goes for space reclaim, it's exactly what we do for space
reclaim. We don't add delays there and neither for delayed iputs.

>
> >
> >>
> >> The XFS is queuing the work with an delay, although their reason is that
> >> the reclaim needs to be run more frequently than syncd interval (30s).
> >>
> >> Do we also need some delay to prevent such too frequent reclaim work?
> >
> > See the comment above.
> >
> > Without the increment of the atomic counter, if it keeps getting
> > scheduled it's because we're under memory pressure and need to free
> > memory as quickly as possible.
>
> Even the atomic stays the same until the current one finished, we just
> get a new number of items to reclaim next.

If we do get it's because we still need to free memory, and we have to do it.

>
> Furthermore, from our existing experience, we didn't really hit a
> realworld case where the em cache is causing a huge problem, so the
> relaim for em should be a very low priority thing.

We had 1 person reporting it. And now that the problem is publicly
known, it can be exploited by someone with bad intentions - no root
privileges are required.

>
> Thus I still believe a delayed work will be much safer, just like what
> XFS is doing for decades, and also like our cleaner kthread.

Not a specialist on XFS, and maybe they have their reasons for doing
what they do.

But the case with our cleaner kthread is very different:

1) First for delayed iputs that delay doesn't exist.
When doing a delayed iput we always wake up the cleaner if it's not
currently running.
We want them to run asap to prevent inode eviction from happening and
holding memory and space reservations for too long.

2) Otherwise the delay and sleep is both because it's a kthread we
manually manage and because deletion of dead roots is IO heavy.

Extent map shrinking doesn't do any IO. Any concern about CPU I've
already addressed above.


>
> Thanks,
> Qu
>
> >
> > Thanks.
> >
> >>
> >> Thanks,
> >> Qu
> >>
> >>> +     if (atomic64_cmpxchg(&fs_info->extent_map_shrinker_nr_to_scan, 0, nr_to_scan) != 0)
> >>> +             return;
> >>> +
> >>> +     queue_work(system_unbound_wq, &fs_info->extent_map_shrinker_work);
> >>> +}
> >>> +
> >>> +void btrfs_init_extent_map_shrinker_work(struct btrfs_fs_info *fs_info)
> >>> +{
> >>> +     atomic64_set(&fs_info->extent_map_shrinker_nr_to_scan, 0);
> >>> +     INIT_WORK(&fs_info->extent_map_shrinker_work, btrfs_extent_map_shrinker_worker);
> >>>    }
> >>> diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
> >>> index 5154a8f1d26c..cd123b266b64 100644
> >>> --- a/fs/btrfs/extent_map.h
> >>> +++ b/fs/btrfs/extent_map.h
> >>> @@ -189,6 +189,7 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode,
> >>>    int btrfs_replace_extent_map_range(struct btrfs_inode *inode,
> >>>                                   struct extent_map *new_em,
> >>>                                   bool modified);
> >>> -long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan);
> >>> +void btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan);
> >>> +void btrfs_init_extent_map_shrinker_work(struct btrfs_fs_info *fs_info);
> >>>
> >>>    #endif
> >>> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
> >>> index 785ec15c1b84..a246d8dc0b20 100644
> >>> --- a/fs/btrfs/fs.h
> >>> +++ b/fs/btrfs/fs.h
> >>> @@ -638,6 +638,8 @@ struct btrfs_fs_info {
> >>>        spinlock_t extent_map_shrinker_lock;
> >>>        u64 extent_map_shrinker_last_root;
> >>>        u64 extent_map_shrinker_last_ino;
> >>> +     atomic64_t extent_map_shrinker_nr_to_scan;
> >>> +     struct work_struct extent_map_shrinker_work;
> >>>
> >>>        /* Protected by 'trans_lock'. */
> >>>        struct list_head dirty_cowonly_roots;
> >>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> >>> index e8a5bf4af918..e9e209dd8e05 100644
> >>> --- a/fs/btrfs/super.c
> >>> +++ b/fs/btrfs/super.c
> >>> @@ -28,7 +28,6 @@
> >>>    #include <linux/btrfs.h>
> >>>    #include <linux/security.h>
> >>>    #include <linux/fs_parser.h>
> >>> -#include <linux/swap.h>
> >>>    #include "messages.h"
> >>>    #include "delayed-inode.h"
> >>>    #include "ctree.h"
> >>> @@ -2416,16 +2415,10 @@ static long btrfs_free_cached_objects(struct super_block *sb, struct shrink_cont
> >>>        const long nr_to_scan = min_t(unsigned long, LONG_MAX, sc->nr_to_scan);
> >>>        struct btrfs_fs_info *fs_info = btrfs_sb(sb);
> >>>
> >>> -     /*
> >>> -      * We may be called from any task trying to allocate memory and we don't
> >>> -      * want to slow it down with scanning and dropping extent maps. It would
> >>> -      * also cause heavy lock contention if many tasks concurrently enter
> >>> -      * here. Therefore only allow kswapd tasks to scan and drop extent maps.
> >>> -      */
> >>> -     if (!current_is_kswapd())
> >>> -             return 0;
> >>> +     btrfs_free_extent_maps(fs_info, nr_to_scan);
> >>>
> >>> -     return btrfs_free_extent_maps(fs_info, nr_to_scan);
> >>> +     /* The extent map shrinker runs asynchronously, so always return 0. */
> >>> +     return 0;
> >>>    }
> >>>
> >>>    static const struct super_operations btrfs_super_ops = {
> >>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/5] btrfs: make the extent map shrinker run asynchronously as a work queue job
  2024-09-26 10:41         ` Filipe Manana
@ 2024-09-26 22:02           ` Qu Wenruo
  2024-09-27  8:10             ` Filipe Manana
  0 siblings, 1 reply; 19+ messages in thread
From: Qu Wenruo @ 2024-09-26 22:02 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

在 2024/9/26 20:11, Filipe Manana 写道:
> On Thu, Sep 26, 2024 at 10:55 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
[...]
>>> What's the problem?
>>> The use of the atomic and not incrementing it, as said in the comment,
>>> makes sure we don't do more work than necessary.
>>>
>>> It's also running in the system unbound queue and has plenty of
>>> explicit reschedule calls to ensure it doesn't monopolize a cpu and
>>> doesn't block other tasks for long.
>>>
>>> So how can it cause any problem?
>>
>> Then it will be a workqueue taking CPU 100% (or near that).
>> Even there is only one running work.
>
> No it won't.
> Besides the very frequent reschedule points, the shrinker is always
> called with a small percentage of the total number of objects to free.

And there is no guarantee that a small number to reclaim is not heavy.

The latency of a scan is not directly related to the passed-in number,
but the amount of roots/inodes/ems.

E.g. we have tens of thousands of inodes to go through. Even most of
them have no extent maps, it may still take milliseconds to go through.

And if we always have a small ems pinned for IO, so even after the
current reclaim done, the next reclaim work will still get some small
number queued to reclaim, again takes several milliseconds to reclaim
may be a few extent maps, then IO creates new ems bumping up the em
counts again.

>
>>
>> The first one queued the X number to do, and the work got queued, after
>> freed X items, the next call triggered, queuing another Y number to reclaim.
>
> And? Work queue jobs are distributed across cores as needed, so that
> work queue won't be monopolized with extent map shrinking.

Scheduled across different CPUs won't make any difference if the reclaim
workload is queued and run again and again.

Let me be clear, reschedule and core distribution are not changing the
nature of a long running CPU heavy workload.
Just take gcc as an example, it's a user space program, which can always
be preempted, and kernel can always reschedule the user space program to
whatever CPU core.
But it's still a long running CPU heavy workload.

The same is for the em shrinker, the difference is the em shrinker is
just waked up again and again, other than a long running one.

>
>>
>> The we get the same near-100% CPU usage, it may be rescheduled, but not
>> much difference.
>> We will always have one reclaim work item running at any moment.
>
> And if that happens it's because it's needed.
> The same goes for space reclaim, it's exactly what we do for space
> reclaim. We don't add delays there and neither for delayed iputs.

Unlike delayed inodes, em shrinker needs to go through all roots and all
the inodes if we didn't free enough ems.

While delayed inodes has a simple list of all the delayed inodes, not
really much scanning.

And just as you mentioned in the changelog, the scanning itself can be
the real problem.

>
>>
>>>
>>>>
>>>> The XFS is queuing the work with an delay, although their reason is that
>>>> the reclaim needs to be run more frequently than syncd interval (30s).
>>>>
>>>> Do we also need some delay to prevent such too frequent reclaim work?
>>>
>>> See the comment above.
>>>
>>> Without the increment of the atomic counter, if it keeps getting
>>> scheduled it's because we're under memory pressure and need to free
>>> memory as quickly as possible.
>>
>> Even the atomic stays the same until the current one finished, we just
>> get a new number of items to reclaim next.
>
> If we do get it's because we still need to free memory, and we have to do it.
>
>>
>> Furthermore, from our existing experience, we didn't really hit a
>> realworld case where the em cache is causing a huge problem, so the
>> relaim for em should be a very low priority thing.
>
> We had 1 person reporting it. And now that the problem is publicly
> known, it can be exploited by someone with bad intentions - no root
> privileges are required.

I do not see how delayed reclaim will cause new problem in that case.
Yes, it will not reclaim those ems fast enough, but it's still doing
proper reclaim.

In fact we have more end users affected by the too aggressive shrinker
behavior than not reclaiming those ems.

And it will be a slap in the face if we move the feature into
experimental, then move it out and have to move it back again.

So I'm very cautious on any possibility that this shrinker can be
triggered without any extra wait.

>
>>
>> Thus I still believe a delayed work will be much safer, just like what
>> XFS is doing for decades, and also like our cleaner kthread.
>
> Not a specialist on XFS, and maybe they have their reasons for doing
> what they do.
>
> But the case with our cleaner kthread is very different:
>
> 1) First for delayed iputs that delay doesn't exist.
> When doing a delayed iput we always wake up the cleaner if it's not
> currently running.
> We want them to run asap to prevent inode eviction from happening and
> holding memory and space reservations for too long.

But the delayed iput doesn't need to scan through all the roots/inodes.

Thanks,
Qu

>
> 2) Otherwise the delay and sleep is both because it's a kthread we
> manually manage and because deletion of dead roots is IO heavy.
>
> Extent map shrinking doesn't do any IO. Any concern about CPU I've
> already addressed above.
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/5] btrfs: make the extent map shrinker run asynchronously as a work queue job
  2024-09-26 22:02           ` Qu Wenruo
@ 2024-09-27  8:10             ` Filipe Manana
  0 siblings, 0 replies; 19+ messages in thread
From: Filipe Manana @ 2024-09-27  8:10 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Sep 26, 2024 at 11:02 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> 在 2024/9/26 20:11, Filipe Manana 写道:
> > On Thu, Sep 26, 2024 at 10:55 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> [...]
> >>> What's the problem?
> >>> The use of the atomic and not incrementing it, as said in the comment,
> >>> makes sure we don't do more work than necessary.
> >>>
> >>> It's also running in the system unbound queue and has plenty of
> >>> explicit reschedule calls to ensure it doesn't monopolize a cpu and
> >>> doesn't block other tasks for long.
> >>>
> >>> So how can it cause any problem?
> >>
> >> Then it will be a workqueue taking CPU 100% (or near that).
> >> Even there is only one running work.
> >
> > No it won't.
> > Besides the very frequent reschedule points, the shrinker is always
> > called with a small percentage of the total number of objects to free.
>
> And there is no guarantee that a small number to reclaim is not heavy.
>
> The latency of a scan is not directly related to the passed-in number,
> but the amount of roots/inodes/ems.
>
> E.g. we have tens of thousands of inodes to go through. Even most of
> them have no extent maps, it may still take milliseconds to go through.

You're taking what I've written in the changelog as if I'm not aware of it.
Yes I talk about several milliseconds for 10000 inodes without any
extent maps, on a very busy system and the
times measured were done with bpftrace which include reschedule times,
so accounting for periods the shrinker was not running.

The point of running it asynchronously in the system unbounded queue
is precisely so that in case it takes a long time it doesn't affect
other tasks.

>
> And if we always have a small ems pinned for IO, so even after the

Small ems? How does the size of an extent map matters here?

> current reclaim done, the next reclaim work will still get some small
> number queued to reclaim, again takes several milliseconds to reclaim

And that's precisely why it's run asynchronously in the system
unbounded workqueue, so that no matter how long it takes, it doesn't
affect (delays) other tasks.

> may be a few extent maps, then IO creates new ems bumping up the em
> counts again.

And adding the delay will solve any of that?
Please explain how.

>
> >
> >>
> >> The first one queued the X number to do, and the work got queued, after
> >> freed X items, the next call triggered, queuing another Y number to reclaim.
> >
> > And? Work queue jobs are distributed across cores as needed, so that
> > work queue won't be monopolized with extent map shrinking.
>
> Scheduled across different CPUs won't make any difference if the reclaim
> workload is queued and run again and again.
>
> Let me be clear, reschedule and core distribution are not changing the
> nature of a long running CPU heavy workload.

And how does your delay make that better?

> Just take gcc as an example, it's a user space program, which can always
> be preempted, and kernel can always reschedule the user space program to
> whatever CPU core.
> But it's still a long running CPU heavy workload.
>
> The same is for the em shrinker, the difference is the em shrinker is
> just waked up again and again, other than a long running one.

If needed, yes, that's why the shrinker (at a higher level) calls into
the fs shrinker for a small portion of objects at a time.

>
> >
> >>
> >> The we get the same near-100% CPU usage, it may be rescheduled, but not
> >> much difference.
> >> We will always have one reclaim work item running at any moment.
> >
> > And if that happens it's because it's needed.
> > The same goes for space reclaim, it's exactly what we do for space
> > reclaim. We don't add delays there and neither for delayed iputs.
>
> Unlike delayed inodes, em shrinker needs to go through all roots and all
> the inodes if we didn't free enough ems.

For a portion of the total extent maps.
Every time it's called it visits a portion of the total extent maps, a
few roots, a few inodes, sometimes just 1 inode and 1 root.

>
> While delayed inodes has a simple list of all the delayed inodes, not
> really much scanning.

The time to move from one element to the next is shorter, yes, but if
you have many thousands of elements it's still time consuming.

>
> And just as you mentioned in the changelog, the scanning itself can be
> the real problem.

Yes, and that's why it is being done by a single task in the system
unbounded queue, so that no matter how slow or fast it is, it doesn't
cause delay for other tasks.
The problem was that it ran synchronously, through memory allocation
paths (and kswapd), and that time introduced delays and made any
concurrent calls spend a lot of time busy on spin locks, preventing
other tasks from being scheduled and therefore unresponsive.

>
> >
> >>
> >>>
> >>>>
> >>>> The XFS is queuing the work with an delay, although their reason is that
> >>>> the reclaim needs to be run more frequently than syncd interval (30s).
> >>>>
> >>>> Do we also need some delay to prevent such too frequent reclaim work?
> >>>
> >>> See the comment above.
> >>>
> >>> Without the increment of the atomic counter, if it keeps getting
> >>> scheduled it's because we're under memory pressure and need to free
> >>> memory as quickly as possible.
> >>
> >> Even the atomic stays the same until the current one finished, we just
> >> get a new number of items to reclaim next.
> >
> > If we do get it's because we still need to free memory, and we have to do it.
> >
> >>
> >> Furthermore, from our existing experience, we didn't really hit a
> >> realworld case where the em cache is causing a huge problem, so the
> >> relaim for em should be a very low priority thing.
> >
> > We had 1 person reporting it. And now that the problem is publicly
> > known, it can be exploited by someone with bad intentions - no root
> > privileges are required.
>
> I do not see how delayed reclaim will cause new problem in that case.
> Yes, it will not reclaim those ems fast enough, but it's still doing
> proper reclaim.

If it's not fast enough it can often have consequences such as
allocations failing anywhere or taking longer for example, or making
the OOM killer start killing processes.

>
> In fact we have more end users affected by the too aggressive shrinker
> behavior than not reclaiming those ems.

Yes, when it ran synchronously, before this patchset.

I don't think you understand why it was aggressive.
This was already explained in the change log and comments, but let me
rephrase it:

1) Multiple tasks, during memory allocations or kswapd, end up
concurrently triggering the fs shrinker which enters the extent map
shrinker, let's say 100 tasks;

2) There's for example 1 000 000 extent maps, and the shrinker decides
for example to try to free 1000 extent maps (the nr_ro_scan argument);

3) The decision for each calling task was made before any other task
was able do do any progress (or very little) dropping extent maps;

4) So while 1000 extent maps may have been enough to release memory,
we end up releasing  (or trying to) 100 * 1000 extent maps
concurrently, causing the heavy spin lock waits and CPU usage.

Making it run asynchronously in the system unbound queue, ensuring
there's only one queued work, and using the atomic without adding
(just swapping if it's 0), solves all that.
In the example above it ensures only 1 task is doing the extent map
shrinking and for at most 1000 extent maps, and not 100 * 1000.

>
>
> And it will be a slap in the face if we move the feature into
> experimental, then move it out and have to move it back again.

Regressions and problems happen often, yes, but we always move forward
by analysing things and fixing them.
I'm sure you've gone through the same in the past.

>
> So I'm very cautious on any possibility that this shrinker can be
> triggered without any extra wait.

I'm very cautious as well, and have been very responsive helping users
ASAP, even spending too often personal time like weekends, evenings,
and holidays.
The last report happened at the only time of the year where I was away
on vacation and had no possibility of having access to a computer and
doing anything, and right at the last 6.10-rc before 6.10 final.

So please don't assume or insinuate I don't care enough. I do care, and a lot.

>
> >
> >>
> >> Thus I still believe a delayed work will be much safer, just like what
> >> XFS is doing for decades, and also like our cleaner kthread.
> >
> > Not a specialist on XFS, and maybe they have their reasons for doing
> > what they do.
> >
> > But the case with our cleaner kthread is very different:
> >
> > 1) First for delayed iputs that delay doesn't exist.
> > When doing a delayed iput we always wake up the cleaner if it's not
> > currently running.
> > We want them to run asap to prevent inode eviction from happening and
> > holding memory and space reservations for too long.
>
> But the delayed iput doesn't need to scan through all the roots/inodes.

And neither does the extent map shrinker, it goes through a small
portion at a time, in a dedicated task run by the system unbounded
queue.

And running delayed iputs still takes time going through a list. If
there are always delayed iputs being added, there's constant work.
Yet in this case you're not worried about adding a delay.

>
> Thanks,
> Qu
>
> >
> > 2) Otherwise the delay and sleep is both because it's a kthread we
> > manually manage and because deletion of dead roots is IO heavy.
> >
> > Extent map shrinking doesn't do any IO. Any concern about CPU I've
> > already addressed above.
> >
> >
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 3/5] btrfs: simplify tracking progress for the extent map shrinker
  2024-09-24 10:45 [PATCH 0/5] btrfs: make extent map shrinker more efficient and re-enable it fdmanana
  2024-09-24 10:45 ` [PATCH 1/5] btrfs: add and use helper to remove extent map from its inode's tree fdmanana
  2024-09-24 10:45 ` [PATCH 2/5] btrfs: make the extent map shrinker run asynchronously as a work queue job fdmanana
@ 2024-09-24 10:45 ` fdmanana
  2024-09-24 10:45 ` [PATCH 4/5] btrfs: rename extent map shrinker members from struct btrfs_fs_info fdmanana
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: fdmanana @ 2024-09-24 10:45 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Now that the extent map shrinker can only be run by a single task (as a
work queue item) there is no need to keep the progress of the shrinker
protected by a spinlock and passing the progress to trace events as
parameters. So remove the lock and simplify the arguments for the trace
events.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/disk-io.c           |  2 --
 fs/btrfs/extent_map.c        | 55 ++++++++----------------------------
 fs/btrfs/fs.h                |  1 -
 include/trace/events/btrfs.h | 21 +++++++-------
 4 files changed, 22 insertions(+), 57 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 2148147c5257..85c6b14cbf76 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2853,8 +2853,6 @@ static int init_mount_fs_info(struct btrfs_fs_info *fs_info, struct super_block
 	if (ret)
 		return ret;
 
-	spin_lock_init(&fs_info->extent_map_shrinker_lock);
-
 	ret = percpu_counter_init(&fs_info->dirty_metadata_bytes, 0, GFP_KERNEL);
 	if (ret)
 		return ret;
diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index e2eeb94aa349..767f0804f504 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -1112,8 +1112,6 @@ int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
 struct btrfs_em_shrink_ctx {
 	long nr_to_scan;
 	long scanned;
-	u64 last_ino;
-	u64 last_root;
 };
 
 static long btrfs_scan_inode(struct btrfs_inode *inode, struct btrfs_em_shrink_ctx *ctx)
@@ -1205,16 +1203,17 @@ static long btrfs_scan_inode(struct btrfs_inode *inode, struct btrfs_em_shrink_c
 
 static long btrfs_scan_root(struct btrfs_root *root, struct btrfs_em_shrink_ctx *ctx)
 {
+	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_inode *inode;
 	long nr_dropped = 0;
-	u64 min_ino = ctx->last_ino + 1;
+	u64 min_ino = fs_info->extent_map_shrinker_last_ino + 1;
 
 	inode = btrfs_find_first_inode(root, min_ino);
 	while (inode) {
 		nr_dropped += btrfs_scan_inode(inode, ctx);
 
 		min_ino = btrfs_ino(inode) + 1;
-		ctx->last_ino = btrfs_ino(inode);
+		fs_info->extent_map_shrinker_last_ino = btrfs_ino(inode);
 		btrfs_add_delayed_iput(inode);
 
 		if (ctx->scanned >= ctx->nr_to_scan ||
@@ -1234,14 +1233,14 @@ static long btrfs_scan_root(struct btrfs_root *root, struct btrfs_em_shrink_ctx
 		 * inode if there is one or we will find out this was the last
 		 * one and move to the next root.
 		 */
-		ctx->last_root = btrfs_root_id(root);
+		fs_info->extent_map_shrinker_last_root = btrfs_root_id(root);
 	} else {
 		/*
 		 * No more inodes in this root, set extent_map_shrinker_last_ino to 0 so
 		 * that when processing the next root we start from its first inode.
 		 */
-		ctx->last_ino = 0;
-		ctx->last_root = btrfs_root_id(root) + 1;
+		fs_info->extent_map_shrinker_last_ino = 0;
+		fs_info->extent_map_shrinker_last_root = btrfs_root_id(root) + 1;
 	}
 
 	return nr_dropped;
@@ -1261,25 +1260,13 @@ static void btrfs_extent_map_shrinker_worker(struct work_struct *work)
 	ctx.scanned = 0;
 	ctx.nr_to_scan = atomic64_read(&fs_info->extent_map_shrinker_nr_to_scan);
 
-	/*
-	 * In case we have multiple tasks running this shrinker, make the next
-	 * one start from the next inode in case it starts before we finish.
-	 */
-	spin_lock(&fs_info->extent_map_shrinker_lock);
-	ctx.last_ino = fs_info->extent_map_shrinker_last_ino;
-	fs_info->extent_map_shrinker_last_ino++;
-	ctx.last_root = fs_info->extent_map_shrinker_last_root;
-	spin_unlock(&fs_info->extent_map_shrinker_lock);
-
-	start_root_id = ctx.last_root;
-	next_root_id = ctx.last_root;
+	start_root_id = fs_info->extent_map_shrinker_last_root;
+	next_root_id = fs_info->extent_map_shrinker_last_root;
 
 	if (trace_btrfs_extent_map_shrinker_scan_enter_enabled()) {
 		s64 nr = percpu_counter_sum_positive(&fs_info->evictable_extent_maps);
 
-		trace_btrfs_extent_map_shrinker_scan_enter(fs_info, ctx.nr_to_scan,
-							   nr, ctx.last_root,
-							   ctx.last_ino);
+		trace_btrfs_extent_map_shrinker_scan_enter(fs_info, nr);
 	}
 
 	while (ctx.scanned < ctx.nr_to_scan && !btrfs_fs_closing(fs_info)) {
@@ -1296,8 +1283,8 @@ static void btrfs_extent_map_shrinker_worker(struct work_struct *work)
 			spin_unlock(&fs_info->fs_roots_radix_lock);
 			if (start_root_id > 0 && !cycled) {
 				next_root_id = 0;
-				ctx.last_root = 0;
-				ctx.last_ino = 0;
+				fs_info->extent_map_shrinker_last_root = 0;
+				fs_info->extent_map_shrinker_last_ino = 0;
 				cycled = true;
 				continue;
 			}
@@ -1316,28 +1303,10 @@ static void btrfs_extent_map_shrinker_worker(struct work_struct *work)
 		btrfs_put_root(root);
 	}
 
-	/*
-	 * In case of multiple tasks running this extent map shrinking code this
-	 * isn't perfect but it's simple and silences things like KCSAN. It's
-	 * not possible to know which task made more progress because we can
-	 * cycle back to the first root and first inode if it's not the first
-	 * time the shrinker ran, see the above logic. Also a task that started
-	 * later may finish ealier than another task and made less progress. So
-	 * make this simple and update to the progress of the last task that
-	 * finished, with the occasional possiblity of having two consecutive
-	 * runs of the shrinker process the same inodes.
-	 */
-	spin_lock(&fs_info->extent_map_shrinker_lock);
-	fs_info->extent_map_shrinker_last_ino = ctx.last_ino;
-	fs_info->extent_map_shrinker_last_root = ctx.last_root;
-	spin_unlock(&fs_info->extent_map_shrinker_lock);
-
 	if (trace_btrfs_extent_map_shrinker_scan_exit_enabled()) {
 		s64 nr = percpu_counter_sum_positive(&fs_info->evictable_extent_maps);
 
-		trace_btrfs_extent_map_shrinker_scan_exit(fs_info, nr_dropped,
-							  nr, ctx.last_root,
-							  ctx.last_ino);
+		trace_btrfs_extent_map_shrinker_scan_exit(fs_info, nr_dropped, nr);
 	}
 
 	atomic64_set(&fs_info->extent_map_shrinker_nr_to_scan, 0);
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index a246d8dc0b20..6639e873b8db 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -635,7 +635,6 @@ struct btrfs_fs_info {
 	s32 delalloc_batch;
 
 	struct percpu_counter evictable_extent_maps;
-	spinlock_t extent_map_shrinker_lock;
 	u64 extent_map_shrinker_last_root;
 	u64 extent_map_shrinker_last_ino;
 	atomic64_t extent_map_shrinker_nr_to_scan;
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index bf60ad50011e..957f3a2b31d4 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -2553,10 +2553,9 @@ TRACE_EVENT(btrfs_extent_map_shrinker_count,
 
 TRACE_EVENT(btrfs_extent_map_shrinker_scan_enter,
 
-	TP_PROTO(const struct btrfs_fs_info *fs_info, long nr_to_scan, long nr,
-		 u64 last_root_id, u64 last_ino),
+	TP_PROTO(const struct btrfs_fs_info *fs_info, long nr),
 
-	TP_ARGS(fs_info, nr_to_scan, nr, last_root_id, last_ino),
+	TP_ARGS(fs_info, nr),
 
 	TP_STRUCT__entry_btrfs(
 		__field(	long,	nr_to_scan	)
@@ -2566,10 +2565,11 @@ TRACE_EVENT(btrfs_extent_map_shrinker_scan_enter,
 	),
 
 	TP_fast_assign_btrfs(fs_info,
-		__entry->nr_to_scan	= nr_to_scan;
+		__entry->nr_to_scan	= \
+		     atomic64_read(&fs_info->extent_map_shrinker_nr_to_scan);
 		__entry->nr		= nr;
-		__entry->last_root_id	= last_root_id;
-		__entry->last_ino	= last_ino;
+		__entry->last_root_id	= fs_info->extent_map_shrinker_last_root;
+		__entry->last_ino	= fs_info->extent_map_shrinker_last_ino;
 	),
 
 	TP_printk_btrfs("nr_to_scan=%ld nr=%ld last_root=%llu(%s) last_ino=%llu",
@@ -2579,10 +2579,9 @@ TRACE_EVENT(btrfs_extent_map_shrinker_scan_enter,
 
 TRACE_EVENT(btrfs_extent_map_shrinker_scan_exit,
 
-	TP_PROTO(const struct btrfs_fs_info *fs_info, long nr_dropped, long nr,
-		 u64 last_root_id, u64 last_ino),
+	TP_PROTO(const struct btrfs_fs_info *fs_info, long nr_dropped, long nr),
 
-	TP_ARGS(fs_info, nr_dropped, nr, last_root_id, last_ino),
+	TP_ARGS(fs_info, nr_dropped, nr),
 
 	TP_STRUCT__entry_btrfs(
 		__field(	long,	nr_dropped	)
@@ -2594,8 +2593,8 @@ TRACE_EVENT(btrfs_extent_map_shrinker_scan_exit,
 	TP_fast_assign_btrfs(fs_info,
 		__entry->nr_dropped	= nr_dropped;
 		__entry->nr		= nr;
-		__entry->last_root_id	= last_root_id;
-		__entry->last_ino	= last_ino;
+		__entry->last_root_id	= fs_info->extent_map_shrinker_last_root;
+		__entry->last_ino	= fs_info->extent_map_shrinker_last_ino;
 	),
 
 	TP_printk_btrfs("nr_dropped=%ld nr=%ld last_root=%llu(%s) last_ino=%llu",
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/5] btrfs: rename extent map shrinker members from struct btrfs_fs_info
  2024-09-24 10:45 [PATCH 0/5] btrfs: make extent map shrinker more efficient and re-enable it fdmanana
                   ` (2 preceding siblings ...)
  2024-09-24 10:45 ` [PATCH 3/5] btrfs: simplify tracking progress for the extent map shrinker fdmanana
@ 2024-09-24 10:45 ` fdmanana
  2024-09-24 10:45 ` [PATCH 5/5] btrfs: re-enable the extent map shrinker fdmanana
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: fdmanana @ 2024-09-24 10:45 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

The names for the members of struct btrfs_fs_info related to the extent
map shrinker are a bit too long, so rename them to be shorter by replacing
the "extent_map_" prefix with the "em_" prefix.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/disk-io.c           |  2 +-
 fs/btrfs/extent_map.c        | 32 ++++++++++++++++----------------
 fs/btrfs/fs.h                |  8 ++++----
 include/trace/events/btrfs.h | 10 +++++-----
 4 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 85c6b14cbf76..2d8053b39203 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4282,7 +4282,7 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
 	cancel_work_sync(&fs_info->async_reclaim_work);
 	cancel_work_sync(&fs_info->async_data_reclaim_work);
 	cancel_work_sync(&fs_info->preempt_reclaim_work);
-	cancel_work_sync(&fs_info->extent_map_shrinker_work);
+	cancel_work_sync(&fs_info->em_shrinker_work);
 
 	/* Cancel or finish ongoing discard work */
 	btrfs_discard_cleanup(fs_info);
diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 767f0804f504..d58d6fc40da1 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -1206,14 +1206,14 @@ static long btrfs_scan_root(struct btrfs_root *root, struct btrfs_em_shrink_ctx
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_inode *inode;
 	long nr_dropped = 0;
-	u64 min_ino = fs_info->extent_map_shrinker_last_ino + 1;
+	u64 min_ino = fs_info->em_shrinker_last_ino + 1;
 
 	inode = btrfs_find_first_inode(root, min_ino);
 	while (inode) {
 		nr_dropped += btrfs_scan_inode(inode, ctx);
 
 		min_ino = btrfs_ino(inode) + 1;
-		fs_info->extent_map_shrinker_last_ino = btrfs_ino(inode);
+		fs_info->em_shrinker_last_ino = btrfs_ino(inode);
 		btrfs_add_delayed_iput(inode);
 
 		if (ctx->scanned >= ctx->nr_to_scan ||
@@ -1233,14 +1233,14 @@ static long btrfs_scan_root(struct btrfs_root *root, struct btrfs_em_shrink_ctx
 		 * inode if there is one or we will find out this was the last
 		 * one and move to the next root.
 		 */
-		fs_info->extent_map_shrinker_last_root = btrfs_root_id(root);
+		fs_info->em_shrinker_last_root = btrfs_root_id(root);
 	} else {
 		/*
 		 * No more inodes in this root, set extent_map_shrinker_last_ino to 0 so
 		 * that when processing the next root we start from its first inode.
 		 */
-		fs_info->extent_map_shrinker_last_ino = 0;
-		fs_info->extent_map_shrinker_last_root = btrfs_root_id(root) + 1;
+		fs_info->em_shrinker_last_ino = 0;
+		fs_info->em_shrinker_last_root = btrfs_root_id(root) + 1;
 	}
 
 	return nr_dropped;
@@ -1255,13 +1255,13 @@ static void btrfs_extent_map_shrinker_worker(struct work_struct *work)
 	bool cycled = false;
 	long nr_dropped = 0;
 
-	fs_info = container_of(work, struct btrfs_fs_info, extent_map_shrinker_work);
+	fs_info = container_of(work, struct btrfs_fs_info, em_shrinker_work);
 
 	ctx.scanned = 0;
-	ctx.nr_to_scan = atomic64_read(&fs_info->extent_map_shrinker_nr_to_scan);
+	ctx.nr_to_scan = atomic64_read(&fs_info->em_shrinker_nr_to_scan);
 
-	start_root_id = fs_info->extent_map_shrinker_last_root;
-	next_root_id = fs_info->extent_map_shrinker_last_root;
+	start_root_id = fs_info->em_shrinker_last_root;
+	next_root_id = fs_info->em_shrinker_last_root;
 
 	if (trace_btrfs_extent_map_shrinker_scan_enter_enabled()) {
 		s64 nr = percpu_counter_sum_positive(&fs_info->evictable_extent_maps);
@@ -1283,8 +1283,8 @@ static void btrfs_extent_map_shrinker_worker(struct work_struct *work)
 			spin_unlock(&fs_info->fs_roots_radix_lock);
 			if (start_root_id > 0 && !cycled) {
 				next_root_id = 0;
-				fs_info->extent_map_shrinker_last_root = 0;
-				fs_info->extent_map_shrinker_last_ino = 0;
+				fs_info->em_shrinker_last_root = 0;
+				fs_info->em_shrinker_last_ino = 0;
 				cycled = true;
 				continue;
 			}
@@ -1309,7 +1309,7 @@ static void btrfs_extent_map_shrinker_worker(struct work_struct *work)
 		trace_btrfs_extent_map_shrinker_scan_exit(fs_info, nr_dropped, nr);
 	}
 
-	atomic64_set(&fs_info->extent_map_shrinker_nr_to_scan, 0);
+	atomic64_set(&fs_info->em_shrinker_nr_to_scan, 0);
 }
 
 void btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan)
@@ -1329,14 +1329,14 @@ void btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan)
 	 * current value is zero, instead of incrementing the counter by
 	 * nr_to_scan.
 	 */
-	if (atomic64_cmpxchg(&fs_info->extent_map_shrinker_nr_to_scan, 0, nr_to_scan) != 0)
+	if (atomic64_cmpxchg(&fs_info->em_shrinker_nr_to_scan, 0, nr_to_scan) != 0)
 		return;
 
-	queue_work(system_unbound_wq, &fs_info->extent_map_shrinker_work);
+	queue_work(system_unbound_wq, &fs_info->em_shrinker_work);
 }
 
 void btrfs_init_extent_map_shrinker_work(struct btrfs_fs_info *fs_info)
 {
-	atomic64_set(&fs_info->extent_map_shrinker_nr_to_scan, 0);
-	INIT_WORK(&fs_info->extent_map_shrinker_work, btrfs_extent_map_shrinker_worker);
+	atomic64_set(&fs_info->em_shrinker_nr_to_scan, 0);
+	INIT_WORK(&fs_info->em_shrinker_work, btrfs_extent_map_shrinker_worker);
 }
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 6639e873b8db..b64d97759e86 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -635,10 +635,10 @@ struct btrfs_fs_info {
 	s32 delalloc_batch;
 
 	struct percpu_counter evictable_extent_maps;
-	u64 extent_map_shrinker_last_root;
-	u64 extent_map_shrinker_last_ino;
-	atomic64_t extent_map_shrinker_nr_to_scan;
-	struct work_struct extent_map_shrinker_work;
+	u64 em_shrinker_last_root;
+	u64 em_shrinker_last_ino;
+	atomic64_t em_shrinker_nr_to_scan;
+	struct work_struct em_shrinker_work;
 
 	/* Protected by 'trans_lock'. */
 	struct list_head dirty_cowonly_roots;
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 957f3a2b31d4..f85bf421a6ae 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -2566,10 +2566,10 @@ TRACE_EVENT(btrfs_extent_map_shrinker_scan_enter,
 
 	TP_fast_assign_btrfs(fs_info,
 		__entry->nr_to_scan	= \
-		     atomic64_read(&fs_info->extent_map_shrinker_nr_to_scan);
+		     atomic64_read(&fs_info->em_shrinker_nr_to_scan);
 		__entry->nr		= nr;
-		__entry->last_root_id	= fs_info->extent_map_shrinker_last_root;
-		__entry->last_ino	= fs_info->extent_map_shrinker_last_ino;
+		__entry->last_root_id	= fs_info->em_shrinker_last_root;
+		__entry->last_ino	= fs_info->em_shrinker_last_ino;
 	),
 
 	TP_printk_btrfs("nr_to_scan=%ld nr=%ld last_root=%llu(%s) last_ino=%llu",
@@ -2593,8 +2593,8 @@ TRACE_EVENT(btrfs_extent_map_shrinker_scan_exit,
 	TP_fast_assign_btrfs(fs_info,
 		__entry->nr_dropped	= nr_dropped;
 		__entry->nr		= nr;
-		__entry->last_root_id	= fs_info->extent_map_shrinker_last_root;
-		__entry->last_ino	= fs_info->extent_map_shrinker_last_ino;
+		__entry->last_root_id	= fs_info->em_shrinker_last_root;
+		__entry->last_ino	= fs_info->em_shrinker_last_ino;
 	),
 
 	TP_printk_btrfs("nr_dropped=%ld nr=%ld last_root=%llu(%s) last_ino=%llu",
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 5/5] btrfs: re-enable the extent map shrinker
  2024-09-24 10:45 [PATCH 0/5] btrfs: make extent map shrinker more efficient and re-enable it fdmanana
                   ` (3 preceding siblings ...)
  2024-09-24 10:45 ` [PATCH 4/5] btrfs: rename extent map shrinker members from struct btrfs_fs_info fdmanana
@ 2024-09-24 10:45 ` fdmanana
  2024-09-26 10:00   ` Qu Wenruo
  2024-09-25 22:11 ` [PATCH 0/5] btrfs: make extent map shrinker more efficient and re-enable it Qu Wenruo
  2024-10-10 14:48 ` Josef Bacik
  6 siblings, 1 reply; 19+ messages in thread
From: fdmanana @ 2024-09-24 10:45 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Now that the extent map shrinker can only be run by a single task and runs
asynchronously as a work queue job, enable it as it can no longer cause
stalls on tasks allocating memory and entering the extent map shrinker
through the fs shrinker (implemented by btrfs_free_cached_objects()).

This is crucial to prevent exhaustion of memory due to unbounded extent
map creation, primarily with direct IO but also for buffered IO on files
with holes. This problem, for the direct IO case, was first reported in
the Link tag below. That report was added to a Link tag of the first patch
that introduced the extent map shrinker, commit 956a17d9d050 ("btrfs: add
a shrinker for extent maps"), however the Link tag disappeared somehow
from the committed patch (but was included in the submitted patch to the
mailing list), so adding it below for future reference.

Link: https://lore.kernel.org/linux-btrfs/13f94633dcf04d29aaf1f0a43d42c55e@amazon.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/super.c | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index e9e209dd8e05..7e20b5e8386c 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2401,13 +2401,7 @@ static long btrfs_nr_cached_objects(struct super_block *sb, struct shrink_contro
 
 	trace_btrfs_extent_map_shrinker_count(fs_info, nr);
 
-	/*
-	 * Only report the real number for EXPERIMENTAL builds, as there are
-	 * reports of serious performance degradation caused by too frequent shrinks.
-	 */
-	if (IS_ENABLED(CONFIG_BTRFS_EXPERIMENTAL))
-		return nr;
-	return 0;
+	return nr;
 }
 
 static long btrfs_free_cached_objects(struct super_block *sb, struct shrink_control *sc)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/5] btrfs: re-enable the extent map shrinker
  2024-09-24 10:45 ` [PATCH 5/5] btrfs: re-enable the extent map shrinker fdmanana
@ 2024-09-26 10:00   ` Qu Wenruo
  2024-09-26 10:52     ` Filipe Manana
  0 siblings, 1 reply; 19+ messages in thread
From: Qu Wenruo @ 2024-09-26 10:00 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



在 2024/9/24 20:15, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
>
> Now that the extent map shrinker can only be run by a single task and runs
> asynchronously as a work queue job, enable it as it can no longer cause
> stalls on tasks allocating memory and entering the extent map shrinker
> through the fs shrinker (implemented by btrfs_free_cached_objects()).
>
> This is crucial to prevent exhaustion of memory due to unbounded extent
> map creation, primarily with direct IO but also for buffered IO on files
> with holes. This problem, for the direct IO case, was first reported in
> the Link tag below. That report was added to a Link tag of the first patch
> that introduced the extent map shrinker, commit 956a17d9d050 ("btrfs: add
> a shrinker for extent maps"), however the Link tag disappeared somehow
> from the committed patch (but was included in the submitted patch to the
> mailing list), so adding it below for future reference.
>
> Link: https://lore.kernel.org/linux-btrfs/13f94633dcf04d29aaf1f0a43d42c55e@amazon.com/

Forgot to mention, I'd prefer the enablement patch to be merged in a
later release cycle, not at the same time inside the series.

We need more tests, especially from the original reporters, and that's
why we have EXPERIMENTAL flags.

Thanks,
Qu

> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>   fs/btrfs/super.c | 8 +-------
>   1 file changed, 1 insertion(+), 7 deletions(-)
>
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index e9e209dd8e05..7e20b5e8386c 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -2401,13 +2401,7 @@ static long btrfs_nr_cached_objects(struct super_block *sb, struct shrink_contro
>
>   	trace_btrfs_extent_map_shrinker_count(fs_info, nr);
>
> -	/*
> -	 * Only report the real number for EXPERIMENTAL builds, as there are
> -	 * reports of serious performance degradation caused by too frequent shrinks.
> -	 */
> -	if (IS_ENABLED(CONFIG_BTRFS_EXPERIMENTAL))
> -		return nr;
> -	return 0;
> +	return nr;
>   }
>
>   static long btrfs_free_cached_objects(struct super_block *sb, struct shrink_control *sc)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/5] btrfs: re-enable the extent map shrinker
  2024-09-26 10:00   ` Qu Wenruo
@ 2024-09-26 10:52     ` Filipe Manana
  2024-09-26 11:18       ` Filipe Manana
  0 siblings, 1 reply; 19+ messages in thread
From: Filipe Manana @ 2024-09-26 10:52 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Sep 26, 2024 at 11:00 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> 在 2024/9/24 20:15, fdmanana@kernel.org 写道:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > Now that the extent map shrinker can only be run by a single task and runs
> > asynchronously as a work queue job, enable it as it can no longer cause
> > stalls on tasks allocating memory and entering the extent map shrinker
> > through the fs shrinker (implemented by btrfs_free_cached_objects()).
> >
> > This is crucial to prevent exhaustion of memory due to unbounded extent
> > map creation, primarily with direct IO but also for buffered IO on files
> > with holes. This problem, for the direct IO case, was first reported in
> > the Link tag below. That report was added to a Link tag of the first patch
> > that introduced the extent map shrinker, commit 956a17d9d050 ("btrfs: add
> > a shrinker for extent maps"), however the Link tag disappeared somehow
> > from the committed patch (but was included in the submitted patch to the
> > mailing list), so adding it below for future reference.
> >
> > Link: https://lore.kernel.org/linux-btrfs/13f94633dcf04d29aaf1f0a43d42c55e@amazon.com/
>
> Forgot to mention, I'd prefer the enablement patch to be merged in a
> later release cycle, not at the same time inside the series.
>
> We need more tests, especially from the original reporters, and that's
> why we have EXPERIMENTAL flags.

Sure I can ping them and see if they have the availability to test and report.
But expecting every single user to be able to test may not be possible.

But I don't think we ever had to have explicit ack from users to move
things out of experimental, especially for things that don't affect
disk format changes or incompatibility issues, for things that are
fully transparent.

>
> Thanks,
> Qu
>
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >   fs/btrfs/super.c | 8 +-------
> >   1 file changed, 1 insertion(+), 7 deletions(-)
> >
> > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> > index e9e209dd8e05..7e20b5e8386c 100644
> > --- a/fs/btrfs/super.c
> > +++ b/fs/btrfs/super.c
> > @@ -2401,13 +2401,7 @@ static long btrfs_nr_cached_objects(struct super_block *sb, struct shrink_contro
> >
> >       trace_btrfs_extent_map_shrinker_count(fs_info, nr);
> >
> > -     /*
> > -      * Only report the real number for EXPERIMENTAL builds, as there are
> > -      * reports of serious performance degradation caused by too frequent shrinks.
> > -      */
> > -     if (IS_ENABLED(CONFIG_BTRFS_EXPERIMENTAL))
> > -             return nr;
> > -     return 0;
> > +     return nr;
> >   }
> >
> >   static long btrfs_free_cached_objects(struct super_block *sb, struct shrink_control *sc)
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/5] btrfs: re-enable the extent map shrinker
  2024-09-26 10:52     ` Filipe Manana
@ 2024-09-26 11:18       ` Filipe Manana
  0 siblings, 0 replies; 19+ messages in thread
From: Filipe Manana @ 2024-09-26 11:18 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Sep 26, 2024 at 11:52 AM Filipe Manana <fdmanana@kernel.org> wrote:
>
> On Thu, Sep 26, 2024 at 11:00 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >
> >
> >
> > 在 2024/9/24 20:15, fdmanana@kernel.org 写道:
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > Now that the extent map shrinker can only be run by a single task and runs
> > > asynchronously as a work queue job, enable it as it can no longer cause
> > > stalls on tasks allocating memory and entering the extent map shrinker
> > > through the fs shrinker (implemented by btrfs_free_cached_objects()).
> > >
> > > This is crucial to prevent exhaustion of memory due to unbounded extent
> > > map creation, primarily with direct IO but also for buffered IO on files
> > > with holes. This problem, for the direct IO case, was first reported in
> > > the Link tag below. That report was added to a Link tag of the first patch
> > > that introduced the extent map shrinker, commit 956a17d9d050 ("btrfs: add
> > > a shrinker for extent maps"), however the Link tag disappeared somehow
> > > from the committed patch (but was included in the submitted patch to the
> > > mailing list), so adding it below for future reference.
> > >
> > > Link: https://lore.kernel.org/linux-btrfs/13f94633dcf04d29aaf1f0a43d42c55e@amazon.com/
> >
> > Forgot to mention, I'd prefer the enablement patch to be merged in a
> > later release cycle, not at the same time inside the series.
> >
> > We need more tests, especially from the original reporters, and that's
> > why we have EXPERIMENTAL flags.
>
> Sure I can ping them and see if they have the availability to test and report.
> But expecting every single user to be able to test may not be possible.
>
> But I don't think we ever had to have explicit ack from users to move
> things out of experimental, especially for things that don't affect
> disk format changes or incompatibility issues, for things that are
> fully transparent.

Further while this is under the experimental flags (previously debug),
it's not a feature but a bug fix (functional and security). It just
happened that while I was under vacation it was placed in that
category just to disable it, instead of commenting out code or other
alternatives.

>
> >
> > Thanks,
> > Qu
> >
> > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > > ---
> > >   fs/btrfs/super.c | 8 +-------
> > >   1 file changed, 1 insertion(+), 7 deletions(-)
> > >
> > > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> > > index e9e209dd8e05..7e20b5e8386c 100644
> > > --- a/fs/btrfs/super.c
> > > +++ b/fs/btrfs/super.c
> > > @@ -2401,13 +2401,7 @@ static long btrfs_nr_cached_objects(struct super_block *sb, struct shrink_contro
> > >
> > >       trace_btrfs_extent_map_shrinker_count(fs_info, nr);
> > >
> > > -     /*
> > > -      * Only report the real number for EXPERIMENTAL builds, as there are
> > > -      * reports of serious performance degradation caused by too frequent shrinks.
> > > -      */
> > > -     if (IS_ENABLED(CONFIG_BTRFS_EXPERIMENTAL))
> > > -             return nr;
> > > -     return 0;
> > > +     return nr;
> > >   }
> > >
> > >   static long btrfs_free_cached_objects(struct super_block *sb, struct shrink_control *sc)
> >

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/5] btrfs: make extent map shrinker more efficient and re-enable it
  2024-09-24 10:45 [PATCH 0/5] btrfs: make extent map shrinker more efficient and re-enable it fdmanana
                   ` (4 preceding siblings ...)
  2024-09-24 10:45 ` [PATCH 5/5] btrfs: re-enable the extent map shrinker fdmanana
@ 2024-09-25 22:11 ` Qu Wenruo
  2024-10-10 14:48 ` Josef Bacik
  6 siblings, 0 replies; 19+ messages in thread
From: Qu Wenruo @ 2024-09-25 22:11 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



在 2024/9/24 20:15, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
>
> This makes the extent map shrinker run by a single task and asynchronously
> in a work queue, re-enables it by default and some cleanups. More details
> in the changelogs of each patch.
>
> Filipe Manana (5):
>    btrfs: add and use helper to remove extent map from its inode's tree
>    btrfs: make the extent map shrinker run asynchronously as a work queue job

I only have a small concern on whether we should introduce a delay to
the reclaim work.

The remaining patches all look good to me.

Thanks,
Qu

>    btrfs: simplify tracking progress for the extent map shrinker
>    btrfs: rename extent map shrinker members from struct btrfs_fs_info
>    btrfs: re-enable the extent map shrinker
>
>   fs/btrfs/disk-io.c           |   4 +-
>   fs/btrfs/extent_map.c        | 122 +++++++++++++++++------------------
>   fs/btrfs/extent_map.h        |   3 +-
>   fs/btrfs/fs.h                |   7 +-
>   fs/btrfs/super.c             |  21 ++----
>   include/trace/events/btrfs.h |  21 +++---
>   6 files changed, 83 insertions(+), 95 deletions(-)
>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/5] btrfs: make extent map shrinker more efficient and re-enable it
  2024-09-24 10:45 [PATCH 0/5] btrfs: make extent map shrinker more efficient and re-enable it fdmanana
                   ` (5 preceding siblings ...)
  2024-09-25 22:11 ` [PATCH 0/5] btrfs: make extent map shrinker more efficient and re-enable it Qu Wenruo
@ 2024-10-10 14:48 ` Josef Bacik
  2024-10-10 16:49   ` Filipe Manana
  6 siblings, 1 reply; 19+ messages in thread
From: Josef Bacik @ 2024-10-10 14:48 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Tue, Sep 24, 2024 at 11:45:40AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> This makes the extent map shrinker run by a single task and asynchronously
> in a work queue, re-enables it by default and some cleanups. More details
> in the changelogs of each patch.
> 
> Filipe Manana (5):
>   btrfs: add and use helper to remove extent map from its inode's tree
>   btrfs: make the extent map shrinker run asynchronously as a work queue job
>   btrfs: simplify tracking progress for the extent map shrinker
>   btrfs: rename extent map shrinker members from struct btrfs_fs_info
>   btrfs: re-enable the extent map shrinker
> 

I think as a first step this is good.

I am concerned that async shrinking under heavy memory pressure will
lead to spurious OOM's because we're never waiting for the async
shrinker to do work.  I think that a next step would be to investigate
that possibility, and possibly use the nr_to_scan or some other
mechanism to figure out that we're getting called multiple times and
then wait for the shrinker to make progress.  This will keep the VM from
deciding we aren't making progress and OOM'ing.

That being said this series looks good to me, you can add

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

to the whole thing.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/5] btrfs: make extent map shrinker more efficient and re-enable it
  2024-10-10 14:48 ` Josef Bacik
@ 2024-10-10 16:49   ` Filipe Manana
  0 siblings, 0 replies; 19+ messages in thread
From: Filipe Manana @ 2024-10-10 16:49 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On Thu, Oct 10, 2024 at 3:48 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> On Tue, Sep 24, 2024 at 11:45:40AM +0100, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > This makes the extent map shrinker run by a single task and asynchronously
> > in a work queue, re-enables it by default and some cleanups. More details
> > in the changelogs of each patch.
> >
> > Filipe Manana (5):
> >   btrfs: add and use helper to remove extent map from its inode's tree
> >   btrfs: make the extent map shrinker run asynchronously as a work queue job
> >   btrfs: simplify tracking progress for the extent map shrinker
> >   btrfs: rename extent map shrinker members from struct btrfs_fs_info
> >   btrfs: re-enable the extent map shrinker
> >
>
> I think as a first step this is good.
>
> I am concerned that async shrinking under heavy memory pressure will
> lead to spurious OOM's because we're never waiting for the async
> shrinker to do work.  I think that a next step would be to investigate
> that possibility, and possibly use the nr_to_scan or some other
> mechanism to figure out that we're getting called multiple times and
> then wait for the shrinker to make progress.  This will keep the VM from
> deciding we aren't making progress and OOM'ing.

Right.
The problem is how to decide when to wait and if the waiting will
introduce latencies - maybe wait only if the current task is a kswapd
task.
If you have concrete ideias, I'm all for it.

Thanks.

>
> That being said this series looks good to me, you can add
>
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
>
> to the whole thing.  Thanks,
>
> Josef

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2024-10-10 16:50 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-24 10:45 [PATCH 0/5] btrfs: make extent map shrinker more efficient and re-enable it fdmanana
2024-09-24 10:45 ` [PATCH 1/5] btrfs: add and use helper to remove extent map from its inode's tree fdmanana
2024-09-25 22:00   ` Qu Wenruo
2024-09-24 10:45 ` [PATCH 2/5] btrfs: make the extent map shrinker run asynchronously as a work queue job fdmanana
2024-09-25 22:08   ` Qu Wenruo
2024-09-26  9:01     ` Filipe Manana
2024-09-26  9:55       ` Qu Wenruo
2024-09-26 10:41         ` Filipe Manana
2024-09-26 22:02           ` Qu Wenruo
2024-09-27  8:10             ` Filipe Manana
2024-09-24 10:45 ` [PATCH 3/5] btrfs: simplify tracking progress for the extent map shrinker fdmanana
2024-09-24 10:45 ` [PATCH 4/5] btrfs: rename extent map shrinker members from struct btrfs_fs_info fdmanana
2024-09-24 10:45 ` [PATCH 5/5] btrfs: re-enable the extent map shrinker fdmanana
2024-09-26 10:00   ` Qu Wenruo
2024-09-26 10:52     ` Filipe Manana
2024-09-26 11:18       ` Filipe Manana
2024-09-25 22:11 ` [PATCH 0/5] btrfs: make extent map shrinker more efficient and re-enable it Qu Wenruo
2024-10-10 14:48 ` Josef Bacik
2024-10-10 16:49   ` Filipe Manana

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox