[RFC PATCH] bcache: reduce gc latency by processing less nodes and sleep less time

Linux bcache driver list
 help / color / mirror / Atom feed

* [RFC PATCH] bcache: reduce gc latency by processing less nodes and sleep less time
@ 2025-08-28 16:16 colyli
       [not found] ` <9CAADFD0-B713-4F83-8B08-3FA97604D9FD@coly.li>
  0 siblings, 1 reply; 2+ messages in thread
From: colyli @ 2025-08-28 16:16 UTC (permalink / raw)
  To: linux-bcache; +Cc: Coly Li, Coly Li, Robert Pang, Mingzhe Zou

From: Coly Li <colyli@suse.de>

When bcache device is busy for high I/O loads, there are two methods to
reduce the garbage collection latency,
- Process less nodes in eac loop of incremental garbage collection in
  btree_gc_recurse().
- Sleep less time between two full garbage collection in
  bch_btree_gc().

This patch introduces to hleper routines to provide different garbage
collection nodes number and sleep intervel time.
- btree_gc_min_nodes()
  If there is no front end I/O, return 128 nodes to process in each
  incremental loop, otherwise only 10 nodes are returned. Then front I/O
  is able to access the btree earlier.
- btree_gc_sleep_ms()
  If there is no synchronized wait for bucket allocation, sleep 100 ms
  between two incremental GC loop. Othersize only sleep 10 ms before
  incremental GC loop. Then a faster GC may provide available buckets
  earlier, to avoid most of bcache working threads from being starved by
  buckets allocation.

The idea is inspired by works from Mingzhe Zou and Robert Pang, but much
simpler and the expected behavior is more predictable.

Signed-off-by: Coly Li <colyli@fnnas.com>
Cc: Robert Pang <robertpang@google.com>
Cc: Mingzhe Zou <mingzhe.zou@easystack.cn>
---
 drivers/md/bcache/alloc.c  |  4 ++++
 drivers/md/bcache/bcache.h |  1 +
 drivers/md/bcache/btree.c  | 47 +++++++++++++++++++-------------------
 3 files changed, 28 insertions(+), 24 deletions(-)

diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
index 48ce750bf70a..db519e1678c2 100644
--- a/drivers/md/bcache/alloc.c
+++ b/drivers/md/bcache/alloc.c
@@ -412,7 +412,11 @@ long bch_bucket_alloc(struct cache *ca, unsigned int reserve, bool wait)
 				TASK_UNINTERRUPTIBLE);
 
 		mutex_unlock(&ca->set->bucket_lock);
+
+		atomic_inc(&ca->set->bucket_wait_cnt);
 		schedule();
+		atomic_dec(&ca->set->bucket_wait_cnt);
+
 		mutex_lock(&ca->set->bucket_lock);
 	} while (!fifo_pop(&ca->free[RESERVE_NONE], r) &&
 		 !fifo_pop(&ca->free[reserve], r));
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 1d33e40d26ea..d43fcccf297c 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -607,6 +607,7 @@ struct cache_set {
 	 */
 	atomic_t		prio_blocked;
 	wait_queue_head_t	bucket_wait;
+	atomic_t		bucket_wait_cnt;
 
 	/*
 	 * For any bio we don't skip we subtract the number of sectors from
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 210b59007d98..f79a229d5728 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -89,8 +89,9 @@
  * Test module load/unload
  */
 
-#define MAX_GC_TIMES		100
-#define MIN_GC_NODES		100
+#define MAX_GC_TIMES_SHIFT	7  /* 128 loops */
+#define GC_NODES_MIN		10
+#define GC_SLEEP_MS_MIN		10
 #define GC_SLEEP_MS		100
 
 #define PTR_DIRTY_BIT		(((uint64_t) 1 << 36))
@@ -1578,29 +1579,28 @@ static unsigned int btree_gc_count_keys(struct btree *b)
 
 static size_t btree_gc_min_nodes(struct cache_set *c)
 {
-	size_t min_nodes;
+	size_t min_nodes = GC_NODES_MIN;
 
-	/*
-	 * Since incremental GC would stop 100ms when front
-	 * side I/O comes, so when there are many btree nodes,
-	 * if GC only processes constant (100) nodes each time,
-	 * GC would last a long time, and the front side I/Os
-	 * would run out of the buckets (since no new bucket
-	 * can be allocated during GC), and be blocked again.
-	 * So GC should not process constant nodes, but varied
-	 * nodes according to the number of btree nodes, which
-	 * realized by dividing GC into constant(100) times,
-	 * so when there are many btree nodes, GC can process
-	 * more nodes each time, otherwise, GC will process less
-	 * nodes each time (but no less than MIN_GC_NODES)
-	 */
-	min_nodes = c->gc_stats.nodes / MAX_GC_TIMES;
-	if (min_nodes < MIN_GC_NODES)
-		min_nodes = MIN_GC_NODES;
+	if (atomic_read(&c->search_inflight) == 0) {
+		size_t n = c->gc_stats.nodes >> MAX_GC_TIMES_SHIFT;
+		if (min_nodes < n)
+			min_nodes = n;
+	}
 
 	return min_nodes;
 }
 
+static uint64_t btree_gc_sleep_ms(struct cache_set *c)
+{
+	uint64_t sleep_ms;
+
+	if (atomic_read(&c->bucket_wait_cnt) > 0)
+		sleep_ms = GC_SLEEP_MS_MIN;
+	else
+		sleep_ms = GC_SLEEP_MS;
+
+	return sleep_ms;
+}
 
 static int btree_gc_recurse(struct btree *b, struct btree_op *op,
 			    struct closure *writes, struct gc_stat *gc)
@@ -1668,8 +1668,7 @@ static int btree_gc_recurse(struct btree *b, struct btree_op *op,
 		memmove(r + 1, r, sizeof(r[0]) * (GC_MERGE_NODES - 1));
 		r->b = NULL;
 
-		if (atomic_read(&b->c->search_inflight) &&
-		    gc->nodes >= gc->nodes_pre + btree_gc_min_nodes(b->c)) {
+		if (gc->nodes >= (gc->nodes_pre + btree_gc_min_nodes(b->c))) {
 			gc->nodes_pre =  gc->nodes;
 			ret = -EAGAIN;
 			break;
@@ -1846,8 +1845,8 @@ static void bch_btree_gc(struct cache_set *c)
 		cond_resched();
 
 		if (ret == -EAGAIN)
-			schedule_timeout_interruptible(msecs_to_jiffies
-						       (GC_SLEEP_MS));
+			schedule_timeout_interruptible(
+				msecs_to_jiffies(btree_gc_sleep_ms(c)));
 		else if (ret)
 			pr_warn("gc failed!\n");
 	} while (ret && !test_bit(CACHE_SET_IO_DISABLE, &c->flags));
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: [RFC PATCH] bcache: reduce gc latency by processing less nodes and sleep less time
       [not found] ` <9CAADFD0-B713-4F83-8B08-3FA97604D9FD@coly.li>
@ 2025-09-05 22:43   ` Robert Pang
  0 siblings, 0 replies; 2+ messages in thread
From: Robert Pang @ 2025-09-05 22:43 UTC (permalink / raw)
  To: Coly Li; +Cc: Mingzhe Zou, linux-bcache

Hi Coly,

Thank you for your new patch. I have reviewed the patch, and it is
much simpler and cleaner indeed.

In addition, I have run the 24-hour stress tests and can confirm that
it brings noticeable performance improvement. For our 4kb writethrough
workload, your patch further reduces the median (P50) latency during
garbage collection from about 20 ms in the previous approach [1] to 4
ms [2]. And I did not observe any issues during my testing.

This is a great step forward. I look forward to seeing this patch
submitted soon. And please feel free to add me to the sign-off list if
it fits.

Best regards,
Robert Pang

[1] https://gist.github.com/robert-pang/a22b7c5dee2600be3260f4db57e5776d
[2] https://gist.github.com/robert-pang/05b17921a83d59afc8aab28b5d9e9e0d

On Thu, Aug 28, 2025 at 9:25 AM Coly Li <i@coly.li> wrote:
>
> Hi Robert,
>
> Your patch breaks the emwa_add() method to maintain the gc stats numbers. So I have to look for another method but try to get similar benchmark results as yours did.
>
>  Hi Mingzhe,
>
> The dynamic sleep interval and gc nodes patch is kind of over complicated IMHO. So I compose a simplified one based on the idea from you and Robert.
>
>
> Can you all help to review and test this RFC patch? Hope it may work out. Thanks for your help in advance.
>
> Coly Li
>
>
>
> > 2025年8月29日 00:16，colyli@kernel.org 写道：
> >
> > When bcache device is busy for high I/O loads, there are two methods to
> > reduce the garbage collection latency,
> > - Process less nodes in eac loop of incremental garbage collection in
> >  btree_gc_recurse().
> > - Sleep less time between two full garbage collection in
> >  bch_btree_gc().
> >
> > This patch introduces to hleper routines to provide different garbage
> > collection nodes number and sleep intervel time.
> > - btree_gc_min_nodes()
> >  If there is no front end I/O, return 128 nodes to process in each
> >  incremental loop, otherwise only 10 nodes are returned. Then front I/O
> >  is able to access the btree earlier.
> > - btree_gc_sleep_ms()
> >  If there is no synchronized wait for bucket allocation, sleep 100 ms
> >  between two incremental GC loop. Othersize only sleep 10 ms before
> >  incremental GC loop. Then a faster GC may provide available buckets
> >  earlier, to avoid most of bcache working threads from being starved by
> >  buckets allocation.
> >
> > The idea is inspired by works from Mingzhe Zou and Robert Pang, but much
> > simpler and the expected behavior is more predictable.
> >
> > Signed-off-by: Coly Li <colyli@fnnas.com>
> > Cc: Robert Pang <robertpang@google.com>
> > Cc: Mingzhe Zou <mingzhe.zou@easystack.cn>
> > ---
>
> [snipped]
>

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2025-09-05 22:43 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-28 16:16 [RFC PATCH] bcache: reduce gc latency by processing less nodes and sleep less time colyli
     [not found] ` <9CAADFD0-B713-4F83-8B08-3FA97604D9FD@coly.li>
2025-09-05 22:43   ` Robert Pang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox