[PATCH 0/1] bcache: reduce front IO latency during GC

public inbox for linux-bcache@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/1] bcache: reduce front IO latency during GC
@ 2025-04-14 22:44 Robert Pang
  2025-04-14 22:44 ` [PATCH 1/1] bcache: process fewer btree nodes in incremental GC cycles Robert Pang
                   ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Robert Pang @ 2025-04-14 22:44 UTC (permalink / raw)
  To: Coly Li, Kent Overstreet, linux-bcache; +Cc: Robert Pang, Mingzhe Zou

In performance benchmarks on disks with bcache using the Linux 6.6 kernel, we
observe noticeable IO latency increase during btree garbage collection. The
increase ranges from high tens to hundreds of milliseconds, depending on the
size of the cache device. Further investigation reveals that it is the same
issue reported in [1], where the large number of nodes processed in each
incremental GC cycle causes the front IO latency.

Building upon the approach suggested in [1], this patch decomposes the
incremental GC process into more but smaller cycles. In contrast to [1], this
implementation adopts a simpler strategy by setting a lower limit of 10 nodes
per cycle to reduce front IO delay and introducing a fixed 10ms sleep per cycle
when front IO is in progress. Furthermore, when garbage collection statistics
are available, the number of nodes processed per cycle is dynamically rescaled
based on the average GC frequency to ensure GC completes well within the next
subsequent scheduled interval.

Testing with a 750GB NVMe cache and 256KB bucket size using the following fio
configuration demonstrates that our patch reduces front IO latency during GC
without significantly increasing GC duration.

ioengine=libaio
direct=1
bs=4k
size=900G
iodepth=10
readwrite=randwrite
log_avg_msec=10

Before:

time-ms,latency-ns,,,

12170, 285016, 1, 0, 0
12183, 296581, 1, 0, 0
12207, 6542725, 1, 0, 0
12242, 24483604, 1, 0, 0
12250, 1895628, 1, 0, 0
12260, 284854, 1, 0, 0
12270, 275513, 1, 0, 0

/sys/block/bcache0/bcache/cache/internal/btree_gc_average_duration_ms:2880
/sys/block/bcache0/bcache/cache/internal/btree_gc_average_frequency_sec:133
/sys/block/bcache0/bcache/cache/internal/btree_gc_last_sec:121
/sys/block/bcache0/bcache/cache/internal/btree_gc_max_duration_ms:3456

After:

12690, 378494, 1, 0, 0
12700, 413934, 1, 0, 0
12710, 661217, 1, 0, 0
12727, 354510, 1, 0, 0
12730, 1100768, 1, 0, 0
12742, 382484, 1, 0, 0
12750, 532679, 1, 0, 0
12760, 572758, 1, 0, 0
12773, 283416, 1, 0, 0

/sys/block/bcache0/bcache/cache/internal/btree_gc_average_duration_ms:3619
/sys/block/bcache0/bcache/cache/internal/btree_gc_average_frequency_sec:58
/sys/block/bcache0/bcache/cache/internal/btree_gc_last_sec:23
/sys/block/bcache0/bcache/cache/internal/btree_gc_max_duration_ms:3866

[1] https://lore.kernel.org/all/20220511073903.13568-1-mingzhe.zou@easystack.cn/

Robert Pang (1):
  bcache: process fewer btree nodes in incremental GC cycles

 drivers/md/bcache/btree.c | 38 +++++++++++++++++---------------------
 drivers/md/bcache/util.h  |  3 +++
 2 files changed, 20 insertions(+), 21 deletions(-)

-- 
2.49.0.604.gff1f9ca942-goog

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/1] bcache: process fewer btree nodes in incremental GC cycles
  2025-04-14 22:44 [PATCH 0/1] bcache: reduce front IO latency during GC Robert Pang
@ 2025-04-14 22:44 ` Robert Pang
  2025-04-15  2:08   ` Coly Li
  2025-04-15  1:57 ` [PATCH 0/1] bcache: reduce front IO latency during GC Coly Li
  2025-05-16 17:10 ` Kent Overstreet
  2 siblings, 1 reply; 27+ messages in thread
From: Robert Pang @ 2025-04-14 22:44 UTC (permalink / raw)
  To: Coly Li, Kent Overstreet, linux-bcache; +Cc: Robert Pang, Mingzhe Zou

Current incremental GC processes a minimum of 100 btree nodes per cycle,
followed by a 100ms sleep. For NVMe cache devices, where the average node
processing time is ~1ms, this leads to front-side I/O latency potentially
reaching tens or hundreds of milliseconds during GC execution.

This commit resolves this latency issue by reducing the minimum node processing
count per cycle to 10 and the inter-cycle sleep duration to 10ms. It also
integrates a check of existing GC statistics to re-scale the number of nodes
processed per sleep interval when needed, ensuring GC finishes well before the
next GC is due.

Signed-off-by: Robert Pang <robertpang@google.com>
---
 drivers/md/bcache/btree.c | 38 +++++++++++++++++---------------------
 drivers/md/bcache/util.h  |  3 +++
 2 files changed, 20 insertions(+), 21 deletions(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index ed40d8600656..093e1edcaa53 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -88,11 +88,8 @@
  * Test module load/unload
  */
 
-#define MAX_NEED_GC		64
-#define MAX_SAVE_PRIO		72
-#define MAX_GC_TIMES		100
-#define MIN_GC_NODES		100
-#define GC_SLEEP_MS		100
+#define GC_MIN_NODES		10
+#define GC_SLEEP_MS		10
 
 #define PTR_DIRTY_BIT		(((uint64_t) 1 << 36))
 
@@ -1585,25 +1582,24 @@ static unsigned int btree_gc_count_keys(struct btree *b)
 
 static size_t btree_gc_min_nodes(struct cache_set *c)
 {
-	size_t min_nodes;
+	size_t min_nodes = GC_MIN_NODES;
+	uint64_t gc_max_ms = time_stat_average(&c->btree_gc_time, frequency, ms) / 2;
 
 	/*
-	 * Since incremental GC would stop 100ms when front
-	 * side I/O comes, so when there are many btree nodes,
-	 * if GC only processes constant (100) nodes each time,
-	 * GC would last a long time, and the front side I/Os
-	 * would run out of the buckets (since no new bucket
-	 * can be allocated during GC), and be blocked again.
-	 * So GC should not process constant nodes, but varied
-	 * nodes according to the number of btree nodes, which
-	 * realized by dividing GC into constant(100) times,
-	 * so when there are many btree nodes, GC can process
-	 * more nodes each time, otherwise, GC will process less
-	 * nodes each time (but no less than MIN_GC_NODES)
+	 * The incremental garbage collector operates by processing
+	 * GC_MIN_NODES at a time, pausing for GC_SLEEP_MS between
+	 * each interval. If historical garbage collection statistics
+	 * (btree_gc_time) is available, the maximum allowable GC
+	 * duration is set to half of this observed frequency. To
+	 * prevent exceeding this maximum duration, the number of
+	 * nodes processed in the current step may be increased if
+	 * the projected completion time based on the current pace
+	 * extends beyond the allowed limit. This ensures timely GC
+	 * completion before the next GC is due.
 	 */
-	min_nodes = c->gc_stats.nodes / MAX_GC_TIMES;
-	if (min_nodes < MIN_GC_NODES)
-		min_nodes = MIN_GC_NODES;
+	if ((gc_max_ms >= GC_SLEEP_MS) &&
+	    (GC_SLEEP_MS * (c->gc_stats.nodes / min_nodes)) > gc_max_ms)
+		min_nodes = c->gc_stats.nodes / (gc_max_ms / GC_SLEEP_MS);
 
 	return min_nodes;
 }
diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h
index 539454d8e2d0..21a370f444b7 100644
--- a/drivers/md/bcache/util.h
+++ b/drivers/md/bcache/util.h
@@ -305,6 +305,9 @@ static inline unsigned int local_clock_us(void)
 #define NSEC_PER_ms			NSEC_PER_MSEC
 #define NSEC_PER_sec			NSEC_PER_SEC
 
+#define time_stat_average(stats, stat, units)				\
+	div_u64((stats)->average_ ## stat >> 8, NSEC_PER_ ## units)
+
 #define __print_time_stat(stats, name, stat, units)			\
 	sysfs_print(name ## _ ## stat ## _ ## units,			\
 		    div_u64((stats)->stat >> 8, NSEC_PER_ ## units))
-- 
2.49.0.604.gff1f9ca942-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/1] bcache: process fewer btree nodes in incremental GC cycles
  2025-04-14 22:44 ` [PATCH 1/1] bcache: process fewer btree nodes in incremental GC cycles Robert Pang
@ 2025-04-15  2:08   ` Coly Li
  2025-04-15 17:25     ` Robert Pang
  0 siblings, 1 reply; 27+ messages in thread
From: Coly Li @ 2025-04-15  2:08 UTC (permalink / raw)
  To: Robert Pang; +Cc: Kent Overstreet, linux-bcache, Mingzhe Zou

On Mon, Apr 14, 2025 at 03:44:04PM +0800, Robert Pang wrote:
> Current incremental GC processes a minimum of 100 btree nodes per cycle,
> followed by a 100ms sleep. For NVMe cache devices, where the average node
> processing time is ~1ms, this leads to front-side I/O latency potentially
> reaching tens or hundreds of milliseconds during GC execution.
> 
> This commit resolves this latency issue by reducing the minimum node processing
> count per cycle to 10 and the inter-cycle sleep duration to 10ms. It also
> integrates a check of existing GC statistics to re-scale the number of nodes
> processed per sleep interval when needed, ensuring GC finishes well before the
> next GC is due.
> 
> Signed-off-by: Robert Pang <robertpang@google.com>
> ---
>  drivers/md/bcache/btree.c | 38 +++++++++++++++++---------------------
>  drivers/md/bcache/util.h  |  3 +++
>  2 files changed, 20 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> index ed40d8600656..093e1edcaa53 100644
> --- a/drivers/md/bcache/btree.c
> +++ b/drivers/md/bcache/btree.c
> @@ -88,11 +88,8 @@
>   * Test module load/unload
>   */
>  
> -#define MAX_NEED_GC		64
> -#define MAX_SAVE_PRIO		72

You may compose another patch for the above changes, to separte them
from main idea of this patch.

> -#define MAX_GC_TIMES		100
> -#define MIN_GC_NODES		100
> -#define GC_SLEEP_MS		100
> +#define GC_MIN_NODES		10
> +#define GC_SLEEP_MS		10
>  
>  #define PTR_DIRTY_BIT		(((uint64_t) 1 << 36))
>  
> @@ -1585,25 +1582,24 @@ static unsigned int btree_gc_count_keys(struct btree *b)
>  
>  static size_t btree_gc_min_nodes(struct cache_set *c)
>  {
> -	size_t min_nodes;
> +	size_t min_nodes = GC_MIN_NODES;
> +	uint64_t gc_max_ms = time_stat_average(&c->btree_gc_time, frequency, ms) / 2;
>  
>  	/*
> -	 * Since incremental GC would stop 100ms when front
> -	 * side I/O comes, so when there are many btree nodes,
> -	 * if GC only processes constant (100) nodes each time,
> -	 * GC would last a long time, and the front side I/Os
> -	 * would run out of the buckets (since no new bucket
> -	 * can be allocated during GC), and be blocked again.
> -	 * So GC should not process constant nodes, but varied
> -	 * nodes according to the number of btree nodes, which
> -	 * realized by dividing GC into constant(100) times,
> -	 * so when there are many btree nodes, GC can process
> -	 * more nodes each time, otherwise, GC will process less
> -	 * nodes each time (but no less than MIN_GC_NODES)
> +	 * The incremental garbage collector operates by processing
> +	 * GC_MIN_NODES at a time, pausing for GC_SLEEP_MS between
> +	 * each interval. If historical garbage collection statistics
> +	 * (btree_gc_time) is available, the maximum allowable GC
> +	 * duration is set to half of this observed frequency. To
> +	 * prevent exceeding this maximum duration, the number of
> +	 * nodes processed in the current step may be increased if
> +	 * the projected completion time based on the current pace
> +	 * extends beyond the allowed limit. This ensures timely GC
> +	 * completion before the next GC is due.
>  	 */
> -	min_nodes = c->gc_stats.nodes / MAX_GC_TIMES;
> -	if (min_nodes < MIN_GC_NODES)
> -		min_nodes = MIN_GC_NODES;
> +	if ((gc_max_ms >= GC_SLEEP_MS) &&
> +	    (GC_SLEEP_MS * (c->gc_stats.nodes / min_nodes)) > gc_max_ms)
> +		min_nodes = c->gc_stats.nodes / (gc_max_ms / GC_SLEEP_MS);
>  

Is it possible that gc_max_ms becomes 0?


>  	return min_nodes;
>  }
> diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h
> index 539454d8e2d0..21a370f444b7 100644
> --- a/drivers/md/bcache/util.h
> +++ b/drivers/md/bcache/util.h
> @@ -305,6 +305,9 @@ static inline unsigned int local_clock_us(void)
>  #define NSEC_PER_ms			NSEC_PER_MSEC
>  #define NSEC_PER_sec			NSEC_PER_SEC
>  
> +#define time_stat_average(stats, stat, units)				\
> +	div_u64((stats)->average_ ## stat >> 8, NSEC_PER_ ## units)
> +

Could you please add a few code comments here to explain what does
time_stat_average() do?


Thanks in advance.

>  #define __print_time_stat(stats, name, stat, units)			\
>  	sysfs_print(name ## _ ## stat ## _ ## units,			\
>  		    div_u64((stats)->stat >> 8, NSEC_PER_ ## units))
> -- 
> 2.49.0.604.gff1f9ca942-goog
> 

-- 
Coly Li

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/1] bcache: process fewer btree nodes in incremental GC cycles
  2025-04-15  2:08   ` Coly Li
@ 2025-04-15 17:25     ` Robert Pang
  2025-04-16  9:44       ` Coly Li
  0 siblings, 1 reply; 27+ messages in thread
From: Robert Pang @ 2025-04-15 17:25 UTC (permalink / raw)
  To: Coly Li; +Cc: Kent Overstreet, linux-bcache, Mingzhe Zou

Thank you for your prompt feedback, Coly.

On Mon, Apr 14, 2025 at 7:08 PM Coly Li <colyli@kernel.org> wrote:
>
> On Mon, Apr 14, 2025 at 03:44:04PM +0800, Robert Pang wrote:
> > Current incremental GC processes a minimum of 100 btree nodes per cycle,
> > followed by a 100ms sleep. For NVMe cache devices, where the average node
> > processing time is ~1ms, this leads to front-side I/O latency potentially
> > reaching tens or hundreds of milliseconds during GC execution.
> >
> > This commit resolves this latency issue by reducing the minimum node processing
> > count per cycle to 10 and the inter-cycle sleep duration to 10ms. It also
> > integrates a check of existing GC statistics to re-scale the number of nodes
> > processed per sleep interval when needed, ensuring GC finishes well before the
> > next GC is due.
> >
> > Signed-off-by: Robert Pang <robertpang@google.com>
> > ---
> >  drivers/md/bcache/btree.c | 38 +++++++++++++++++---------------------
> >  drivers/md/bcache/util.h  |  3 +++
> >  2 files changed, 20 insertions(+), 21 deletions(-)
> >
> > diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> > index ed40d8600656..093e1edcaa53 100644
> > --- a/drivers/md/bcache/btree.c
> > +++ b/drivers/md/bcache/btree.c
> > @@ -88,11 +88,8 @@
> >   * Test module load/unload
> >   */
> >
> > -#define MAX_NEED_GC          64
> > -#define MAX_SAVE_PRIO                72
>
> You may compose another patch for the above changes, to separte them
> from main idea of this patch.

Certainly, Just sent this as a separate patch.

> > -#define MAX_GC_TIMES         100
> > -#define MIN_GC_NODES         100
> > -#define GC_SLEEP_MS          100
> > +#define GC_MIN_NODES         10
> > +#define GC_SLEEP_MS          10
> >
> >  #define PTR_DIRTY_BIT                (((uint64_t) 1 << 36))
> >
> > @@ -1585,25 +1582,24 @@ static unsigned int btree_gc_count_keys(struct btree *b)
> >
> >  static size_t btree_gc_min_nodes(struct cache_set *c)
> >  {
> > -     size_t min_nodes;
> > +     size_t min_nodes = GC_MIN_NODES;
> > +     uint64_t gc_max_ms = time_stat_average(&c->btree_gc_time, frequency, ms) / 2;
> >
> >       /*
> > -      * Since incremental GC would stop 100ms when front
> > -      * side I/O comes, so when there are many btree nodes,
> > -      * if GC only processes constant (100) nodes each time,
> > -      * GC would last a long time, and the front side I/Os
> > -      * would run out of the buckets (since no new bucket
> > -      * can be allocated during GC), and be blocked again.
> > -      * So GC should not process constant nodes, but varied
> > -      * nodes according to the number of btree nodes, which
> > -      * realized by dividing GC into constant(100) times,
> > -      * so when there are many btree nodes, GC can process
> > -      * more nodes each time, otherwise, GC will process less
> > -      * nodes each time (but no less than MIN_GC_NODES)
> > +      * The incremental garbage collector operates by processing
> > +      * GC_MIN_NODES at a time, pausing for GC_SLEEP_MS between
> > +      * each interval. If historical garbage collection statistics
> > +      * (btree_gc_time) is available, the maximum allowable GC
> > +      * duration is set to half of this observed frequency. To
> > +      * prevent exceeding this maximum duration, the number of
> > +      * nodes processed in the current step may be increased if
> > +      * the projected completion time based on the current pace
> > +      * extends beyond the allowed limit. This ensures timely GC
> > +      * completion before the next GC is due.
> >        */
> > -     min_nodes = c->gc_stats.nodes / MAX_GC_TIMES;
> > -     if (min_nodes < MIN_GC_NODES)
> > -             min_nodes = MIN_GC_NODES;
> > +     if ((gc_max_ms >= GC_SLEEP_MS) &&
> > +         (GC_SLEEP_MS * (c->gc_stats.nodes / min_nodes)) > gc_max_ms)
> > +             min_nodes = c->gc_stats.nodes / (gc_max_ms / GC_SLEEP_MS);
> >
>
> Is it possible that gc_max_ms becomes 0?

Yes, gc_max_ms can be 0 initially when the cache is set up and stats are not
collected yet. In that case, the check "gc_max_ms >= GC_SLEEP_MS" fails and
we process at the default rate of 10 nodes per cycle.

Importantly, this same check also serves to prevent a division-by-zero error
when the number of nodes is re-scaled using the following calculation:

  min_nodes = c->gc_stats.nodes / (gc_max_ms / GC_SLEEP_MS);

>
> >       return min_nodes;
> >  }
> > diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h
> > index 539454d8e2d0..21a370f444b7 100644
> > --- a/drivers/md/bcache/util.h
> > +++ b/drivers/md/bcache/util.h
> > @@ -305,6 +305,9 @@ static inline unsigned int local_clock_us(void)
> >  #define NSEC_PER_ms                  NSEC_PER_MSEC
> >  #define NSEC_PER_sec                 NSEC_PER_SEC
> >
> > +#define time_stat_average(stats, stat, units)                                \
> > +     div_u64((stats)->average_ ## stat >> 8, NSEC_PER_ ## units)
> > +
>
> Could you please add a few code comments here to explain what does
> time_stat_average() do?

Will add the code comments with explanation in a new version promptly.

>
> Thanks in advance.
>
> >  #define __print_time_stat(stats, name, stat, units)                  \
> >       sysfs_print(name ## _ ## stat ## _ ## units,                    \
> >                   div_u64((stats)->stat >> 8, NSEC_PER_ ## units))
> > --
> > 2.49.0.604.gff1f9ca942-goog
> >
>
> --
> Coly Li

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/1] bcache: process fewer btree nodes in incremental GC cycles
  2025-04-15 17:25     ` Robert Pang
@ 2025-04-16  9:44       ` Coly Li
  2025-04-17  5:44         ` Robert Pang
  0 siblings, 1 reply; 27+ messages in thread
From: Coly Li @ 2025-04-16  9:44 UTC (permalink / raw)
  To: Robert Pang; +Cc: Coly Li, Kent Overstreet, linux-bcache, Mingzhe Zou



> 2025年4月16日 01:25，Robert Pang <robertpang@google.com> 写道：
> 
> Thank you for your prompt feedback, Coly.
> 
> On Mon, Apr 14, 2025 at 7:08 PM Coly Li <colyli@kernel.org> wrote:
>> 
>> On Mon, Apr 14, 2025 at 03:44:04PM +0800, Robert Pang wrote:
>>> Current incremental GC processes a minimum of 100 btree nodes per cycle,
>>> followed by a 100ms sleep. For NVMe cache devices, where the average node
>>> processing time is ~1ms, this leads to front-side I/O latency potentially
>>> reaching tens or hundreds of milliseconds during GC execution.
>>> 
>>> This commit resolves this latency issue by reducing the minimum node processing
>>> count per cycle to 10 and the inter-cycle sleep duration to 10ms. It also
>>> integrates a check of existing GC statistics to re-scale the number of nodes
>>> processed per sleep interval when needed, ensuring GC finishes well before the
>>> next GC is due.
>>> 
>>> Signed-off-by: Robert Pang <robertpang@google.com>
>>> ---
>>> drivers/md/bcache/btree.c | 38 +++++++++++++++++---------------------
>>> drivers/md/bcache/util.h  |  3 +++
>>> 2 files changed, 20 insertions(+), 21 deletions(-)
>>> 
>>> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
>>> index ed40d8600656..093e1edcaa53 100644
>>> --- a/drivers/md/bcache/btree.c
>>> +++ b/drivers/md/bcache/btree.c
>>> @@ -88,11 +88,8 @@
>>>  * Test module load/unload
>>>  */
>>> 
>>> -#define MAX_NEED_GC          64
>>> -#define MAX_SAVE_PRIO                72
>> 
>> You may compose another patch for the above changes, to separte them
>> from main idea of this patch.
> 
> Certainly, Just sent this as a separate patch.
> 
>>> -#define MAX_GC_TIMES         100
>>> -#define MIN_GC_NODES         100
>>> -#define GC_SLEEP_MS          100
>>> +#define GC_MIN_NODES         10
>>> +#define GC_SLEEP_MS          10
>>> 
>>> #define PTR_DIRTY_BIT                (((uint64_t) 1 << 36))
>>> 
>>> @@ -1585,25 +1582,24 @@ static unsigned int btree_gc_count_keys(struct btree *b)
>>> 
>>> static size_t btree_gc_min_nodes(struct cache_set *c)
>>> {
>>> -     size_t min_nodes;
>>> +     size_t min_nodes = GC_MIN_NODES;
>>> +     uint64_t gc_max_ms = time_stat_average(&c->btree_gc_time, frequency, ms) / 2;
>>> 
>>>      /*
>>> -      * Since incremental GC would stop 100ms when front
>>> -      * side I/O comes, so when there are many btree nodes,
>>> -      * if GC only processes constant (100) nodes each time,
>>> -      * GC would last a long time, and the front side I/Os
>>> -      * would run out of the buckets (since no new bucket
>>> -      * can be allocated during GC), and be blocked again.
>>> -      * So GC should not process constant nodes, but varied
>>> -      * nodes according to the number of btree nodes, which
>>> -      * realized by dividing GC into constant(100) times,
>>> -      * so when there are many btree nodes, GC can process
>>> -      * more nodes each time, otherwise, GC will process less
>>> -      * nodes each time (but no less than MIN_GC_NODES)
>>> +      * The incremental garbage collector operates by processing
>>> +      * GC_MIN_NODES at a time, pausing for GC_SLEEP_MS between
>>> +      * each interval. If historical garbage collection statistics
>>> +      * (btree_gc_time) is available, the maximum allowable GC
>>> +      * duration is set to half of this observed frequency. To
>>> +      * prevent exceeding this maximum duration, the number of
>>> +      * nodes processed in the current step may be increased if
>>> +      * the projected completion time based on the current pace
>>> +      * extends beyond the allowed limit. This ensures timely GC
>>> +      * completion before the next GC is due.
>>>       */
>>> -     min_nodes = c->gc_stats.nodes / MAX_GC_TIMES;
>>> -     if (min_nodes < MIN_GC_NODES)
>>> -             min_nodes = MIN_GC_NODES;
>>> +     if ((gc_max_ms >= GC_SLEEP_MS) &&
>>> +         (GC_SLEEP_MS * (c->gc_stats.nodes / min_nodes)) > gc_max_ms)
>>> +             min_nodes = c->gc_stats.nodes / (gc_max_ms / GC_SLEEP_MS);
>>> 
>> 
>> Is it possible that gc_max_ms becomes 0?
> 
> Yes, gc_max_ms can be 0 initially when the cache is set up and stats are not
> collected yet. In that case, the check "gc_max_ms >= GC_SLEEP_MS" fails and
> we process at the default rate of 10 nodes per cycle.
> 
> Importantly, this same check also serves to prevent a division-by-zero error
> when the number of nodes is re-scaled using the following calculation:
> 
>  min_nodes = c->gc_stats.nodes / (gc_max_ms / GC_SLEEP_MS);
> 
>> 

Thanks. So the code itself I don’t have more comment, just look forward to more testing and benchmark results.

Thanks.

Coly Li




>>>      return min_nodes;
>>> }
>>> diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h
>>> index 539454d8e2d0..21a370f444b7 100644
>>> --- a/drivers/md/bcache/util.h
>>> +++ b/drivers/md/bcache/util.h
>>> @@ -305,6 +305,9 @@ static inline unsigned int local_clock_us(void)
>>> #define NSEC_PER_ms                  NSEC_PER_MSEC
>>> #define NSEC_PER_sec                 NSEC_PER_SEC
>>> 
>>> +#define time_stat_average(stats, stat, units)                                \
>>> +     div_u64((stats)->average_ ## stat >> 8, NSEC_PER_ ## units)
>>> +
>> 
>> Could you please add a few code comments here to explain what does
>> time_stat_average() do?
> 
> Will add the code comments with explanation in a new version promptly.
> 
>> 
>> Thanks in advance.
>> 
>>> #define __print_time_stat(stats, name, stat, units)                  \
>>>      sysfs_print(name ## _ ## stat ## _ ## units,                    \
>>>                  div_u64((stats)->stat >> 8, NSEC_PER_ ## units))
>>> --
>>> 2.49.0.604.gff1f9ca942-goog
>>> 
>> 
>> --
>> Coly Li



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/1] bcache: process fewer btree nodes in incremental GC cycles
  2025-04-16  9:44       ` Coly Li
@ 2025-04-17  5:44         ` Robert Pang
  0 siblings, 0 replies; 27+ messages in thread
From: Robert Pang @ 2025-04-17  5:44 UTC (permalink / raw)
  To: Coly Li; +Cc: Coly Li, Kent Overstreet, linux-bcache, Mingzhe Zou

Thank you for the positive feedback and your review.

Regarding testing, the 24-hour fio test is currently underway on a 6TB SSD
with an iodepth set to 128. I will provide an update with the findings upon
its conclusion.

On Wed, Apr 16, 2025 at 2:44 AM Coly Li <i@coly.li> wrote:
>
>
>
> > 2025年4月16日 01:25，Robert Pang <robertpang@google.com> 写道：
> >
> > Thank you for your prompt feedback, Coly.
> >
> > On Mon, Apr 14, 2025 at 7:08 PM Coly Li <colyli@kernel.org> wrote:
> >>
> >> On Mon, Apr 14, 2025 at 03:44:04PM +0800, Robert Pang wrote:
> >>> Current incremental GC processes a minimum of 100 btree nodes per cycle,
> >>> followed by a 100ms sleep. For NVMe cache devices, where the average node
> >>> processing time is ~1ms, this leads to front-side I/O latency potentially
> >>> reaching tens or hundreds of milliseconds during GC execution.
> >>>
> >>> This commit resolves this latency issue by reducing the minimum node processing
> >>> count per cycle to 10 and the inter-cycle sleep duration to 10ms. It also
> >>> integrates a check of existing GC statistics to re-scale the number of nodes
> >>> processed per sleep interval when needed, ensuring GC finishes well before the
> >>> next GC is due.
> >>>
> >>> Signed-off-by: Robert Pang <robertpang@google.com>
> >>> ---
> >>> drivers/md/bcache/btree.c | 38 +++++++++++++++++---------------------
> >>> drivers/md/bcache/util.h  |  3 +++
> >>> 2 files changed, 20 insertions(+), 21 deletions(-)
> >>>
> >>> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> >>> index ed40d8600656..093e1edcaa53 100644
> >>> --- a/drivers/md/bcache/btree.c
> >>> +++ b/drivers/md/bcache/btree.c
> >>> @@ -88,11 +88,8 @@
> >>>  * Test module load/unload
> >>>  */
> >>>
> >>> -#define MAX_NEED_GC          64
> >>> -#define MAX_SAVE_PRIO                72
> >>
> >> You may compose another patch for the above changes, to separte them
> >> from main idea of this patch.
> >
> > Certainly, Just sent this as a separate patch.
> >
> >>> -#define MAX_GC_TIMES         100
> >>> -#define MIN_GC_NODES         100
> >>> -#define GC_SLEEP_MS          100
> >>> +#define GC_MIN_NODES         10
> >>> +#define GC_SLEEP_MS          10
> >>>
> >>> #define PTR_DIRTY_BIT                (((uint64_t) 1 << 36))
> >>>
> >>> @@ -1585,25 +1582,24 @@ static unsigned int btree_gc_count_keys(struct btree *b)
> >>>
> >>> static size_t btree_gc_min_nodes(struct cache_set *c)
> >>> {
> >>> -     size_t min_nodes;
> >>> +     size_t min_nodes = GC_MIN_NODES;
> >>> +     uint64_t gc_max_ms = time_stat_average(&c->btree_gc_time, frequency, ms) / 2;
> >>>
> >>>      /*
> >>> -      * Since incremental GC would stop 100ms when front
> >>> -      * side I/O comes, so when there are many btree nodes,
> >>> -      * if GC only processes constant (100) nodes each time,
> >>> -      * GC would last a long time, and the front side I/Os
> >>> -      * would run out of the buckets (since no new bucket
> >>> -      * can be allocated during GC), and be blocked again.
> >>> -      * So GC should not process constant nodes, but varied
> >>> -      * nodes according to the number of btree nodes, which
> >>> -      * realized by dividing GC into constant(100) times,
> >>> -      * so when there are many btree nodes, GC can process
> >>> -      * more nodes each time, otherwise, GC will process less
> >>> -      * nodes each time (but no less than MIN_GC_NODES)
> >>> +      * The incremental garbage collector operates by processing
> >>> +      * GC_MIN_NODES at a time, pausing for GC_SLEEP_MS between
> >>> +      * each interval. If historical garbage collection statistics
> >>> +      * (btree_gc_time) is available, the maximum allowable GC
> >>> +      * duration is set to half of this observed frequency. To
> >>> +      * prevent exceeding this maximum duration, the number of
> >>> +      * nodes processed in the current step may be increased if
> >>> +      * the projected completion time based on the current pace
> >>> +      * extends beyond the allowed limit. This ensures timely GC
> >>> +      * completion before the next GC is due.
> >>>       */
> >>> -     min_nodes = c->gc_stats.nodes / MAX_GC_TIMES;
> >>> -     if (min_nodes < MIN_GC_NODES)
> >>> -             min_nodes = MIN_GC_NODES;
> >>> +     if ((gc_max_ms >= GC_SLEEP_MS) &&
> >>> +         (GC_SLEEP_MS * (c->gc_stats.nodes / min_nodes)) > gc_max_ms)
> >>> +             min_nodes = c->gc_stats.nodes / (gc_max_ms / GC_SLEEP_MS);
> >>>
> >>
> >> Is it possible that gc_max_ms becomes 0?
> >
> > Yes, gc_max_ms can be 0 initially when the cache is set up and stats are not
> > collected yet. In that case, the check "gc_max_ms >= GC_SLEEP_MS" fails and
> > we process at the default rate of 10 nodes per cycle.
> >
> > Importantly, this same check also serves to prevent a division-by-zero error
> > when the number of nodes is re-scaled using the following calculation:
> >
> >  min_nodes = c->gc_stats.nodes / (gc_max_ms / GC_SLEEP_MS);
> >
> >>
>
> Thanks. So the code itself I don’t have more comment, just look forward to more testing and benchmark results.
>
> Thanks.
>
> Coly Li
>
>
>
>
> >>>      return min_nodes;
> >>> }
> >>> diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h
> >>> index 539454d8e2d0..21a370f444b7 100644
> >>> --- a/drivers/md/bcache/util.h
> >>> +++ b/drivers/md/bcache/util.h
> >>> @@ -305,6 +305,9 @@ static inline unsigned int local_clock_us(void)
> >>> #define NSEC_PER_ms                  NSEC_PER_MSEC
> >>> #define NSEC_PER_sec                 NSEC_PER_SEC
> >>>
> >>> +#define time_stat_average(stats, stat, units)                                \
> >>> +     div_u64((stats)->average_ ## stat >> 8, NSEC_PER_ ## units)
> >>> +
> >>
> >> Could you please add a few code comments here to explain what does
> >> time_stat_average() do?
> >
> > Will add the code comments with explanation in a new version promptly.
> >
> >>
> >> Thanks in advance.
> >>
> >>> #define __print_time_stat(stats, name, stat, units)                  \
> >>>      sysfs_print(name ## _ ## stat ## _ ## units,                    \
> >>>                  div_u64((stats)->stat >> 8, NSEC_PER_ ## units))
> >>> --
> >>> 2.49.0.604.gff1f9ca942-goog
> >>>
> >>
> >> --
> >> Coly Li
>
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/1] bcache: reduce front IO latency during GC
  2025-04-14 22:44 [PATCH 0/1] bcache: reduce front IO latency during GC Robert Pang
  2025-04-14 22:44 ` [PATCH 1/1] bcache: process fewer btree nodes in incremental GC cycles Robert Pang
@ 2025-04-15  1:57 ` Coly Li
  2025-04-22  1:44   ` Robert Pang
  2025-05-16 17:10 ` Kent Overstreet
  2 siblings, 1 reply; 27+ messages in thread
From: Coly Li @ 2025-04-15  1:57 UTC (permalink / raw)
  To: Robert Pang; +Cc: Coly Li, Kent Overstreet, linux-bcache, Mingzhe Zou

Hi Robert,

Thanks for the fix up :-)

> 2025年4月15日 06:44，Robert Pang <robertpang@google.com> 写道：
> 
> In performance benchmarks on disks with bcache using the Linux 6.6 kernel, we
> observe noticeable IO latency increase during btree garbage collection. The
> increase ranges from high tens to hundreds of milliseconds, depending on the
> size of the cache device. Further investigation reveals that it is the same
> issue reported in [1], where the large number of nodes processed in each
> incremental GC cycle causes the front IO latency.
> 
> Building upon the approach suggested in [1], this patch decomposes the
> incremental GC process into more but smaller cycles. In contrast to [1], this
> implementation adopts a simpler strategy by setting a lower limit of 10 nodes
> per cycle to reduce front IO delay and introducing a fixed 10ms sleep per cycle
> when front IO is in progress. Furthermore, when garbage collection statistics
> are available, the number of nodes processed per cycle is dynamically rescaled
> based on the average GC frequency to ensure GC completes well within the next
> subsequent scheduled interval.
> 
> Testing with a 750GB NVMe cache and 256KB bucket size using the following fio
> configuration demonstrates that our patch reduces front IO latency during GC
> without significantly increasing GC duration.
> 
> ioengine=libaio
> direct=1
> bs=4k
> size=900G
> iodepth=10
> readwrite=randwrite
> log_avg_msec=10
> 
> Before:
> 
> time-ms,latency-ns,,,
> 
> 12170, 285016, 1, 0, 0
> 12183, 296581, 1, 0, 0
> 12207, 6542725, 1, 0, 0
> 12242, 24483604, 1, 0, 0
> 12250, 1895628, 1, 0, 0
> 12260, 284854, 1, 0, 0
> 12270, 275513, 1, 0, 0
> 
> /sys/block/bcache0/bcache/cache/internal/btree_gc_average_duration_ms:2880
> /sys/block/bcache0/bcache/cache/internal/btree_gc_average_frequency_sec:133
> /sys/block/bcache0/bcache/cache/internal/btree_gc_last_sec:121
> /sys/block/bcache0/bcache/cache/internal/btree_gc_max_duration_ms:3456
> 
> After:
> 
> 12690, 378494, 1, 0, 0
> 12700, 413934, 1, 0, 0
> 12710, 661217, 1, 0, 0
> 12727, 354510, 1, 0, 0
> 12730, 1100768, 1, 0, 0
> 12742, 382484, 1, 0, 0
> 12750, 532679, 1, 0, 0
> 12760, 572758, 1, 0, 0
> 12773, 283416, 1, 0, 0
> 
> /sys/block/bcache0/bcache/cache/internal/btree_gc_average_duration_ms:3619
> /sys/block/bcache0/bcache/cache/internal/btree_gc_average_frequency_sec:58
> /sys/block/bcache0/bcache/cache/internal/btree_gc_last_sec:23
> /sys/block/bcache0/bcache/cache/internal/btree_gc_max_duration_ms:3866
> 
> [1] https://lore.kernel.org/all/20220511073903.13568-1-mingzhe.zou@easystack.cn/


I see the data, it makes sense. I’d like to see more testing data, e.g.
1) Larger SSD (4T or 8T)
2) Higher I/O pressure (randomwrite)
3) Much longer time with high I/O pressure (24 hours+)

Then it can help me to understand the optimization better and make decision easier. Of course it will save a lot of testing time from my side.

Thank you in advance.

Coly Li 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/1] bcache: reduce front IO latency during GC
  2025-04-15  1:57 ` [PATCH 0/1] bcache: reduce front IO latency during GC Coly Li
@ 2025-04-22  1:44   ` Robert Pang
  2025-05-02  1:01     ` Robert Pang
  0 siblings, 1 reply; 27+ messages in thread
From: Robert Pang @ 2025-04-22  1:44 UTC (permalink / raw)
  To: Coly Li; +Cc: Coly Li, Kent Overstreet, linux-bcache, Mingzhe Zou

I conducted a 24-hour fio random write test on a 6TB local SSD cache using
256kb buckets and the following parameters:

ioengine=libaio
direct=1
bs=4k
size=12T
iodepth=128
readwrite=randwrite
log_avg_msec=10

The results show some improvement in average write latency, resulting in
increased IOPS (from 11.8k to 17.1k) and throughput (from 45.9MiB/s to
66.7MiB/s). However, the maximum write latency exhibited a considerable
degradation.

Before:

latency_test: (groupid=0, jobs=1): err= 0: pid=14917: Mon Apr 21 06:50:45 2025
  write: IOPS=11.8k, BW=45.9MiB/s (48.1MB/s)(4035GiB/90000002msec); 0
zone resets
    slat (usec): min=9, max=4003.9k, avg=74.37, stdev=3734.09
    clat (usec): min=2, max=5319.2k, avg=10815.66, stdev=46998.95
     lat (usec): min=239, max=5319.3k, avg=10890.15, stdev=47167.41

/sys/block/bcache0/bcache/cache/internal/btree_gc_average_duration_ms:381138
/sys/block/bcache0/bcache/cache/internal/btree_gc_average_frequency_sec:3834
/sys/block/bcache0/bcache/cache/internal/btree_gc_last_sec:60668
/sys/block/bcache0/bcache/cache/internal/btree_gc_max_duration_ms:825228
/sys/block/bcache0/bcache/cache/internal/bset_tree_stats:
btree nodes: 144330
written sets: 283733
unwritten sets: 144329
written key bytes: 24993783392
unwritten key bytes: 11777400
floats: 30936844345385
failed: 5776


After:

latency_test: (groupid=0, jobs=1): err= 0: pid=15158: Mon Apr 21 17:22:13 2025
  write: IOPS=17.1k, BW=66.7MiB/s (69.9MB/s)(5859GiB/90000004msec); 0
zone resets
    slat (usec): min=7, max=469348k, avg=46.71, stdev=20576.47
    clat (usec): min=3, max=560660k, avg=7453.04, stdev=458965.53
     lat (usec): min=313, max=560660k, avg=7499.88, stdev=459426.83

/sys/block/bcache0/bcache/cache/internal/btree_gc_average_duration_ms:540350
/sys/block/bcache0/bcache/cache/internal/btree_gc_average_frequency_sec:3235
/sys/block/bcache0/bcache/cache/internal/btree_gc_last_sec:22304
/sys/block/bcache0/bcache/cache/internal/btree_gc_max_duration_ms:719097
/sys/block/bcache0/bcache/cache/internal/bset_tree_stats:
btree nodes: 239590
written sets: 466470
unwritten sets: 239587
written key bytes: 36120573032
unwritten key bytes: 16776624
floats: 95606253613657
failed: 4522

Further investigation pointed to the 10ms sleep time being too restrictive for
this intensive workload. Examining tail latency percentiles outside garbage
collection revealed that a significant number of I/O requests remained in the
queue for extended periods. The observed percentiles are:

Percentile Latency (ms)
P85 10
P86 11
P87 13
P88 16
P89 22
P90 36
P91 60
P92 104
P93 174
P94 231
P95 257
P96 282
P97 308
P98 337
P99 377
P100 472

This data suggests that while a 10ms sleep time might be suitable for lighter
workloads, it becomes a bottleneck under heavier stress.

In consideration of the test results, I propose the feasibility of keeping the
original sleep time and min nodes per cycle, but also exposing them as
configurable attributes through sysfs so users can tune according to their
workload and latency SLO. I would appreciate your insights on this potential
enhancement.

On Mon, Apr 14, 2025 at 6:57 PM Coly Li <i@coly.li> wrote:
>
> Hi Robert,
>
> Thanks for the fix up :-)
>
> > 2025年4月15日 06:44，Robert Pang <robertpang@google.com> 写道：
> >
> > In performance benchmarks on disks with bcache using the Linux 6.6 kernel, we
> > observe noticeable IO latency increase during btree garbage collection. The
> > increase ranges from high tens to hundreds of milliseconds, depending on the
> > size of the cache device. Further investigation reveals that it is the same
> > issue reported in [1], where the large number of nodes processed in each
> > incremental GC cycle causes the front IO latency.
> >
> > Building upon the approach suggested in [1], this patch decomposes the
> > incremental GC process into more but smaller cycles. In contrast to [1], this
> > implementation adopts a simpler strategy by setting a lower limit of 10 nodes
> > per cycle to reduce front IO delay and introducing a fixed 10ms sleep per cycle
> > when front IO is in progress. Furthermore, when garbage collection statistics
> > are available, the number of nodes processed per cycle is dynamically rescaled
> > based on the average GC frequency to ensure GC completes well within the next
> > subsequent scheduled interval.
> >
> > Testing with a 750GB NVMe cache and 256KB bucket size using the following fio
> > configuration demonstrates that our patch reduces front IO latency during GC
> > without significantly increasing GC duration.
> >
> > ioengine=libaio
> > direct=1
> > bs=4k
> > size=900G
> > iodepth=10
> > readwrite=randwrite
> > log_avg_msec=10
> >
> > Before:
> >
> > time-ms,latency-ns,,,
> >
> > 12170, 285016, 1, 0, 0
> > 12183, 296581, 1, 0, 0
> > 12207, 6542725, 1, 0, 0
> > 12242, 24483604, 1, 0, 0
> > 12250, 1895628, 1, 0, 0
> > 12260, 284854, 1, 0, 0
> > 12270, 275513, 1, 0, 0
> >
> > /sys/block/bcache0/bcache/cache/internal/btree_gc_average_duration_ms:2880
> > /sys/block/bcache0/bcache/cache/internal/btree_gc_average_frequency_sec:133
> > /sys/block/bcache0/bcache/cache/internal/btree_gc_last_sec:121
> > /sys/block/bcache0/bcache/cache/internal/btree_gc_max_duration_ms:3456
> >
> > After:
> >
> > 12690, 378494, 1, 0, 0
> > 12700, 413934, 1, 0, 0
> > 12710, 661217, 1, 0, 0
> > 12727, 354510, 1, 0, 0
> > 12730, 1100768, 1, 0, 0
> > 12742, 382484, 1, 0, 0
> > 12750, 532679, 1, 0, 0
> > 12760, 572758, 1, 0, 0
> > 12773, 283416, 1, 0, 0
> >
> > /sys/block/bcache0/bcache/cache/internal/btree_gc_average_duration_ms:3619
> > /sys/block/bcache0/bcache/cache/internal/btree_gc_average_frequency_sec:58
> > /sys/block/bcache0/bcache/cache/internal/btree_gc_last_sec:23
> > /sys/block/bcache0/bcache/cache/internal/btree_gc_max_duration_ms:3866
> >
> > [1] https://lore.kernel.org/all/20220511073903.13568-1-mingzhe.zou@easystack.cn/
>
>
> I see the data, it makes sense. I’d like to see more testing data, e.g.
> 1) Larger SSD (4T or 8T)
> 2) Higher I/O pressure (randomwrite)
> 3) Much longer time with high I/O pressure (24 hours+)
>
> Then it can help me to understand the optimization better and make decision easier. Of course it will save a lot of testing time from my side.
>
> Thank you in advance.
>
> Coly Li
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/1] bcache: reduce front IO latency during GC
  2025-04-22  1:44   ` Robert Pang
@ 2025-05-02  1:01     ` Robert Pang
  2025-05-03 17:33       ` Coly Li
  0 siblings, 1 reply; 27+ messages in thread
From: Robert Pang @ 2025-05-02  1:01 UTC (permalink / raw)
  To: Coly Li; +Cc: Coly Li, Kent Overstreet, linux-bcache, Mingzhe Zou

Hi Coly,

Please disregard the test results I shared over a week ago. After digging
deeper into the recent latency spikes with various workloads and by
instrumenting the garbage collector, I realized that the earlier GC latency
patch, "bcache: allow allocator to invalidate bucket in gc" [1], wasn't
backported to the Linux 6.6 branch I tested my patch against. This omission
explains the much higher latency observed during the extended test because the
allocator was blocked for the entire GC. My sincere apologies for the
inconsistent results and any confusion this has caused.

With patch [1] back-patched and after a 24-hour re-test, the fio results clearly
demonstrate that this patch effectively reduces front IO latency during GC due
to the smaller incremental GC cycles, while the GC duration increase is still
well within bounds.

Here's a summary of the improved latency:

Before:

Median latency (P50): 210 ms
Max latency (P100): 3.5 sec

btree_gc_average_duration_ms:381138
btree_gc_average_frequency_sec:3834
btree_gc_last_sec:60668
btree_gc_max_duration_ms:825228
bset_tree_stats:
btree nodes: 144330
written sets: 283733
unwritten sets: 144329
written key bytes: 24993783392
unwritten key bytes: 11777400
floats: 30936844345385
failed: 5776

After:

Median latency (P50): 25 ms
Max latency (P100): 0.8 sec

btree_gc_average_duration_ms:622274
btree_gc_average_frequency_sec:3518
btree_gc_last_sec:8931
btree_gc_max_duration_ms:953146
bset_tree_stats:
btree nodes: 175491
written sets: 339078
unwritten sets: 175488
written key bytes: 29821314856
unwritten key bytes: 14076504
floats: 90520963280544
failed: 6462

The complete latency data is available at [2].

I will be glad to run further tests to solidify these findings for the inclusion
of this patch in the coming merge window. Let me know if you'd like me to
conduct any specific tests.

Best regards
Robert Pang

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a14a68b76954e73031ca6399abace17dcb77c17a
[2[ https://gist.github.com/robert-pang/cc7c88f356293ea6d43103e6e5f9180f

On Mon, Apr 21, 2025 at 6:44 PM Robert Pang <robertpang@google.com> wrote:
>
> I conducted a 24-hour fio random write test on a 6TB local SSD cache using
> 256kb buckets and the following parameters:
>
> ioengine=libaio
> direct=1
> bs=4k
> size=12T
> iodepth=128
> readwrite=randwrite
> log_avg_msec=10
>
> The results show some improvement in average write latency, resulting in
> increased IOPS (from 11.8k to 17.1k) and throughput (from 45.9MiB/s to
> 66.7MiB/s). However, the maximum write latency exhibited a considerable
> degradation.
>
> Before:
>
> latency_test: (groupid=0, jobs=1): err= 0: pid=14917: Mon Apr 21 06:50:45 2025
>   write: IOPS=11.8k, BW=45.9MiB/s (48.1MB/s)(4035GiB/90000002msec); 0
> zone resets
>     slat (usec): min=9, max=4003.9k, avg=74.37, stdev=3734.09
>     clat (usec): min=2, max=5319.2k, avg=10815.66, stdev=46998.95
>      lat (usec): min=239, max=5319.3k, avg=10890.15, stdev=47167.41
>
> /sys/block/bcache0/bcache/cache/internal/btree_gc_average_duration_ms:381138
> /sys/block/bcache0/bcache/cache/internal/btree_gc_average_frequency_sec:3834
> /sys/block/bcache0/bcache/cache/internal/btree_gc_last_sec:60668
> /sys/block/bcache0/bcache/cache/internal/btree_gc_max_duration_ms:825228
> /sys/block/bcache0/bcache/cache/internal/bset_tree_stats:
> btree nodes: 144330
> written sets: 283733
> unwritten sets: 144329
> written key bytes: 24993783392
> unwritten key bytes: 11777400
> floats: 30936844345385
> failed: 5776
>
>
> After:
>
> latency_test: (groupid=0, jobs=1): err= 0: pid=15158: Mon Apr 21 17:22:13 2025
>   write: IOPS=17.1k, BW=66.7MiB/s (69.9MB/s)(5859GiB/90000004msec); 0
> zone resets
>     slat (usec): min=7, max=469348k, avg=46.71, stdev=20576.47
>     clat (usec): min=3, max=560660k, avg=7453.04, stdev=458965.53
>      lat (usec): min=313, max=560660k, avg=7499.88, stdev=459426.83
>
> /sys/block/bcache0/bcache/cache/internal/btree_gc_average_duration_ms:540350
> /sys/block/bcache0/bcache/cache/internal/btree_gc_average_frequency_sec:3235
> /sys/block/bcache0/bcache/cache/internal/btree_gc_last_sec:22304
> /sys/block/bcache0/bcache/cache/internal/btree_gc_max_duration_ms:719097
> /sys/block/bcache0/bcache/cache/internal/bset_tree_stats:
> btree nodes: 239590
> written sets: 466470
> unwritten sets: 239587
> written key bytes: 36120573032
> unwritten key bytes: 16776624
> floats: 95606253613657
> failed: 4522
>
> Further investigation pointed to the 10ms sleep time being too restrictive for
> this intensive workload. Examining tail latency percentiles outside garbage
> collection revealed that a significant number of I/O requests remained in the
> queue for extended periods. The observed percentiles are:
>
> Percentile Latency (ms)
> P85 10
> P86 11
> P87 13
> P88 16
> P89 22
> P90 36
> P91 60
> P92 104
> P93 174
> P94 231
> P95 257
> P96 282
> P97 308
> P98 337
> P99 377
> P100 472
>
> This data suggests that while a 10ms sleep time might be suitable for lighter
> workloads, it becomes a bottleneck under heavier stress.
>
> In consideration of the test results, I propose the feasibility of keeping the
> original sleep time and min nodes per cycle, but also exposing them as
> configurable attributes through sysfs so users can tune according to their
> workload and latency SLO. I would appreciate your insights on this potential
> enhancement.
>
> On Mon, Apr 14, 2025 at 6:57 PM Coly Li <i@coly.li> wrote:
> >
> > Hi Robert,
> >
> > Thanks for the fix up :-)
> >
> > > 2025年4月15日 06:44，Robert Pang <robertpang@google.com> 写道：
> > >
> > > In performance benchmarks on disks with bcache using the Linux 6.6 kernel, we
> > > observe noticeable IO latency increase during btree garbage collection. The
> > > increase ranges from high tens to hundreds of milliseconds, depending on the
> > > size of the cache device. Further investigation reveals that it is the same
> > > issue reported in [1], where the large number of nodes processed in each
> > > incremental GC cycle causes the front IO latency.
> > >
> > > Building upon the approach suggested in [1], this patch decomposes the
> > > incremental GC process into more but smaller cycles. In contrast to [1], this
> > > implementation adopts a simpler strategy by setting a lower limit of 10 nodes
> > > per cycle to reduce front IO delay and introducing a fixed 10ms sleep per cycle
> > > when front IO is in progress. Furthermore, when garbage collection statistics
> > > are available, the number of nodes processed per cycle is dynamically rescaled
> > > based on the average GC frequency to ensure GC completes well within the next
> > > subsequent scheduled interval.
> > >
> > > Testing with a 750GB NVMe cache and 256KB bucket size using the following fio
> > > configuration demonstrates that our patch reduces front IO latency during GC
> > > without significantly increasing GC duration.
> > >
> > > ioengine=libaio
> > > direct=1
> > > bs=4k
> > > size=900G
> > > iodepth=10
> > > readwrite=randwrite
> > > log_avg_msec=10
> > >
> > > Before:
> > >
> > > time-ms,latency-ns,,,
> > >
> > > 12170, 285016, 1, 0, 0
> > > 12183, 296581, 1, 0, 0
> > > 12207, 6542725, 1, 0, 0
> > > 12242, 24483604, 1, 0, 0
> > > 12250, 1895628, 1, 0, 0
> > > 12260, 284854, 1, 0, 0
> > > 12270, 275513, 1, 0, 0
> > >
> > > /sys/block/bcache0/bcache/cache/internal/btree_gc_average_duration_ms:2880
> > > /sys/block/bcache0/bcache/cache/internal/btree_gc_average_frequency_sec:133
> > > /sys/block/bcache0/bcache/cache/internal/btree_gc_last_sec:121
> > > /sys/block/bcache0/bcache/cache/internal/btree_gc_max_duration_ms:3456
> > >
> > > After:
> > >
> > > 12690, 378494, 1, 0, 0
> > > 12700, 413934, 1, 0, 0
> > > 12710, 661217, 1, 0, 0
> > > 12727, 354510, 1, 0, 0
> > > 12730, 1100768, 1, 0, 0
> > > 12742, 382484, 1, 0, 0
> > > 12750, 532679, 1, 0, 0
> > > 12760, 572758, 1, 0, 0
> > > 12773, 283416, 1, 0, 0
> > >
> > > /sys/block/bcache0/bcache/cache/internal/btree_gc_average_duration_ms:3619
> > > /sys/block/bcache0/bcache/cache/internal/btree_gc_average_frequency_sec:58
> > > /sys/block/bcache0/bcache/cache/internal/btree_gc_last_sec:23
> > > /sys/block/bcache0/bcache/cache/internal/btree_gc_max_duration_ms:3866
> > >
> > > [1] https://lore.kernel.org/all/20220511073903.13568-1-mingzhe.zou@easystack.cn/
> >
> >
> > I see the data, it makes sense. I’d like to see more testing data, e.g.
> > 1) Larger SSD (4T or 8T)
> > 2) Higher I/O pressure (randomwrite)
> > 3) Much longer time with high I/O pressure (24 hours+)
> >
> > Then it can help me to understand the optimization better and make decision easier. Of course it will save a lot of testing time from my side.
> >
> > Thank you in advance.
> >
> > Coly Li
> >

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/1] bcache: reduce front IO latency during GC
  2025-05-02  1:01     ` Robert Pang
@ 2025-05-03 17:33       ` Coly Li
  2025-05-15  0:58         ` Robert Pang
  0 siblings, 1 reply; 27+ messages in thread
From: Coly Li @ 2025-05-03 17:33 UTC (permalink / raw)
  To: Robert Pang; +Cc: Coly Li, Kent Overstreet, linux-bcache, Mingzhe Zou

On Thu, May 01, 2025 at 06:01:09PM +0800, Robert Pang wrote:
> Hi Coly,
> 
> Please disregard the test results I shared over a week ago. After digging
> deeper into the recent latency spikes with various workloads and by
> instrumenting the garbage collector, I realized that the earlier GC latency
> patch, "bcache: allow allocator to invalidate bucket in gc" [1], wasn't
> backported to the Linux 6.6 branch I tested my patch against. This omission
> explains the much higher latency observed during the extended test because the
> allocator was blocked for the entire GC. My sincere apologies for the
> inconsistent results and any confusion this has caused.
> 

Did you also backport commit 05356938a4be ("bcache: call force_wake_up_gc()
if necessary in check_should_bypass()") ? Last time when you pushed me to
add commit a14a68b76954 into mainline kernel, I tested a regression from this
patch and fixed it. Please add this fix if you didn't, otherwise the testing
might not be completed.

> With patch [1] back-patched and after a 24-hour re-test, the fio results clearly
> demonstrate that this patch effectively reduces front IO latency during GC due
> to the smaller incremental GC cycles, while the GC duration increase is still
> well within bounds.
>

From the performance result in [2], it seems the max latency are reduced,
but higher latency period are longer. I am not sure whether this is a happy
result.

Can I have a download link for the whole log? Then I can look at the
performance numbers more close.

> Here's a summary of the improved latency:
> 
> Before:
> 
> Median latency (P50): 210 ms
> Max latency (P100): 3.5 sec
> 
> btree_gc_average_duration_ms:381138
> btree_gc_average_frequency_sec:3834
> btree_gc_last_sec:60668
> btree_gc_max_duration_ms:825228
> bset_tree_stats:
> btree nodes: 144330
> written sets: 283733
> unwritten sets: 144329
> written key bytes: 24993783392
> unwritten key bytes: 11777400
> floats: 30936844345385
> failed: 5776
> 
> After:
> 
> Median latency (P50): 25 ms
> Max latency (P100): 0.8 sec
> 
> btree_gc_average_duration_ms:622274
> btree_gc_average_frequency_sec:3518
> btree_gc_last_sec:8931
> btree_gc_max_duration_ms:953146
> bset_tree_stats:
> btree nodes: 175491
> written sets: 339078
> unwritten sets: 175488
> written key bytes: 29821314856
> unwritten key bytes: 14076504
> floats: 90520963280544
> failed: 6462
> 
> The complete latency data is available at [2].
> 
> I will be glad to run further tests to solidify these findings for the inclusion
> of this patch in the coming merge window. Let me know if you'd like me to
> conduct any specific tests.

Yes, more testing are necessary, from 512 Bytes block size to 1 MiB or
8MiB block size. We need to make sure it won't introduce performance
regression in other workload or circumstances.

I don't have plan to submit this patch in this merge window, and please don't
push me. For performance improvement change, I prefer the defalt
configuration will cover most of work loads, so more testing and perforamce
data are desired. E.g. the patch you mentioned (commit a14a68b76954 "bcache:
allow allocator to invalidate bucket in gc"), it had been deployed in Easy
Stack product environment for 20+ months before it got merged.

Thanks.

> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a14a68b76954e73031ca6399abace17dcb77c17a
> [2[ https://gist.github.com/robert-pang/cc7c88f356293ea6d43103e6e5f9180f

[snipped]

-- 
Coly Li

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/1] bcache: reduce front IO latency during GC
  2025-05-03 17:33       ` Coly Li
@ 2025-05-15  0:58         ` Robert Pang
  2025-05-15  1:04           ` Robert Pang
                             ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Robert Pang @ 2025-05-15  0:58 UTC (permalink / raw)
  To: Coly Li; +Cc: Coly Li, Kent Overstreet, linux-bcache, Mingzhe Zou,
	Kuan-Wei Chiu

Hi Coly,

My apologies for the delay in providing this update; comprehensive testing
takes some time to complete.

As you suggested, I conducted extensive tests for 24 hours against the
latest 6.14.5 Linux kernel, exploring more configurations to get a complete
picture:

1. 4KB block size with writethrough mode
2. 4KB block size with writeback mode (70% dirty)
3. 1MB block size with writethrough mode

The detailed results, available at [1], consistently demonstrate that our patch
is effective in significantly reducing latency during garbage collection. This
holds true for both the default writethrough mode and the 70% writeback mode.
As anticipated, with 1MB block sizes, we observed no difference in latency
because the number of btree nodes is much smaller.

[1] https://gist.github.com/robert-pang/817fa7c11ece99d25aabc0467a9427d8

However, during these tests, we've uncovered a new and distinct latency problem
that appears to be introduced in the recent Linux kernel. This issue manifests
as frequent and periodic latency spikes that occur outside of garbage
collection.
Below is a snippet of the latency data illustrating this:

time (s)  median (ms)  max (ms)
60810   2.28     679.37
60840   2.32   2,434.24 *
60870   2.46   2,434.24 *
60900   2.52   2,434.24 *
60930   2.63     566.15
60960   2.82     566.15
60990   2.82     566.15
61020   2.78     471.79
61050   2.93   2,028.54 *
61080   3.11   2,028.54 *
61110   3.29   2,028.54 *
61140   3.42     679.37
61170   3.42     679.37
61200   3.41     679.37
61230   3.30     566.15
61260   2.93   1,690.45 *
61290   2.75   1,690.45 *
61320   2.72   1,690.45 *
61350   2.88   1,408.71 *
61380   5.07   1,408.71 *
61410 107.94   1,408.71 **
61440  65.28   1,408.71 **
61470  45.41   2,028.54 **
61500  72.45   2,028.54 **
61530  55.37   2,028.54 **
61560  40.73   1,408.71 **
61590  11.48   1,690.45 **
61620   2.92   1,690.45 *
61650   2.54   1,690.45 *
61680   2.58     679.37
61710   2.78     679.37

** garbage collection
* cache replacement

Based on the consistent periodicity of these spikes, we deduce that they are
linked to the invalidate_buckets_lru() function during cache replacement. This
function was recently modified to use min heap operations [2]. To confirm our
hypothesis, we reverted the relevant commits and re-ran the tests. Results show
that the latency spikes completely disappeared, positively confirming that the
min heap changes introduce this regression. Furthermore, these changes also
reduce the effectiveness of our GC patch. It appears that the min heap changes
reduce heap sort speed somehow in invalidate_buckets_lr() and in GC.

[2] https://lore.kernel.org/linux-bcache/ZxzkLJmhn3a%2F1ALQ@visitorckw-System-Product-Name/T/#m0dd24ba0c63615de465d3fec72dc73febb0f7a94

You may download the full test result data from these links.

https://gist.github.com/robert-pang/5df1d595ee77756c0a01d6479bdf8e34#file-bcache-latency-4kb-no-patch-csv
https://gist.github.com/robert-pang/5df1d595ee77756c0a01d6479bdf8e34#file-bcache-latency-4kb-with-patch-csv
https://gist.github.com/robert-pang/bcc26a3aa90dc95a083799cf4fd48116#file-bcache-latency-4kb-wb-no-patch-csv
https://gist.github.com/robert-pang/bcc26a3aa90dc95a083799cf4fd48116#file-bcache-latency-4kb-wb-with-patch-csv
https://gist.github.com/robert-pang/7036b06b66c8de7e958cdbddcd92a3f5#file-bcache-latency-1mb-no-patch-csv
https://gist.github.com/robert-pang/7036b06b66c8de7e958cdbddcd92a3f5#file-bcache-latency-1mb-with-patch-csv
https://gist.github.com/robert-pang/40f90afdea2d2a8c3f6e22ff959eff03#file-bcache-latency-4kb-no-patch-min-heap-reverted-csv
https://gist.github.com/robert-pang/40f90afdea2d2a8c3f6e22ff959eff03#file-bcache-latency-4kb-with-patch-min-heap-reverted-csv

Best regards
Robert Pang

On Sat, May 3, 2025 at 10:33 AM Coly Li <colyli@kernel.org> wrote:
>
> On Thu, May 01, 2025 at 06:01:09PM +0800, Robert Pang wrote:
> > Hi Coly,
> >
> > Please disregard the test results I shared over a week ago. After digging
> > deeper into the recent latency spikes with various workloads and by
> > instrumenting the garbage collector, I realized that the earlier GC latency
> > patch, "bcache: allow allocator to invalidate bucket in gc" [1], wasn't
> > backported to the Linux 6.6 branch I tested my patch against. This omission
> > explains the much higher latency observed during the extended test because the
> > allocator was blocked for the entire GC. My sincere apologies for the
> > inconsistent results and any confusion this has caused.
> >
>
> Did you also backport commit 05356938a4be ("bcache: call force_wake_up_gc()
> if necessary in check_should_bypass()") ? Last time when you pushed me to
> add commit a14a68b76954 into mainline kernel, I tested a regression from this
> patch and fixed it. Please add this fix if you didn't, otherwise the testing
> might not be completed.
>
>
> > With patch [1] back-patched and after a 24-hour re-test, the fio results clearly
> > demonstrate that this patch effectively reduces front IO latency during GC due
> > to the smaller incremental GC cycles, while the GC duration increase is still
> > well within bounds.
> >
>
> From the performance result in [2], it seems the max latency are reduced,
> but higher latency period are longer. I am not sure whether this is a happy
> result.
>
> Can I have a download link for the whole log? Then I can look at the
> performance numbers more close.
>
> > Here's a summary of the improved latency:
> >
> > Before:
> >
> > Median latency (P50): 210 ms
> > Max latency (P100): 3.5 sec
> >
> > btree_gc_average_duration_ms:381138
> > btree_gc_average_frequency_sec:3834
> > btree_gc_last_sec:60668
> > btree_gc_max_duration_ms:825228
> > bset_tree_stats:
> > btree nodes: 144330
> > written sets: 283733
> > unwritten sets: 144329
> > written key bytes: 24993783392
> > unwritten key bytes: 11777400
> > floats: 30936844345385
> > failed: 5776
> >
> > After:
> >
> > Median latency (P50): 25 ms
> > Max latency (P100): 0.8 sec
> >
> > btree_gc_average_duration_ms:622274
> > btree_gc_average_frequency_sec:3518
> > btree_gc_last_sec:8931
> > btree_gc_max_duration_ms:953146
> > bset_tree_stats:
> > btree nodes: 175491
> > written sets: 339078
> > unwritten sets: 175488
> > written key bytes: 29821314856
> > unwritten key bytes: 14076504
> > floats: 90520963280544
> > failed: 6462
> >
> > The complete latency data is available at [2].
> >
> > I will be glad to run further tests to solidify these findings for the inclusion
> > of this patch in the coming merge window. Let me know if you'd like me to
> > conduct any specific tests.
>
> Yes, more testing are necessary, from 512 Bytes block size to 1 MiB or
> 8MiB block size. We need to make sure it won't introduce performance
> regression in other workload or circumstances.
>
> I don't have plan to submit this patch in this merge window, and please don't
> push me. For performance improvement change, I prefer the defalt
> configuration will cover most of work loads, so more testing and perforamce
> data are desired. E.g. the patch you mentioned (commit a14a68b76954 "bcache:
> allow allocator to invalidate bucket in gc"), it had been deployed in Easy
> Stack product environment for 20+ months before it got merged.
>
> Thanks.
>
> >
> > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a14a68b76954e73031ca6399abace17dcb77c17a
> > [2[ https://gist.github.com/robert-pang/cc7c88f356293ea6d43103e6e5f9180f
>
> [snipped]
>
> --
> Coly Li

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/1] bcache: reduce front IO latency during GC
  2025-05-15  0:58         ` Robert Pang
@ 2025-05-15  1:04           ` Robert Pang
  2025-05-15  7:12           ` Kuan-Wei Chiu
  2025-07-03 15:28           ` Robert Pang
  2 siblings, 0 replies; 27+ messages in thread
From: Robert Pang @ 2025-05-15  1:04 UTC (permalink / raw)
  To: Coly Li; +Cc: Coly Li, Kent Overstreet, linux-bcache, Mingzhe Zou,
	Kuan-Wei Chiu

Here is the link to the changes to bcache and invalidate_buckets_lru()
specifically.

https://lore.kernel.org/linux-bcache/20240524152958.919343-16-visitorckw@gmail.com/T/#u

On Wed, May 14, 2025 at 5:58 PM Robert Pang <robertpang@google.com> wrote:
>
> Hi Coly,
>
> My apologies for the delay in providing this update; comprehensive testing
> takes some time to complete.
>
> As you suggested, I conducted extensive tests for 24 hours against the
> latest 6.14.5 Linux kernel, exploring more configurations to get a complete
> picture:
>
> 1. 4KB block size with writethrough mode
> 2. 4KB block size with writeback mode (70% dirty)
> 3. 1MB block size with writethrough mode
>
> The detailed results, available at [1], consistently demonstrate that our patch
> is effective in significantly reducing latency during garbage collection. This
> holds true for both the default writethrough mode and the 70% writeback mode.
> As anticipated, with 1MB block sizes, we observed no difference in latency
> because the number of btree nodes is much smaller.
>
> [1] https://gist.github.com/robert-pang/817fa7c11ece99d25aabc0467a9427d8
>
> However, during these tests, we've uncovered a new and distinct latency problem
> that appears to be introduced in the recent Linux kernel. This issue manifests
> as frequent and periodic latency spikes that occur outside of garbage
> collection.
> Below is a snippet of the latency data illustrating this:
>
> time (s)  median (ms)  max (ms)
> 60810   2.28     679.37
> 60840   2.32   2,434.24 *
> 60870   2.46   2,434.24 *
> 60900   2.52   2,434.24 *
> 60930   2.63     566.15
> 60960   2.82     566.15
> 60990   2.82     566.15
> 61020   2.78     471.79
> 61050   2.93   2,028.54 *
> 61080   3.11   2,028.54 *
> 61110   3.29   2,028.54 *
> 61140   3.42     679.37
> 61170   3.42     679.37
> 61200   3.41     679.37
> 61230   3.30     566.15
> 61260   2.93   1,690.45 *
> 61290   2.75   1,690.45 *
> 61320   2.72   1,690.45 *
> 61350   2.88   1,408.71 *
> 61380   5.07   1,408.71 *
> 61410 107.94   1,408.71 **
> 61440  65.28   1,408.71 **
> 61470  45.41   2,028.54 **
> 61500  72.45   2,028.54 **
> 61530  55.37   2,028.54 **
> 61560  40.73   1,408.71 **
> 61590  11.48   1,690.45 **
> 61620   2.92   1,690.45 *
> 61650   2.54   1,690.45 *
> 61680   2.58     679.37
> 61710   2.78     679.37
>
> ** garbage collection
> * cache replacement
>
> Based on the consistent periodicity of these spikes, we deduce that they are
> linked to the invalidate_buckets_lru() function during cache replacement. This
> function was recently modified to use min heap operations [2]. To confirm our
> hypothesis, we reverted the relevant commits and re-ran the tests. Results show
> that the latency spikes completely disappeared, positively confirming that the
> min heap changes introduce this regression. Furthermore, these changes also
> reduce the effectiveness of our GC patch. It appears that the min heap changes
> reduce heap sort speed somehow in invalidate_buckets_lr() and in GC.
>
> [2] https://lore.kernel.org/linux-bcache/ZxzkLJmhn3a%2F1ALQ@visitorckw-System-Product-Name/T/#m0dd24ba0c63615de465d3fec72dc73febb0f7a94
>
> You may download the full test result data from these links.
>
> https://gist.github.com/robert-pang/5df1d595ee77756c0a01d6479bdf8e34#file-bcache-latency-4kb-no-patch-csv
> https://gist.github.com/robert-pang/5df1d595ee77756c0a01d6479bdf8e34#file-bcache-latency-4kb-with-patch-csv
> https://gist.github.com/robert-pang/bcc26a3aa90dc95a083799cf4fd48116#file-bcache-latency-4kb-wb-no-patch-csv
> https://gist.github.com/robert-pang/bcc26a3aa90dc95a083799cf4fd48116#file-bcache-latency-4kb-wb-with-patch-csv
> https://gist.github.com/robert-pang/7036b06b66c8de7e958cdbddcd92a3f5#file-bcache-latency-1mb-no-patch-csv
> https://gist.github.com/robert-pang/7036b06b66c8de7e958cdbddcd92a3f5#file-bcache-latency-1mb-with-patch-csv
> https://gist.github.com/robert-pang/40f90afdea2d2a8c3f6e22ff959eff03#file-bcache-latency-4kb-no-patch-min-heap-reverted-csv
> https://gist.github.com/robert-pang/40f90afdea2d2a8c3f6e22ff959eff03#file-bcache-latency-4kb-with-patch-min-heap-reverted-csv
>
> Best regards
> Robert Pang
>
> On Sat, May 3, 2025 at 10:33 AM Coly Li <colyli@kernel.org> wrote:
> >
> > On Thu, May 01, 2025 at 06:01:09PM +0800, Robert Pang wrote:
> > > Hi Coly,
> > >
> > > Please disregard the test results I shared over a week ago. After digging
> > > deeper into the recent latency spikes with various workloads and by
> > > instrumenting the garbage collector, I realized that the earlier GC latency
> > > patch, "bcache: allow allocator to invalidate bucket in gc" [1], wasn't
> > > backported to the Linux 6.6 branch I tested my patch against. This omission
> > > explains the much higher latency observed during the extended test because the
> > > allocator was blocked for the entire GC. My sincere apologies for the
> > > inconsistent results and any confusion this has caused.
> > >
> >
> > Did you also backport commit 05356938a4be ("bcache: call force_wake_up_gc()
> > if necessary in check_should_bypass()") ? Last time when you pushed me to
> > add commit a14a68b76954 into mainline kernel, I tested a regression from this
> > patch and fixed it. Please add this fix if you didn't, otherwise the testing
> > might not be completed.
> >
> >
> > > With patch [1] back-patched and after a 24-hour re-test, the fio results clearly
> > > demonstrate that this patch effectively reduces front IO latency during GC due
> > > to the smaller incremental GC cycles, while the GC duration increase is still
> > > well within bounds.
> > >
> >
> > From the performance result in [2], it seems the max latency are reduced,
> > but higher latency period are longer. I am not sure whether this is a happy
> > result.
> >
> > Can I have a download link for the whole log? Then I can look at the
> > performance numbers more close.
> >
> > > Here's a summary of the improved latency:
> > >
> > > Before:
> > >
> > > Median latency (P50): 210 ms
> > > Max latency (P100): 3.5 sec
> > >
> > > btree_gc_average_duration_ms:381138
> > > btree_gc_average_frequency_sec:3834
> > > btree_gc_last_sec:60668
> > > btree_gc_max_duration_ms:825228
> > > bset_tree_stats:
> > > btree nodes: 144330
> > > written sets: 283733
> > > unwritten sets: 144329
> > > written key bytes: 24993783392
> > > unwritten key bytes: 11777400
> > > floats: 30936844345385
> > > failed: 5776
> > >
> > > After:
> > >
> > > Median latency (P50): 25 ms
> > > Max latency (P100): 0.8 sec
> > >
> > > btree_gc_average_duration_ms:622274
> > > btree_gc_average_frequency_sec:3518
> > > btree_gc_last_sec:8931
> > > btree_gc_max_duration_ms:953146
> > > bset_tree_stats:
> > > btree nodes: 175491
> > > written sets: 339078
> > > unwritten sets: 175488
> > > written key bytes: 29821314856
> > > unwritten key bytes: 14076504
> > > floats: 90520963280544
> > > failed: 6462
> > >
> > > The complete latency data is available at [2].
> > >
> > > I will be glad to run further tests to solidify these findings for the inclusion
> > > of this patch in the coming merge window. Let me know if you'd like me to
> > > conduct any specific tests.
> >
> > Yes, more testing are necessary, from 512 Bytes block size to 1 MiB or
> > 8MiB block size. We need to make sure it won't introduce performance
> > regression in other workload or circumstances.
> >
> > I don't have plan to submit this patch in this merge window, and please don't
> > push me. For performance improvement change, I prefer the defalt
> > configuration will cover most of work loads, so more testing and perforamce
> > data are desired. E.g. the patch you mentioned (commit a14a68b76954 "bcache:
> > allow allocator to invalidate bucket in gc"), it had been deployed in Easy
> > Stack product environment for 20+ months before it got merged.
> >
> > Thanks.
> >
> > >
> > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a14a68b76954e73031ca6399abace17dcb77c17a
> > > [2[ https://gist.github.com/robert-pang/cc7c88f356293ea6d43103e6e5f9180f
> >
> > [snipped]
> >
> > --
> > Coly Li

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/1] bcache: reduce front IO latency during GC
  2025-05-15  0:58         ` Robert Pang
  2025-05-15  1:04           ` Robert Pang
@ 2025-05-15  7:12           ` Kuan-Wei Chiu
  2025-05-16  3:58             ` Robert Pang
  2025-07-03 15:28           ` Robert Pang
  2 siblings, 1 reply; 27+ messages in thread
From: Kuan-Wei Chiu @ 2025-05-15  7:12 UTC (permalink / raw)
  To: Robert Pang; +Cc: Coly Li, Coly Li, Kent Overstreet, linux-bcache, Mingzhe Zou

Hi Robert,

On Wed, May 14, 2025 at 05:58:01PM -0700, Robert Pang wrote:
> Hi Coly,
> 
> My apologies for the delay in providing this update; comprehensive testing
> takes some time to complete.
> 
> As you suggested, I conducted extensive tests for 24 hours against the
> latest 6.14.5 Linux kernel, exploring more configurations to get a complete
> picture:
> 
> 1. 4KB block size with writethrough mode
> 2. 4KB block size with writeback mode (70% dirty)
> 3. 1MB block size with writethrough mode
> 
> The detailed results, available at [1], consistently demonstrate that our patch
> is effective in significantly reducing latency during garbage collection. This
> holds true for both the default writethrough mode and the 70% writeback mode.
> As anticipated, with 1MB block sizes, we observed no difference in latency
> because the number of btree nodes is much smaller.
> 
> [1] https://gist.github.com/robert-pang/817fa7c11ece99d25aabc0467a9427d8
> 
> However, during these tests, we've uncovered a new and distinct latency problem
> that appears to be introduced in the recent Linux kernel. This issue manifests
> as frequent and periodic latency spikes that occur outside of garbage
> collection.
> Below is a snippet of the latency data illustrating this:
> 
> time (s)  median (ms)  max (ms)
> 60810   2.28     679.37
> 60840   2.32   2,434.24 *
> 60870   2.46   2,434.24 *
> 60900   2.52   2,434.24 *
> 60930   2.63     566.15
> 60960   2.82     566.15
> 60990   2.82     566.15
> 61020   2.78     471.79
> 61050   2.93   2,028.54 *
> 61080   3.11   2,028.54 *
> 61110   3.29   2,028.54 *
> 61140   3.42     679.37
> 61170   3.42     679.37
> 61200   3.41     679.37
> 61230   3.30     566.15
> 61260   2.93   1,690.45 *
> 61290   2.75   1,690.45 *
> 61320   2.72   1,690.45 *
> 61350   2.88   1,408.71 *
> 61380   5.07   1,408.71 *
> 61410 107.94   1,408.71 **
> 61440  65.28   1,408.71 **
> 61470  45.41   2,028.54 **
> 61500  72.45   2,028.54 **
> 61530  55.37   2,028.54 **
> 61560  40.73   1,408.71 **
> 61590  11.48   1,690.45 **
> 61620   2.92   1,690.45 *
> 61650   2.54   1,690.45 *
> 61680   2.58     679.37
> 61710   2.78     679.37
> 
> ** garbage collection
> * cache replacement
> 
> Based on the consistent periodicity of these spikes, we deduce that they are
> linked to the invalidate_buckets_lru() function during cache replacement. This
> function was recently modified to use min heap operations [2]. To confirm our
> hypothesis, we reverted the relevant commits and re-ran the tests. Results show
> that the latency spikes completely disappeared, positively confirming that the
> min heap changes introduce this regression. Furthermore, these changes also
> reduce the effectiveness of our GC patch. It appears that the min heap changes
> reduce heap sort speed somehow in invalidate_buckets_lr() and in GC.

Thank you for reporting this regression and for taking the time to
perform such thorough testing.

My current hypothesis is that the root cause may be related to the
earlier change where the min heap API was converted from inline to
non-inline [1]. As a result, the comparison function used by the min
heap is now invoked via an indirect function call instead of a direct
one, which introduces significant overhead - especially when
CONFIG_MITIGATION_RETPOLINE is enabled, as the cost of indirect calls
can be quite substantial in that case.

I understand that running these tests takes considerable time, but
could you try the attached diff to see if it addresses the regression?

Regards,
Kuan-Wei

[1]: https://lore.kernel.org/lkml/20241020040200.939973-2-visitorckw@gmail.com/

diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
index 8998e61efa40..862e7118f81d 100644
--- a/drivers/md/bcache/alloc.c
+++ b/drivers/md/bcache/alloc.c
@@ -207,15 +207,15 @@ static void invalidate_buckets_lru(struct cache *ca)
 		if (!bch_can_invalidate_bucket(ca, b))
 			continue;
 
-		if (!min_heap_full(&ca->heap))
-			min_heap_push(&ca->heap, &b, &bucket_max_cmp_callback, ca);
+		if (!min_heap_full_inline(&ca->heap))
+			min_heap_push_inline(&ca->heap, &b, &bucket_max_cmp_callback, ca);
 		else if (!new_bucket_max_cmp(&b, min_heap_peek(&ca->heap), ca)) {
 			ca->heap.data[0] = b;
-			min_heap_sift_down(&ca->heap, 0, &bucket_max_cmp_callback, ca);
+			min_heap_sift_down_inline(&ca->heap, 0, &bucket_max_cmp_callback, ca);
 		}
 	}
 
-	min_heapify_all(&ca->heap, &bucket_min_cmp_callback, ca);
+	min_heapify_all_inline(&ca->heap, &bucket_min_cmp_callback, ca);
 
 	while (!fifo_full(&ca->free_inc)) {
 		if (!ca->heap.nr) {
@@ -227,8 +227,8 @@ static void invalidate_buckets_lru(struct cache *ca)
 			wake_up_gc(ca->set);
 			return;
 		}
-		b = min_heap_peek(&ca->heap)[0];
-		min_heap_pop(&ca->heap, &bucket_min_cmp_callback, ca);
+		b = min_heap_peek_inline(&ca->heap)[0];
+		min_heap_pop_inline(&ca->heap, &bucket_min_cmp_callback, ca);
 
 		bch_invalidate_one_bucket(ca, b);
 	}

> 
> [2] https://lore.kernel.org/linux-bcache/ZxzkLJmhn3a%2F1ALQ@visitorckw-System-Product-Name/T/#m0dd24ba0c63615de465d3fec72dc73febb0f7a94
> 
> You may download the full test result data from these links.
> 
> https://gist.github.com/robert-pang/5df1d595ee77756c0a01d6479bdf8e34#file-bcache-latency-4kb-no-patch-csv
> https://gist.github.com/robert-pang/5df1d595ee77756c0a01d6479bdf8e34#file-bcache-latency-4kb-with-patch-csv
> https://gist.github.com/robert-pang/bcc26a3aa90dc95a083799cf4fd48116#file-bcache-latency-4kb-wb-no-patch-csv
> https://gist.github.com/robert-pang/bcc26a3aa90dc95a083799cf4fd48116#file-bcache-latency-4kb-wb-with-patch-csv
> https://gist.github.com/robert-pang/7036b06b66c8de7e958cdbddcd92a3f5#file-bcache-latency-1mb-no-patch-csv
> https://gist.github.com/robert-pang/7036b06b66c8de7e958cdbddcd92a3f5#file-bcache-latency-1mb-with-patch-csv
> https://gist.github.com/robert-pang/40f90afdea2d2a8c3f6e22ff959eff03#file-bcache-latency-4kb-no-patch-min-heap-reverted-csv
> https://gist.github.com/robert-pang/40f90afdea2d2a8c3f6e22ff959eff03#file-bcache-latency-4kb-with-patch-min-heap-reverted-csv
> 
> Best regards
> Robert Pang
> 
> On Sat, May 3, 2025 at 10:33 AM Coly Li <colyli@kernel.org> wrote:
> >
> > On Thu, May 01, 2025 at 06:01:09PM +0800, Robert Pang wrote:
> > > Hi Coly,
> > >
> > > Please disregard the test results I shared over a week ago. After digging
> > > deeper into the recent latency spikes with various workloads and by
> > > instrumenting the garbage collector, I realized that the earlier GC latency
> > > patch, "bcache: allow allocator to invalidate bucket in gc" [1], wasn't
> > > backported to the Linux 6.6 branch I tested my patch against. This omission
> > > explains the much higher latency observed during the extended test because the
> > > allocator was blocked for the entire GC. My sincere apologies for the
> > > inconsistent results and any confusion this has caused.
> > >
> >
> > Did you also backport commit 05356938a4be ("bcache: call force_wake_up_gc()
> > if necessary in check_should_bypass()") ? Last time when you pushed me to
> > add commit a14a68b76954 into mainline kernel, I tested a regression from this
> > patch and fixed it. Please add this fix if you didn't, otherwise the testing
> > might not be completed.
> >
> >
> > > With patch [1] back-patched and after a 24-hour re-test, the fio results clearly
> > > demonstrate that this patch effectively reduces front IO latency during GC due
> > > to the smaller incremental GC cycles, while the GC duration increase is still
> > > well within bounds.
> > >
> >
> > From the performance result in [2], it seems the max latency are reduced,
> > but higher latency period are longer. I am not sure whether this is a happy
> > result.
> >
> > Can I have a download link for the whole log? Then I can look at the
> > performance numbers more close.
> >
> > > Here's a summary of the improved latency:
> > >
> > > Before:
> > >
> > > Median latency (P50): 210 ms
> > > Max latency (P100): 3.5 sec
> > >
> > > btree_gc_average_duration_ms:381138
> > > btree_gc_average_frequency_sec:3834
> > > btree_gc_last_sec:60668
> > > btree_gc_max_duration_ms:825228
> > > bset_tree_stats:
> > > btree nodes: 144330
> > > written sets: 283733
> > > unwritten sets: 144329
> > > written key bytes: 24993783392
> > > unwritten key bytes: 11777400
> > > floats: 30936844345385
> > > failed: 5776
> > >
> > > After:
> > >
> > > Median latency (P50): 25 ms
> > > Max latency (P100): 0.8 sec
> > >
> > > btree_gc_average_duration_ms:622274
> > > btree_gc_average_frequency_sec:3518
> > > btree_gc_last_sec:8931
> > > btree_gc_max_duration_ms:953146
> > > bset_tree_stats:
> > > btree nodes: 175491
> > > written sets: 339078
> > > unwritten sets: 175488
> > > written key bytes: 29821314856
> > > unwritten key bytes: 14076504
> > > floats: 90520963280544
> > > failed: 6462
> > >
> > > The complete latency data is available at [2].
> > >
> > > I will be glad to run further tests to solidify these findings for the inclusion
> > > of this patch in the coming merge window. Let me know if you'd like me to
> > > conduct any specific tests.
> >
> > Yes, more testing are necessary, from 512 Bytes block size to 1 MiB or
> > 8MiB block size. We need to make sure it won't introduce performance
> > regression in other workload or circumstances.
> >
> > I don't have plan to submit this patch in this merge window, and please don't
> > push me. For performance improvement change, I prefer the defalt
> > configuration will cover most of work loads, so more testing and perforamce
> > data are desired. E.g. the patch you mentioned (commit a14a68b76954 "bcache:
> > allow allocator to invalidate bucket in gc"), it had been deployed in Easy
> > Stack product environment for 20+ months before it got merged.
> >
> > Thanks.
> >
> > >
> > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a14a68b76954e73031ca6399abace17dcb77c17a
> > > [2[ https://gist.github.com/robert-pang/cc7c88f356293ea6d43103e6e5f9180f
> >
> > [snipped]
> >
> > --
> > Coly Li

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/1] bcache: reduce front IO latency during GC
  2025-05-15  7:12           ` Kuan-Wei Chiu
@ 2025-05-16  3:58             ` Robert Pang
  2025-05-16 16:14               ` Kuan-Wei Chiu
  0 siblings, 1 reply; 27+ messages in thread
From: Robert Pang @ 2025-05-16  3:58 UTC (permalink / raw)
  To: Kuan-Wei Chiu
  Cc: Coly Li, Coly Li, Kent Overstreet, linux-bcache, Mingzhe Zou

Hi Kuan-Wei,

Thank you for your prompt response. I tested your suggested patch to
inline the min heap operations for 8 hours and it is still ongoing.
Unfortunately, basing on the results so far, it didn't resolve the
regression, suggesting inlining isn't the issue.

After reviewing the commits in lib/min_heap.h, I noticed commit
c641722 ("lib min_heap: optimize number of comparisons in
min_heapify()") and it looked like a potential candidate. I reverted
this commit (below) and ran the tests. While the test is still
ongoing, the results for the past 3 hours show that the latency spikes
during invalidate_buckets_lru() disappeared after this change,
indicating that this commit is likely the root cause of the
regression.

My hypothesis is that while commit c641722 was designed to reduce
comparisons with randomized input [1], it might inadvertently increase
comparisons when the input isn't as random. A scenario where this
could happen is within invalidate_buckets_lru() before the cache is
fully populated. In such cases, many buckets are unfilled, causing
new_bucket_prio() to return zero, leading to more frequent
compare-equal operations with other unfilled buckets. In the case when
the cache is populated, the bucket priorities fall in a range with
many duplicates. How will heap_sift() behave in such cases?

[1] https://lore.kernel.org/linux-bcache/20240121153649.2733274-6-visitorckw@gmail.com/

Best regards,
Robert

---
 include/linux/min_heap.h | 46 +++++++++++++++++++---------------------
 1 file changed, 22 insertions(+), 24 deletions(-)

diff --git a/include/linux/min_heap.h b/include/linux/min_heap.h
index 1160bed6579e..2d16e6747e36 100644
--- a/include/linux/min_heap.h
+++ b/include/linux/min_heap.h
@@ -257,34 +257,32 @@ static __always_inline
 void __min_heap_sift_down_inline(min_heap_char *heap, int pos, size_t
elem_size,
  const struct min_heap_callbacks *func, void *args)
 {
- const unsigned long lsbit = elem_size & -elem_size;
+ void *left, *right, *parent, *smallest;
  void *data = heap->data;
  void (*swp)(void *lhs, void *rhs, void *args) = func->swp;
- /* pre-scale counters for performance */
- size_t a = pos * elem_size;
- size_t b, c, d;
- size_t n = heap->nr * elem_size;
-
- if (!swp)
- swp = select_swap_func(data, elem_size);

- /* Find the sift-down path all the way to the leaves. */
- for (b = a; c = 2 * b + elem_size, (d = c + elem_size) < n;)
- b = func->less(data + c, data + d, args) ? c : d;
-
- /* Special case for the last leaf with no sibling. */
- if (d == n)
- b = c;
-
- /* Backtrack to the correct location. */
- while (b != a && func->less(data + a, data + b, args))
- b = parent(b, lsbit, elem_size);
+ for (;;) {
+ if (pos * 2 + 1 >= heap->nr)
+ break;

- /* Shift the element into its correct place. */
- c = b;
- while (b != a) {
- b = parent(b, lsbit, elem_size);
- do_swap(data + b, data + c, elem_size, swp, args);
+ left = data + ((pos * 2 + 1) * elem_size);
+ parent = data + (pos * elem_size);
+ smallest = parent;
+ if (func->less(left, smallest, args))
+ smallest = left;
+
+ if (pos * 2 + 2 < heap->nr) {
+ right = data + ((pos * 2 + 2) * elem_size);
+ if (func->less(right, smallest, args))
+ smallest = right;
+ }
+ if (smallest == parent)
+ break;
+ do_swap(smallest, parent, elem_size, swp, args);
+ if (smallest == left)
+ pos = (pos * 2) + 1;
+ else
+ pos = (pos * 2) + 2;
  }
 }

-- 

On Thu, May 15, 2025 at 12:12 AM Kuan-Wei Chiu <visitorckw@gmail.com> wrote:
>
> Hi Robert,
>
> On Wed, May 14, 2025 at 05:58:01PM -0700, Robert Pang wrote:
> > Hi Coly,
> >
> > My apologies for the delay in providing this update; comprehensive testing
> > takes some time to complete.
> >
> > As you suggested, I conducted extensive tests for 24 hours against the
> > latest 6.14.5 Linux kernel, exploring more configurations to get a complete
> > picture:
> >
> > 1. 4KB block size with writethrough mode
> > 2. 4KB block size with writeback mode (70% dirty)
> > 3. 1MB block size with writethrough mode
> >
> > The detailed results, available at [1], consistently demonstrate that our patch
> > is effective in significantly reducing latency during garbage collection. This
> > holds true for both the default writethrough mode and the 70% writeback mode.
> > As anticipated, with 1MB block sizes, we observed no difference in latency
> > because the number of btree nodes is much smaller.
> >
> > [1] https://gist.github.com/robert-pang/817fa7c11ece99d25aabc0467a9427d8
> >
> > However, during these tests, we've uncovered a new and distinct latency problem
> > that appears to be introduced in the recent Linux kernel. This issue manifests
> > as frequent and periodic latency spikes that occur outside of garbage
> > collection.
> > Below is a snippet of the latency data illustrating this:
> >
> > time (s)  median (ms)  max (ms)
> > 60810   2.28     679.37
> > 60840   2.32   2,434.24 *
> > 60870   2.46   2,434.24 *
> > 60900   2.52   2,434.24 *
> > 60930   2.63     566.15
> > 60960   2.82     566.15
> > 60990   2.82     566.15
> > 61020   2.78     471.79
> > 61050   2.93   2,028.54 *
> > 61080   3.11   2,028.54 *
> > 61110   3.29   2,028.54 *
> > 61140   3.42     679.37
> > 61170   3.42     679.37
> > 61200   3.41     679.37
> > 61230   3.30     566.15
> > 61260   2.93   1,690.45 *
> > 61290   2.75   1,690.45 *
> > 61320   2.72   1,690.45 *
> > 61350   2.88   1,408.71 *
> > 61380   5.07   1,408.71 *
> > 61410 107.94   1,408.71 **
> > 61440  65.28   1,408.71 **
> > 61470  45.41   2,028.54 **
> > 61500  72.45   2,028.54 **
> > 61530  55.37   2,028.54 **
> > 61560  40.73   1,408.71 **
> > 61590  11.48   1,690.45 **
> > 61620   2.92   1,690.45 *
> > 61650   2.54   1,690.45 *
> > 61680   2.58     679.37
> > 61710   2.78     679.37
> >
> > ** garbage collection
> > * cache replacement
> >
> > Based on the consistent periodicity of these spikes, we deduce that they are
> > linked to the invalidate_buckets_lru() function during cache replacement. This
> > function was recently modified to use min heap operations [2]. To confirm our
> > hypothesis, we reverted the relevant commits and re-ran the tests. Results show
> > that the latency spikes completely disappeared, positively confirming that the
> > min heap changes introduce this regression. Furthermore, these changes also
> > reduce the effectiveness of our GC patch. It appears that the min heap changes
> > reduce heap sort speed somehow in invalidate_buckets_lr() and in GC.
>
> Thank you for reporting this regression and for taking the time to
> perform such thorough testing.
>
> My current hypothesis is that the root cause may be related to the
> earlier change where the min heap API was converted from inline to
> non-inline [1]. As a result, the comparison function used by the min
> heap is now invoked via an indirect function call instead of a direct
> one, which introduces significant overhead - especially when
> CONFIG_MITIGATION_RETPOLINE is enabled, as the cost of indirect calls
> can be quite substantial in that case.
>
> I understand that running these tests takes considerable time, but
> could you try the attached diff to see if it addresses the regression?
>
> Regards,
> Kuan-Wei
>
> [1]: https://lore.kernel.org/lkml/20241020040200.939973-2-visitorckw@gmail.com/
>
> diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
> index 8998e61efa40..862e7118f81d 100644
> --- a/drivers/md/bcache/alloc.c
> +++ b/drivers/md/bcache/alloc.c
> @@ -207,15 +207,15 @@ static void invalidate_buckets_lru(struct cache *ca)
>                 if (!bch_can_invalidate_bucket(ca, b))
>                         continue;
>
> -               if (!min_heap_full(&ca->heap))
> -                       min_heap_push(&ca->heap, &b, &bucket_max_cmp_callback, ca);
> +               if (!min_heap_full_inline(&ca->heap))
> +                       min_heap_push_inline(&ca->heap, &b, &bucket_max_cmp_callback, ca);
>                 else if (!new_bucket_max_cmp(&b, min_heap_peek(&ca->heap), ca)) {
>                         ca->heap.data[0] = b;
> -                       min_heap_sift_down(&ca->heap, 0, &bucket_max_cmp_callback, ca);
> +                       min_heap_sift_down_inline(&ca->heap, 0, &bucket_max_cmp_callback, ca);
>                 }
>         }
>
> -       min_heapify_all(&ca->heap, &bucket_min_cmp_callback, ca);
> +       min_heapify_all_inline(&ca->heap, &bucket_min_cmp_callback, ca);
>
>         while (!fifo_full(&ca->free_inc)) {
>                 if (!ca->heap.nr) {
> @@ -227,8 +227,8 @@ static void invalidate_buckets_lru(struct cache *ca)
>                         wake_up_gc(ca->set);
>                         return;
>                 }
> -               b = min_heap_peek(&ca->heap)[0];
> -               min_heap_pop(&ca->heap, &bucket_min_cmp_callback, ca);
> +               b = min_heap_peek_inline(&ca->heap)[0];
> +               min_heap_pop_inline(&ca->heap, &bucket_min_cmp_callback, ca);
>
>                 bch_invalidate_one_bucket(ca, b);
>         }
>
> >
> > [2] https://lore.kernel.org/linux-bcache/ZxzkLJmhn3a%2F1ALQ@visitorckw-System-Product-Name/T/#m0dd24ba0c63615de465d3fec72dc73febb0f7a94
> >
> > You may download the full test result data from these links.
> >
> > https://gist.github.com/robert-pang/5df1d595ee77756c0a01d6479bdf8e34#file-bcache-latency-4kb-no-patch-csv
> > https://gist.github.com/robert-pang/5df1d595ee77756c0a01d6479bdf8e34#file-bcache-latency-4kb-with-patch-csv
> > https://gist.github.com/robert-pang/bcc26a3aa90dc95a083799cf4fd48116#file-bcache-latency-4kb-wb-no-patch-csv
> > https://gist.github.com/robert-pang/bcc26a3aa90dc95a083799cf4fd48116#file-bcache-latency-4kb-wb-with-patch-csv
> > https://gist.github.com/robert-pang/7036b06b66c8de7e958cdbddcd92a3f5#file-bcache-latency-1mb-no-patch-csv
> > https://gist.github.com/robert-pang/7036b06b66c8de7e958cdbddcd92a3f5#file-bcache-latency-1mb-with-patch-csv
> > https://gist.github.com/robert-pang/40f90afdea2d2a8c3f6e22ff959eff03#file-bcache-latency-4kb-no-patch-min-heap-reverted-csv
> > https://gist.github.com/robert-pang/40f90afdea2d2a8c3f6e22ff959eff03#file-bcache-latency-4kb-with-patch-min-heap-reverted-csv
> >
> > Best regards
> > Robert Pang
> >
> > On Sat, May 3, 2025 at 10:33 AM Coly Li <colyli@kernel.org> wrote:
> > >
> > > On Thu, May 01, 2025 at 06:01:09PM +0800, Robert Pang wrote:
> > > > Hi Coly,
> > > >
> > > > Please disregard the test results I shared over a week ago. After digging
> > > > deeper into the recent latency spikes with various workloads and by
> > > > instrumenting the garbage collector, I realized that the earlier GC latency
> > > > patch, "bcache: allow allocator to invalidate bucket in gc" [1], wasn't
> > > > backported to the Linux 6.6 branch I tested my patch against. This omission
> > > > explains the much higher latency observed during the extended test because the
> > > > allocator was blocked for the entire GC. My sincere apologies for the
> > > > inconsistent results and any confusion this has caused.
> > > >
> > >
> > > Did you also backport commit 05356938a4be ("bcache: call force_wake_up_gc()
> > > if necessary in check_should_bypass()") ? Last time when you pushed me to
> > > add commit a14a68b76954 into mainline kernel, I tested a regression from this
> > > patch and fixed it. Please add this fix if you didn't, otherwise the testing
> > > might not be completed.
> > >
> > >
> > > > With patch [1] back-patched and after a 24-hour re-test, the fio results clearly
> > > > demonstrate that this patch effectively reduces front IO latency during GC due
> > > > to the smaller incremental GC cycles, while the GC duration increase is still
> > > > well within bounds.
> > > >
> > >
> > > From the performance result in [2], it seems the max latency are reduced,
> > > but higher latency period are longer. I am not sure whether this is a happy
> > > result.
> > >
> > > Can I have a download link for the whole log? Then I can look at the
> > > performance numbers more close.
> > >
> > > > Here's a summary of the improved latency:
> > > >
> > > > Before:
> > > >
> > > > Median latency (P50): 210 ms
> > > > Max latency (P100): 3.5 sec
> > > >
> > > > btree_gc_average_duration_ms:381138
> > > > btree_gc_average_frequency_sec:3834
> > > > btree_gc_last_sec:60668
> > > > btree_gc_max_duration_ms:825228
> > > > bset_tree_stats:
> > > > btree nodes: 144330
> > > > written sets: 283733
> > > > unwritten sets: 144329
> > > > written key bytes: 24993783392
> > > > unwritten key bytes: 11777400
> > > > floats: 30936844345385
> > > > failed: 5776
> > > >
> > > > After:
> > > >
> > > > Median latency (P50): 25 ms
> > > > Max latency (P100): 0.8 sec
> > > >
> > > > btree_gc_average_duration_ms:622274
> > > > btree_gc_average_frequency_sec:3518
> > > > btree_gc_last_sec:8931
> > > > btree_gc_max_duration_ms:953146
> > > > bset_tree_stats:
> > > > btree nodes: 175491
> > > > written sets: 339078
> > > > unwritten sets: 175488
> > > > written key bytes: 29821314856
> > > > unwritten key bytes: 14076504
> > > > floats: 90520963280544
> > > > failed: 6462
> > > >
> > > > The complete latency data is available at [2].
> > > >
> > > > I will be glad to run further tests to solidify these findings for the inclusion
> > > > of this patch in the coming merge window. Let me know if you'd like me to
> > > > conduct any specific tests.
> > >
> > > Yes, more testing are necessary, from 512 Bytes block size to 1 MiB or
> > > 8MiB block size. We need to make sure it won't introduce performance
> > > regression in other workload or circumstances.
> > >
> > > I don't have plan to submit this patch in this merge window, and please don't
> > > push me. For performance improvement change, I prefer the defalt
> > > configuration will cover most of work loads, so more testing and perforamce
> > > data are desired. E.g. the patch you mentioned (commit a14a68b76954 "bcache:
> > > allow allocator to invalidate bucket in gc"), it had been deployed in Easy
> > > Stack product environment for 20+ months before it got merged.
> > >
> > > Thanks.
> > >
> > > >
> > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a14a68b76954e73031ca6399abace17dcb77c17a
> > > > [2[ https://gist.github.com/robert-pang/cc7c88f356293ea6d43103e6e5f9180f
> > >
> > > [snipped]
> > >
> > > --
> > > Coly Li

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/1] bcache: reduce front IO latency during GC
  2025-05-16  3:58             ` Robert Pang
@ 2025-05-16 16:14               ` Kuan-Wei Chiu
  2025-05-17 11:02                 ` Coly Li
  0 siblings, 1 reply; 27+ messages in thread
From: Kuan-Wei Chiu @ 2025-05-16 16:14 UTC (permalink / raw)
  To: Robert Pang; +Cc: Coly Li, Coly Li, Kent Overstreet, linux-bcache, Mingzhe Zou

On Thu, May 15, 2025 at 08:58:44PM -0700, Robert Pang wrote:
> Hi Kuan-Wei,
> 
> Thank you for your prompt response. I tested your suggested patch to
> inline the min heap operations for 8 hours and it is still ongoing.
> Unfortunately, basing on the results so far, it didn't resolve the
> regression, suggesting inlining isn't the issue.
> 
> After reviewing the commits in lib/min_heap.h, I noticed commit
> c641722 ("lib min_heap: optimize number of comparisons in
> min_heapify()") and it looked like a potential candidate. I reverted
> this commit (below) and ran the tests. While the test is still
> ongoing, the results for the past 3 hours show that the latency spikes
> during invalidate_buckets_lru() disappeared after this change,
> indicating that this commit is likely the root cause of the
> regression.
> 
> My hypothesis is that while commit c641722 was designed to reduce
> comparisons with randomized input [1], it might inadvertently increase
> comparisons when the input isn't as random. A scenario where this
> could happen is within invalidate_buckets_lru() before the cache is
> fully populated. In such cases, many buckets are unfilled, causing
> new_bucket_prio() to return zero, leading to more frequent
> compare-equal operations with other unfilled buckets. In the case when
> the cache is populated, the bucket priorities fall in a range with
> many duplicates. How will heap_sift() behave in such cases?
> 
> [1] https://lore.kernel.org/linux-bcache/20240121153649.2733274-6-visitorckw@gmail.com/
> 

You're very likely correct.

In scenarios where the majority of elements in the heap are identical,
the traditional top-down version of heapify finishes after just 2
comparisons. However, with the bottom-up version introduced by that
commit, it ends up performing roughly 2 * log₂(n) comparisons in the
same case.

That said, reverting the commit would increase the number of
comparisons by about 2x in cases where all elements in the heap are
distinct, which was the original motivation for the change. I'm not
entirely sure what the best way would be to fix this regression without
negatively impacting the performance of the other use cases.

Regards,
Kuan-Wei

> Best regards,
> Robert
> 
> ---
>  include/linux/min_heap.h | 46 +++++++++++++++++++---------------------
>  1 file changed, 22 insertions(+), 24 deletions(-)
> 
> diff --git a/include/linux/min_heap.h b/include/linux/min_heap.h
> index 1160bed6579e..2d16e6747e36 100644
> --- a/include/linux/min_heap.h
> +++ b/include/linux/min_heap.h
> @@ -257,34 +257,32 @@ static __always_inline
>  void __min_heap_sift_down_inline(min_heap_char *heap, int pos, size_t
> elem_size,
>   const struct min_heap_callbacks *func, void *args)
>  {
> - const unsigned long lsbit = elem_size & -elem_size;
> + void *left, *right, *parent, *smallest;
>   void *data = heap->data;
>   void (*swp)(void *lhs, void *rhs, void *args) = func->swp;
> - /* pre-scale counters for performance */
> - size_t a = pos * elem_size;
> - size_t b, c, d;
> - size_t n = heap->nr * elem_size;
> -
> - if (!swp)
> - swp = select_swap_func(data, elem_size);
> 
> - /* Find the sift-down path all the way to the leaves. */
> - for (b = a; c = 2 * b + elem_size, (d = c + elem_size) < n;)
> - b = func->less(data + c, data + d, args) ? c : d;
> -
> - /* Special case for the last leaf with no sibling. */
> - if (d == n)
> - b = c;
> -
> - /* Backtrack to the correct location. */
> - while (b != a && func->less(data + a, data + b, args))
> - b = parent(b, lsbit, elem_size);
> + for (;;) {
> + if (pos * 2 + 1 >= heap->nr)
> + break;
> 
> - /* Shift the element into its correct place. */
> - c = b;
> - while (b != a) {
> - b = parent(b, lsbit, elem_size);
> - do_swap(data + b, data + c, elem_size, swp, args);
> + left = data + ((pos * 2 + 1) * elem_size);
> + parent = data + (pos * elem_size);
> + smallest = parent;
> + if (func->less(left, smallest, args))
> + smallest = left;
> +
> + if (pos * 2 + 2 < heap->nr) {
> + right = data + ((pos * 2 + 2) * elem_size);
> + if (func->less(right, smallest, args))
> + smallest = right;
> + }
> + if (smallest == parent)
> + break;
> + do_swap(smallest, parent, elem_size, swp, args);
> + if (smallest == left)
> + pos = (pos * 2) + 1;
> + else
> + pos = (pos * 2) + 2;
>   }
>  }
> 
> -- 
> 
> On Thu, May 15, 2025 at 12:12 AM Kuan-Wei Chiu <visitorckw@gmail.com> wrote:
> >
> > Hi Robert,
> >
> > On Wed, May 14, 2025 at 05:58:01PM -0700, Robert Pang wrote:
> > > Hi Coly,
> > >
> > > My apologies for the delay in providing this update; comprehensive testing
> > > takes some time to complete.
> > >
> > > As you suggested, I conducted extensive tests for 24 hours against the
> > > latest 6.14.5 Linux kernel, exploring more configurations to get a complete
> > > picture:
> > >
> > > 1. 4KB block size with writethrough mode
> > > 2. 4KB block size with writeback mode (70% dirty)
> > > 3. 1MB block size with writethrough mode
> > >
> > > The detailed results, available at [1], consistently demonstrate that our patch
> > > is effective in significantly reducing latency during garbage collection. This
> > > holds true for both the default writethrough mode and the 70% writeback mode.
> > > As anticipated, with 1MB block sizes, we observed no difference in latency
> > > because the number of btree nodes is much smaller.
> > >
> > > [1] https://gist.github.com/robert-pang/817fa7c11ece99d25aabc0467a9427d8
> > >
> > > However, during these tests, we've uncovered a new and distinct latency problem
> > > that appears to be introduced in the recent Linux kernel. This issue manifests
> > > as frequent and periodic latency spikes that occur outside of garbage
> > > collection.
> > > Below is a snippet of the latency data illustrating this:
> > >
> > > time (s)  median (ms)  max (ms)
> > > 60810   2.28     679.37
> > > 60840   2.32   2,434.24 *
> > > 60870   2.46   2,434.24 *
> > > 60900   2.52   2,434.24 *
> > > 60930   2.63     566.15
> > > 60960   2.82     566.15
> > > 60990   2.82     566.15
> > > 61020   2.78     471.79
> > > 61050   2.93   2,028.54 *
> > > 61080   3.11   2,028.54 *
> > > 61110   3.29   2,028.54 *
> > > 61140   3.42     679.37
> > > 61170   3.42     679.37
> > > 61200   3.41     679.37
> > > 61230   3.30     566.15
> > > 61260   2.93   1,690.45 *
> > > 61290   2.75   1,690.45 *
> > > 61320   2.72   1,690.45 *
> > > 61350   2.88   1,408.71 *
> > > 61380   5.07   1,408.71 *
> > > 61410 107.94   1,408.71 **
> > > 61440  65.28   1,408.71 **
> > > 61470  45.41   2,028.54 **
> > > 61500  72.45   2,028.54 **
> > > 61530  55.37   2,028.54 **
> > > 61560  40.73   1,408.71 **
> > > 61590  11.48   1,690.45 **
> > > 61620   2.92   1,690.45 *
> > > 61650   2.54   1,690.45 *
> > > 61680   2.58     679.37
> > > 61710   2.78     679.37
> > >
> > > ** garbage collection
> > > * cache replacement
> > >
> > > Based on the consistent periodicity of these spikes, we deduce that they are
> > > linked to the invalidate_buckets_lru() function during cache replacement. This
> > > function was recently modified to use min heap operations [2]. To confirm our
> > > hypothesis, we reverted the relevant commits and re-ran the tests. Results show
> > > that the latency spikes completely disappeared, positively confirming that the
> > > min heap changes introduce this regression. Furthermore, these changes also
> > > reduce the effectiveness of our GC patch. It appears that the min heap changes
> > > reduce heap sort speed somehow in invalidate_buckets_lr() and in GC.
> >
> > Thank you for reporting this regression and for taking the time to
> > perform such thorough testing.
> >
> > My current hypothesis is that the root cause may be related to the
> > earlier change where the min heap API was converted from inline to
> > non-inline [1]. As a result, the comparison function used by the min
> > heap is now invoked via an indirect function call instead of a direct
> > one, which introduces significant overhead - especially when
> > CONFIG_MITIGATION_RETPOLINE is enabled, as the cost of indirect calls
> > can be quite substantial in that case.
> >
> > I understand that running these tests takes considerable time, but
> > could you try the attached diff to see if it addresses the regression?
> >
> > Regards,
> > Kuan-Wei
> >
> > [1]: https://lore.kernel.org/lkml/20241020040200.939973-2-visitorckw@gmail.com/
> >
> > diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
> > index 8998e61efa40..862e7118f81d 100644
> > --- a/drivers/md/bcache/alloc.c
> > +++ b/drivers/md/bcache/alloc.c
> > @@ -207,15 +207,15 @@ static void invalidate_buckets_lru(struct cache *ca)
> >                 if (!bch_can_invalidate_bucket(ca, b))
> >                         continue;
> >
> > -               if (!min_heap_full(&ca->heap))
> > -                       min_heap_push(&ca->heap, &b, &bucket_max_cmp_callback, ca);
> > +               if (!min_heap_full_inline(&ca->heap))
> > +                       min_heap_push_inline(&ca->heap, &b, &bucket_max_cmp_callback, ca);
> >                 else if (!new_bucket_max_cmp(&b, min_heap_peek(&ca->heap), ca)) {
> >                         ca->heap.data[0] = b;
> > -                       min_heap_sift_down(&ca->heap, 0, &bucket_max_cmp_callback, ca);
> > +                       min_heap_sift_down_inline(&ca->heap, 0, &bucket_max_cmp_callback, ca);
> >                 }
> >         }
> >
> > -       min_heapify_all(&ca->heap, &bucket_min_cmp_callback, ca);
> > +       min_heapify_all_inline(&ca->heap, &bucket_min_cmp_callback, ca);
> >
> >         while (!fifo_full(&ca->free_inc)) {
> >                 if (!ca->heap.nr) {
> > @@ -227,8 +227,8 @@ static void invalidate_buckets_lru(struct cache *ca)
> >                         wake_up_gc(ca->set);
> >                         return;
> >                 }
> > -               b = min_heap_peek(&ca->heap)[0];
> > -               min_heap_pop(&ca->heap, &bucket_min_cmp_callback, ca);
> > +               b = min_heap_peek_inline(&ca->heap)[0];
> > +               min_heap_pop_inline(&ca->heap, &bucket_min_cmp_callback, ca);
> >
> >                 bch_invalidate_one_bucket(ca, b);
> >         }
> >
> > >
> > > [2] https://lore.kernel.org/linux-bcache/ZxzkLJmhn3a%2F1ALQ@visitorckw-System-Product-Name/T/#m0dd24ba0c63615de465d3fec72dc73febb0f7a94
> > >
> > > You may download the full test result data from these links.
> > >
> > > https://gist.github.com/robert-pang/5df1d595ee77756c0a01d6479bdf8e34#file-bcache-latency-4kb-no-patch-csv
> > > https://gist.github.com/robert-pang/5df1d595ee77756c0a01d6479bdf8e34#file-bcache-latency-4kb-with-patch-csv
> > > https://gist.github.com/robert-pang/bcc26a3aa90dc95a083799cf4fd48116#file-bcache-latency-4kb-wb-no-patch-csv
> > > https://gist.github.com/robert-pang/bcc26a3aa90dc95a083799cf4fd48116#file-bcache-latency-4kb-wb-with-patch-csv
> > > https://gist.github.com/robert-pang/7036b06b66c8de7e958cdbddcd92a3f5#file-bcache-latency-1mb-no-patch-csv
> > > https://gist.github.com/robert-pang/7036b06b66c8de7e958cdbddcd92a3f5#file-bcache-latency-1mb-with-patch-csv
> > > https://gist.github.com/robert-pang/40f90afdea2d2a8c3f6e22ff959eff03#file-bcache-latency-4kb-no-patch-min-heap-reverted-csv
> > > https://gist.github.com/robert-pang/40f90afdea2d2a8c3f6e22ff959eff03#file-bcache-latency-4kb-with-patch-min-heap-reverted-csv
> > >
> > > Best regards
> > > Robert Pang
> > >
> > > On Sat, May 3, 2025 at 10:33 AM Coly Li <colyli@kernel.org> wrote:
> > > >
> > > > On Thu, May 01, 2025 at 06:01:09PM +0800, Robert Pang wrote:
> > > > > Hi Coly,
> > > > >
> > > > > Please disregard the test results I shared over a week ago. After digging
> > > > > deeper into the recent latency spikes with various workloads and by
> > > > > instrumenting the garbage collector, I realized that the earlier GC latency
> > > > > patch, "bcache: allow allocator to invalidate bucket in gc" [1], wasn't
> > > > > backported to the Linux 6.6 branch I tested my patch against. This omission
> > > > > explains the much higher latency observed during the extended test because the
> > > > > allocator was blocked for the entire GC. My sincere apologies for the
> > > > > inconsistent results and any confusion this has caused.
> > > > >
> > > >
> > > > Did you also backport commit 05356938a4be ("bcache: call force_wake_up_gc()
> > > > if necessary in check_should_bypass()") ? Last time when you pushed me to
> > > > add commit a14a68b76954 into mainline kernel, I tested a regression from this
> > > > patch and fixed it. Please add this fix if you didn't, otherwise the testing
> > > > might not be completed.
> > > >
> > > >
> > > > > With patch [1] back-patched and after a 24-hour re-test, the fio results clearly
> > > > > demonstrate that this patch effectively reduces front IO latency during GC due
> > > > > to the smaller incremental GC cycles, while the GC duration increase is still
> > > > > well within bounds.
> > > > >
> > > >
> > > > From the performance result in [2], it seems the max latency are reduced,
> > > > but higher latency period are longer. I am not sure whether this is a happy
> > > > result.
> > > >
> > > > Can I have a download link for the whole log? Then I can look at the
> > > > performance numbers more close.
> > > >
> > > > > Here's a summary of the improved latency:
> > > > >
> > > > > Before:
> > > > >
> > > > > Median latency (P50): 210 ms
> > > > > Max latency (P100): 3.5 sec
> > > > >
> > > > > btree_gc_average_duration_ms:381138
> > > > > btree_gc_average_frequency_sec:3834
> > > > > btree_gc_last_sec:60668
> > > > > btree_gc_max_duration_ms:825228
> > > > > bset_tree_stats:
> > > > > btree nodes: 144330
> > > > > written sets: 283733
> > > > > unwritten sets: 144329
> > > > > written key bytes: 24993783392
> > > > > unwritten key bytes: 11777400
> > > > > floats: 30936844345385
> > > > > failed: 5776
> > > > >
> > > > > After:
> > > > >
> > > > > Median latency (P50): 25 ms
> > > > > Max latency (P100): 0.8 sec
> > > > >
> > > > > btree_gc_average_duration_ms:622274
> > > > > btree_gc_average_frequency_sec:3518
> > > > > btree_gc_last_sec:8931
> > > > > btree_gc_max_duration_ms:953146
> > > > > bset_tree_stats:
> > > > > btree nodes: 175491
> > > > > written sets: 339078
> > > > > unwritten sets: 175488
> > > > > written key bytes: 29821314856
> > > > > unwritten key bytes: 14076504
> > > > > floats: 90520963280544
> > > > > failed: 6462
> > > > >
> > > > > The complete latency data is available at [2].
> > > > >
> > > > > I will be glad to run further tests to solidify these findings for the inclusion
> > > > > of this patch in the coming merge window. Let me know if you'd like me to
> > > > > conduct any specific tests.
> > > >
> > > > Yes, more testing are necessary, from 512 Bytes block size to 1 MiB or
> > > > 8MiB block size. We need to make sure it won't introduce performance
> > > > regression in other workload or circumstances.
> > > >
> > > > I don't have plan to submit this patch in this merge window, and please don't
> > > > push me. For performance improvement change, I prefer the defalt
> > > > configuration will cover most of work loads, so more testing and perforamce
> > > > data are desired. E.g. the patch you mentioned (commit a14a68b76954 "bcache:
> > > > allow allocator to invalidate bucket in gc"), it had been deployed in Easy
> > > > Stack product environment for 20+ months before it got merged.
> > > >
> > > > Thanks.
> > > >
> > > > >
> > > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a14a68b76954e73031ca6399abace17dcb77c17a
> > > > > [2[ https://gist.github.com/robert-pang/cc7c88f356293ea6d43103e6e5f9180f
> > > >
> > > > [snipped]
> > > >
> > > > --
> > > > Coly Li

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/1] bcache: reduce front IO latency during GC
  2025-05-16 16:14               ` Kuan-Wei Chiu
@ 2025-05-17 11:02                 ` Coly Li
  2025-05-20 11:51                   ` Kuan-Wei Chiu
  0 siblings, 1 reply; 27+ messages in thread
From: Coly Li @ 2025-05-17 11:02 UTC (permalink / raw)
  To: Kuan-Wei Chiu
  Cc: Robert Pang, Coly Li, Kent Overstreet, linux-bcache, Mingzhe Zou



> 2025年5月17日 00:14，Kuan-Wei Chiu <visitorckw@gmail.com> 写道：
> 
> On Thu, May 15, 2025 at 08:58:44PM -0700, Robert Pang wrote:
>> Hi Kuan-Wei,
>> 
>> Thank you for your prompt response. I tested your suggested patch to
>> inline the min heap operations for 8 hours and it is still ongoing.
>> Unfortunately, basing on the results so far, it didn't resolve the
>> regression, suggesting inlining isn't the issue.
>> 
>> After reviewing the commits in lib/min_heap.h, I noticed commit
>> c641722 ("lib min_heap: optimize number of comparisons in
>> min_heapify()") and it looked like a potential candidate. I reverted
>> this commit (below) and ran the tests. While the test is still
>> ongoing, the results for the past 3 hours show that the latency spikes
>> during invalidate_buckets_lru() disappeared after this change,
>> indicating that this commit is likely the root cause of the
>> regression.
>> 
>> My hypothesis is that while commit c641722 was designed to reduce
>> comparisons with randomized input [1], it might inadvertently increase
>> comparisons when the input isn't as random. A scenario where this
>> could happen is within invalidate_buckets_lru() before the cache is
>> fully populated. In such cases, many buckets are unfilled, causing
>> new_bucket_prio() to return zero, leading to more frequent
>> compare-equal operations with other unfilled buckets. In the case when
>> the cache is populated, the bucket priorities fall in a range with
>> many duplicates. How will heap_sift() behave in such cases?
>> 
>> [1] https://lore.kernel.org/linux-bcache/20240121153649.2733274-6-visitorckw@gmail.com/
>> 
> 
> You're very likely correct.
> 
> In scenarios where the majority of elements in the heap are identical,
> the traditional top-down version of heapify finishes after just 2
> comparisons. However, with the bottom-up version introduced by that
> commit, it ends up performing roughly 2 * log₂(n) comparisons in the
> same case.

For bcache scenario for ideal circumstances and best performance, the cached data
and following requests should have spatial or temporal locality.

I guess it means for the heap usage, the input might not be typical random.


> 
> That said, reverting the commit would increase the number of
> comparisons by about 2x in cases where all elements in the heap are
> distinct, which was the original motivation for the change. I'm not
> entirely sure what the best way would be to fix this regression without
> negatively impacting the performance of the other use cases.

If the data read model are fully sequential or random, bcache cannot help too much.

So I guess maybe we still need to old heapify code? The new version is for full random input,
and previous version for not that much random input.

Thanks.

Coly Li


> 
> Regards,
> Kuan-Wei
> 
>> Best regards,
>> Robert
>> 
>> ---
>> include/linux/min_heap.h | 46 +++++++++++++++++++---------------------
>> 1 file changed, 22 insertions(+), 24 deletions(-)
>> 
>> diff --git a/include/linux/min_heap.h b/include/linux/min_heap.h
>> index 1160bed6579e..2d16e6747e36 100644
>> --- a/include/linux/min_heap.h
>> +++ b/include/linux/min_heap.h
>> @@ -257,34 +257,32 @@ static __always_inline
>> void __min_heap_sift_down_inline(min_heap_char *heap, int pos, size_t
>> elem_size,
>>  const struct min_heap_callbacks *func, void *args)
>> {
>> - const unsigned long lsbit = elem_size & -elem_size;
>> + void *left, *right, *parent, *smallest;
>>  void *data = heap->data;
>>  void (*swp)(void *lhs, void *rhs, void *args) = func->swp;
>> - /* pre-scale counters for performance */
>> - size_t a = pos * elem_size;
>> - size_t b, c, d;
>> - size_t n = heap->nr * elem_size;
>> -
>> - if (!swp)
>> - swp = select_swap_func(data, elem_size);
>> 
>> - /* Find the sift-down path all the way to the leaves. */
>> - for (b = a; c = 2 * b + elem_size, (d = c + elem_size) < n;)
>> - b = func->less(data + c, data + d, args) ? c : d;
>> -
>> - /* Special case for the last leaf with no sibling. */
>> - if (d == n)
>> - b = c;
>> -
>> - /* Backtrack to the correct location. */
>> - while (b != a && func->less(data + a, data + b, args))
>> - b = parent(b, lsbit, elem_size);
>> + for (;;) {
>> + if (pos * 2 + 1 >= heap->nr)
>> + break;
>> 
>> - /* Shift the element into its correct place. */
>> - c = b;
>> - while (b != a) {
>> - b = parent(b, lsbit, elem_size);
>> - do_swap(data + b, data + c, elem_size, swp, args);
>> + left = data + ((pos * 2 + 1) * elem_size);
>> + parent = data + (pos * elem_size);
>> + smallest = parent;
>> + if (func->less(left, smallest, args))
>> + smallest = left;
>> +
>> + if (pos * 2 + 2 < heap->nr) {
>> + right = data + ((pos * 2 + 2) * elem_size);
>> + if (func->less(right, smallest, args))
>> + smallest = right;
>> + }
>> + if (smallest == parent)
>> + break;
>> + do_swap(smallest, parent, elem_size, swp, args);
>> + if (smallest == left)
>> + pos = (pos * 2) + 1;
>> + else
>> + pos = (pos * 2) + 2;
>>  }
>> }
>> 
>> -- 
>> 
>> On Thu, May 15, 2025 at 12:12 AM Kuan-Wei Chiu <visitorckw@gmail.com> wrote:
>>> 
>>> Hi Robert,
>>> 
>>> On Wed, May 14, 2025 at 05:58:01PM -0700, Robert Pang wrote:
>>>> Hi Coly,
>>>> 
>>>> My apologies for the delay in providing this update; comprehensive testing
>>>> takes some time to complete.
>>>> 
>>>> As you suggested, I conducted extensive tests for 24 hours against the
>>>> latest 6.14.5 Linux kernel, exploring more configurations to get a complete
>>>> picture:
>>>> 
>>>> 1. 4KB block size with writethrough mode
>>>> 2. 4KB block size with writeback mode (70% dirty)
>>>> 3. 1MB block size with writethrough mode
>>>> 
>>>> The detailed results, available at [1], consistently demonstrate that our patch
>>>> is effective in significantly reducing latency during garbage collection. This
>>>> holds true for both the default writethrough mode and the 70% writeback mode.
>>>> As anticipated, with 1MB block sizes, we observed no difference in latency
>>>> because the number of btree nodes is much smaller.
>>>> 
>>>> [1] https://gist.github.com/robert-pang/817fa7c11ece99d25aabc0467a9427d8
>>>> 
>>>> However, during these tests, we've uncovered a new and distinct latency problem
>>>> that appears to be introduced in the recent Linux kernel. This issue manifests
>>>> as frequent and periodic latency spikes that occur outside of garbage
>>>> collection.
>>>> Below is a snippet of the latency data illustrating this:
>>>> 
>>>> time (s)  median (ms)  max (ms)
>>>> 60810   2.28     679.37
>>>> 60840   2.32   2,434.24 *
>>>> 60870   2.46   2,434.24 *
>>>> 60900   2.52   2,434.24 *
>>>> 60930   2.63     566.15
>>>> 60960   2.82     566.15
>>>> 60990   2.82     566.15
>>>> 61020   2.78     471.79
>>>> 61050   2.93   2,028.54 *
>>>> 61080   3.11   2,028.54 *
>>>> 61110   3.29   2,028.54 *
>>>> 61140   3.42     679.37
>>>> 61170   3.42     679.37
>>>> 61200   3.41     679.37
>>>> 61230   3.30     566.15
>>>> 61260   2.93   1,690.45 *
>>>> 61290   2.75   1,690.45 *
>>>> 61320   2.72   1,690.45 *
>>>> 61350   2.88   1,408.71 *
>>>> 61380   5.07   1,408.71 *
>>>> 61410 107.94   1,408.71 **
>>>> 61440  65.28   1,408.71 **
>>>> 61470  45.41   2,028.54 **
>>>> 61500  72.45   2,028.54 **
>>>> 61530  55.37   2,028.54 **
>>>> 61560  40.73   1,408.71 **
>>>> 61590  11.48   1,690.45 **
>>>> 61620   2.92   1,690.45 *
>>>> 61650   2.54   1,690.45 *
>>>> 61680   2.58     679.37
>>>> 61710   2.78     679.37
>>>> 
>>>> ** garbage collection
>>>> * cache replacement
>>>> 
>>>> Based on the consistent periodicity of these spikes, we deduce that they are
>>>> linked to the invalidate_buckets_lru() function during cache replacement. This
>>>> function was recently modified to use min heap operations [2]. To confirm our
>>>> hypothesis, we reverted the relevant commits and re-ran the tests. Results show
>>>> that the latency spikes completely disappeared, positively confirming that the
>>>> min heap changes introduce this regression. Furthermore, these changes also
>>>> reduce the effectiveness of our GC patch. It appears that the min heap changes
>>>> reduce heap sort speed somehow in invalidate_buckets_lr() and in GC.
>>> 
>>> Thank you for reporting this regression and for taking the time to
>>> perform such thorough testing.
>>> 
>>> My current hypothesis is that the root cause may be related to the
>>> earlier change where the min heap API was converted from inline to
>>> non-inline [1]. As a result, the comparison function used by the min
>>> heap is now invoked via an indirect function call instead of a direct
>>> one, which introduces significant overhead - especially when
>>> CONFIG_MITIGATION_RETPOLINE is enabled, as the cost of indirect calls
>>> can be quite substantial in that case.
>>> 
>>> I understand that running these tests takes considerable time, but
>>> could you try the attached diff to see if it addresses the regression?
>>> 
>>> Regards,
>>> Kuan-Wei
>>> 
>>> [1]: https://lore.kernel.org/lkml/20241020040200.939973-2-visitorckw@gmail.com/
>>> 
>>> diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
>>> index 8998e61efa40..862e7118f81d 100644
>>> --- a/drivers/md/bcache/alloc.c
>>> +++ b/drivers/md/bcache/alloc.c
>>> @@ -207,15 +207,15 @@ static void invalidate_buckets_lru(struct cache *ca)
>>>                if (!bch_can_invalidate_bucket(ca, b))
>>>                        continue;
>>> 
>>> -               if (!min_heap_full(&ca->heap))
>>> -                       min_heap_push(&ca->heap, &b, &bucket_max_cmp_callback, ca);
>>> +               if (!min_heap_full_inline(&ca->heap))
>>> +                       min_heap_push_inline(&ca->heap, &b, &bucket_max_cmp_callback, ca);
>>>                else if (!new_bucket_max_cmp(&b, min_heap_peek(&ca->heap), ca)) {
>>>                        ca->heap.data[0] = b;
>>> -                       min_heap_sift_down(&ca->heap, 0, &bucket_max_cmp_callback, ca);
>>> +                       min_heap_sift_down_inline(&ca->heap, 0, &bucket_max_cmp_callback, ca);
>>>                }
>>>        }
>>> 
>>> -       min_heapify_all(&ca->heap, &bucket_min_cmp_callback, ca);
>>> +       min_heapify_all_inline(&ca->heap, &bucket_min_cmp_callback, ca);
>>> 
>>>        while (!fifo_full(&ca->free_inc)) {
>>>                if (!ca->heap.nr) {
>>> @@ -227,8 +227,8 @@ static void invalidate_buckets_lru(struct cache *ca)
>>>                        wake_up_gc(ca->set);
>>>                        return;
>>>                }
>>> -               b = min_heap_peek(&ca->heap)[0];
>>> -               min_heap_pop(&ca->heap, &bucket_min_cmp_callback, ca);
>>> +               b = min_heap_peek_inline(&ca->heap)[0];
>>> +               min_heap_pop_inline(&ca->heap, &bucket_min_cmp_callback, ca);
>>> 
>>>                bch_invalidate_one_bucket(ca, b);
>>>        }
>>> 
>>>> 
>>>> [2] https://lore.kernel.org/linux-bcache/ZxzkLJmhn3a%2F1ALQ@visitorckw-System-Product-Name/T/#m0dd24ba0c63615de465d3fec72dc73febb0f7a94
>>>> 
>>>> You may download the full test result data from these links.
>>>> 
>>>> https://gist.github.com/robert-pang/5df1d595ee77756c0a01d6479bdf8e34#file-bcache-latency-4kb-no-patch-csv
>>>> https://gist.github.com/robert-pang/5df1d595ee77756c0a01d6479bdf8e34#file-bcache-latency-4kb-with-patch-csv
>>>> https://gist.github.com/robert-pang/bcc26a3aa90dc95a083799cf4fd48116#file-bcache-latency-4kb-wb-no-patch-csv
>>>> https://gist.github.com/robert-pang/bcc26a3aa90dc95a083799cf4fd48116#file-bcache-latency-4kb-wb-with-patch-csv
>>>> https://gist.github.com/robert-pang/7036b06b66c8de7e958cdbddcd92a3f5#file-bcache-latency-1mb-no-patch-csv
>>>> https://gist.github.com/robert-pang/7036b06b66c8de7e958cdbddcd92a3f5#file-bcache-latency-1mb-with-patch-csv
>>>> https://gist.github.com/robert-pang/40f90afdea2d2a8c3f6e22ff959eff03#file-bcache-latency-4kb-no-patch-min-heap-reverted-csv
>>>> https://gist.github.com/robert-pang/40f90afdea2d2a8c3f6e22ff959eff03#file-bcache-latency-4kb-with-patch-min-heap-reverted-csv
>>>> 
>>>> Best regards
>>>> Robert Pang
>>>> 
>>>> On Sat, May 3, 2025 at 10:33 AM Coly Li <colyli@kernel.org> wrote:
>>>>> 
>>>>> On Thu, May 01, 2025 at 06:01:09PM +0800, Robert Pang wrote:
>>>>>> Hi Coly,
>>>>>> 
>>>>>> Please disregard the test results I shared over a week ago. After digging
>>>>>> deeper into the recent latency spikes with various workloads and by
>>>>>> instrumenting the garbage collector, I realized that the earlier GC latency
>>>>>> patch, "bcache: allow allocator to invalidate bucket in gc" [1], wasn't
>>>>>> backported to the Linux 6.6 branch I tested my patch against. This omission
>>>>>> explains the much higher latency observed during the extended test because the
>>>>>> allocator was blocked for the entire GC. My sincere apologies for the
>>>>>> inconsistent results and any confusion this has caused.
>>>>>> 
>>>>> 
>>>>> Did you also backport commit 05356938a4be ("bcache: call force_wake_up_gc()
>>>>> if necessary in check_should_bypass()") ? Last time when you pushed me to
>>>>> add commit a14a68b76954 into mainline kernel, I tested a regression from this
>>>>> patch and fixed it. Please add this fix if you didn't, otherwise the testing
>>>>> might not be completed.
>>>>> 
>>>>> 
>>>>>> With patch [1] back-patched and after a 24-hour re-test, the fio results clearly
>>>>>> demonstrate that this patch effectively reduces front IO latency during GC due
>>>>>> to the smaller incremental GC cycles, while the GC duration increase is still
>>>>>> well within bounds.
>>>>>> 
>>>>> 
>>>>> From the performance result in [2], it seems the max latency are reduced,
>>>>> but higher latency period are longer. I am not sure whether this is a happy
>>>>> result.
>>>>> 
>>>>> Can I have a download link for the whole log? Then I can look at the
>>>>> performance numbers more close.
>>>>> 
>>>>>> Here's a summary of the improved latency:
>>>>>> 
>>>>>> Before:
>>>>>> 
>>>>>> Median latency (P50): 210 ms
>>>>>> Max latency (P100): 3.5 sec
>>>>>> 
>>>>>> btree_gc_average_duration_ms:381138
>>>>>> btree_gc_average_frequency_sec:3834
>>>>>> btree_gc_last_sec:60668
>>>>>> btree_gc_max_duration_ms:825228
>>>>>> bset_tree_stats:
>>>>>> btree nodes: 144330
>>>>>> written sets: 283733
>>>>>> unwritten sets: 144329
>>>>>> written key bytes: 24993783392
>>>>>> unwritten key bytes: 11777400
>>>>>> floats: 30936844345385
>>>>>> failed: 5776
>>>>>> 
>>>>>> After:
>>>>>> 
>>>>>> Median latency (P50): 25 ms
>>>>>> Max latency (P100): 0.8 sec
>>>>>> 
>>>>>> btree_gc_average_duration_ms:622274
>>>>>> btree_gc_average_frequency_sec:3518
>>>>>> btree_gc_last_sec:8931
>>>>>> btree_gc_max_duration_ms:953146
>>>>>> bset_tree_stats:
>>>>>> btree nodes: 175491
>>>>>> written sets: 339078
>>>>>> unwritten sets: 175488
>>>>>> written key bytes: 29821314856
>>>>>> unwritten key bytes: 14076504
>>>>>> floats: 90520963280544
>>>>>> failed: 6462
>>>>>> 
>>>>>> The complete latency data is available at [2].
>>>>>> 
>>>>>> I will be glad to run further tests to solidify these findings for the inclusion
>>>>>> of this patch in the coming merge window. Let me know if you'd like me to
>>>>>> conduct any specific tests.
>>>>> 
>>>>> Yes, more testing are necessary, from 512 Bytes block size to 1 MiB or
>>>>> 8MiB block size. We need to make sure it won't introduce performance
>>>>> regression in other workload or circumstances.
>>>>> 
>>>>> I don't have plan to submit this patch in this merge window, and please don't
>>>>> push me. For performance improvement change, I prefer the defalt
>>>>> configuration will cover most of work loads, so more testing and perforamce
>>>>> data are desired. E.g. the patch you mentioned (commit a14a68b76954 "bcache:
>>>>> allow allocator to invalidate bucket in gc"), it had been deployed in Easy
>>>>> Stack product environment for 20+ months before it got merged.
>>>>> 
>>>>> Thanks.
>>>>> 
>>>>>> 
>>>>>> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a14a68b76954e73031ca6399abace17dcb77c17a
>>>>>> [2[ https://gist.github.com/robert-pang/cc7c88f356293ea6d43103e6e5f9180f
>>>>> 
>>>>> [snipped]
>>>>> 
>>>>> --
>>>>> Coly Li



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/1] bcache: reduce front IO latency during GC
  2025-05-17 11:02                 ` Coly Li
@ 2025-05-20 11:51                   ` Kuan-Wei Chiu
  2025-05-20 12:13                     ` Coly Li
  0 siblings, 1 reply; 27+ messages in thread
From: Kuan-Wei Chiu @ 2025-05-20 11:51 UTC (permalink / raw)
  To: Coly Li; +Cc: Robert Pang, Coly Li, Kent Overstreet, linux-bcache, Mingzhe Zou

On Sat, May 17, 2025 at 07:02:06PM +0800, Coly Li wrote:
> 
> 
> > 2025年5月17日 00:14，Kuan-Wei Chiu <visitorckw@gmail.com> 写道：
> > 
> > On Thu, May 15, 2025 at 08:58:44PM -0700, Robert Pang wrote:
> >> Hi Kuan-Wei,
> >> 
> >> Thank you for your prompt response. I tested your suggested patch to
> >> inline the min heap operations for 8 hours and it is still ongoing.
> >> Unfortunately, basing on the results so far, it didn't resolve the
> >> regression, suggesting inlining isn't the issue.
> >> 
> >> After reviewing the commits in lib/min_heap.h, I noticed commit
> >> c641722 ("lib min_heap: optimize number of comparisons in
> >> min_heapify()") and it looked like a potential candidate. I reverted
> >> this commit (below) and ran the tests. While the test is still
> >> ongoing, the results for the past 3 hours show that the latency spikes
> >> during invalidate_buckets_lru() disappeared after this change,
> >> indicating that this commit is likely the root cause of the
> >> regression.
> >> 
> >> My hypothesis is that while commit c641722 was designed to reduce
> >> comparisons with randomized input [1], it might inadvertently increase
> >> comparisons when the input isn't as random. A scenario where this
> >> could happen is within invalidate_buckets_lru() before the cache is
> >> fully populated. In such cases, many buckets are unfilled, causing
> >> new_bucket_prio() to return zero, leading to more frequent
> >> compare-equal operations with other unfilled buckets. In the case when
> >> the cache is populated, the bucket priorities fall in a range with
> >> many duplicates. How will heap_sift() behave in such cases?
> >> 
> >> [1] https://lore.kernel.org/linux-bcache/20240121153649.2733274-6-visitorckw@gmail.com/
> >> 
> > 
> > You're very likely correct.
> > 
> > In scenarios where the majority of elements in the heap are identical,
> > the traditional top-down version of heapify finishes after just 2
> > comparisons. However, with the bottom-up version introduced by that
> > commit, it ends up performing roughly 2 * log₂(n) comparisons in the
> > same case.
> 
> For bcache scenario for ideal circumstances and best performance, the cached data
> and following requests should have spatial or temporal locality.
> 
> I guess it means for the heap usage, the input might not be typical random.
> 
> 
> > 
> > That said, reverting the commit would increase the number of
> > comparisons by about 2x in cases where all elements in the heap are
> > distinct, which was the original motivation for the change. I'm not
> > entirely sure what the best way would be to fix this regression without
> > negatively impacting the performance of the other use cases.
> 
> If the data read model are fully sequential or random, bcache cannot help too much.
> 
> So I guess maybe we still need to old heapify code? The new version is for full random input,
> and previous version for not that much random input.
> 

I think we have two options here. One is to add a classic heapify
function to min_heap.h, allowing users to choose based on whether they
expect many duplicate elements in the heap. While having two heapify
variants might be confusing from a library design perspective, we could
mitigate that with clear kernel-doc comments. The other option is to
revert to the old bcache heap code. I'm not sure which approach is
better.

Regards,
Kuan-Wei

> Thanks.
> 
> Coly Li
> 
> 
> > 
> > Regards,
> > Kuan-Wei
> > 
> >> Best regards,
> >> Robert
> >> 
> >> ---
> >> include/linux/min_heap.h | 46 +++++++++++++++++++---------------------
> >> 1 file changed, 22 insertions(+), 24 deletions(-)
> >> 
> >> diff --git a/include/linux/min_heap.h b/include/linux/min_heap.h
> >> index 1160bed6579e..2d16e6747e36 100644
> >> --- a/include/linux/min_heap.h
> >> +++ b/include/linux/min_heap.h
> >> @@ -257,34 +257,32 @@ static __always_inline
> >> void __min_heap_sift_down_inline(min_heap_char *heap, int pos, size_t
> >> elem_size,
> >>  const struct min_heap_callbacks *func, void *args)
> >> {
> >> - const unsigned long lsbit = elem_size & -elem_size;
> >> + void *left, *right, *parent, *smallest;
> >>  void *data = heap->data;
> >>  void (*swp)(void *lhs, void *rhs, void *args) = func->swp;
> >> - /* pre-scale counters for performance */
> >> - size_t a = pos * elem_size;
> >> - size_t b, c, d;
> >> - size_t n = heap->nr * elem_size;
> >> -
> >> - if (!swp)
> >> - swp = select_swap_func(data, elem_size);
> >> 
> >> - /* Find the sift-down path all the way to the leaves. */
> >> - for (b = a; c = 2 * b + elem_size, (d = c + elem_size) < n;)
> >> - b = func->less(data + c, data + d, args) ? c : d;
> >> -
> >> - /* Special case for the last leaf with no sibling. */
> >> - if (d == n)
> >> - b = c;
> >> -
> >> - /* Backtrack to the correct location. */
> >> - while (b != a && func->less(data + a, data + b, args))
> >> - b = parent(b, lsbit, elem_size);
> >> + for (;;) {
> >> + if (pos * 2 + 1 >= heap->nr)
> >> + break;
> >> 
> >> - /* Shift the element into its correct place. */
> >> - c = b;
> >> - while (b != a) {
> >> - b = parent(b, lsbit, elem_size);
> >> - do_swap(data + b, data + c, elem_size, swp, args);
> >> + left = data + ((pos * 2 + 1) * elem_size);
> >> + parent = data + (pos * elem_size);
> >> + smallest = parent;
> >> + if (func->less(left, smallest, args))
> >> + smallest = left;
> >> +
> >> + if (pos * 2 + 2 < heap->nr) {
> >> + right = data + ((pos * 2 + 2) * elem_size);
> >> + if (func->less(right, smallest, args))
> >> + smallest = right;
> >> + }
> >> + if (smallest == parent)
> >> + break;
> >> + do_swap(smallest, parent, elem_size, swp, args);
> >> + if (smallest == left)
> >> + pos = (pos * 2) + 1;
> >> + else
> >> + pos = (pos * 2) + 2;
> >>  }
> >> }
> >> 
> >> -- 
> >> 
> >> On Thu, May 15, 2025 at 12:12 AM Kuan-Wei Chiu <visitorckw@gmail.com> wrote:
> >>> 
> >>> Hi Robert,
> >>> 
> >>> On Wed, May 14, 2025 at 05:58:01PM -0700, Robert Pang wrote:
> >>>> Hi Coly,
> >>>> 
> >>>> My apologies for the delay in providing this update; comprehensive testing
> >>>> takes some time to complete.
> >>>> 
> >>>> As you suggested, I conducted extensive tests for 24 hours against the
> >>>> latest 6.14.5 Linux kernel, exploring more configurations to get a complete
> >>>> picture:
> >>>> 
> >>>> 1. 4KB block size with writethrough mode
> >>>> 2. 4KB block size with writeback mode (70% dirty)
> >>>> 3. 1MB block size with writethrough mode
> >>>> 
> >>>> The detailed results, available at [1], consistently demonstrate that our patch
> >>>> is effective in significantly reducing latency during garbage collection. This
> >>>> holds true for both the default writethrough mode and the 70% writeback mode.
> >>>> As anticipated, with 1MB block sizes, we observed no difference in latency
> >>>> because the number of btree nodes is much smaller.
> >>>> 
> >>>> [1] https://gist.github.com/robert-pang/817fa7c11ece99d25aabc0467a9427d8
> >>>> 
> >>>> However, during these tests, we've uncovered a new and distinct latency problem
> >>>> that appears to be introduced in the recent Linux kernel. This issue manifests
> >>>> as frequent and periodic latency spikes that occur outside of garbage
> >>>> collection.
> >>>> Below is a snippet of the latency data illustrating this:
> >>>> 
> >>>> time (s)  median (ms)  max (ms)
> >>>> 60810   2.28     679.37
> >>>> 60840   2.32   2,434.24 *
> >>>> 60870   2.46   2,434.24 *
> >>>> 60900   2.52   2,434.24 *
> >>>> 60930   2.63     566.15
> >>>> 60960   2.82     566.15
> >>>> 60990   2.82     566.15
> >>>> 61020   2.78     471.79
> >>>> 61050   2.93   2,028.54 *
> >>>> 61080   3.11   2,028.54 *
> >>>> 61110   3.29   2,028.54 *
> >>>> 61140   3.42     679.37
> >>>> 61170   3.42     679.37
> >>>> 61200   3.41     679.37
> >>>> 61230   3.30     566.15
> >>>> 61260   2.93   1,690.45 *
> >>>> 61290   2.75   1,690.45 *
> >>>> 61320   2.72   1,690.45 *
> >>>> 61350   2.88   1,408.71 *
> >>>> 61380   5.07   1,408.71 *
> >>>> 61410 107.94   1,408.71 **
> >>>> 61440  65.28   1,408.71 **
> >>>> 61470  45.41   2,028.54 **
> >>>> 61500  72.45   2,028.54 **
> >>>> 61530  55.37   2,028.54 **
> >>>> 61560  40.73   1,408.71 **
> >>>> 61590  11.48   1,690.45 **
> >>>> 61620   2.92   1,690.45 *
> >>>> 61650   2.54   1,690.45 *
> >>>> 61680   2.58     679.37
> >>>> 61710   2.78     679.37
> >>>> 
> >>>> ** garbage collection
> >>>> * cache replacement
> >>>> 
> >>>> Based on the consistent periodicity of these spikes, we deduce that they are
> >>>> linked to the invalidate_buckets_lru() function during cache replacement. This
> >>>> function was recently modified to use min heap operations [2]. To confirm our
> >>>> hypothesis, we reverted the relevant commits and re-ran the tests. Results show
> >>>> that the latency spikes completely disappeared, positively confirming that the
> >>>> min heap changes introduce this regression. Furthermore, these changes also
> >>>> reduce the effectiveness of our GC patch. It appears that the min heap changes
> >>>> reduce heap sort speed somehow in invalidate_buckets_lr() and in GC.
> >>> 
> >>> Thank you for reporting this regression and for taking the time to
> >>> perform such thorough testing.
> >>> 
> >>> My current hypothesis is that the root cause may be related to the
> >>> earlier change where the min heap API was converted from inline to
> >>> non-inline [1]. As a result, the comparison function used by the min
> >>> heap is now invoked via an indirect function call instead of a direct
> >>> one, which introduces significant overhead - especially when
> >>> CONFIG_MITIGATION_RETPOLINE is enabled, as the cost of indirect calls
> >>> can be quite substantial in that case.
> >>> 
> >>> I understand that running these tests takes considerable time, but
> >>> could you try the attached diff to see if it addresses the regression?
> >>> 
> >>> Regards,
> >>> Kuan-Wei
> >>> 
> >>> [1]: https://lore.kernel.org/lkml/20241020040200.939973-2-visitorckw@gmail.com/
> >>> 
> >>> diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
> >>> index 8998e61efa40..862e7118f81d 100644
> >>> --- a/drivers/md/bcache/alloc.c
> >>> +++ b/drivers/md/bcache/alloc.c
> >>> @@ -207,15 +207,15 @@ static void invalidate_buckets_lru(struct cache *ca)
> >>>                if (!bch_can_invalidate_bucket(ca, b))
> >>>                        continue;
> >>> 
> >>> -               if (!min_heap_full(&ca->heap))
> >>> -                       min_heap_push(&ca->heap, &b, &bucket_max_cmp_callback, ca);
> >>> +               if (!min_heap_full_inline(&ca->heap))
> >>> +                       min_heap_push_inline(&ca->heap, &b, &bucket_max_cmp_callback, ca);
> >>>                else if (!new_bucket_max_cmp(&b, min_heap_peek(&ca->heap), ca)) {
> >>>                        ca->heap.data[0] = b;
> >>> -                       min_heap_sift_down(&ca->heap, 0, &bucket_max_cmp_callback, ca);
> >>> +                       min_heap_sift_down_inline(&ca->heap, 0, &bucket_max_cmp_callback, ca);
> >>>                }
> >>>        }
> >>> 
> >>> -       min_heapify_all(&ca->heap, &bucket_min_cmp_callback, ca);
> >>> +       min_heapify_all_inline(&ca->heap, &bucket_min_cmp_callback, ca);
> >>> 
> >>>        while (!fifo_full(&ca->free_inc)) {
> >>>                if (!ca->heap.nr) {
> >>> @@ -227,8 +227,8 @@ static void invalidate_buckets_lru(struct cache *ca)
> >>>                        wake_up_gc(ca->set);
> >>>                        return;
> >>>                }
> >>> -               b = min_heap_peek(&ca->heap)[0];
> >>> -               min_heap_pop(&ca->heap, &bucket_min_cmp_callback, ca);
> >>> +               b = min_heap_peek_inline(&ca->heap)[0];
> >>> +               min_heap_pop_inline(&ca->heap, &bucket_min_cmp_callback, ca);
> >>> 
> >>>                bch_invalidate_one_bucket(ca, b);
> >>>        }
> >>> 
> >>>> 
> >>>> [2] https://lore.kernel.org/linux-bcache/ZxzkLJmhn3a%2F1ALQ@visitorckw-System-Product-Name/T/#m0dd24ba0c63615de465d3fec72dc73febb0f7a94
> >>>> 
> >>>> You may download the full test result data from these links.
> >>>> 
> >>>> https://gist.github.com/robert-pang/5df1d595ee77756c0a01d6479bdf8e34#file-bcache-latency-4kb-no-patch-csv
> >>>> https://gist.github.com/robert-pang/5df1d595ee77756c0a01d6479bdf8e34#file-bcache-latency-4kb-with-patch-csv
> >>>> https://gist.github.com/robert-pang/bcc26a3aa90dc95a083799cf4fd48116#file-bcache-latency-4kb-wb-no-patch-csv
> >>>> https://gist.github.com/robert-pang/bcc26a3aa90dc95a083799cf4fd48116#file-bcache-latency-4kb-wb-with-patch-csv
> >>>> https://gist.github.com/robert-pang/7036b06b66c8de7e958cdbddcd92a3f5#file-bcache-latency-1mb-no-patch-csv
> >>>> https://gist.github.com/robert-pang/7036b06b66c8de7e958cdbddcd92a3f5#file-bcache-latency-1mb-with-patch-csv
> >>>> https://gist.github.com/robert-pang/40f90afdea2d2a8c3f6e22ff959eff03#file-bcache-latency-4kb-no-patch-min-heap-reverted-csv
> >>>> https://gist.github.com/robert-pang/40f90afdea2d2a8c3f6e22ff959eff03#file-bcache-latency-4kb-with-patch-min-heap-reverted-csv
> >>>> 
> >>>> Best regards
> >>>> Robert Pang
> >>>> 
> >>>> On Sat, May 3, 2025 at 10:33 AM Coly Li <colyli@kernel.org> wrote:
> >>>>> 
> >>>>> On Thu, May 01, 2025 at 06:01:09PM +0800, Robert Pang wrote:
> >>>>>> Hi Coly,
> >>>>>> 
> >>>>>> Please disregard the test results I shared over a week ago. After digging
> >>>>>> deeper into the recent latency spikes with various workloads and by
> >>>>>> instrumenting the garbage collector, I realized that the earlier GC latency
> >>>>>> patch, "bcache: allow allocator to invalidate bucket in gc" [1], wasn't
> >>>>>> backported to the Linux 6.6 branch I tested my patch against. This omission
> >>>>>> explains the much higher latency observed during the extended test because the
> >>>>>> allocator was blocked for the entire GC. My sincere apologies for the
> >>>>>> inconsistent results and any confusion this has caused.
> >>>>>> 
> >>>>> 
> >>>>> Did you also backport commit 05356938a4be ("bcache: call force_wake_up_gc()
> >>>>> if necessary in check_should_bypass()") ? Last time when you pushed me to
> >>>>> add commit a14a68b76954 into mainline kernel, I tested a regression from this
> >>>>> patch and fixed it. Please add this fix if you didn't, otherwise the testing
> >>>>> might not be completed.
> >>>>> 
> >>>>> 
> >>>>>> With patch [1] back-patched and after a 24-hour re-test, the fio results clearly
> >>>>>> demonstrate that this patch effectively reduces front IO latency during GC due
> >>>>>> to the smaller incremental GC cycles, while the GC duration increase is still
> >>>>>> well within bounds.
> >>>>>> 
> >>>>> 
> >>>>> From the performance result in [2], it seems the max latency are reduced,
> >>>>> but higher latency period are longer. I am not sure whether this is a happy
> >>>>> result.
> >>>>> 
> >>>>> Can I have a download link for the whole log? Then I can look at the
> >>>>> performance numbers more close.
> >>>>> 
> >>>>>> Here's a summary of the improved latency:
> >>>>>> 
> >>>>>> Before:
> >>>>>> 
> >>>>>> Median latency (P50): 210 ms
> >>>>>> Max latency (P100): 3.5 sec
> >>>>>> 
> >>>>>> btree_gc_average_duration_ms:381138
> >>>>>> btree_gc_average_frequency_sec:3834
> >>>>>> btree_gc_last_sec:60668
> >>>>>> btree_gc_max_duration_ms:825228
> >>>>>> bset_tree_stats:
> >>>>>> btree nodes: 144330
> >>>>>> written sets: 283733
> >>>>>> unwritten sets: 144329
> >>>>>> written key bytes: 24993783392
> >>>>>> unwritten key bytes: 11777400
> >>>>>> floats: 30936844345385
> >>>>>> failed: 5776
> >>>>>> 
> >>>>>> After:
> >>>>>> 
> >>>>>> Median latency (P50): 25 ms
> >>>>>> Max latency (P100): 0.8 sec
> >>>>>> 
> >>>>>> btree_gc_average_duration_ms:622274
> >>>>>> btree_gc_average_frequency_sec:3518
> >>>>>> btree_gc_last_sec:8931
> >>>>>> btree_gc_max_duration_ms:953146
> >>>>>> bset_tree_stats:
> >>>>>> btree nodes: 175491
> >>>>>> written sets: 339078
> >>>>>> unwritten sets: 175488
> >>>>>> written key bytes: 29821314856
> >>>>>> unwritten key bytes: 14076504
> >>>>>> floats: 90520963280544
> >>>>>> failed: 6462
> >>>>>> 
> >>>>>> The complete latency data is available at [2].
> >>>>>> 
> >>>>>> I will be glad to run further tests to solidify these findings for the inclusion
> >>>>>> of this patch in the coming merge window. Let me know if you'd like me to
> >>>>>> conduct any specific tests.
> >>>>> 
> >>>>> Yes, more testing are necessary, from 512 Bytes block size to 1 MiB or
> >>>>> 8MiB block size. We need to make sure it won't introduce performance
> >>>>> regression in other workload or circumstances.
> >>>>> 
> >>>>> I don't have plan to submit this patch in this merge window, and please don't
> >>>>> push me. For performance improvement change, I prefer the defalt
> >>>>> configuration will cover most of work loads, so more testing and perforamce
> >>>>> data are desired. E.g. the patch you mentioned (commit a14a68b76954 "bcache:
> >>>>> allow allocator to invalidate bucket in gc"), it had been deployed in Easy
> >>>>> Stack product environment for 20+ months before it got merged.
> >>>>> 
> >>>>> Thanks.
> >>>>> 
> >>>>>> 
> >>>>>> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a14a68b76954e73031ca6399abace17dcb77c17a
> >>>>>> [2[ https://gist.github.com/robert-pang/cc7c88f356293ea6d43103e6e5f9180f
> >>>>> 
> >>>>> [snipped]
> >>>>> 
> >>>>> --
> >>>>> Coly Li
> 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/1] bcache: reduce front IO latency during GC
  2025-05-20 11:51                   ` Kuan-Wei Chiu
@ 2025-05-20 12:13                     ` Coly Li
  2025-05-20 12:26                       ` Kuan-Wei Chiu
  0 siblings, 1 reply; 27+ messages in thread
From: Coly Li @ 2025-05-20 12:13 UTC (permalink / raw)
  To: Kuan-Wei Chiu, Kent Overstreet
  Cc: Robert Pang, Coly Li, linux-bcache, Mingzhe Zou



> 2025年5月20日 19:51，Kuan-Wei Chiu <visitorckw@gmail.com> 写道：
> 
> On Sat, May 17, 2025 at 07:02:06PM +0800, Coly Li wrote:
>> 
>> 
>>> 2025年5月17日 00:14，Kuan-Wei Chiu <visitorckw@gmail.com> 写道：
>>> 
>>> On Thu, May 15, 2025 at 08:58:44PM -0700, Robert Pang wrote:
>>>> Hi Kuan-Wei,
>>>> 
>>>> Thank you for your prompt response. I tested your suggested patch to
>>>> inline the min heap operations for 8 hours and it is still ongoing.
>>>> Unfortunately, basing on the results so far, it didn't resolve the
>>>> regression, suggesting inlining isn't the issue.
>>>> 
>>>> After reviewing the commits in lib/min_heap.h, I noticed commit
>>>> c641722 ("lib min_heap: optimize number of comparisons in
>>>> min_heapify()") and it looked like a potential candidate. I reverted
>>>> this commit (below) and ran the tests. While the test is still
>>>> ongoing, the results for the past 3 hours show that the latency spikes
>>>> during invalidate_buckets_lru() disappeared after this change,
>>>> indicating that this commit is likely the root cause of the
>>>> regression.
>>>> 
>>>> My hypothesis is that while commit c641722 was designed to reduce
>>>> comparisons with randomized input [1], it might inadvertently increase
>>>> comparisons when the input isn't as random. A scenario where this
>>>> could happen is within invalidate_buckets_lru() before the cache is
>>>> fully populated. In such cases, many buckets are unfilled, causing
>>>> new_bucket_prio() to return zero, leading to more frequent
>>>> compare-equal operations with other unfilled buckets. In the case when
>>>> the cache is populated, the bucket priorities fall in a range with
>>>> many duplicates. How will heap_sift() behave in such cases?
>>>> 
>>>> [1] https://lore.kernel.org/linux-bcache/20240121153649.2733274-6-visitorckw@gmail.com/
>>>> 
>>> 
>>> You're very likely correct.
>>> 
>>> In scenarios where the majority of elements in the heap are identical,
>>> the traditional top-down version of heapify finishes after just 2
>>> comparisons. However, with the bottom-up version introduced by that
>>> commit, it ends up performing roughly 2 * log₂(n) comparisons in the
>>> same case.
>> 
>> For bcache scenario for ideal circumstances and best performance, the cached data
>> and following requests should have spatial or temporal locality.
>> 
>> I guess it means for the heap usage, the input might not be typical random.
>> 
>> 
>>> 
>>> That said, reverting the commit would increase the number of
>>> comparisons by about 2x in cases where all elements in the heap are
>>> distinct, which was the original motivation for the change. I'm not
>>> entirely sure what the best way would be to fix this regression without
>>> negatively impacting the performance of the other use cases.
>> 
>> If the data read model are fully sequential or random, bcache cannot help too much.
>> 
>> So I guess maybe we still need to old heapify code? The new version is for full random input,
>> and previous version for not that much random input.
>> 
> 
> I think we have two options here. One is to add a classic heapify
> function to min_heap.h, allowing users to choose based on whether they
> expect many duplicate elements in the heap. While having two heapify
> variants might be confusing from a library design perspective, we could
> mitigate that with clear kernel-doc comments. The other option is to
> revert to the old bcache heap code. I'm not sure which approach is
> better.
> 

I prefer to have two min_heap APIs, but how to name them, this is a question from me.

Also if the full-random min_heap version has no user in kernel, whether to keep it in kernel also is a question.

Kent,
Could you please offer your opinion?

Thanks.

Coly Li


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/1] bcache: reduce front IO latency during GC
  2025-05-20 12:13                     ` Coly Li
@ 2025-05-20 12:26                       ` Kuan-Wei Chiu
  2025-05-20 13:13                         ` Coly Li
  0 siblings, 1 reply; 27+ messages in thread
From: Kuan-Wei Chiu @ 2025-05-20 12:26 UTC (permalink / raw)
  To: Coly Li; +Cc: Kent Overstreet, Robert Pang, Coly Li, linux-bcache, Mingzhe Zou

On Tue, May 20, 2025 at 08:13:47PM +0800, Coly Li wrote:
> 
> 
> > 2025年5月20日 19:51，Kuan-Wei Chiu <visitorckw@gmail.com> 写道：
> > 
> > On Sat, May 17, 2025 at 07:02:06PM +0800, Coly Li wrote:
> >> 
> >> 
> >>> 2025年5月17日 00:14，Kuan-Wei Chiu <visitorckw@gmail.com> 写道：
> >>> 
> >>> On Thu, May 15, 2025 at 08:58:44PM -0700, Robert Pang wrote:
> >>>> Hi Kuan-Wei,
> >>>> 
> >>>> Thank you for your prompt response. I tested your suggested patch to
> >>>> inline the min heap operations for 8 hours and it is still ongoing.
> >>>> Unfortunately, basing on the results so far, it didn't resolve the
> >>>> regression, suggesting inlining isn't the issue.
> >>>> 
> >>>> After reviewing the commits in lib/min_heap.h, I noticed commit
> >>>> c641722 ("lib min_heap: optimize number of comparisons in
> >>>> min_heapify()") and it looked like a potential candidate. I reverted
> >>>> this commit (below) and ran the tests. While the test is still
> >>>> ongoing, the results for the past 3 hours show that the latency spikes
> >>>> during invalidate_buckets_lru() disappeared after this change,
> >>>> indicating that this commit is likely the root cause of the
> >>>> regression.
> >>>> 
> >>>> My hypothesis is that while commit c641722 was designed to reduce
> >>>> comparisons with randomized input [1], it might inadvertently increase
> >>>> comparisons when the input isn't as random. A scenario where this
> >>>> could happen is within invalidate_buckets_lru() before the cache is
> >>>> fully populated. In such cases, many buckets are unfilled, causing
> >>>> new_bucket_prio() to return zero, leading to more frequent
> >>>> compare-equal operations with other unfilled buckets. In the case when
> >>>> the cache is populated, the bucket priorities fall in a range with
> >>>> many duplicates. How will heap_sift() behave in such cases?
> >>>> 
> >>>> [1] https://lore.kernel.org/linux-bcache/20240121153649.2733274-6-visitorckw@gmail.com/
> >>>> 
> >>> 
> >>> You're very likely correct.
> >>> 
> >>> In scenarios where the majority of elements in the heap are identical,
> >>> the traditional top-down version of heapify finishes after just 2
> >>> comparisons. However, with the bottom-up version introduced by that
> >>> commit, it ends up performing roughly 2 * log₂(n) comparisons in the
> >>> same case.
> >> 
> >> For bcache scenario for ideal circumstances and best performance, the cached data
> >> and following requests should have spatial or temporal locality.
> >> 
> >> I guess it means for the heap usage, the input might not be typical random.
> >> 
> >> 
> >>> 
> >>> That said, reverting the commit would increase the number of
> >>> comparisons by about 2x in cases where all elements in the heap are
> >>> distinct, which was the original motivation for the change. I'm not
> >>> entirely sure what the best way would be to fix this regression without
> >>> negatively impacting the performance of the other use cases.
> >> 
> >> If the data read model are fully sequential or random, bcache cannot help too much.
> >> 
> >> So I guess maybe we still need to old heapify code? The new version is for full random input,
> >> and previous version for not that much random input.
> >> 
> > 
> > I think we have two options here. One is to add a classic heapify
> > function to min_heap.h, allowing users to choose based on whether they
> > expect many duplicate elements in the heap. While having two heapify
> > variants might be confusing from a library design perspective, we could
> > mitigate that with clear kernel-doc comments. The other option is to
> > revert to the old bcache heap code. I'm not sure which approach is
> > better.
> > 
> 
> I prefer to have two min_heap APIs, but how to name them, this is a question from me.
> 
> Also if the full-random min_heap version has no user in kernel, whether to keep it in kernel also is a question.

From the perspective of the number of comparisons in heapify, what
matters more is whether the data contains many equal elements, rather
than whether it's truly random. I assume that for most other kernel
users, their use cases don't typically involve a large number of equal
elements?

Regards,
Kuan-Wei

> 
> Kent,
> Could you please offer your opinion?
> 
> Thanks.
> 
> Coly Li
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/1] bcache: reduce front IO latency during GC
  2025-05-20 12:26                       ` Kuan-Wei Chiu
@ 2025-05-20 13:13                         ` Coly Li
  2025-05-21 14:40                           ` Kuan-Wei Chiu
  0 siblings, 1 reply; 27+ messages in thread
From: Coly Li @ 2025-05-20 13:13 UTC (permalink / raw)
  To: Kuan-Wei Chiu
  Cc: Kent Overstreet, Robert Pang, Coly Li, linux-bcache, Mingzhe Zou



> 2025年5月20日 20:26，Kuan-Wei Chiu <visitorckw@gmail.com> 写道：
> 
> On Tue, May 20, 2025 at 08:13:47PM +0800, Coly Li wrote:
>> 
>> 
>>> 2025年5月20日 19:51，Kuan-Wei Chiu <visitorckw@gmail.com> 写道：
>>> 
>>> On Sat, May 17, 2025 at 07:02:06PM +0800, Coly Li wrote:
>>>> 
>>>> 
>>>>> 2025年5月17日 00:14，Kuan-Wei Chiu <visitorckw@gmail.com> 写道：
>>>>> 
>>>>> On Thu, May 15, 2025 at 08:58:44PM -0700, Robert Pang wrote:
>>>>>> Hi Kuan-Wei,
>>>>>> 
>>>>>> Thank you for your prompt response. I tested your suggested patch to
>>>>>> inline the min heap operations for 8 hours and it is still ongoing.
>>>>>> Unfortunately, basing on the results so far, it didn't resolve the
>>>>>> regression, suggesting inlining isn't the issue.
>>>>>> 
>>>>>> After reviewing the commits in lib/min_heap.h, I noticed commit
>>>>>> c641722 ("lib min_heap: optimize number of comparisons in
>>>>>> min_heapify()") and it looked like a potential candidate. I reverted
>>>>>> this commit (below) and ran the tests. While the test is still
>>>>>> ongoing, the results for the past 3 hours show that the latency spikes
>>>>>> during invalidate_buckets_lru() disappeared after this change,
>>>>>> indicating that this commit is likely the root cause of the
>>>>>> regression.
>>>>>> 
>>>>>> My hypothesis is that while commit c641722 was designed to reduce
>>>>>> comparisons with randomized input [1], it might inadvertently increase
>>>>>> comparisons when the input isn't as random. A scenario where this
>>>>>> could happen is within invalidate_buckets_lru() before the cache is
>>>>>> fully populated. In such cases, many buckets are unfilled, causing
>>>>>> new_bucket_prio() to return zero, leading to more frequent
>>>>>> compare-equal operations with other unfilled buckets. In the case when
>>>>>> the cache is populated, the bucket priorities fall in a range with
>>>>>> many duplicates. How will heap_sift() behave in such cases?
>>>>>> 
>>>>>> [1] https://lore.kernel.org/linux-bcache/20240121153649.2733274-6-visitorckw@gmail.com/
>>>>>> 
>>>>> 
>>>>> You're very likely correct.
>>>>> 
>>>>> In scenarios where the majority of elements in the heap are identical,
>>>>> the traditional top-down version of heapify finishes after just 2
>>>>> comparisons. However, with the bottom-up version introduced by that
>>>>> commit, it ends up performing roughly 2 * log₂(n) comparisons in the
>>>>> same case.
>>>> 
>>>> For bcache scenario for ideal circumstances and best performance, the cached data
>>>> and following requests should have spatial or temporal locality.
>>>> 
>>>> I guess it means for the heap usage, the input might not be typical random.
>>>> 
>>>> 
>>>>> 
>>>>> That said, reverting the commit would increase the number of
>>>>> comparisons by about 2x in cases where all elements in the heap are
>>>>> distinct, which was the original motivation for the change. I'm not
>>>>> entirely sure what the best way would be to fix this regression without
>>>>> negatively impacting the performance of the other use cases.
>>>> 
>>>> If the data read model are fully sequential or random, bcache cannot help too much.
>>>> 
>>>> So I guess maybe we still need to old heapify code? The new version is for full random input,
>>>> and previous version for not that much random input.
>>>> 
>>> 
>>> I think we have two options here. One is to add a classic heapify
>>> function to min_heap.h, allowing users to choose based on whether they
>>> expect many duplicate elements in the heap. While having two heapify
>>> variants might be confusing from a library design perspective, we could
>>> mitigate that with clear kernel-doc comments. The other option is to
>>> revert to the old bcache heap code. I'm not sure which approach is
>>> better.
>>> 
>> 
>> I prefer to have two min_heap APIs, but how to name them, this is a question from me.
>> 
>> Also if the full-random min_heap version has no user in kernel, whether to keep it in kernel also is a question.
> 
> From the perspective of the number of comparisons in heapify, what
> matters more is whether the data contains many equal elements, rather
> than whether it's truly random. I assume that for most other kernel
> users, their use cases don't typically involve a large number of equal
> elements?
> 

Yes, you are right.  Maybe dm-vdo also has similar I/O pattern?

Deduplication may also have duplicated items in heap I guess.

Thanks.


>> 
>> Kent,
>> Could you please offer your opinion?
>> 
>> Thanks.
>> 
>> Coly Li



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/1] bcache: reduce front IO latency during GC
  2025-05-20 13:13                         ` Coly Li
@ 2025-05-21 14:40                           ` Kuan-Wei Chiu
  2025-06-06  7:39                             ` Robert Pang
  0 siblings, 1 reply; 27+ messages in thread
From: Kuan-Wei Chiu @ 2025-05-21 14:40 UTC (permalink / raw)
  To: Coly Li; +Cc: Kent Overstreet, Robert Pang, Coly Li, linux-bcache, Mingzhe Zou

On Tue, May 20, 2025 at 09:13:09PM +0800, Coly Li wrote:
> 
> 
> > 2025年5月20日 20:26，Kuan-Wei Chiu <visitorckw@gmail.com> 写道：
> > 
> > On Tue, May 20, 2025 at 08:13:47PM +0800, Coly Li wrote:
> >> 
> >> 
> >>> 2025年5月20日 19:51，Kuan-Wei Chiu <visitorckw@gmail.com> 写道：
> >>> 
> >>> On Sat, May 17, 2025 at 07:02:06PM +0800, Coly Li wrote:
> >>>> 
> >>>> 
> >>>>> 2025年5月17日 00:14，Kuan-Wei Chiu <visitorckw@gmail.com> 写道：
> >>>>> 
> >>>>> On Thu, May 15, 2025 at 08:58:44PM -0700, Robert Pang wrote:
> >>>>>> Hi Kuan-Wei,
> >>>>>> 
> >>>>>> Thank you for your prompt response. I tested your suggested patch to
> >>>>>> inline the min heap operations for 8 hours and it is still ongoing.
> >>>>>> Unfortunately, basing on the results so far, it didn't resolve the
> >>>>>> regression, suggesting inlining isn't the issue.
> >>>>>> 
> >>>>>> After reviewing the commits in lib/min_heap.h, I noticed commit
> >>>>>> c641722 ("lib min_heap: optimize number of comparisons in
> >>>>>> min_heapify()") and it looked like a potential candidate. I reverted
> >>>>>> this commit (below) and ran the tests. While the test is still
> >>>>>> ongoing, the results for the past 3 hours show that the latency spikes
> >>>>>> during invalidate_buckets_lru() disappeared after this change,
> >>>>>> indicating that this commit is likely the root cause of the
> >>>>>> regression.
> >>>>>> 
> >>>>>> My hypothesis is that while commit c641722 was designed to reduce
> >>>>>> comparisons with randomized input [1], it might inadvertently increase
> >>>>>> comparisons when the input isn't as random. A scenario where this
> >>>>>> could happen is within invalidate_buckets_lru() before the cache is
> >>>>>> fully populated. In such cases, many buckets are unfilled, causing
> >>>>>> new_bucket_prio() to return zero, leading to more frequent
> >>>>>> compare-equal operations with other unfilled buckets. In the case when
> >>>>>> the cache is populated, the bucket priorities fall in a range with
> >>>>>> many duplicates. How will heap_sift() behave in such cases?
> >>>>>> 
> >>>>>> [1] https://lore.kernel.org/linux-bcache/20240121153649.2733274-6-visitorckw@gmail.com/
> >>>>>> 
> >>>>> 
> >>>>> You're very likely correct.
> >>>>> 
> >>>>> In scenarios where the majority of elements in the heap are identical,
> >>>>> the traditional top-down version of heapify finishes after just 2
> >>>>> comparisons. However, with the bottom-up version introduced by that
> >>>>> commit, it ends up performing roughly 2 * log₂(n) comparisons in the
> >>>>> same case.
> >>>> 
> >>>> For bcache scenario for ideal circumstances and best performance, the cached data
> >>>> and following requests should have spatial or temporal locality.
> >>>> 
> >>>> I guess it means for the heap usage, the input might not be typical random.
> >>>> 
> >>>> 
> >>>>> 
> >>>>> That said, reverting the commit would increase the number of
> >>>>> comparisons by about 2x in cases where all elements in the heap are
> >>>>> distinct, which was the original motivation for the change. I'm not
> >>>>> entirely sure what the best way would be to fix this regression without
> >>>>> negatively impacting the performance of the other use cases.
> >>>> 
> >>>> If the data read model are fully sequential or random, bcache cannot help too much.
> >>>> 
> >>>> So I guess maybe we still need to old heapify code? The new version is for full random input,
> >>>> and previous version for not that much random input.
> >>>> 
> >>> 
> >>> I think we have two options here. One is to add a classic heapify
> >>> function to min_heap.h, allowing users to choose based on whether they
> >>> expect many duplicate elements in the heap. While having two heapify
> >>> variants might be confusing from a library design perspective, we could
> >>> mitigate that with clear kernel-doc comments. The other option is to
> >>> revert to the old bcache heap code. I'm not sure which approach is
> >>> better.
> >>> 
> >> 
> >> I prefer to have two min_heap APIs, but how to name them, this is a question from me.
> >> 
> >> Also if the full-random min_heap version has no user in kernel, whether to keep it in kernel also is a question.
> > 
> > From the perspective of the number of comparisons in heapify, what
> > matters more is whether the data contains many equal elements, rather
> > than whether it's truly random. I assume that for most other kernel
> > users, their use cases don't typically involve a large number of equal
> > elements?
> > 
> 
> Yes, you are right.  Maybe dm-vdo also has similar I/O pattern?
> 
> Deduplication may also have duplicated items in heap I guess.
> 

Thanks for pointing out this potential issue.
I'll check with Matthew to confirm.

Regards,
Kuan-Wei

> Thanks.
> 
> 
> >> 
> >> Kent,
> >> Could you please offer your opinion?
> >> 
> >> Thanks.
> >> 
> >> Coly Li
> 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/1] bcache: reduce front IO latency during GC
  2025-05-21 14:40                           ` Kuan-Wei Chiu
@ 2025-06-06  7:39                             ` Robert Pang
  2025-06-06 12:37                               ` Kuan-Wei Chiu
  0 siblings, 1 reply; 27+ messages in thread
From: Robert Pang @ 2025-06-06  7:39 UTC (permalink / raw)
  To: Kuan-Wei Chiu
  Cc: Coly Li, Kent Overstreet, Coly Li, linux-bcache, Mingzhe Zou

Hi Kuan-Wei,

I'm circling back on our plan to address this regression. Based on our
discussions to-date, it seems a separate min_heap API that uses the
conventional top-down sift-down strategy is the preferred approach.

I've prototyped this idea and sent out a patch series [1] to kick off
the discussion on this approach. The patch has been running for over
12 hours and is looking promising. The API name is only my initial
thoughts and suggestions are welcome.

Given that this regression was introduced in Linux 6.11 and affects
versions up to 6.15 (including 6.12 LTS), a timely solution will be
important.

Best regress
Robert Pang

[1] https://lore.kernel.org/linux-bcache/20250606071959.1685079-1-robertpang@google.com/T/#t


On Wed, May 21, 2025 at 7:40 AM Kuan-Wei Chiu <visitorckw@gmail.com> wrote:
>
> On Tue, May 20, 2025 at 09:13:09PM +0800, Coly Li wrote:
> >
> >
> > > 2025年5月20日 20:26，Kuan-Wei Chiu <visitorckw@gmail.com> 写道：
> > >
> > > On Tue, May 20, 2025 at 08:13:47PM +0800, Coly Li wrote:
> > >>
> > >>
> > >>> 2025年5月20日 19:51，Kuan-Wei Chiu <visitorckw@gmail.com> 写道：
> > >>>
> > >>> On Sat, May 17, 2025 at 07:02:06PM +0800, Coly Li wrote:
> > >>>>
> > >>>>
> > >>>>> 2025年5月17日 00:14，Kuan-Wei Chiu <visitorckw@gmail.com> 写道：
> > >>>>>
> > >>>>> On Thu, May 15, 2025 at 08:58:44PM -0700, Robert Pang wrote:
> > >>>>>> Hi Kuan-Wei,
> > >>>>>>
> > >>>>>> Thank you for your prompt response. I tested your suggested patch to
> > >>>>>> inline the min heap operations for 8 hours and it is still ongoing.
> > >>>>>> Unfortunately, basing on the results so far, it didn't resolve the
> > >>>>>> regression, suggesting inlining isn't the issue.
> > >>>>>>
> > >>>>>> After reviewing the commits in lib/min_heap.h, I noticed commit
> > >>>>>> c641722 ("lib min_heap: optimize number of comparisons in
> > >>>>>> min_heapify()") and it looked like a potential candidate. I reverted
> > >>>>>> this commit (below) and ran the tests. While the test is still
> > >>>>>> ongoing, the results for the past 3 hours show that the latency spikes
> > >>>>>> during invalidate_buckets_lru() disappeared after this change,
> > >>>>>> indicating that this commit is likely the root cause of the
> > >>>>>> regression.
> > >>>>>>
> > >>>>>> My hypothesis is that while commit c641722 was designed to reduce
> > >>>>>> comparisons with randomized input [1], it might inadvertently increase
> > >>>>>> comparisons when the input isn't as random. A scenario where this
> > >>>>>> could happen is within invalidate_buckets_lru() before the cache is
> > >>>>>> fully populated. In such cases, many buckets are unfilled, causing
> > >>>>>> new_bucket_prio() to return zero, leading to more frequent
> > >>>>>> compare-equal operations with other unfilled buckets. In the case when
> > >>>>>> the cache is populated, the bucket priorities fall in a range with
> > >>>>>> many duplicates. How will heap_sift() behave in such cases?
> > >>>>>>
> > >>>>>> [1] https://lore.kernel.org/linux-bcache/20240121153649.2733274-6-visitorckw@gmail.com/
> > >>>>>>
> > >>>>>
> > >>>>> You're very likely correct.
> > >>>>>
> > >>>>> In scenarios where the majority of elements in the heap are identical,
> > >>>>> the traditional top-down version of heapify finishes after just 2
> > >>>>> comparisons. However, with the bottom-up version introduced by that
> > >>>>> commit, it ends up performing roughly 2 * log₂(n) comparisons in the
> > >>>>> same case.
> > >>>>
> > >>>> For bcache scenario for ideal circumstances and best performance, the cached data
> > >>>> and following requests should have spatial or temporal locality.
> > >>>>
> > >>>> I guess it means for the heap usage, the input might not be typical random.
> > >>>>
> > >>>>
> > >>>>>
> > >>>>> That said, reverting the commit would increase the number of
> > >>>>> comparisons by about 2x in cases where all elements in the heap are
> > >>>>> distinct, which was the original motivation for the change. I'm not
> > >>>>> entirely sure what the best way would be to fix this regression without
> > >>>>> negatively impacting the performance of the other use cases.
> > >>>>
> > >>>> If the data read model are fully sequential or random, bcache cannot help too much.
> > >>>>
> > >>>> So I guess maybe we still need to old heapify code? The new version is for full random input,
> > >>>> and previous version for not that much random input.
> > >>>>
> > >>>
> > >>> I think we have two options here. One is to add a classic heapify
> > >>> function to min_heap.h, allowing users to choose based on whether they
> > >>> expect many duplicate elements in the heap. While having two heapify
> > >>> variants might be confusing from a library design perspective, we could
> > >>> mitigate that with clear kernel-doc comments. The other option is to
> > >>> revert to the old bcache heap code. I'm not sure which approach is
> > >>> better.
> > >>>
> > >>
> > >> I prefer to have two min_heap APIs, but how to name them, this is a question from me.
> > >>
> > >> Also if the full-random min_heap version has no user in kernel, whether to keep it in kernel also is a question.
> > >
> > > From the perspective of the number of comparisons in heapify, what
> > > matters more is whether the data contains many equal elements, rather
> > > than whether it's truly random. I assume that for most other kernel
> > > users, their use cases don't typically involve a large number of equal
> > > elements?
> > >
> >
> > Yes, you are right.  Maybe dm-vdo also has similar I/O pattern?
> >
> > Deduplication may also have duplicated items in heap I guess.
> >
>
> Thanks for pointing out this potential issue.
> I'll check with Matthew to confirm.
>
> Regards,
> Kuan-Wei
>
> > Thanks.
> >
> >
> > >>
> > >> Kent,
> > >> Could you please offer your opinion?
> > >>
> > >> Thanks.
> > >>
> > >> Coly Li
> >
> >

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/1] bcache: reduce front IO latency during GC
  2025-06-06  7:39                             ` Robert Pang
@ 2025-06-06 12:37                               ` Kuan-Wei Chiu
  0 siblings, 0 replies; 27+ messages in thread
From: Kuan-Wei Chiu @ 2025-06-06 12:37 UTC (permalink / raw)
  To: Robert Pang; +Cc: Coly Li, Kent Overstreet, Coly Li, linux-bcache, Mingzhe Zou

On Fri, Jun 06, 2025 at 12:39:52AM -0700, Robert Pang wrote:
> Hi Kuan-Wei,
> 
> I'm circling back on our plan to address this regression. Based on our
> discussions to-date, it seems a separate min_heap API that uses the
> conventional top-down sift-down strategy is the preferred approach.
> 
> I've prototyped this idea and sent out a patch series [1] to kick off
> the discussion on this approach. The patch has been running for over
> 12 hours and is looking promising. The API name is only my initial
> thoughts and suggestions are welcome.
> 
> Given that this regression was introduced in Linux 6.11 and affects
> versions up to 6.15 (including 6.12 LTS), a timely solution will be
> important.
> 

I'm terribly sorry for not submitting the patch to fix this regression
earlier. I had something drafted, but I got sick, and the issue
unfortunately slipped off my radar and I ended up forgetting about it.

Regards,
Kuan-Wei

> Best regress
> Robert Pang
> 
> [1] https://lore.kernel.org/linux-bcache/20250606071959.1685079-1-robertpang@google.com/T/#t
> 
> 
> On Wed, May 21, 2025 at 7:40 AM Kuan-Wei Chiu <visitorckw@gmail.com> wrote:
> >
> > On Tue, May 20, 2025 at 09:13:09PM +0800, Coly Li wrote:
> > >
> > >
> > > > 2025年5月20日 20:26，Kuan-Wei Chiu <visitorckw@gmail.com> 写道：
> > > >
> > > > On Tue, May 20, 2025 at 08:13:47PM +0800, Coly Li wrote:
> > > >>
> > > >>
> > > >>> 2025年5月20日 19:51，Kuan-Wei Chiu <visitorckw@gmail.com> 写道：
> > > >>>
> > > >>> On Sat, May 17, 2025 at 07:02:06PM +0800, Coly Li wrote:
> > > >>>>
> > > >>>>
> > > >>>>> 2025年5月17日 00:14，Kuan-Wei Chiu <visitorckw@gmail.com> 写道：
> > > >>>>>
> > > >>>>> On Thu, May 15, 2025 at 08:58:44PM -0700, Robert Pang wrote:
> > > >>>>>> Hi Kuan-Wei,
> > > >>>>>>
> > > >>>>>> Thank you for your prompt response. I tested your suggested patch to
> > > >>>>>> inline the min heap operations for 8 hours and it is still ongoing.
> > > >>>>>> Unfortunately, basing on the results so far, it didn't resolve the
> > > >>>>>> regression, suggesting inlining isn't the issue.
> > > >>>>>>
> > > >>>>>> After reviewing the commits in lib/min_heap.h, I noticed commit
> > > >>>>>> c641722 ("lib min_heap: optimize number of comparisons in
> > > >>>>>> min_heapify()") and it looked like a potential candidate. I reverted
> > > >>>>>> this commit (below) and ran the tests. While the test is still
> > > >>>>>> ongoing, the results for the past 3 hours show that the latency spikes
> > > >>>>>> during invalidate_buckets_lru() disappeared after this change,
> > > >>>>>> indicating that this commit is likely the root cause of the
> > > >>>>>> regression.
> > > >>>>>>
> > > >>>>>> My hypothesis is that while commit c641722 was designed to reduce
> > > >>>>>> comparisons with randomized input [1], it might inadvertently increase
> > > >>>>>> comparisons when the input isn't as random. A scenario where this
> > > >>>>>> could happen is within invalidate_buckets_lru() before the cache is
> > > >>>>>> fully populated. In such cases, many buckets are unfilled, causing
> > > >>>>>> new_bucket_prio() to return zero, leading to more frequent
> > > >>>>>> compare-equal operations with other unfilled buckets. In the case when
> > > >>>>>> the cache is populated, the bucket priorities fall in a range with
> > > >>>>>> many duplicates. How will heap_sift() behave in such cases?
> > > >>>>>>
> > > >>>>>> [1] https://lore.kernel.org/linux-bcache/20240121153649.2733274-6-visitorckw@gmail.com/
> > > >>>>>>
> > > >>>>>
> > > >>>>> You're very likely correct.
> > > >>>>>
> > > >>>>> In scenarios where the majority of elements in the heap are identical,
> > > >>>>> the traditional top-down version of heapify finishes after just 2
> > > >>>>> comparisons. However, with the bottom-up version introduced by that
> > > >>>>> commit, it ends up performing roughly 2 * log₂(n) comparisons in the
> > > >>>>> same case.
> > > >>>>
> > > >>>> For bcache scenario for ideal circumstances and best performance, the cached data
> > > >>>> and following requests should have spatial or temporal locality.
> > > >>>>
> > > >>>> I guess it means for the heap usage, the input might not be typical random.
> > > >>>>
> > > >>>>
> > > >>>>>
> > > >>>>> That said, reverting the commit would increase the number of
> > > >>>>> comparisons by about 2x in cases where all elements in the heap are
> > > >>>>> distinct, which was the original motivation for the change. I'm not
> > > >>>>> entirely sure what the best way would be to fix this regression without
> > > >>>>> negatively impacting the performance of the other use cases.
> > > >>>>
> > > >>>> If the data read model are fully sequential or random, bcache cannot help too much.
> > > >>>>
> > > >>>> So I guess maybe we still need to old heapify code? The new version is for full random input,
> > > >>>> and previous version for not that much random input.
> > > >>>>
> > > >>>
> > > >>> I think we have two options here. One is to add a classic heapify
> > > >>> function to min_heap.h, allowing users to choose based on whether they
> > > >>> expect many duplicate elements in the heap. While having two heapify
> > > >>> variants might be confusing from a library design perspective, we could
> > > >>> mitigate that with clear kernel-doc comments. The other option is to
> > > >>> revert to the old bcache heap code. I'm not sure which approach is
> > > >>> better.
> > > >>>
> > > >>
> > > >> I prefer to have two min_heap APIs, but how to name them, this is a question from me.
> > > >>
> > > >> Also if the full-random min_heap version has no user in kernel, whether to keep it in kernel also is a question.
> > > >
> > > > From the perspective of the number of comparisons in heapify, what
> > > > matters more is whether the data contains many equal elements, rather
> > > > than whether it's truly random. I assume that for most other kernel
> > > > users, their use cases don't typically involve a large number of equal
> > > > elements?
> > > >
> > >
> > > Yes, you are right.  Maybe dm-vdo also has similar I/O pattern?
> > >
> > > Deduplication may also have duplicated items in heap I guess.
> > >
> >
> > Thanks for pointing out this potential issue.
> > I'll check with Matthew to confirm.
> >
> > Regards,
> > Kuan-Wei
> >
> > > Thanks.
> > >
> > >
> > > >>
> > > >> Kent,
> > > >> Could you please offer your opinion?
> > > >>
> > > >> Thanks.
> > > >>
> > > >> Coly Li
> > >
> > >

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/1] bcache: reduce front IO latency during GC
  2025-05-15  0:58         ` Robert Pang
  2025-05-15  1:04           ` Robert Pang
  2025-05-15  7:12           ` Kuan-Wei Chiu
@ 2025-07-03 15:28           ` Robert Pang
       [not found]             ` <10F3457F-FC71-4D21-B2CA-05346B68D873@coly.li>
  2 siblings, 1 reply; 27+ messages in thread
From: Robert Pang @ 2025-07-03 15:28 UTC (permalink / raw)
  To: Coly Li; +Cc: Coly Li, Kent Overstreet, linux-bcache, Mingzhe Zou,
	Kuan-Wei Chiu

Hi Coly

I'm writing to provide an update on this bcache GC latency patch.

With the min heap regression addressed by the recent revert [1], I am
going to initiate a rerun of our last test set [2]. I will share the
results as soon as they are available.

To ensure this patch is fully qualified for the next merge window, I'm
eager to perform any further tests you deem necessary. What additional
test scenarios would you recommend?"

Best regards
Robert Pang

[1] https://lore.kernel.org/linux-bcache/20250614202353.1632957-1-visitorckw@gmail.com/T/#m86a71f8030c26f8f366f9442e33f2a236a5a2eac
[2] https://lore.kernel.org/linux-bcache/wtfuhfntbi6yorxqtpcs4vg5w67mvyckp2a6jmxuzt2hvbw65t@gznwsae5653d/T/#me50a9ddd0386ce602b2f17415e02d33b8e29f533

On Thu, May 15, 2025 at 8:58 AM Robert Pang <robertpang@google.com> wrote:
>
> Hi Coly,
>
> My apologies for the delay in providing this update; comprehensive testing
> takes some time to complete.
>
> As you suggested, I conducted extensive tests for 24 hours against the
> latest 6.14.5 Linux kernel, exploring more configurations to get a complete
> picture:
>
> 1. 4KB block size with writethrough mode
> 2. 4KB block size with writeback mode (70% dirty)
> 3. 1MB block size with writethrough mode
>
> The detailed results, available at [1], consistently demonstrate that our patch
> is effective in significantly reducing latency during garbage collection. This
> holds true for both the default writethrough mode and the 70% writeback mode.
> As anticipated, with 1MB block sizes, we observed no difference in latency
> because the number of btree nodes is much smaller.
>
> [1] https://gist.github.com/robert-pang/817fa7c11ece99d25aabc0467a9427d8
>
> However, during these tests, we've uncovered a new and distinct latency problem
> that appears to be introduced in the recent Linux kernel. This issue manifests
> as frequent and periodic latency spikes that occur outside of garbage
> collection.
> Below is a snippet of the latency data illustrating this:
>
> time (s)  median (ms)  max (ms)
> 60810   2.28     679.37
> 60840   2.32   2,434.24 *
> 60870   2.46   2,434.24 *
> 60900   2.52   2,434.24 *
> 60930   2.63     566.15
> 60960   2.82     566.15
> 60990   2.82     566.15
> 61020   2.78     471.79
> 61050   2.93   2,028.54 *
> 61080   3.11   2,028.54 *
> 61110   3.29   2,028.54 *
> 61140   3.42     679.37
> 61170   3.42     679.37
> 61200   3.41     679.37
> 61230   3.30     566.15
> 61260   2.93   1,690.45 *
> 61290   2.75   1,690.45 *
> 61320   2.72   1,690.45 *
> 61350   2.88   1,408.71 *
> 61380   5.07   1,408.71 *
> 61410 107.94   1,408.71 **
> 61440  65.28   1,408.71 **
> 61470  45.41   2,028.54 **
> 61500  72.45   2,028.54 **
> 61530  55.37   2,028.54 **
> 61560  40.73   1,408.71 **
> 61590  11.48   1,690.45 **
> 61620   2.92   1,690.45 *
> 61650   2.54   1,690.45 *
> 61680   2.58     679.37
> 61710   2.78     679.37
>
> ** garbage collection
> * cache replacement
>
> Based on the consistent periodicity of these spikes, we deduce that they are
> linked to the invalidate_buckets_lru() function during cache replacement. This
> function was recently modified to use min heap operations [2]. To confirm our
> hypothesis, we reverted the relevant commits and re-ran the tests. Results show
> that the latency spikes completely disappeared, positively confirming that the
> min heap changes introduce this regression. Furthermore, these changes also
> reduce the effectiveness of our GC patch. It appears that the min heap changes
> reduce heap sort speed somehow in invalidate_buckets_lr() and in GC.
>
> [2] https://lore.kernel.org/linux-bcache/ZxzkLJmhn3a%2F1ALQ@visitorckw-System-Product-Name/T/#m0dd24ba0c63615de465d3fec72dc73febb0f7a94
>
> You may download the full test result data from these links.
>
> https://gist.github.com/robert-pang/5df1d595ee77756c0a01d6479bdf8e34#file-bcache-latency-4kb-no-patch-csv
> https://gist.github.com/robert-pang/5df1d595ee77756c0a01d6479bdf8e34#file-bcache-latency-4kb-with-patch-csv
> https://gist.github.com/robert-pang/bcc26a3aa90dc95a083799cf4fd48116#file-bcache-latency-4kb-wb-no-patch-csv
> https://gist.github.com/robert-pang/bcc26a3aa90dc95a083799cf4fd48116#file-bcache-latency-4kb-wb-with-patch-csv
> https://gist.github.com/robert-pang/7036b06b66c8de7e958cdbddcd92a3f5#file-bcache-latency-1mb-no-patch-csv
> https://gist.github.com/robert-pang/7036b06b66c8de7e958cdbddcd92a3f5#file-bcache-latency-1mb-with-patch-csv
> https://gist.github.com/robert-pang/40f90afdea2d2a8c3f6e22ff959eff03#file-bcache-latency-4kb-no-patch-min-heap-reverted-csv
> https://gist.github.com/robert-pang/40f90afdea2d2a8c3f6e22ff959eff03#file-bcache-latency-4kb-with-patch-min-heap-reverted-csv
>
> Best regards
> Robert Pang
>
> On Sat, May 3, 2025 at 10:33 AM Coly Li <colyli@kernel.org> wrote:
> >
> > On Thu, May 01, 2025 at 06:01:09PM +0800, Robert Pang wrote:
> > > Hi Coly,
> > >
> > > Please disregard the test results I shared over a week ago. After digging
> > > deeper into the recent latency spikes with various workloads and by
> > > instrumenting the garbage collector, I realized that the earlier GC latency
> > > patch, "bcache: allow allocator to invalidate bucket in gc" [1], wasn't
> > > backported to the Linux 6.6 branch I tested my patch against. This omission
> > > explains the much higher latency observed during the extended test because the
> > > allocator was blocked for the entire GC. My sincere apologies for the
> > > inconsistent results and any confusion this has caused.
> > >
> >
> > Did you also backport commit 05356938a4be ("bcache: call force_wake_up_gc()
> > if necessary in check_should_bypass()") ? Last time when you pushed me to
> > add commit a14a68b76954 into mainline kernel, I tested a regression from this
> > patch and fixed it. Please add this fix if you didn't, otherwise the testing
> > might not be completed.
> >
> >
> > > With patch [1] back-patched and after a 24-hour re-test, the fio results clearly
> > > demonstrate that this patch effectively reduces front IO latency during GC due
> > > to the smaller incremental GC cycles, while the GC duration increase is still
> > > well within bounds.
> > >
> >
> > From the performance result in [2], it seems the max latency are reduced,
> > but higher latency period are longer. I am not sure whether this is a happy
> > result.
> >
> > Can I have a download link for the whole log? Then I can look at the
> > performance numbers more close.
> >
> > > Here's a summary of the improved latency:
> > >
> > > Before:
> > >
> > > Median latency (P50): 210 ms
> > > Max latency (P100): 3.5 sec
> > >
> > > btree_gc_average_duration_ms:381138
> > > btree_gc_average_frequency_sec:3834
> > > btree_gc_last_sec:60668
> > > btree_gc_max_duration_ms:825228
> > > bset_tree_stats:
> > > btree nodes: 144330
> > > written sets: 283733
> > > unwritten sets: 144329
> > > written key bytes: 24993783392
> > > unwritten key bytes: 11777400
> > > floats: 30936844345385
> > > failed: 5776
> > >
> > > After:
> > >
> > > Median latency (P50): 25 ms
> > > Max latency (P100): 0.8 sec
> > >
> > > btree_gc_average_duration_ms:622274
> > > btree_gc_average_frequency_sec:3518
> > > btree_gc_last_sec:8931
> > > btree_gc_max_duration_ms:953146
> > > bset_tree_stats:
> > > btree nodes: 175491
> > > written sets: 339078
> > > unwritten sets: 175488
> > > written key bytes: 29821314856
> > > unwritten key bytes: 14076504
> > > floats: 90520963280544
> > > failed: 6462
> > >
> > > The complete latency data is available at [2].
> > >
> > > I will be glad to run further tests to solidify these findings for the inclusion
> > > of this patch in the coming merge window. Let me know if you'd like me to
> > > conduct any specific tests.
> >
> > Yes, more testing are necessary, from 512 Bytes block size to 1 MiB or
> > 8MiB block size. We need to make sure it won't introduce performance
> > regression in other workload or circumstances.
> >
> > I don't have plan to submit this patch in this merge window, and please don't
> > push me. For performance improvement change, I prefer the defalt
> > configuration will cover most of work loads, so more testing and perforamce
> > data are desired. E.g. the patch you mentioned (commit a14a68b76954 "bcache:
> > allow allocator to invalidate bucket in gc"), it had been deployed in Easy
> > Stack product environment for 20+ months before it got merged.
> >
> > Thanks.
> >
> > >
> > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a14a68b76954e73031ca6399abace17dcb77c17a
> > > [2[ https://gist.github.com/robert-pang/cc7c88f356293ea6d43103e6e5f9180f
> >
> > [snipped]
> >
> > --
> > Coly Li

^ permalink raw reply	[flat|nested] 27+ messages in thread

[parent not found: <10F3457F-FC71-4D21-B2CA-05346B68D873@coly.li>]

[parent not found: <A52B16E2-DF9C-4C8A-9853-464385D1A7FA@coly.li>]

* Re: [PATCH 0/1] bcache: reduce front IO latency during GC
       [not found]               ` <A52B16E2-DF9C-4C8A-9853-464385D1A7FA@coly.li>
@ 2025-07-17 22:52                 ` Robert Pang
  2025-07-18  6:09                   ` Coly Li
  0 siblings, 1 reply; 27+ messages in thread
From: Robert Pang @ 2025-07-17 22:52 UTC (permalink / raw)
  To: Coly Li; +Cc: Coly Li, Kent Overstreet, linux-bcache, Mingzhe Zou,
	Kuan-Wei Chiu

Hi Coly

>> With the min heap regression addressed by the recent revert [1], I am
>> going to initiate a rerun of our last test set [2]. I will share the
>> results as soon as they are available.

My apologies for the delay in sharing the results due to my work schedule.

I've completed the rerun of the tests, and the results are available
for your review in [1]. As expected, the data confirms the desired
latency improvements. Furthermore, I've checked for any unintended
side effects and can confirm that no regressions were observed.

Please take a look and let me know what you think. If there are any
other specific scenarios or tests you would like me to execute to
further validate the changes, please don't hesitate to let me know.

Best regards
Robert Pang

[1] https://gist.github.com/robert-pang/a22b7c5dee2600be3260f4db57e5776d#file-test-results-md

On Thu, Jul 3, 2025 at 9:47 AM Coly Li <i@coly.li> wrote:
>
>
>
> > 2025年7月3日 23:34，Coly Li <i@coly.li> 写道：
> >
> > Hi Robert,
> >
> >> 2025年7月3日 23:28，Robert Pang <robertpang@google.com> 写道：
> >>
> >> Hi Coly
> >>
> >> I'm writing to provide an update on this bcache GC latency patch.
> >>
> >> With the min heap regression addressed by the recent revert [1], I am
> >> going to initiate a rerun of our last test set [2]. I will share the
> >> results as soon as they are available.
> >>
> >> To ensure this patch is fully qualified for the next merge window, I'm
> >> eager to perform any further tests you deem necessary. What additional
> >> test scenarios would you recommend?"
> >>
> >
> > Can you also do some full random write (4K and 64K block size) performance testing?
> >
> > If there is no performance regression (and I believe improvement will be observed), then let’s go ahead.
>
> BTW,  I don’t commit that this patch will go into next merge window.
>
> Thanks.
>
> Coly Li
>
>
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/1] bcache: reduce front IO latency during GC
  2025-07-17 22:52                 ` Robert Pang
@ 2025-07-18  6:09                   ` Coly Li
  0 siblings, 0 replies; 27+ messages in thread
From: Coly Li @ 2025-07-18  6:09 UTC (permalink / raw)
  To: Robert Pang
  Cc: Coly Li, Kent Overstreet, linux-bcache, Mingzhe Zou,
	Kuan-Wei Chiu

Hi Robert,

Thanks for the great effort! Let me take a look and respond later.

Coly Li

> 2025年7月18日 06:52，Robert Pang <robertpang@google.com> 写道：
> 
> Hi Coly
> 
>>> With the min heap regression addressed by the recent revert [1], I am
>>> going to initiate a rerun of our last test set [2]. I will share the
>>> results as soon as they are available.
> 
> My apologies for the delay in sharing the results due to my work schedule.
> 
> I've completed the rerun of the tests, and the results are available
> for your review in [1]. As expected, the data confirms the desired
> latency improvements. Furthermore, I've checked for any unintended
> side effects and can confirm that no regressions were observed.
> 
> Please take a look and let me know what you think. If there are any
> other specific scenarios or tests you would like me to execute to
> further validate the changes, please don't hesitate to let me know.
> 
> Best regards
> Robert Pang
> 
> [1] https://gist.github.com/robert-pang/a22b7c5dee2600be3260f4db57e5776d#file-test-results-md
> 
> On Thu, Jul 3, 2025 at 9:47 AM Coly Li <i@coly.li> wrote:
>> 
>> 
>> 
>>> 2025年7月3日 23:34，Coly Li <i@coly.li> 写道：
>>> 
>>> Hi Robert,
>>> 
>>>> 2025年7月3日 23:28，Robert Pang <robertpang@google.com> 写道：
>>>> 
>>>> Hi Coly
>>>> 
>>>> I'm writing to provide an update on this bcache GC latency patch.
>>>> 
>>>> With the min heap regression addressed by the recent revert [1], I am
>>>> going to initiate a rerun of our last test set [2]. I will share the
>>>> results as soon as they are available.
>>>> 
>>>> To ensure this patch is fully qualified for the next merge window, I'm
>>>> eager to perform any further tests you deem necessary. What additional
>>>> test scenarios would you recommend?"
>>>> 
>>> 
>>> Can you also do some full random write (4K and 64K block size) performance testing?
>>> 
>>> If there is no performance regression (and I believe improvement will be observed), then let’s go ahead.
>> 
>> BTW,  I don’t commit that this patch will go into next merge window.
>> 
>> Thanks.
>> 
>> Coly Li
>> 
>> 
>> 
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/1] bcache: reduce front IO latency during GC
  2025-04-14 22:44 [PATCH 0/1] bcache: reduce front IO latency during GC Robert Pang
  2025-04-14 22:44 ` [PATCH 1/1] bcache: process fewer btree nodes in incremental GC cycles Robert Pang
  2025-04-15  1:57 ` [PATCH 0/1] bcache: reduce front IO latency during GC Coly Li
@ 2025-05-16 17:10 ` Kent Overstreet
  2 siblings, 0 replies; 27+ messages in thread
From: Kent Overstreet @ 2025-05-16 17:10 UTC (permalink / raw)
  To: Robert Pang; +Cc: Coly Li, linux-bcache, Mingzhe Zou

On Mon, Apr 14, 2025 at 03:44:03PM -0700, Robert Pang wrote:
> In performance benchmarks on disks with bcache using the Linux 6.6 kernel, we
> observe noticeable IO latency increase during btree garbage collection. The
> increase ranges from high tens to hundreds of milliseconds, depending on the
> size of the cache device. Further investigation reveals that it is the same
> issue reported in [1], where the large number of nodes processed in each
> incremental GC cycle causes the front IO latency.
> 
> Building upon the approach suggested in [1], this patch decomposes the
> incremental GC process into more but smaller cycles. In contrast to [1], this
> implementation adopts a simpler strategy by setting a lower limit of 10 nodes
> per cycle to reduce front IO delay and introducing a fixed 10ms sleep per cycle
> when front IO is in progress. Furthermore, when garbage collection statistics
> are available, the number of nodes processed per cycle is dynamically rescaled
> based on the average GC frequency to ensure GC completes well within the next
> subsequent scheduled interval.
> 
> Testing with a 750GB NVMe cache and 256KB bucket size using the following fio
> configuration demonstrates that our patch reduces front IO latency during GC
> without significantly increasing GC duration.

Have you been looking at bcachefs yet? It solves all the latency issues
in bcache.

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2025-07-18  6:18 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-14 22:44 [PATCH 0/1] bcache: reduce front IO latency during GC Robert Pang
2025-04-14 22:44 ` [PATCH 1/1] bcache: process fewer btree nodes in incremental GC cycles Robert Pang
2025-04-15  2:08   ` Coly Li
2025-04-15 17:25     ` Robert Pang
2025-04-16  9:44       ` Coly Li
2025-04-17  5:44         ` Robert Pang
2025-04-15  1:57 ` [PATCH 0/1] bcache: reduce front IO latency during GC Coly Li
2025-04-22  1:44   ` Robert Pang
2025-05-02  1:01     ` Robert Pang
2025-05-03 17:33       ` Coly Li
2025-05-15  0:58         ` Robert Pang
2025-05-15  1:04           ` Robert Pang
2025-05-15  7:12           ` Kuan-Wei Chiu
2025-05-16  3:58             ` Robert Pang
2025-05-16 16:14               ` Kuan-Wei Chiu
2025-05-17 11:02                 ` Coly Li
2025-05-20 11:51                   ` Kuan-Wei Chiu
2025-05-20 12:13                     ` Coly Li
2025-05-20 12:26                       ` Kuan-Wei Chiu
2025-05-20 13:13                         ` Coly Li
2025-05-21 14:40                           ` Kuan-Wei Chiu
2025-06-06  7:39                             ` Robert Pang
2025-06-06 12:37                               ` Kuan-Wei Chiu
2025-07-03 15:28           ` Robert Pang
     [not found]             ` <10F3457F-FC71-4D21-B2CA-05346B68D873@coly.li>
     [not found]               ` <A52B16E2-DF9C-4C8A-9853-464385D1A7FA@coly.li>
2025-07-17 22:52                 ` Robert Pang
2025-07-18  6:09                   ` Coly Li
2025-05-16 17:10 ` Kent Overstreet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox