* [PATCH 0/1] bcache: reduce front IO latency during GC
@ 2025-04-14 22:44 Robert Pang
2025-04-14 22:44 ` [PATCH 1/1] bcache: process fewer btree nodes in incremental GC cycles Robert Pang
` (2 more replies)
0 siblings, 3 replies; 27+ messages in thread
From: Robert Pang @ 2025-04-14 22:44 UTC (permalink / raw)
To: Coly Li, Kent Overstreet, linux-bcache; +Cc: Robert Pang, Mingzhe Zou
In performance benchmarks on disks with bcache using the Linux 6.6 kernel, we
observe noticeable IO latency increase during btree garbage collection. The
increase ranges from high tens to hundreds of milliseconds, depending on the
size of the cache device. Further investigation reveals that it is the same
issue reported in [1], where the large number of nodes processed in each
incremental GC cycle causes the front IO latency.
Building upon the approach suggested in [1], this patch decomposes the
incremental GC process into more but smaller cycles. In contrast to [1], this
implementation adopts a simpler strategy by setting a lower limit of 10 nodes
per cycle to reduce front IO delay and introducing a fixed 10ms sleep per cycle
when front IO is in progress. Furthermore, when garbage collection statistics
are available, the number of nodes processed per cycle is dynamically rescaled
based on the average GC frequency to ensure GC completes well within the next
subsequent scheduled interval.
Testing with a 750GB NVMe cache and 256KB bucket size using the following fio
configuration demonstrates that our patch reduces front IO latency during GC
without significantly increasing GC duration.
ioengine=libaio
direct=1
bs=4k
size=900G
iodepth=10
readwrite=randwrite
log_avg_msec=10
Before:
time-ms,latency-ns,,,
12170, 285016, 1, 0, 0
12183, 296581, 1, 0, 0
12207, 6542725, 1, 0, 0
12242, 24483604, 1, 0, 0
12250, 1895628, 1, 0, 0
12260, 284854, 1, 0, 0
12270, 275513, 1, 0, 0
/sys/block/bcache0/bcache/cache/internal/btree_gc_average_duration_ms:2880
/sys/block/bcache0/bcache/cache/internal/btree_gc_average_frequency_sec:133
/sys/block/bcache0/bcache/cache/internal/btree_gc_last_sec:121
/sys/block/bcache0/bcache/cache/internal/btree_gc_max_duration_ms:3456
After:
12690, 378494, 1, 0, 0
12700, 413934, 1, 0, 0
12710, 661217, 1, 0, 0
12727, 354510, 1, 0, 0
12730, 1100768, 1, 0, 0
12742, 382484, 1, 0, 0
12750, 532679, 1, 0, 0
12760, 572758, 1, 0, 0
12773, 283416, 1, 0, 0
/sys/block/bcache0/bcache/cache/internal/btree_gc_average_duration_ms:3619
/sys/block/bcache0/bcache/cache/internal/btree_gc_average_frequency_sec:58
/sys/block/bcache0/bcache/cache/internal/btree_gc_last_sec:23
/sys/block/bcache0/bcache/cache/internal/btree_gc_max_duration_ms:3866
[1] https://lore.kernel.org/all/20220511073903.13568-1-mingzhe.zou@easystack.cn/
Robert Pang (1):
bcache: process fewer btree nodes in incremental GC cycles
drivers/md/bcache/btree.c | 38 +++++++++++++++++---------------------
drivers/md/bcache/util.h | 3 +++
2 files changed, 20 insertions(+), 21 deletions(-)
--
2.49.0.604.gff1f9ca942-goog
^ permalink raw reply [flat|nested] 27+ messages in thread* [PATCH 1/1] bcache: process fewer btree nodes in incremental GC cycles 2025-04-14 22:44 [PATCH 0/1] bcache: reduce front IO latency during GC Robert Pang @ 2025-04-14 22:44 ` Robert Pang 2025-04-15 2:08 ` Coly Li 2025-04-15 1:57 ` [PATCH 0/1] bcache: reduce front IO latency during GC Coly Li 2025-05-16 17:10 ` Kent Overstreet 2 siblings, 1 reply; 27+ messages in thread From: Robert Pang @ 2025-04-14 22:44 UTC (permalink / raw) To: Coly Li, Kent Overstreet, linux-bcache; +Cc: Robert Pang, Mingzhe Zou Current incremental GC processes a minimum of 100 btree nodes per cycle, followed by a 100ms sleep. For NVMe cache devices, where the average node processing time is ~1ms, this leads to front-side I/O latency potentially reaching tens or hundreds of milliseconds during GC execution. This commit resolves this latency issue by reducing the minimum node processing count per cycle to 10 and the inter-cycle sleep duration to 10ms. It also integrates a check of existing GC statistics to re-scale the number of nodes processed per sleep interval when needed, ensuring GC finishes well before the next GC is due. Signed-off-by: Robert Pang <robertpang@google.com> --- drivers/md/bcache/btree.c | 38 +++++++++++++++++--------------------- drivers/md/bcache/util.h | 3 +++ 2 files changed, 20 insertions(+), 21 deletions(-) diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index ed40d8600656..093e1edcaa53 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -88,11 +88,8 @@ * Test module load/unload */ -#define MAX_NEED_GC 64 -#define MAX_SAVE_PRIO 72 -#define MAX_GC_TIMES 100 -#define MIN_GC_NODES 100 -#define GC_SLEEP_MS 100 +#define GC_MIN_NODES 10 +#define GC_SLEEP_MS 10 #define PTR_DIRTY_BIT (((uint64_t) 1 << 36)) @@ -1585,25 +1582,24 @@ static unsigned int btree_gc_count_keys(struct btree *b) static size_t btree_gc_min_nodes(struct cache_set *c) { - size_t min_nodes; + size_t min_nodes = GC_MIN_NODES; + uint64_t gc_max_ms = time_stat_average(&c->btree_gc_time, frequency, ms) / 2; /* - * Since incremental GC would stop 100ms when front - * side I/O comes, so when there are many btree nodes, - * if GC only processes constant (100) nodes each time, - * GC would last a long time, and the front side I/Os - * would run out of the buckets (since no new bucket - * can be allocated during GC), and be blocked again. - * So GC should not process constant nodes, but varied - * nodes according to the number of btree nodes, which - * realized by dividing GC into constant(100) times, - * so when there are many btree nodes, GC can process - * more nodes each time, otherwise, GC will process less - * nodes each time (but no less than MIN_GC_NODES) + * The incremental garbage collector operates by processing + * GC_MIN_NODES at a time, pausing for GC_SLEEP_MS between + * each interval. If historical garbage collection statistics + * (btree_gc_time) is available, the maximum allowable GC + * duration is set to half of this observed frequency. To + * prevent exceeding this maximum duration, the number of + * nodes processed in the current step may be increased if + * the projected completion time based on the current pace + * extends beyond the allowed limit. This ensures timely GC + * completion before the next GC is due. */ - min_nodes = c->gc_stats.nodes / MAX_GC_TIMES; - if (min_nodes < MIN_GC_NODES) - min_nodes = MIN_GC_NODES; + if ((gc_max_ms >= GC_SLEEP_MS) && + (GC_SLEEP_MS * (c->gc_stats.nodes / min_nodes)) > gc_max_ms) + min_nodes = c->gc_stats.nodes / (gc_max_ms / GC_SLEEP_MS); return min_nodes; } diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h index 539454d8e2d0..21a370f444b7 100644 --- a/drivers/md/bcache/util.h +++ b/drivers/md/bcache/util.h @@ -305,6 +305,9 @@ static inline unsigned int local_clock_us(void) #define NSEC_PER_ms NSEC_PER_MSEC #define NSEC_PER_sec NSEC_PER_SEC +#define time_stat_average(stats, stat, units) \ + div_u64((stats)->average_ ## stat >> 8, NSEC_PER_ ## units) + #define __print_time_stat(stats, name, stat, units) \ sysfs_print(name ## _ ## stat ## _ ## units, \ div_u64((stats)->stat >> 8, NSEC_PER_ ## units)) -- 2.49.0.604.gff1f9ca942-goog ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH 1/1] bcache: process fewer btree nodes in incremental GC cycles 2025-04-14 22:44 ` [PATCH 1/1] bcache: process fewer btree nodes in incremental GC cycles Robert Pang @ 2025-04-15 2:08 ` Coly Li 2025-04-15 17:25 ` Robert Pang 0 siblings, 1 reply; 27+ messages in thread From: Coly Li @ 2025-04-15 2:08 UTC (permalink / raw) To: Robert Pang; +Cc: Kent Overstreet, linux-bcache, Mingzhe Zou On Mon, Apr 14, 2025 at 03:44:04PM +0800, Robert Pang wrote: > Current incremental GC processes a minimum of 100 btree nodes per cycle, > followed by a 100ms sleep. For NVMe cache devices, where the average node > processing time is ~1ms, this leads to front-side I/O latency potentially > reaching tens or hundreds of milliseconds during GC execution. > > This commit resolves this latency issue by reducing the minimum node processing > count per cycle to 10 and the inter-cycle sleep duration to 10ms. It also > integrates a check of existing GC statistics to re-scale the number of nodes > processed per sleep interval when needed, ensuring GC finishes well before the > next GC is due. > > Signed-off-by: Robert Pang <robertpang@google.com> > --- > drivers/md/bcache/btree.c | 38 +++++++++++++++++--------------------- > drivers/md/bcache/util.h | 3 +++ > 2 files changed, 20 insertions(+), 21 deletions(-) > > diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c > index ed40d8600656..093e1edcaa53 100644 > --- a/drivers/md/bcache/btree.c > +++ b/drivers/md/bcache/btree.c > @@ -88,11 +88,8 @@ > * Test module load/unload > */ > > -#define MAX_NEED_GC 64 > -#define MAX_SAVE_PRIO 72 You may compose another patch for the above changes, to separte them from main idea of this patch. > -#define MAX_GC_TIMES 100 > -#define MIN_GC_NODES 100 > -#define GC_SLEEP_MS 100 > +#define GC_MIN_NODES 10 > +#define GC_SLEEP_MS 10 > > #define PTR_DIRTY_BIT (((uint64_t) 1 << 36)) > > @@ -1585,25 +1582,24 @@ static unsigned int btree_gc_count_keys(struct btree *b) > > static size_t btree_gc_min_nodes(struct cache_set *c) > { > - size_t min_nodes; > + size_t min_nodes = GC_MIN_NODES; > + uint64_t gc_max_ms = time_stat_average(&c->btree_gc_time, frequency, ms) / 2; > > /* > - * Since incremental GC would stop 100ms when front > - * side I/O comes, so when there are many btree nodes, > - * if GC only processes constant (100) nodes each time, > - * GC would last a long time, and the front side I/Os > - * would run out of the buckets (since no new bucket > - * can be allocated during GC), and be blocked again. > - * So GC should not process constant nodes, but varied > - * nodes according to the number of btree nodes, which > - * realized by dividing GC into constant(100) times, > - * so when there are many btree nodes, GC can process > - * more nodes each time, otherwise, GC will process less > - * nodes each time (but no less than MIN_GC_NODES) > + * The incremental garbage collector operates by processing > + * GC_MIN_NODES at a time, pausing for GC_SLEEP_MS between > + * each interval. If historical garbage collection statistics > + * (btree_gc_time) is available, the maximum allowable GC > + * duration is set to half of this observed frequency. To > + * prevent exceeding this maximum duration, the number of > + * nodes processed in the current step may be increased if > + * the projected completion time based on the current pace > + * extends beyond the allowed limit. This ensures timely GC > + * completion before the next GC is due. > */ > - min_nodes = c->gc_stats.nodes / MAX_GC_TIMES; > - if (min_nodes < MIN_GC_NODES) > - min_nodes = MIN_GC_NODES; > + if ((gc_max_ms >= GC_SLEEP_MS) && > + (GC_SLEEP_MS * (c->gc_stats.nodes / min_nodes)) > gc_max_ms) > + min_nodes = c->gc_stats.nodes / (gc_max_ms / GC_SLEEP_MS); > Is it possible that gc_max_ms becomes 0? > return min_nodes; > } > diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h > index 539454d8e2d0..21a370f444b7 100644 > --- a/drivers/md/bcache/util.h > +++ b/drivers/md/bcache/util.h > @@ -305,6 +305,9 @@ static inline unsigned int local_clock_us(void) > #define NSEC_PER_ms NSEC_PER_MSEC > #define NSEC_PER_sec NSEC_PER_SEC > > +#define time_stat_average(stats, stat, units) \ > + div_u64((stats)->average_ ## stat >> 8, NSEC_PER_ ## units) > + Could you please add a few code comments here to explain what does time_stat_average() do? Thanks in advance. > #define __print_time_stat(stats, name, stat, units) \ > sysfs_print(name ## _ ## stat ## _ ## units, \ > div_u64((stats)->stat >> 8, NSEC_PER_ ## units)) > -- > 2.49.0.604.gff1f9ca942-goog > -- Coly Li ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/1] bcache: process fewer btree nodes in incremental GC cycles 2025-04-15 2:08 ` Coly Li @ 2025-04-15 17:25 ` Robert Pang 2025-04-16 9:44 ` Coly Li 0 siblings, 1 reply; 27+ messages in thread From: Robert Pang @ 2025-04-15 17:25 UTC (permalink / raw) To: Coly Li; +Cc: Kent Overstreet, linux-bcache, Mingzhe Zou Thank you for your prompt feedback, Coly. On Mon, Apr 14, 2025 at 7:08 PM Coly Li <colyli@kernel.org> wrote: > > On Mon, Apr 14, 2025 at 03:44:04PM +0800, Robert Pang wrote: > > Current incremental GC processes a minimum of 100 btree nodes per cycle, > > followed by a 100ms sleep. For NVMe cache devices, where the average node > > processing time is ~1ms, this leads to front-side I/O latency potentially > > reaching tens or hundreds of milliseconds during GC execution. > > > > This commit resolves this latency issue by reducing the minimum node processing > > count per cycle to 10 and the inter-cycle sleep duration to 10ms. It also > > integrates a check of existing GC statistics to re-scale the number of nodes > > processed per sleep interval when needed, ensuring GC finishes well before the > > next GC is due. > > > > Signed-off-by: Robert Pang <robertpang@google.com> > > --- > > drivers/md/bcache/btree.c | 38 +++++++++++++++++--------------------- > > drivers/md/bcache/util.h | 3 +++ > > 2 files changed, 20 insertions(+), 21 deletions(-) > > > > diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c > > index ed40d8600656..093e1edcaa53 100644 > > --- a/drivers/md/bcache/btree.c > > +++ b/drivers/md/bcache/btree.c > > @@ -88,11 +88,8 @@ > > * Test module load/unload > > */ > > > > -#define MAX_NEED_GC 64 > > -#define MAX_SAVE_PRIO 72 > > You may compose another patch for the above changes, to separte them > from main idea of this patch. Certainly, Just sent this as a separate patch. > > -#define MAX_GC_TIMES 100 > > -#define MIN_GC_NODES 100 > > -#define GC_SLEEP_MS 100 > > +#define GC_MIN_NODES 10 > > +#define GC_SLEEP_MS 10 > > > > #define PTR_DIRTY_BIT (((uint64_t) 1 << 36)) > > > > @@ -1585,25 +1582,24 @@ static unsigned int btree_gc_count_keys(struct btree *b) > > > > static size_t btree_gc_min_nodes(struct cache_set *c) > > { > > - size_t min_nodes; > > + size_t min_nodes = GC_MIN_NODES; > > + uint64_t gc_max_ms = time_stat_average(&c->btree_gc_time, frequency, ms) / 2; > > > > /* > > - * Since incremental GC would stop 100ms when front > > - * side I/O comes, so when there are many btree nodes, > > - * if GC only processes constant (100) nodes each time, > > - * GC would last a long time, and the front side I/Os > > - * would run out of the buckets (since no new bucket > > - * can be allocated during GC), and be blocked again. > > - * So GC should not process constant nodes, but varied > > - * nodes according to the number of btree nodes, which > > - * realized by dividing GC into constant(100) times, > > - * so when there are many btree nodes, GC can process > > - * more nodes each time, otherwise, GC will process less > > - * nodes each time (but no less than MIN_GC_NODES) > > + * The incremental garbage collector operates by processing > > + * GC_MIN_NODES at a time, pausing for GC_SLEEP_MS between > > + * each interval. If historical garbage collection statistics > > + * (btree_gc_time) is available, the maximum allowable GC > > + * duration is set to half of this observed frequency. To > > + * prevent exceeding this maximum duration, the number of > > + * nodes processed in the current step may be increased if > > + * the projected completion time based on the current pace > > + * extends beyond the allowed limit. This ensures timely GC > > + * completion before the next GC is due. > > */ > > - min_nodes = c->gc_stats.nodes / MAX_GC_TIMES; > > - if (min_nodes < MIN_GC_NODES) > > - min_nodes = MIN_GC_NODES; > > + if ((gc_max_ms >= GC_SLEEP_MS) && > > + (GC_SLEEP_MS * (c->gc_stats.nodes / min_nodes)) > gc_max_ms) > > + min_nodes = c->gc_stats.nodes / (gc_max_ms / GC_SLEEP_MS); > > > > Is it possible that gc_max_ms becomes 0? Yes, gc_max_ms can be 0 initially when the cache is set up and stats are not collected yet. In that case, the check "gc_max_ms >= GC_SLEEP_MS" fails and we process at the default rate of 10 nodes per cycle. Importantly, this same check also serves to prevent a division-by-zero error when the number of nodes is re-scaled using the following calculation: min_nodes = c->gc_stats.nodes / (gc_max_ms / GC_SLEEP_MS); > > > return min_nodes; > > } > > diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h > > index 539454d8e2d0..21a370f444b7 100644 > > --- a/drivers/md/bcache/util.h > > +++ b/drivers/md/bcache/util.h > > @@ -305,6 +305,9 @@ static inline unsigned int local_clock_us(void) > > #define NSEC_PER_ms NSEC_PER_MSEC > > #define NSEC_PER_sec NSEC_PER_SEC > > > > +#define time_stat_average(stats, stat, units) \ > > + div_u64((stats)->average_ ## stat >> 8, NSEC_PER_ ## units) > > + > > Could you please add a few code comments here to explain what does > time_stat_average() do? Will add the code comments with explanation in a new version promptly. > > Thanks in advance. > > > #define __print_time_stat(stats, name, stat, units) \ > > sysfs_print(name ## _ ## stat ## _ ## units, \ > > div_u64((stats)->stat >> 8, NSEC_PER_ ## units)) > > -- > > 2.49.0.604.gff1f9ca942-goog > > > > -- > Coly Li ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/1] bcache: process fewer btree nodes in incremental GC cycles 2025-04-15 17:25 ` Robert Pang @ 2025-04-16 9:44 ` Coly Li 2025-04-17 5:44 ` Robert Pang 0 siblings, 1 reply; 27+ messages in thread From: Coly Li @ 2025-04-16 9:44 UTC (permalink / raw) To: Robert Pang; +Cc: Coly Li, Kent Overstreet, linux-bcache, Mingzhe Zou > 2025年4月16日 01:25,Robert Pang <robertpang@google.com> 写道: > > Thank you for your prompt feedback, Coly. > > On Mon, Apr 14, 2025 at 7:08 PM Coly Li <colyli@kernel.org> wrote: >> >> On Mon, Apr 14, 2025 at 03:44:04PM +0800, Robert Pang wrote: >>> Current incremental GC processes a minimum of 100 btree nodes per cycle, >>> followed by a 100ms sleep. For NVMe cache devices, where the average node >>> processing time is ~1ms, this leads to front-side I/O latency potentially >>> reaching tens or hundreds of milliseconds during GC execution. >>> >>> This commit resolves this latency issue by reducing the minimum node processing >>> count per cycle to 10 and the inter-cycle sleep duration to 10ms. It also >>> integrates a check of existing GC statistics to re-scale the number of nodes >>> processed per sleep interval when needed, ensuring GC finishes well before the >>> next GC is due. >>> >>> Signed-off-by: Robert Pang <robertpang@google.com> >>> --- >>> drivers/md/bcache/btree.c | 38 +++++++++++++++++--------------------- >>> drivers/md/bcache/util.h | 3 +++ >>> 2 files changed, 20 insertions(+), 21 deletions(-) >>> >>> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c >>> index ed40d8600656..093e1edcaa53 100644 >>> --- a/drivers/md/bcache/btree.c >>> +++ b/drivers/md/bcache/btree.c >>> @@ -88,11 +88,8 @@ >>> * Test module load/unload >>> */ >>> >>> -#define MAX_NEED_GC 64 >>> -#define MAX_SAVE_PRIO 72 >> >> You may compose another patch for the above changes, to separte them >> from main idea of this patch. > > Certainly, Just sent this as a separate patch. > >>> -#define MAX_GC_TIMES 100 >>> -#define MIN_GC_NODES 100 >>> -#define GC_SLEEP_MS 100 >>> +#define GC_MIN_NODES 10 >>> +#define GC_SLEEP_MS 10 >>> >>> #define PTR_DIRTY_BIT (((uint64_t) 1 << 36)) >>> >>> @@ -1585,25 +1582,24 @@ static unsigned int btree_gc_count_keys(struct btree *b) >>> >>> static size_t btree_gc_min_nodes(struct cache_set *c) >>> { >>> - size_t min_nodes; >>> + size_t min_nodes = GC_MIN_NODES; >>> + uint64_t gc_max_ms = time_stat_average(&c->btree_gc_time, frequency, ms) / 2; >>> >>> /* >>> - * Since incremental GC would stop 100ms when front >>> - * side I/O comes, so when there are many btree nodes, >>> - * if GC only processes constant (100) nodes each time, >>> - * GC would last a long time, and the front side I/Os >>> - * would run out of the buckets (since no new bucket >>> - * can be allocated during GC), and be blocked again. >>> - * So GC should not process constant nodes, but varied >>> - * nodes according to the number of btree nodes, which >>> - * realized by dividing GC into constant(100) times, >>> - * so when there are many btree nodes, GC can process >>> - * more nodes each time, otherwise, GC will process less >>> - * nodes each time (but no less than MIN_GC_NODES) >>> + * The incremental garbage collector operates by processing >>> + * GC_MIN_NODES at a time, pausing for GC_SLEEP_MS between >>> + * each interval. If historical garbage collection statistics >>> + * (btree_gc_time) is available, the maximum allowable GC >>> + * duration is set to half of this observed frequency. To >>> + * prevent exceeding this maximum duration, the number of >>> + * nodes processed in the current step may be increased if >>> + * the projected completion time based on the current pace >>> + * extends beyond the allowed limit. This ensures timely GC >>> + * completion before the next GC is due. >>> */ >>> - min_nodes = c->gc_stats.nodes / MAX_GC_TIMES; >>> - if (min_nodes < MIN_GC_NODES) >>> - min_nodes = MIN_GC_NODES; >>> + if ((gc_max_ms >= GC_SLEEP_MS) && >>> + (GC_SLEEP_MS * (c->gc_stats.nodes / min_nodes)) > gc_max_ms) >>> + min_nodes = c->gc_stats.nodes / (gc_max_ms / GC_SLEEP_MS); >>> >> >> Is it possible that gc_max_ms becomes 0? > > Yes, gc_max_ms can be 0 initially when the cache is set up and stats are not > collected yet. In that case, the check "gc_max_ms >= GC_SLEEP_MS" fails and > we process at the default rate of 10 nodes per cycle. > > Importantly, this same check also serves to prevent a division-by-zero error > when the number of nodes is re-scaled using the following calculation: > > min_nodes = c->gc_stats.nodes / (gc_max_ms / GC_SLEEP_MS); > >> Thanks. So the code itself I don’t have more comment, just look forward to more testing and benchmark results. Thanks. Coly Li >>> return min_nodes; >>> } >>> diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h >>> index 539454d8e2d0..21a370f444b7 100644 >>> --- a/drivers/md/bcache/util.h >>> +++ b/drivers/md/bcache/util.h >>> @@ -305,6 +305,9 @@ static inline unsigned int local_clock_us(void) >>> #define NSEC_PER_ms NSEC_PER_MSEC >>> #define NSEC_PER_sec NSEC_PER_SEC >>> >>> +#define time_stat_average(stats, stat, units) \ >>> + div_u64((stats)->average_ ## stat >> 8, NSEC_PER_ ## units) >>> + >> >> Could you please add a few code comments here to explain what does >> time_stat_average() do? > > Will add the code comments with explanation in a new version promptly. > >> >> Thanks in advance. >> >>> #define __print_time_stat(stats, name, stat, units) \ >>> sysfs_print(name ## _ ## stat ## _ ## units, \ >>> div_u64((stats)->stat >> 8, NSEC_PER_ ## units)) >>> -- >>> 2.49.0.604.gff1f9ca942-goog >>> >> >> -- >> Coly Li ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/1] bcache: process fewer btree nodes in incremental GC cycles 2025-04-16 9:44 ` Coly Li @ 2025-04-17 5:44 ` Robert Pang 0 siblings, 0 replies; 27+ messages in thread From: Robert Pang @ 2025-04-17 5:44 UTC (permalink / raw) To: Coly Li; +Cc: Coly Li, Kent Overstreet, linux-bcache, Mingzhe Zou Thank you for the positive feedback and your review. Regarding testing, the 24-hour fio test is currently underway on a 6TB SSD with an iodepth set to 128. I will provide an update with the findings upon its conclusion. On Wed, Apr 16, 2025 at 2:44 AM Coly Li <i@coly.li> wrote: > > > > > 2025年4月16日 01:25,Robert Pang <robertpang@google.com> 写道: > > > > Thank you for your prompt feedback, Coly. > > > > On Mon, Apr 14, 2025 at 7:08 PM Coly Li <colyli@kernel.org> wrote: > >> > >> On Mon, Apr 14, 2025 at 03:44:04PM +0800, Robert Pang wrote: > >>> Current incremental GC processes a minimum of 100 btree nodes per cycle, > >>> followed by a 100ms sleep. For NVMe cache devices, where the average node > >>> processing time is ~1ms, this leads to front-side I/O latency potentially > >>> reaching tens or hundreds of milliseconds during GC execution. > >>> > >>> This commit resolves this latency issue by reducing the minimum node processing > >>> count per cycle to 10 and the inter-cycle sleep duration to 10ms. It also > >>> integrates a check of existing GC statistics to re-scale the number of nodes > >>> processed per sleep interval when needed, ensuring GC finishes well before the > >>> next GC is due. > >>> > >>> Signed-off-by: Robert Pang <robertpang@google.com> > >>> --- > >>> drivers/md/bcache/btree.c | 38 +++++++++++++++++--------------------- > >>> drivers/md/bcache/util.h | 3 +++ > >>> 2 files changed, 20 insertions(+), 21 deletions(-) > >>> > >>> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c > >>> index ed40d8600656..093e1edcaa53 100644 > >>> --- a/drivers/md/bcache/btree.c > >>> +++ b/drivers/md/bcache/btree.c > >>> @@ -88,11 +88,8 @@ > >>> * Test module load/unload > >>> */ > >>> > >>> -#define MAX_NEED_GC 64 > >>> -#define MAX_SAVE_PRIO 72 > >> > >> You may compose another patch for the above changes, to separte them > >> from main idea of this patch. > > > > Certainly, Just sent this as a separate patch. > > > >>> -#define MAX_GC_TIMES 100 > >>> -#define MIN_GC_NODES 100 > >>> -#define GC_SLEEP_MS 100 > >>> +#define GC_MIN_NODES 10 > >>> +#define GC_SLEEP_MS 10 > >>> > >>> #define PTR_DIRTY_BIT (((uint64_t) 1 << 36)) > >>> > >>> @@ -1585,25 +1582,24 @@ static unsigned int btree_gc_count_keys(struct btree *b) > >>> > >>> static size_t btree_gc_min_nodes(struct cache_set *c) > >>> { > >>> - size_t min_nodes; > >>> + size_t min_nodes = GC_MIN_NODES; > >>> + uint64_t gc_max_ms = time_stat_average(&c->btree_gc_time, frequency, ms) / 2; > >>> > >>> /* > >>> - * Since incremental GC would stop 100ms when front > >>> - * side I/O comes, so when there are many btree nodes, > >>> - * if GC only processes constant (100) nodes each time, > >>> - * GC would last a long time, and the front side I/Os > >>> - * would run out of the buckets (since no new bucket > >>> - * can be allocated during GC), and be blocked again. > >>> - * So GC should not process constant nodes, but varied > >>> - * nodes according to the number of btree nodes, which > >>> - * realized by dividing GC into constant(100) times, > >>> - * so when there are many btree nodes, GC can process > >>> - * more nodes each time, otherwise, GC will process less > >>> - * nodes each time (but no less than MIN_GC_NODES) > >>> + * The incremental garbage collector operates by processing > >>> + * GC_MIN_NODES at a time, pausing for GC_SLEEP_MS between > >>> + * each interval. If historical garbage collection statistics > >>> + * (btree_gc_time) is available, the maximum allowable GC > >>> + * duration is set to half of this observed frequency. To > >>> + * prevent exceeding this maximum duration, the number of > >>> + * nodes processed in the current step may be increased if > >>> + * the projected completion time based on the current pace > >>> + * extends beyond the allowed limit. This ensures timely GC > >>> + * completion before the next GC is due. > >>> */ > >>> - min_nodes = c->gc_stats.nodes / MAX_GC_TIMES; > >>> - if (min_nodes < MIN_GC_NODES) > >>> - min_nodes = MIN_GC_NODES; > >>> + if ((gc_max_ms >= GC_SLEEP_MS) && > >>> + (GC_SLEEP_MS * (c->gc_stats.nodes / min_nodes)) > gc_max_ms) > >>> + min_nodes = c->gc_stats.nodes / (gc_max_ms / GC_SLEEP_MS); > >>> > >> > >> Is it possible that gc_max_ms becomes 0? > > > > Yes, gc_max_ms can be 0 initially when the cache is set up and stats are not > > collected yet. In that case, the check "gc_max_ms >= GC_SLEEP_MS" fails and > > we process at the default rate of 10 nodes per cycle. > > > > Importantly, this same check also serves to prevent a division-by-zero error > > when the number of nodes is re-scaled using the following calculation: > > > > min_nodes = c->gc_stats.nodes / (gc_max_ms / GC_SLEEP_MS); > > > >> > > Thanks. So the code itself I don’t have more comment, just look forward to more testing and benchmark results. > > Thanks. > > Coly Li > > > > > >>> return min_nodes; > >>> } > >>> diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h > >>> index 539454d8e2d0..21a370f444b7 100644 > >>> --- a/drivers/md/bcache/util.h > >>> +++ b/drivers/md/bcache/util.h > >>> @@ -305,6 +305,9 @@ static inline unsigned int local_clock_us(void) > >>> #define NSEC_PER_ms NSEC_PER_MSEC > >>> #define NSEC_PER_sec NSEC_PER_SEC > >>> > >>> +#define time_stat_average(stats, stat, units) \ > >>> + div_u64((stats)->average_ ## stat >> 8, NSEC_PER_ ## units) > >>> + > >> > >> Could you please add a few code comments here to explain what does > >> time_stat_average() do? > > > > Will add the code comments with explanation in a new version promptly. > > > >> > >> Thanks in advance. > >> > >>> #define __print_time_stat(stats, name, stat, units) \ > >>> sysfs_print(name ## _ ## stat ## _ ## units, \ > >>> div_u64((stats)->stat >> 8, NSEC_PER_ ## units)) > >>> -- > >>> 2.49.0.604.gff1f9ca942-goog > >>> > >> > >> -- > >> Coly Li > > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] bcache: reduce front IO latency during GC 2025-04-14 22:44 [PATCH 0/1] bcache: reduce front IO latency during GC Robert Pang 2025-04-14 22:44 ` [PATCH 1/1] bcache: process fewer btree nodes in incremental GC cycles Robert Pang @ 2025-04-15 1:57 ` Coly Li 2025-04-22 1:44 ` Robert Pang 2025-05-16 17:10 ` Kent Overstreet 2 siblings, 1 reply; 27+ messages in thread From: Coly Li @ 2025-04-15 1:57 UTC (permalink / raw) To: Robert Pang; +Cc: Coly Li, Kent Overstreet, linux-bcache, Mingzhe Zou Hi Robert, Thanks for the fix up :-) > 2025年4月15日 06:44,Robert Pang <robertpang@google.com> 写道: > > In performance benchmarks on disks with bcache using the Linux 6.6 kernel, we > observe noticeable IO latency increase during btree garbage collection. The > increase ranges from high tens to hundreds of milliseconds, depending on the > size of the cache device. Further investigation reveals that it is the same > issue reported in [1], where the large number of nodes processed in each > incremental GC cycle causes the front IO latency. > > Building upon the approach suggested in [1], this patch decomposes the > incremental GC process into more but smaller cycles. In contrast to [1], this > implementation adopts a simpler strategy by setting a lower limit of 10 nodes > per cycle to reduce front IO delay and introducing a fixed 10ms sleep per cycle > when front IO is in progress. Furthermore, when garbage collection statistics > are available, the number of nodes processed per cycle is dynamically rescaled > based on the average GC frequency to ensure GC completes well within the next > subsequent scheduled interval. > > Testing with a 750GB NVMe cache and 256KB bucket size using the following fio > configuration demonstrates that our patch reduces front IO latency during GC > without significantly increasing GC duration. > > ioengine=libaio > direct=1 > bs=4k > size=900G > iodepth=10 > readwrite=randwrite > log_avg_msec=10 > > Before: > > time-ms,latency-ns,,, > > 12170, 285016, 1, 0, 0 > 12183, 296581, 1, 0, 0 > 12207, 6542725, 1, 0, 0 > 12242, 24483604, 1, 0, 0 > 12250, 1895628, 1, 0, 0 > 12260, 284854, 1, 0, 0 > 12270, 275513, 1, 0, 0 > > /sys/block/bcache0/bcache/cache/internal/btree_gc_average_duration_ms:2880 > /sys/block/bcache0/bcache/cache/internal/btree_gc_average_frequency_sec:133 > /sys/block/bcache0/bcache/cache/internal/btree_gc_last_sec:121 > /sys/block/bcache0/bcache/cache/internal/btree_gc_max_duration_ms:3456 > > After: > > 12690, 378494, 1, 0, 0 > 12700, 413934, 1, 0, 0 > 12710, 661217, 1, 0, 0 > 12727, 354510, 1, 0, 0 > 12730, 1100768, 1, 0, 0 > 12742, 382484, 1, 0, 0 > 12750, 532679, 1, 0, 0 > 12760, 572758, 1, 0, 0 > 12773, 283416, 1, 0, 0 > > /sys/block/bcache0/bcache/cache/internal/btree_gc_average_duration_ms:3619 > /sys/block/bcache0/bcache/cache/internal/btree_gc_average_frequency_sec:58 > /sys/block/bcache0/bcache/cache/internal/btree_gc_last_sec:23 > /sys/block/bcache0/bcache/cache/internal/btree_gc_max_duration_ms:3866 > > [1] https://lore.kernel.org/all/20220511073903.13568-1-mingzhe.zou@easystack.cn/ I see the data, it makes sense. I’d like to see more testing data, e.g. 1) Larger SSD (4T or 8T) 2) Higher I/O pressure (randomwrite) 3) Much longer time with high I/O pressure (24 hours+) Then it can help me to understand the optimization better and make decision easier. Of course it will save a lot of testing time from my side. Thank you in advance. Coly Li ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] bcache: reduce front IO latency during GC 2025-04-15 1:57 ` [PATCH 0/1] bcache: reduce front IO latency during GC Coly Li @ 2025-04-22 1:44 ` Robert Pang 2025-05-02 1:01 ` Robert Pang 0 siblings, 1 reply; 27+ messages in thread From: Robert Pang @ 2025-04-22 1:44 UTC (permalink / raw) To: Coly Li; +Cc: Coly Li, Kent Overstreet, linux-bcache, Mingzhe Zou I conducted a 24-hour fio random write test on a 6TB local SSD cache using 256kb buckets and the following parameters: ioengine=libaio direct=1 bs=4k size=12T iodepth=128 readwrite=randwrite log_avg_msec=10 The results show some improvement in average write latency, resulting in increased IOPS (from 11.8k to 17.1k) and throughput (from 45.9MiB/s to 66.7MiB/s). However, the maximum write latency exhibited a considerable degradation. Before: latency_test: (groupid=0, jobs=1): err= 0: pid=14917: Mon Apr 21 06:50:45 2025 write: IOPS=11.8k, BW=45.9MiB/s (48.1MB/s)(4035GiB/90000002msec); 0 zone resets slat (usec): min=9, max=4003.9k, avg=74.37, stdev=3734.09 clat (usec): min=2, max=5319.2k, avg=10815.66, stdev=46998.95 lat (usec): min=239, max=5319.3k, avg=10890.15, stdev=47167.41 /sys/block/bcache0/bcache/cache/internal/btree_gc_average_duration_ms:381138 /sys/block/bcache0/bcache/cache/internal/btree_gc_average_frequency_sec:3834 /sys/block/bcache0/bcache/cache/internal/btree_gc_last_sec:60668 /sys/block/bcache0/bcache/cache/internal/btree_gc_max_duration_ms:825228 /sys/block/bcache0/bcache/cache/internal/bset_tree_stats: btree nodes: 144330 written sets: 283733 unwritten sets: 144329 written key bytes: 24993783392 unwritten key bytes: 11777400 floats: 30936844345385 failed: 5776 After: latency_test: (groupid=0, jobs=1): err= 0: pid=15158: Mon Apr 21 17:22:13 2025 write: IOPS=17.1k, BW=66.7MiB/s (69.9MB/s)(5859GiB/90000004msec); 0 zone resets slat (usec): min=7, max=469348k, avg=46.71, stdev=20576.47 clat (usec): min=3, max=560660k, avg=7453.04, stdev=458965.53 lat (usec): min=313, max=560660k, avg=7499.88, stdev=459426.83 /sys/block/bcache0/bcache/cache/internal/btree_gc_average_duration_ms:540350 /sys/block/bcache0/bcache/cache/internal/btree_gc_average_frequency_sec:3235 /sys/block/bcache0/bcache/cache/internal/btree_gc_last_sec:22304 /sys/block/bcache0/bcache/cache/internal/btree_gc_max_duration_ms:719097 /sys/block/bcache0/bcache/cache/internal/bset_tree_stats: btree nodes: 239590 written sets: 466470 unwritten sets: 239587 written key bytes: 36120573032 unwritten key bytes: 16776624 floats: 95606253613657 failed: 4522 Further investigation pointed to the 10ms sleep time being too restrictive for this intensive workload. Examining tail latency percentiles outside garbage collection revealed that a significant number of I/O requests remained in the queue for extended periods. The observed percentiles are: Percentile Latency (ms) P85 10 P86 11 P87 13 P88 16 P89 22 P90 36 P91 60 P92 104 P93 174 P94 231 P95 257 P96 282 P97 308 P98 337 P99 377 P100 472 This data suggests that while a 10ms sleep time might be suitable for lighter workloads, it becomes a bottleneck under heavier stress. In consideration of the test results, I propose the feasibility of keeping the original sleep time and min nodes per cycle, but also exposing them as configurable attributes through sysfs so users can tune according to their workload and latency SLO. I would appreciate your insights on this potential enhancement. On Mon, Apr 14, 2025 at 6:57 PM Coly Li <i@coly.li> wrote: > > Hi Robert, > > Thanks for the fix up :-) > > > 2025年4月15日 06:44,Robert Pang <robertpang@google.com> 写道: > > > > In performance benchmarks on disks with bcache using the Linux 6.6 kernel, we > > observe noticeable IO latency increase during btree garbage collection. The > > increase ranges from high tens to hundreds of milliseconds, depending on the > > size of the cache device. Further investigation reveals that it is the same > > issue reported in [1], where the large number of nodes processed in each > > incremental GC cycle causes the front IO latency. > > > > Building upon the approach suggested in [1], this patch decomposes the > > incremental GC process into more but smaller cycles. In contrast to [1], this > > implementation adopts a simpler strategy by setting a lower limit of 10 nodes > > per cycle to reduce front IO delay and introducing a fixed 10ms sleep per cycle > > when front IO is in progress. Furthermore, when garbage collection statistics > > are available, the number of nodes processed per cycle is dynamically rescaled > > based on the average GC frequency to ensure GC completes well within the next > > subsequent scheduled interval. > > > > Testing with a 750GB NVMe cache and 256KB bucket size using the following fio > > configuration demonstrates that our patch reduces front IO latency during GC > > without significantly increasing GC duration. > > > > ioengine=libaio > > direct=1 > > bs=4k > > size=900G > > iodepth=10 > > readwrite=randwrite > > log_avg_msec=10 > > > > Before: > > > > time-ms,latency-ns,,, > > > > 12170, 285016, 1, 0, 0 > > 12183, 296581, 1, 0, 0 > > 12207, 6542725, 1, 0, 0 > > 12242, 24483604, 1, 0, 0 > > 12250, 1895628, 1, 0, 0 > > 12260, 284854, 1, 0, 0 > > 12270, 275513, 1, 0, 0 > > > > /sys/block/bcache0/bcache/cache/internal/btree_gc_average_duration_ms:2880 > > /sys/block/bcache0/bcache/cache/internal/btree_gc_average_frequency_sec:133 > > /sys/block/bcache0/bcache/cache/internal/btree_gc_last_sec:121 > > /sys/block/bcache0/bcache/cache/internal/btree_gc_max_duration_ms:3456 > > > > After: > > > > 12690, 378494, 1, 0, 0 > > 12700, 413934, 1, 0, 0 > > 12710, 661217, 1, 0, 0 > > 12727, 354510, 1, 0, 0 > > 12730, 1100768, 1, 0, 0 > > 12742, 382484, 1, 0, 0 > > 12750, 532679, 1, 0, 0 > > 12760, 572758, 1, 0, 0 > > 12773, 283416, 1, 0, 0 > > > > /sys/block/bcache0/bcache/cache/internal/btree_gc_average_duration_ms:3619 > > /sys/block/bcache0/bcache/cache/internal/btree_gc_average_frequency_sec:58 > > /sys/block/bcache0/bcache/cache/internal/btree_gc_last_sec:23 > > /sys/block/bcache0/bcache/cache/internal/btree_gc_max_duration_ms:3866 > > > > [1] https://lore.kernel.org/all/20220511073903.13568-1-mingzhe.zou@easystack.cn/ > > > I see the data, it makes sense. I’d like to see more testing data, e.g. > 1) Larger SSD (4T or 8T) > 2) Higher I/O pressure (randomwrite) > 3) Much longer time with high I/O pressure (24 hours+) > > Then it can help me to understand the optimization better and make decision easier. Of course it will save a lot of testing time from my side. > > Thank you in advance. > > Coly Li > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] bcache: reduce front IO latency during GC 2025-04-22 1:44 ` Robert Pang @ 2025-05-02 1:01 ` Robert Pang 2025-05-03 17:33 ` Coly Li 0 siblings, 1 reply; 27+ messages in thread From: Robert Pang @ 2025-05-02 1:01 UTC (permalink / raw) To: Coly Li; +Cc: Coly Li, Kent Overstreet, linux-bcache, Mingzhe Zou Hi Coly, Please disregard the test results I shared over a week ago. After digging deeper into the recent latency spikes with various workloads and by instrumenting the garbage collector, I realized that the earlier GC latency patch, "bcache: allow allocator to invalidate bucket in gc" [1], wasn't backported to the Linux 6.6 branch I tested my patch against. This omission explains the much higher latency observed during the extended test because the allocator was blocked for the entire GC. My sincere apologies for the inconsistent results and any confusion this has caused. With patch [1] back-patched and after a 24-hour re-test, the fio results clearly demonstrate that this patch effectively reduces front IO latency during GC due to the smaller incremental GC cycles, while the GC duration increase is still well within bounds. Here's a summary of the improved latency: Before: Median latency (P50): 210 ms Max latency (P100): 3.5 sec btree_gc_average_duration_ms:381138 btree_gc_average_frequency_sec:3834 btree_gc_last_sec:60668 btree_gc_max_duration_ms:825228 bset_tree_stats: btree nodes: 144330 written sets: 283733 unwritten sets: 144329 written key bytes: 24993783392 unwritten key bytes: 11777400 floats: 30936844345385 failed: 5776 After: Median latency (P50): 25 ms Max latency (P100): 0.8 sec btree_gc_average_duration_ms:622274 btree_gc_average_frequency_sec:3518 btree_gc_last_sec:8931 btree_gc_max_duration_ms:953146 bset_tree_stats: btree nodes: 175491 written sets: 339078 unwritten sets: 175488 written key bytes: 29821314856 unwritten key bytes: 14076504 floats: 90520963280544 failed: 6462 The complete latency data is available at [2]. I will be glad to run further tests to solidify these findings for the inclusion of this patch in the coming merge window. Let me know if you'd like me to conduct any specific tests. Best regards Robert Pang [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a14a68b76954e73031ca6399abace17dcb77c17a [2[ https://gist.github.com/robert-pang/cc7c88f356293ea6d43103e6e5f9180f On Mon, Apr 21, 2025 at 6:44 PM Robert Pang <robertpang@google.com> wrote: > > I conducted a 24-hour fio random write test on a 6TB local SSD cache using > 256kb buckets and the following parameters: > > ioengine=libaio > direct=1 > bs=4k > size=12T > iodepth=128 > readwrite=randwrite > log_avg_msec=10 > > The results show some improvement in average write latency, resulting in > increased IOPS (from 11.8k to 17.1k) and throughput (from 45.9MiB/s to > 66.7MiB/s). However, the maximum write latency exhibited a considerable > degradation. > > Before: > > latency_test: (groupid=0, jobs=1): err= 0: pid=14917: Mon Apr 21 06:50:45 2025 > write: IOPS=11.8k, BW=45.9MiB/s (48.1MB/s)(4035GiB/90000002msec); 0 > zone resets > slat (usec): min=9, max=4003.9k, avg=74.37, stdev=3734.09 > clat (usec): min=2, max=5319.2k, avg=10815.66, stdev=46998.95 > lat (usec): min=239, max=5319.3k, avg=10890.15, stdev=47167.41 > > /sys/block/bcache0/bcache/cache/internal/btree_gc_average_duration_ms:381138 > /sys/block/bcache0/bcache/cache/internal/btree_gc_average_frequency_sec:3834 > /sys/block/bcache0/bcache/cache/internal/btree_gc_last_sec:60668 > /sys/block/bcache0/bcache/cache/internal/btree_gc_max_duration_ms:825228 > /sys/block/bcache0/bcache/cache/internal/bset_tree_stats: > btree nodes: 144330 > written sets: 283733 > unwritten sets: 144329 > written key bytes: 24993783392 > unwritten key bytes: 11777400 > floats: 30936844345385 > failed: 5776 > > > After: > > latency_test: (groupid=0, jobs=1): err= 0: pid=15158: Mon Apr 21 17:22:13 2025 > write: IOPS=17.1k, BW=66.7MiB/s (69.9MB/s)(5859GiB/90000004msec); 0 > zone resets > slat (usec): min=7, max=469348k, avg=46.71, stdev=20576.47 > clat (usec): min=3, max=560660k, avg=7453.04, stdev=458965.53 > lat (usec): min=313, max=560660k, avg=7499.88, stdev=459426.83 > > /sys/block/bcache0/bcache/cache/internal/btree_gc_average_duration_ms:540350 > /sys/block/bcache0/bcache/cache/internal/btree_gc_average_frequency_sec:3235 > /sys/block/bcache0/bcache/cache/internal/btree_gc_last_sec:22304 > /sys/block/bcache0/bcache/cache/internal/btree_gc_max_duration_ms:719097 > /sys/block/bcache0/bcache/cache/internal/bset_tree_stats: > btree nodes: 239590 > written sets: 466470 > unwritten sets: 239587 > written key bytes: 36120573032 > unwritten key bytes: 16776624 > floats: 95606253613657 > failed: 4522 > > Further investigation pointed to the 10ms sleep time being too restrictive for > this intensive workload. Examining tail latency percentiles outside garbage > collection revealed that a significant number of I/O requests remained in the > queue for extended periods. The observed percentiles are: > > Percentile Latency (ms) > P85 10 > P86 11 > P87 13 > P88 16 > P89 22 > P90 36 > P91 60 > P92 104 > P93 174 > P94 231 > P95 257 > P96 282 > P97 308 > P98 337 > P99 377 > P100 472 > > This data suggests that while a 10ms sleep time might be suitable for lighter > workloads, it becomes a bottleneck under heavier stress. > > In consideration of the test results, I propose the feasibility of keeping the > original sleep time and min nodes per cycle, but also exposing them as > configurable attributes through sysfs so users can tune according to their > workload and latency SLO. I would appreciate your insights on this potential > enhancement. > > On Mon, Apr 14, 2025 at 6:57 PM Coly Li <i@coly.li> wrote: > > > > Hi Robert, > > > > Thanks for the fix up :-) > > > > > 2025年4月15日 06:44,Robert Pang <robertpang@google.com> 写道: > > > > > > In performance benchmarks on disks with bcache using the Linux 6.6 kernel, we > > > observe noticeable IO latency increase during btree garbage collection. The > > > increase ranges from high tens to hundreds of milliseconds, depending on the > > > size of the cache device. Further investigation reveals that it is the same > > > issue reported in [1], where the large number of nodes processed in each > > > incremental GC cycle causes the front IO latency. > > > > > > Building upon the approach suggested in [1], this patch decomposes the > > > incremental GC process into more but smaller cycles. In contrast to [1], this > > > implementation adopts a simpler strategy by setting a lower limit of 10 nodes > > > per cycle to reduce front IO delay and introducing a fixed 10ms sleep per cycle > > > when front IO is in progress. Furthermore, when garbage collection statistics > > > are available, the number of nodes processed per cycle is dynamically rescaled > > > based on the average GC frequency to ensure GC completes well within the next > > > subsequent scheduled interval. > > > > > > Testing with a 750GB NVMe cache and 256KB bucket size using the following fio > > > configuration demonstrates that our patch reduces front IO latency during GC > > > without significantly increasing GC duration. > > > > > > ioengine=libaio > > > direct=1 > > > bs=4k > > > size=900G > > > iodepth=10 > > > readwrite=randwrite > > > log_avg_msec=10 > > > > > > Before: > > > > > > time-ms,latency-ns,,, > > > > > > 12170, 285016, 1, 0, 0 > > > 12183, 296581, 1, 0, 0 > > > 12207, 6542725, 1, 0, 0 > > > 12242, 24483604, 1, 0, 0 > > > 12250, 1895628, 1, 0, 0 > > > 12260, 284854, 1, 0, 0 > > > 12270, 275513, 1, 0, 0 > > > > > > /sys/block/bcache0/bcache/cache/internal/btree_gc_average_duration_ms:2880 > > > /sys/block/bcache0/bcache/cache/internal/btree_gc_average_frequency_sec:133 > > > /sys/block/bcache0/bcache/cache/internal/btree_gc_last_sec:121 > > > /sys/block/bcache0/bcache/cache/internal/btree_gc_max_duration_ms:3456 > > > > > > After: > > > > > > 12690, 378494, 1, 0, 0 > > > 12700, 413934, 1, 0, 0 > > > 12710, 661217, 1, 0, 0 > > > 12727, 354510, 1, 0, 0 > > > 12730, 1100768, 1, 0, 0 > > > 12742, 382484, 1, 0, 0 > > > 12750, 532679, 1, 0, 0 > > > 12760, 572758, 1, 0, 0 > > > 12773, 283416, 1, 0, 0 > > > > > > /sys/block/bcache0/bcache/cache/internal/btree_gc_average_duration_ms:3619 > > > /sys/block/bcache0/bcache/cache/internal/btree_gc_average_frequency_sec:58 > > > /sys/block/bcache0/bcache/cache/internal/btree_gc_last_sec:23 > > > /sys/block/bcache0/bcache/cache/internal/btree_gc_max_duration_ms:3866 > > > > > > [1] https://lore.kernel.org/all/20220511073903.13568-1-mingzhe.zou@easystack.cn/ > > > > > > I see the data, it makes sense. I’d like to see more testing data, e.g. > > 1) Larger SSD (4T or 8T) > > 2) Higher I/O pressure (randomwrite) > > 3) Much longer time with high I/O pressure (24 hours+) > > > > Then it can help me to understand the optimization better and make decision easier. Of course it will save a lot of testing time from my side. > > > > Thank you in advance. > > > > Coly Li > > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] bcache: reduce front IO latency during GC 2025-05-02 1:01 ` Robert Pang @ 2025-05-03 17:33 ` Coly Li 2025-05-15 0:58 ` Robert Pang 0 siblings, 1 reply; 27+ messages in thread From: Coly Li @ 2025-05-03 17:33 UTC (permalink / raw) To: Robert Pang; +Cc: Coly Li, Kent Overstreet, linux-bcache, Mingzhe Zou On Thu, May 01, 2025 at 06:01:09PM +0800, Robert Pang wrote: > Hi Coly, > > Please disregard the test results I shared over a week ago. After digging > deeper into the recent latency spikes with various workloads and by > instrumenting the garbage collector, I realized that the earlier GC latency > patch, "bcache: allow allocator to invalidate bucket in gc" [1], wasn't > backported to the Linux 6.6 branch I tested my patch against. This omission > explains the much higher latency observed during the extended test because the > allocator was blocked for the entire GC. My sincere apologies for the > inconsistent results and any confusion this has caused. > Did you also backport commit 05356938a4be ("bcache: call force_wake_up_gc() if necessary in check_should_bypass()") ? Last time when you pushed me to add commit a14a68b76954 into mainline kernel, I tested a regression from this patch and fixed it. Please add this fix if you didn't, otherwise the testing might not be completed. > With patch [1] back-patched and after a 24-hour re-test, the fio results clearly > demonstrate that this patch effectively reduces front IO latency during GC due > to the smaller incremental GC cycles, while the GC duration increase is still > well within bounds. > From the performance result in [2], it seems the max latency are reduced, but higher latency period are longer. I am not sure whether this is a happy result. Can I have a download link for the whole log? Then I can look at the performance numbers more close. > Here's a summary of the improved latency: > > Before: > > Median latency (P50): 210 ms > Max latency (P100): 3.5 sec > > btree_gc_average_duration_ms:381138 > btree_gc_average_frequency_sec:3834 > btree_gc_last_sec:60668 > btree_gc_max_duration_ms:825228 > bset_tree_stats: > btree nodes: 144330 > written sets: 283733 > unwritten sets: 144329 > written key bytes: 24993783392 > unwritten key bytes: 11777400 > floats: 30936844345385 > failed: 5776 > > After: > > Median latency (P50): 25 ms > Max latency (P100): 0.8 sec > > btree_gc_average_duration_ms:622274 > btree_gc_average_frequency_sec:3518 > btree_gc_last_sec:8931 > btree_gc_max_duration_ms:953146 > bset_tree_stats: > btree nodes: 175491 > written sets: 339078 > unwritten sets: 175488 > written key bytes: 29821314856 > unwritten key bytes: 14076504 > floats: 90520963280544 > failed: 6462 > > The complete latency data is available at [2]. > > I will be glad to run further tests to solidify these findings for the inclusion > of this patch in the coming merge window. Let me know if you'd like me to > conduct any specific tests. Yes, more testing are necessary, from 512 Bytes block size to 1 MiB or 8MiB block size. We need to make sure it won't introduce performance regression in other workload or circumstances. I don't have plan to submit this patch in this merge window, and please don't push me. For performance improvement change, I prefer the defalt configuration will cover most of work loads, so more testing and perforamce data are desired. E.g. the patch you mentioned (commit a14a68b76954 "bcache: allow allocator to invalidate bucket in gc"), it had been deployed in Easy Stack product environment for 20+ months before it got merged. Thanks. > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a14a68b76954e73031ca6399abace17dcb77c17a > [2[ https://gist.github.com/robert-pang/cc7c88f356293ea6d43103e6e5f9180f [snipped] -- Coly Li ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] bcache: reduce front IO latency during GC 2025-05-03 17:33 ` Coly Li @ 2025-05-15 0:58 ` Robert Pang 2025-05-15 1:04 ` Robert Pang ` (2 more replies) 0 siblings, 3 replies; 27+ messages in thread From: Robert Pang @ 2025-05-15 0:58 UTC (permalink / raw) To: Coly Li; +Cc: Coly Li, Kent Overstreet, linux-bcache, Mingzhe Zou, Kuan-Wei Chiu Hi Coly, My apologies for the delay in providing this update; comprehensive testing takes some time to complete. As you suggested, I conducted extensive tests for 24 hours against the latest 6.14.5 Linux kernel, exploring more configurations to get a complete picture: 1. 4KB block size with writethrough mode 2. 4KB block size with writeback mode (70% dirty) 3. 1MB block size with writethrough mode The detailed results, available at [1], consistently demonstrate that our patch is effective in significantly reducing latency during garbage collection. This holds true for both the default writethrough mode and the 70% writeback mode. As anticipated, with 1MB block sizes, we observed no difference in latency because the number of btree nodes is much smaller. [1] https://gist.github.com/robert-pang/817fa7c11ece99d25aabc0467a9427d8 However, during these tests, we've uncovered a new and distinct latency problem that appears to be introduced in the recent Linux kernel. This issue manifests as frequent and periodic latency spikes that occur outside of garbage collection. Below is a snippet of the latency data illustrating this: time (s) median (ms) max (ms) 60810 2.28 679.37 60840 2.32 2,434.24 * 60870 2.46 2,434.24 * 60900 2.52 2,434.24 * 60930 2.63 566.15 60960 2.82 566.15 60990 2.82 566.15 61020 2.78 471.79 61050 2.93 2,028.54 * 61080 3.11 2,028.54 * 61110 3.29 2,028.54 * 61140 3.42 679.37 61170 3.42 679.37 61200 3.41 679.37 61230 3.30 566.15 61260 2.93 1,690.45 * 61290 2.75 1,690.45 * 61320 2.72 1,690.45 * 61350 2.88 1,408.71 * 61380 5.07 1,408.71 * 61410 107.94 1,408.71 ** 61440 65.28 1,408.71 ** 61470 45.41 2,028.54 ** 61500 72.45 2,028.54 ** 61530 55.37 2,028.54 ** 61560 40.73 1,408.71 ** 61590 11.48 1,690.45 ** 61620 2.92 1,690.45 * 61650 2.54 1,690.45 * 61680 2.58 679.37 61710 2.78 679.37 ** garbage collection * cache replacement Based on the consistent periodicity of these spikes, we deduce that they are linked to the invalidate_buckets_lru() function during cache replacement. This function was recently modified to use min heap operations [2]. To confirm our hypothesis, we reverted the relevant commits and re-ran the tests. Results show that the latency spikes completely disappeared, positively confirming that the min heap changes introduce this regression. Furthermore, these changes also reduce the effectiveness of our GC patch. It appears that the min heap changes reduce heap sort speed somehow in invalidate_buckets_lr() and in GC. [2] https://lore.kernel.org/linux-bcache/ZxzkLJmhn3a%2F1ALQ@visitorckw-System-Product-Name/T/#m0dd24ba0c63615de465d3fec72dc73febb0f7a94 You may download the full test result data from these links. https://gist.github.com/robert-pang/5df1d595ee77756c0a01d6479bdf8e34#file-bcache-latency-4kb-no-patch-csv https://gist.github.com/robert-pang/5df1d595ee77756c0a01d6479bdf8e34#file-bcache-latency-4kb-with-patch-csv https://gist.github.com/robert-pang/bcc26a3aa90dc95a083799cf4fd48116#file-bcache-latency-4kb-wb-no-patch-csv https://gist.github.com/robert-pang/bcc26a3aa90dc95a083799cf4fd48116#file-bcache-latency-4kb-wb-with-patch-csv https://gist.github.com/robert-pang/7036b06b66c8de7e958cdbddcd92a3f5#file-bcache-latency-1mb-no-patch-csv https://gist.github.com/robert-pang/7036b06b66c8de7e958cdbddcd92a3f5#file-bcache-latency-1mb-with-patch-csv https://gist.github.com/robert-pang/40f90afdea2d2a8c3f6e22ff959eff03#file-bcache-latency-4kb-no-patch-min-heap-reverted-csv https://gist.github.com/robert-pang/40f90afdea2d2a8c3f6e22ff959eff03#file-bcache-latency-4kb-with-patch-min-heap-reverted-csv Best regards Robert Pang On Sat, May 3, 2025 at 10:33 AM Coly Li <colyli@kernel.org> wrote: > > On Thu, May 01, 2025 at 06:01:09PM +0800, Robert Pang wrote: > > Hi Coly, > > > > Please disregard the test results I shared over a week ago. After digging > > deeper into the recent latency spikes with various workloads and by > > instrumenting the garbage collector, I realized that the earlier GC latency > > patch, "bcache: allow allocator to invalidate bucket in gc" [1], wasn't > > backported to the Linux 6.6 branch I tested my patch against. This omission > > explains the much higher latency observed during the extended test because the > > allocator was blocked for the entire GC. My sincere apologies for the > > inconsistent results and any confusion this has caused. > > > > Did you also backport commit 05356938a4be ("bcache: call force_wake_up_gc() > if necessary in check_should_bypass()") ? Last time when you pushed me to > add commit a14a68b76954 into mainline kernel, I tested a regression from this > patch and fixed it. Please add this fix if you didn't, otherwise the testing > might not be completed. > > > > With patch [1] back-patched and after a 24-hour re-test, the fio results clearly > > demonstrate that this patch effectively reduces front IO latency during GC due > > to the smaller incremental GC cycles, while the GC duration increase is still > > well within bounds. > > > > From the performance result in [2], it seems the max latency are reduced, > but higher latency period are longer. I am not sure whether this is a happy > result. > > Can I have a download link for the whole log? Then I can look at the > performance numbers more close. > > > Here's a summary of the improved latency: > > > > Before: > > > > Median latency (P50): 210 ms > > Max latency (P100): 3.5 sec > > > > btree_gc_average_duration_ms:381138 > > btree_gc_average_frequency_sec:3834 > > btree_gc_last_sec:60668 > > btree_gc_max_duration_ms:825228 > > bset_tree_stats: > > btree nodes: 144330 > > written sets: 283733 > > unwritten sets: 144329 > > written key bytes: 24993783392 > > unwritten key bytes: 11777400 > > floats: 30936844345385 > > failed: 5776 > > > > After: > > > > Median latency (P50): 25 ms > > Max latency (P100): 0.8 sec > > > > btree_gc_average_duration_ms:622274 > > btree_gc_average_frequency_sec:3518 > > btree_gc_last_sec:8931 > > btree_gc_max_duration_ms:953146 > > bset_tree_stats: > > btree nodes: 175491 > > written sets: 339078 > > unwritten sets: 175488 > > written key bytes: 29821314856 > > unwritten key bytes: 14076504 > > floats: 90520963280544 > > failed: 6462 > > > > The complete latency data is available at [2]. > > > > I will be glad to run further tests to solidify these findings for the inclusion > > of this patch in the coming merge window. Let me know if you'd like me to > > conduct any specific tests. > > Yes, more testing are necessary, from 512 Bytes block size to 1 MiB or > 8MiB block size. We need to make sure it won't introduce performance > regression in other workload or circumstances. > > I don't have plan to submit this patch in this merge window, and please don't > push me. For performance improvement change, I prefer the defalt > configuration will cover most of work loads, so more testing and perforamce > data are desired. E.g. the patch you mentioned (commit a14a68b76954 "bcache: > allow allocator to invalidate bucket in gc"), it had been deployed in Easy > Stack product environment for 20+ months before it got merged. > > Thanks. > > > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a14a68b76954e73031ca6399abace17dcb77c17a > > [2[ https://gist.github.com/robert-pang/cc7c88f356293ea6d43103e6e5f9180f > > [snipped] > > -- > Coly Li ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] bcache: reduce front IO latency during GC 2025-05-15 0:58 ` Robert Pang @ 2025-05-15 1:04 ` Robert Pang 2025-05-15 7:12 ` Kuan-Wei Chiu 2025-07-03 15:28 ` Robert Pang 2 siblings, 0 replies; 27+ messages in thread From: Robert Pang @ 2025-05-15 1:04 UTC (permalink / raw) To: Coly Li; +Cc: Coly Li, Kent Overstreet, linux-bcache, Mingzhe Zou, Kuan-Wei Chiu Here is the link to the changes to bcache and invalidate_buckets_lru() specifically. https://lore.kernel.org/linux-bcache/20240524152958.919343-16-visitorckw@gmail.com/T/#u On Wed, May 14, 2025 at 5:58 PM Robert Pang <robertpang@google.com> wrote: > > Hi Coly, > > My apologies for the delay in providing this update; comprehensive testing > takes some time to complete. > > As you suggested, I conducted extensive tests for 24 hours against the > latest 6.14.5 Linux kernel, exploring more configurations to get a complete > picture: > > 1. 4KB block size with writethrough mode > 2. 4KB block size with writeback mode (70% dirty) > 3. 1MB block size with writethrough mode > > The detailed results, available at [1], consistently demonstrate that our patch > is effective in significantly reducing latency during garbage collection. This > holds true for both the default writethrough mode and the 70% writeback mode. > As anticipated, with 1MB block sizes, we observed no difference in latency > because the number of btree nodes is much smaller. > > [1] https://gist.github.com/robert-pang/817fa7c11ece99d25aabc0467a9427d8 > > However, during these tests, we've uncovered a new and distinct latency problem > that appears to be introduced in the recent Linux kernel. This issue manifests > as frequent and periodic latency spikes that occur outside of garbage > collection. > Below is a snippet of the latency data illustrating this: > > time (s) median (ms) max (ms) > 60810 2.28 679.37 > 60840 2.32 2,434.24 * > 60870 2.46 2,434.24 * > 60900 2.52 2,434.24 * > 60930 2.63 566.15 > 60960 2.82 566.15 > 60990 2.82 566.15 > 61020 2.78 471.79 > 61050 2.93 2,028.54 * > 61080 3.11 2,028.54 * > 61110 3.29 2,028.54 * > 61140 3.42 679.37 > 61170 3.42 679.37 > 61200 3.41 679.37 > 61230 3.30 566.15 > 61260 2.93 1,690.45 * > 61290 2.75 1,690.45 * > 61320 2.72 1,690.45 * > 61350 2.88 1,408.71 * > 61380 5.07 1,408.71 * > 61410 107.94 1,408.71 ** > 61440 65.28 1,408.71 ** > 61470 45.41 2,028.54 ** > 61500 72.45 2,028.54 ** > 61530 55.37 2,028.54 ** > 61560 40.73 1,408.71 ** > 61590 11.48 1,690.45 ** > 61620 2.92 1,690.45 * > 61650 2.54 1,690.45 * > 61680 2.58 679.37 > 61710 2.78 679.37 > > ** garbage collection > * cache replacement > > Based on the consistent periodicity of these spikes, we deduce that they are > linked to the invalidate_buckets_lru() function during cache replacement. This > function was recently modified to use min heap operations [2]. To confirm our > hypothesis, we reverted the relevant commits and re-ran the tests. Results show > that the latency spikes completely disappeared, positively confirming that the > min heap changes introduce this regression. Furthermore, these changes also > reduce the effectiveness of our GC patch. It appears that the min heap changes > reduce heap sort speed somehow in invalidate_buckets_lr() and in GC. > > [2] https://lore.kernel.org/linux-bcache/ZxzkLJmhn3a%2F1ALQ@visitorckw-System-Product-Name/T/#m0dd24ba0c63615de465d3fec72dc73febb0f7a94 > > You may download the full test result data from these links. > > https://gist.github.com/robert-pang/5df1d595ee77756c0a01d6479bdf8e34#file-bcache-latency-4kb-no-patch-csv > https://gist.github.com/robert-pang/5df1d595ee77756c0a01d6479bdf8e34#file-bcache-latency-4kb-with-patch-csv > https://gist.github.com/robert-pang/bcc26a3aa90dc95a083799cf4fd48116#file-bcache-latency-4kb-wb-no-patch-csv > https://gist.github.com/robert-pang/bcc26a3aa90dc95a083799cf4fd48116#file-bcache-latency-4kb-wb-with-patch-csv > https://gist.github.com/robert-pang/7036b06b66c8de7e958cdbddcd92a3f5#file-bcache-latency-1mb-no-patch-csv > https://gist.github.com/robert-pang/7036b06b66c8de7e958cdbddcd92a3f5#file-bcache-latency-1mb-with-patch-csv > https://gist.github.com/robert-pang/40f90afdea2d2a8c3f6e22ff959eff03#file-bcache-latency-4kb-no-patch-min-heap-reverted-csv > https://gist.github.com/robert-pang/40f90afdea2d2a8c3f6e22ff959eff03#file-bcache-latency-4kb-with-patch-min-heap-reverted-csv > > Best regards > Robert Pang > > On Sat, May 3, 2025 at 10:33 AM Coly Li <colyli@kernel.org> wrote: > > > > On Thu, May 01, 2025 at 06:01:09PM +0800, Robert Pang wrote: > > > Hi Coly, > > > > > > Please disregard the test results I shared over a week ago. After digging > > > deeper into the recent latency spikes with various workloads and by > > > instrumenting the garbage collector, I realized that the earlier GC latency > > > patch, "bcache: allow allocator to invalidate bucket in gc" [1], wasn't > > > backported to the Linux 6.6 branch I tested my patch against. This omission > > > explains the much higher latency observed during the extended test because the > > > allocator was blocked for the entire GC. My sincere apologies for the > > > inconsistent results and any confusion this has caused. > > > > > > > Did you also backport commit 05356938a4be ("bcache: call force_wake_up_gc() > > if necessary in check_should_bypass()") ? Last time when you pushed me to > > add commit a14a68b76954 into mainline kernel, I tested a regression from this > > patch and fixed it. Please add this fix if you didn't, otherwise the testing > > might not be completed. > > > > > > > With patch [1] back-patched and after a 24-hour re-test, the fio results clearly > > > demonstrate that this patch effectively reduces front IO latency during GC due > > > to the smaller incremental GC cycles, while the GC duration increase is still > > > well within bounds. > > > > > > > From the performance result in [2], it seems the max latency are reduced, > > but higher latency period are longer. I am not sure whether this is a happy > > result. > > > > Can I have a download link for the whole log? Then I can look at the > > performance numbers more close. > > > > > Here's a summary of the improved latency: > > > > > > Before: > > > > > > Median latency (P50): 210 ms > > > Max latency (P100): 3.5 sec > > > > > > btree_gc_average_duration_ms:381138 > > > btree_gc_average_frequency_sec:3834 > > > btree_gc_last_sec:60668 > > > btree_gc_max_duration_ms:825228 > > > bset_tree_stats: > > > btree nodes: 144330 > > > written sets: 283733 > > > unwritten sets: 144329 > > > written key bytes: 24993783392 > > > unwritten key bytes: 11777400 > > > floats: 30936844345385 > > > failed: 5776 > > > > > > After: > > > > > > Median latency (P50): 25 ms > > > Max latency (P100): 0.8 sec > > > > > > btree_gc_average_duration_ms:622274 > > > btree_gc_average_frequency_sec:3518 > > > btree_gc_last_sec:8931 > > > btree_gc_max_duration_ms:953146 > > > bset_tree_stats: > > > btree nodes: 175491 > > > written sets: 339078 > > > unwritten sets: 175488 > > > written key bytes: 29821314856 > > > unwritten key bytes: 14076504 > > > floats: 90520963280544 > > > failed: 6462 > > > > > > The complete latency data is available at [2]. > > > > > > I will be glad to run further tests to solidify these findings for the inclusion > > > of this patch in the coming merge window. Let me know if you'd like me to > > > conduct any specific tests. > > > > Yes, more testing are necessary, from 512 Bytes block size to 1 MiB or > > 8MiB block size. We need to make sure it won't introduce performance > > regression in other workload or circumstances. > > > > I don't have plan to submit this patch in this merge window, and please don't > > push me. For performance improvement change, I prefer the defalt > > configuration will cover most of work loads, so more testing and perforamce > > data are desired. E.g. the patch you mentioned (commit a14a68b76954 "bcache: > > allow allocator to invalidate bucket in gc"), it had been deployed in Easy > > Stack product environment for 20+ months before it got merged. > > > > Thanks. > > > > > > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a14a68b76954e73031ca6399abace17dcb77c17a > > > [2[ https://gist.github.com/robert-pang/cc7c88f356293ea6d43103e6e5f9180f > > > > [snipped] > > > > -- > > Coly Li ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] bcache: reduce front IO latency during GC 2025-05-15 0:58 ` Robert Pang 2025-05-15 1:04 ` Robert Pang @ 2025-05-15 7:12 ` Kuan-Wei Chiu 2025-05-16 3:58 ` Robert Pang 2025-07-03 15:28 ` Robert Pang 2 siblings, 1 reply; 27+ messages in thread From: Kuan-Wei Chiu @ 2025-05-15 7:12 UTC (permalink / raw) To: Robert Pang; +Cc: Coly Li, Coly Li, Kent Overstreet, linux-bcache, Mingzhe Zou Hi Robert, On Wed, May 14, 2025 at 05:58:01PM -0700, Robert Pang wrote: > Hi Coly, > > My apologies for the delay in providing this update; comprehensive testing > takes some time to complete. > > As you suggested, I conducted extensive tests for 24 hours against the > latest 6.14.5 Linux kernel, exploring more configurations to get a complete > picture: > > 1. 4KB block size with writethrough mode > 2. 4KB block size with writeback mode (70% dirty) > 3. 1MB block size with writethrough mode > > The detailed results, available at [1], consistently demonstrate that our patch > is effective in significantly reducing latency during garbage collection. This > holds true for both the default writethrough mode and the 70% writeback mode. > As anticipated, with 1MB block sizes, we observed no difference in latency > because the number of btree nodes is much smaller. > > [1] https://gist.github.com/robert-pang/817fa7c11ece99d25aabc0467a9427d8 > > However, during these tests, we've uncovered a new and distinct latency problem > that appears to be introduced in the recent Linux kernel. This issue manifests > as frequent and periodic latency spikes that occur outside of garbage > collection. > Below is a snippet of the latency data illustrating this: > > time (s) median (ms) max (ms) > 60810 2.28 679.37 > 60840 2.32 2,434.24 * > 60870 2.46 2,434.24 * > 60900 2.52 2,434.24 * > 60930 2.63 566.15 > 60960 2.82 566.15 > 60990 2.82 566.15 > 61020 2.78 471.79 > 61050 2.93 2,028.54 * > 61080 3.11 2,028.54 * > 61110 3.29 2,028.54 * > 61140 3.42 679.37 > 61170 3.42 679.37 > 61200 3.41 679.37 > 61230 3.30 566.15 > 61260 2.93 1,690.45 * > 61290 2.75 1,690.45 * > 61320 2.72 1,690.45 * > 61350 2.88 1,408.71 * > 61380 5.07 1,408.71 * > 61410 107.94 1,408.71 ** > 61440 65.28 1,408.71 ** > 61470 45.41 2,028.54 ** > 61500 72.45 2,028.54 ** > 61530 55.37 2,028.54 ** > 61560 40.73 1,408.71 ** > 61590 11.48 1,690.45 ** > 61620 2.92 1,690.45 * > 61650 2.54 1,690.45 * > 61680 2.58 679.37 > 61710 2.78 679.37 > > ** garbage collection > * cache replacement > > Based on the consistent periodicity of these spikes, we deduce that they are > linked to the invalidate_buckets_lru() function during cache replacement. This > function was recently modified to use min heap operations [2]. To confirm our > hypothesis, we reverted the relevant commits and re-ran the tests. Results show > that the latency spikes completely disappeared, positively confirming that the > min heap changes introduce this regression. Furthermore, these changes also > reduce the effectiveness of our GC patch. It appears that the min heap changes > reduce heap sort speed somehow in invalidate_buckets_lr() and in GC. Thank you for reporting this regression and for taking the time to perform such thorough testing. My current hypothesis is that the root cause may be related to the earlier change where the min heap API was converted from inline to non-inline [1]. As a result, the comparison function used by the min heap is now invoked via an indirect function call instead of a direct one, which introduces significant overhead - especially when CONFIG_MITIGATION_RETPOLINE is enabled, as the cost of indirect calls can be quite substantial in that case. I understand that running these tests takes considerable time, but could you try the attached diff to see if it addresses the regression? Regards, Kuan-Wei [1]: https://lore.kernel.org/lkml/20241020040200.939973-2-visitorckw@gmail.com/ diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c index 8998e61efa40..862e7118f81d 100644 --- a/drivers/md/bcache/alloc.c +++ b/drivers/md/bcache/alloc.c @@ -207,15 +207,15 @@ static void invalidate_buckets_lru(struct cache *ca) if (!bch_can_invalidate_bucket(ca, b)) continue; - if (!min_heap_full(&ca->heap)) - min_heap_push(&ca->heap, &b, &bucket_max_cmp_callback, ca); + if (!min_heap_full_inline(&ca->heap)) + min_heap_push_inline(&ca->heap, &b, &bucket_max_cmp_callback, ca); else if (!new_bucket_max_cmp(&b, min_heap_peek(&ca->heap), ca)) { ca->heap.data[0] = b; - min_heap_sift_down(&ca->heap, 0, &bucket_max_cmp_callback, ca); + min_heap_sift_down_inline(&ca->heap, 0, &bucket_max_cmp_callback, ca); } } - min_heapify_all(&ca->heap, &bucket_min_cmp_callback, ca); + min_heapify_all_inline(&ca->heap, &bucket_min_cmp_callback, ca); while (!fifo_full(&ca->free_inc)) { if (!ca->heap.nr) { @@ -227,8 +227,8 @@ static void invalidate_buckets_lru(struct cache *ca) wake_up_gc(ca->set); return; } - b = min_heap_peek(&ca->heap)[0]; - min_heap_pop(&ca->heap, &bucket_min_cmp_callback, ca); + b = min_heap_peek_inline(&ca->heap)[0]; + min_heap_pop_inline(&ca->heap, &bucket_min_cmp_callback, ca); bch_invalidate_one_bucket(ca, b); } > > [2] https://lore.kernel.org/linux-bcache/ZxzkLJmhn3a%2F1ALQ@visitorckw-System-Product-Name/T/#m0dd24ba0c63615de465d3fec72dc73febb0f7a94 > > You may download the full test result data from these links. > > https://gist.github.com/robert-pang/5df1d595ee77756c0a01d6479bdf8e34#file-bcache-latency-4kb-no-patch-csv > https://gist.github.com/robert-pang/5df1d595ee77756c0a01d6479bdf8e34#file-bcache-latency-4kb-with-patch-csv > https://gist.github.com/robert-pang/bcc26a3aa90dc95a083799cf4fd48116#file-bcache-latency-4kb-wb-no-patch-csv > https://gist.github.com/robert-pang/bcc26a3aa90dc95a083799cf4fd48116#file-bcache-latency-4kb-wb-with-patch-csv > https://gist.github.com/robert-pang/7036b06b66c8de7e958cdbddcd92a3f5#file-bcache-latency-1mb-no-patch-csv > https://gist.github.com/robert-pang/7036b06b66c8de7e958cdbddcd92a3f5#file-bcache-latency-1mb-with-patch-csv > https://gist.github.com/robert-pang/40f90afdea2d2a8c3f6e22ff959eff03#file-bcache-latency-4kb-no-patch-min-heap-reverted-csv > https://gist.github.com/robert-pang/40f90afdea2d2a8c3f6e22ff959eff03#file-bcache-latency-4kb-with-patch-min-heap-reverted-csv > > Best regards > Robert Pang > > On Sat, May 3, 2025 at 10:33 AM Coly Li <colyli@kernel.org> wrote: > > > > On Thu, May 01, 2025 at 06:01:09PM +0800, Robert Pang wrote: > > > Hi Coly, > > > > > > Please disregard the test results I shared over a week ago. After digging > > > deeper into the recent latency spikes with various workloads and by > > > instrumenting the garbage collector, I realized that the earlier GC latency > > > patch, "bcache: allow allocator to invalidate bucket in gc" [1], wasn't > > > backported to the Linux 6.6 branch I tested my patch against. This omission > > > explains the much higher latency observed during the extended test because the > > > allocator was blocked for the entire GC. My sincere apologies for the > > > inconsistent results and any confusion this has caused. > > > > > > > Did you also backport commit 05356938a4be ("bcache: call force_wake_up_gc() > > if necessary in check_should_bypass()") ? Last time when you pushed me to > > add commit a14a68b76954 into mainline kernel, I tested a regression from this > > patch and fixed it. Please add this fix if you didn't, otherwise the testing > > might not be completed. > > > > > > > With patch [1] back-patched and after a 24-hour re-test, the fio results clearly > > > demonstrate that this patch effectively reduces front IO latency during GC due > > > to the smaller incremental GC cycles, while the GC duration increase is still > > > well within bounds. > > > > > > > From the performance result in [2], it seems the max latency are reduced, > > but higher latency period are longer. I am not sure whether this is a happy > > result. > > > > Can I have a download link for the whole log? Then I can look at the > > performance numbers more close. > > > > > Here's a summary of the improved latency: > > > > > > Before: > > > > > > Median latency (P50): 210 ms > > > Max latency (P100): 3.5 sec > > > > > > btree_gc_average_duration_ms:381138 > > > btree_gc_average_frequency_sec:3834 > > > btree_gc_last_sec:60668 > > > btree_gc_max_duration_ms:825228 > > > bset_tree_stats: > > > btree nodes: 144330 > > > written sets: 283733 > > > unwritten sets: 144329 > > > written key bytes: 24993783392 > > > unwritten key bytes: 11777400 > > > floats: 30936844345385 > > > failed: 5776 > > > > > > After: > > > > > > Median latency (P50): 25 ms > > > Max latency (P100): 0.8 sec > > > > > > btree_gc_average_duration_ms:622274 > > > btree_gc_average_frequency_sec:3518 > > > btree_gc_last_sec:8931 > > > btree_gc_max_duration_ms:953146 > > > bset_tree_stats: > > > btree nodes: 175491 > > > written sets: 339078 > > > unwritten sets: 175488 > > > written key bytes: 29821314856 > > > unwritten key bytes: 14076504 > > > floats: 90520963280544 > > > failed: 6462 > > > > > > The complete latency data is available at [2]. > > > > > > I will be glad to run further tests to solidify these findings for the inclusion > > > of this patch in the coming merge window. Let me know if you'd like me to > > > conduct any specific tests. > > > > Yes, more testing are necessary, from 512 Bytes block size to 1 MiB or > > 8MiB block size. We need to make sure it won't introduce performance > > regression in other workload or circumstances. > > > > I don't have plan to submit this patch in this merge window, and please don't > > push me. For performance improvement change, I prefer the defalt > > configuration will cover most of work loads, so more testing and perforamce > > data are desired. E.g. the patch you mentioned (commit a14a68b76954 "bcache: > > allow allocator to invalidate bucket in gc"), it had been deployed in Easy > > Stack product environment for 20+ months before it got merged. > > > > Thanks. > > > > > > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a14a68b76954e73031ca6399abace17dcb77c17a > > > [2[ https://gist.github.com/robert-pang/cc7c88f356293ea6d43103e6e5f9180f > > > > [snipped] > > > > -- > > Coly Li ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] bcache: reduce front IO latency during GC 2025-05-15 7:12 ` Kuan-Wei Chiu @ 2025-05-16 3:58 ` Robert Pang 2025-05-16 16:14 ` Kuan-Wei Chiu 0 siblings, 1 reply; 27+ messages in thread From: Robert Pang @ 2025-05-16 3:58 UTC (permalink / raw) To: Kuan-Wei Chiu Cc: Coly Li, Coly Li, Kent Overstreet, linux-bcache, Mingzhe Zou Hi Kuan-Wei, Thank you for your prompt response. I tested your suggested patch to inline the min heap operations for 8 hours and it is still ongoing. Unfortunately, basing on the results so far, it didn't resolve the regression, suggesting inlining isn't the issue. After reviewing the commits in lib/min_heap.h, I noticed commit c641722 ("lib min_heap: optimize number of comparisons in min_heapify()") and it looked like a potential candidate. I reverted this commit (below) and ran the tests. While the test is still ongoing, the results for the past 3 hours show that the latency spikes during invalidate_buckets_lru() disappeared after this change, indicating that this commit is likely the root cause of the regression. My hypothesis is that while commit c641722 was designed to reduce comparisons with randomized input [1], it might inadvertently increase comparisons when the input isn't as random. A scenario where this could happen is within invalidate_buckets_lru() before the cache is fully populated. In such cases, many buckets are unfilled, causing new_bucket_prio() to return zero, leading to more frequent compare-equal operations with other unfilled buckets. In the case when the cache is populated, the bucket priorities fall in a range with many duplicates. How will heap_sift() behave in such cases? [1] https://lore.kernel.org/linux-bcache/20240121153649.2733274-6-visitorckw@gmail.com/ Best regards, Robert --- include/linux/min_heap.h | 46 +++++++++++++++++++--------------------- 1 file changed, 22 insertions(+), 24 deletions(-) diff --git a/include/linux/min_heap.h b/include/linux/min_heap.h index 1160bed6579e..2d16e6747e36 100644 --- a/include/linux/min_heap.h +++ b/include/linux/min_heap.h @@ -257,34 +257,32 @@ static __always_inline void __min_heap_sift_down_inline(min_heap_char *heap, int pos, size_t elem_size, const struct min_heap_callbacks *func, void *args) { - const unsigned long lsbit = elem_size & -elem_size; + void *left, *right, *parent, *smallest; void *data = heap->data; void (*swp)(void *lhs, void *rhs, void *args) = func->swp; - /* pre-scale counters for performance */ - size_t a = pos * elem_size; - size_t b, c, d; - size_t n = heap->nr * elem_size; - - if (!swp) - swp = select_swap_func(data, elem_size); - /* Find the sift-down path all the way to the leaves. */ - for (b = a; c = 2 * b + elem_size, (d = c + elem_size) < n;) - b = func->less(data + c, data + d, args) ? c : d; - - /* Special case for the last leaf with no sibling. */ - if (d == n) - b = c; - - /* Backtrack to the correct location. */ - while (b != a && func->less(data + a, data + b, args)) - b = parent(b, lsbit, elem_size); + for (;;) { + if (pos * 2 + 1 >= heap->nr) + break; - /* Shift the element into its correct place. */ - c = b; - while (b != a) { - b = parent(b, lsbit, elem_size); - do_swap(data + b, data + c, elem_size, swp, args); + left = data + ((pos * 2 + 1) * elem_size); + parent = data + (pos * elem_size); + smallest = parent; + if (func->less(left, smallest, args)) + smallest = left; + + if (pos * 2 + 2 < heap->nr) { + right = data + ((pos * 2 + 2) * elem_size); + if (func->less(right, smallest, args)) + smallest = right; + } + if (smallest == parent) + break; + do_swap(smallest, parent, elem_size, swp, args); + if (smallest == left) + pos = (pos * 2) + 1; + else + pos = (pos * 2) + 2; } } -- On Thu, May 15, 2025 at 12:12 AM Kuan-Wei Chiu <visitorckw@gmail.com> wrote: > > Hi Robert, > > On Wed, May 14, 2025 at 05:58:01PM -0700, Robert Pang wrote: > > Hi Coly, > > > > My apologies for the delay in providing this update; comprehensive testing > > takes some time to complete. > > > > As you suggested, I conducted extensive tests for 24 hours against the > > latest 6.14.5 Linux kernel, exploring more configurations to get a complete > > picture: > > > > 1. 4KB block size with writethrough mode > > 2. 4KB block size with writeback mode (70% dirty) > > 3. 1MB block size with writethrough mode > > > > The detailed results, available at [1], consistently demonstrate that our patch > > is effective in significantly reducing latency during garbage collection. This > > holds true for both the default writethrough mode and the 70% writeback mode. > > As anticipated, with 1MB block sizes, we observed no difference in latency > > because the number of btree nodes is much smaller. > > > > [1] https://gist.github.com/robert-pang/817fa7c11ece99d25aabc0467a9427d8 > > > > However, during these tests, we've uncovered a new and distinct latency problem > > that appears to be introduced in the recent Linux kernel. This issue manifests > > as frequent and periodic latency spikes that occur outside of garbage > > collection. > > Below is a snippet of the latency data illustrating this: > > > > time (s) median (ms) max (ms) > > 60810 2.28 679.37 > > 60840 2.32 2,434.24 * > > 60870 2.46 2,434.24 * > > 60900 2.52 2,434.24 * > > 60930 2.63 566.15 > > 60960 2.82 566.15 > > 60990 2.82 566.15 > > 61020 2.78 471.79 > > 61050 2.93 2,028.54 * > > 61080 3.11 2,028.54 * > > 61110 3.29 2,028.54 * > > 61140 3.42 679.37 > > 61170 3.42 679.37 > > 61200 3.41 679.37 > > 61230 3.30 566.15 > > 61260 2.93 1,690.45 * > > 61290 2.75 1,690.45 * > > 61320 2.72 1,690.45 * > > 61350 2.88 1,408.71 * > > 61380 5.07 1,408.71 * > > 61410 107.94 1,408.71 ** > > 61440 65.28 1,408.71 ** > > 61470 45.41 2,028.54 ** > > 61500 72.45 2,028.54 ** > > 61530 55.37 2,028.54 ** > > 61560 40.73 1,408.71 ** > > 61590 11.48 1,690.45 ** > > 61620 2.92 1,690.45 * > > 61650 2.54 1,690.45 * > > 61680 2.58 679.37 > > 61710 2.78 679.37 > > > > ** garbage collection > > * cache replacement > > > > Based on the consistent periodicity of these spikes, we deduce that they are > > linked to the invalidate_buckets_lru() function during cache replacement. This > > function was recently modified to use min heap operations [2]. To confirm our > > hypothesis, we reverted the relevant commits and re-ran the tests. Results show > > that the latency spikes completely disappeared, positively confirming that the > > min heap changes introduce this regression. Furthermore, these changes also > > reduce the effectiveness of our GC patch. It appears that the min heap changes > > reduce heap sort speed somehow in invalidate_buckets_lr() and in GC. > > Thank you for reporting this regression and for taking the time to > perform such thorough testing. > > My current hypothesis is that the root cause may be related to the > earlier change where the min heap API was converted from inline to > non-inline [1]. As a result, the comparison function used by the min > heap is now invoked via an indirect function call instead of a direct > one, which introduces significant overhead - especially when > CONFIG_MITIGATION_RETPOLINE is enabled, as the cost of indirect calls > can be quite substantial in that case. > > I understand that running these tests takes considerable time, but > could you try the attached diff to see if it addresses the regression? > > Regards, > Kuan-Wei > > [1]: https://lore.kernel.org/lkml/20241020040200.939973-2-visitorckw@gmail.com/ > > diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c > index 8998e61efa40..862e7118f81d 100644 > --- a/drivers/md/bcache/alloc.c > +++ b/drivers/md/bcache/alloc.c > @@ -207,15 +207,15 @@ static void invalidate_buckets_lru(struct cache *ca) > if (!bch_can_invalidate_bucket(ca, b)) > continue; > > - if (!min_heap_full(&ca->heap)) > - min_heap_push(&ca->heap, &b, &bucket_max_cmp_callback, ca); > + if (!min_heap_full_inline(&ca->heap)) > + min_heap_push_inline(&ca->heap, &b, &bucket_max_cmp_callback, ca); > else if (!new_bucket_max_cmp(&b, min_heap_peek(&ca->heap), ca)) { > ca->heap.data[0] = b; > - min_heap_sift_down(&ca->heap, 0, &bucket_max_cmp_callback, ca); > + min_heap_sift_down_inline(&ca->heap, 0, &bucket_max_cmp_callback, ca); > } > } > > - min_heapify_all(&ca->heap, &bucket_min_cmp_callback, ca); > + min_heapify_all_inline(&ca->heap, &bucket_min_cmp_callback, ca); > > while (!fifo_full(&ca->free_inc)) { > if (!ca->heap.nr) { > @@ -227,8 +227,8 @@ static void invalidate_buckets_lru(struct cache *ca) > wake_up_gc(ca->set); > return; > } > - b = min_heap_peek(&ca->heap)[0]; > - min_heap_pop(&ca->heap, &bucket_min_cmp_callback, ca); > + b = min_heap_peek_inline(&ca->heap)[0]; > + min_heap_pop_inline(&ca->heap, &bucket_min_cmp_callback, ca); > > bch_invalidate_one_bucket(ca, b); > } > > > > > [2] https://lore.kernel.org/linux-bcache/ZxzkLJmhn3a%2F1ALQ@visitorckw-System-Product-Name/T/#m0dd24ba0c63615de465d3fec72dc73febb0f7a94 > > > > You may download the full test result data from these links. > > > > https://gist.github.com/robert-pang/5df1d595ee77756c0a01d6479bdf8e34#file-bcache-latency-4kb-no-patch-csv > > https://gist.github.com/robert-pang/5df1d595ee77756c0a01d6479bdf8e34#file-bcache-latency-4kb-with-patch-csv > > https://gist.github.com/robert-pang/bcc26a3aa90dc95a083799cf4fd48116#file-bcache-latency-4kb-wb-no-patch-csv > > https://gist.github.com/robert-pang/bcc26a3aa90dc95a083799cf4fd48116#file-bcache-latency-4kb-wb-with-patch-csv > > https://gist.github.com/robert-pang/7036b06b66c8de7e958cdbddcd92a3f5#file-bcache-latency-1mb-no-patch-csv > > https://gist.github.com/robert-pang/7036b06b66c8de7e958cdbddcd92a3f5#file-bcache-latency-1mb-with-patch-csv > > https://gist.github.com/robert-pang/40f90afdea2d2a8c3f6e22ff959eff03#file-bcache-latency-4kb-no-patch-min-heap-reverted-csv > > https://gist.github.com/robert-pang/40f90afdea2d2a8c3f6e22ff959eff03#file-bcache-latency-4kb-with-patch-min-heap-reverted-csv > > > > Best regards > > Robert Pang > > > > On Sat, May 3, 2025 at 10:33 AM Coly Li <colyli@kernel.org> wrote: > > > > > > On Thu, May 01, 2025 at 06:01:09PM +0800, Robert Pang wrote: > > > > Hi Coly, > > > > > > > > Please disregard the test results I shared over a week ago. After digging > > > > deeper into the recent latency spikes with various workloads and by > > > > instrumenting the garbage collector, I realized that the earlier GC latency > > > > patch, "bcache: allow allocator to invalidate bucket in gc" [1], wasn't > > > > backported to the Linux 6.6 branch I tested my patch against. This omission > > > > explains the much higher latency observed during the extended test because the > > > > allocator was blocked for the entire GC. My sincere apologies for the > > > > inconsistent results and any confusion this has caused. > > > > > > > > > > Did you also backport commit 05356938a4be ("bcache: call force_wake_up_gc() > > > if necessary in check_should_bypass()") ? Last time when you pushed me to > > > add commit a14a68b76954 into mainline kernel, I tested a regression from this > > > patch and fixed it. Please add this fix if you didn't, otherwise the testing > > > might not be completed. > > > > > > > > > > With patch [1] back-patched and after a 24-hour re-test, the fio results clearly > > > > demonstrate that this patch effectively reduces front IO latency during GC due > > > > to the smaller incremental GC cycles, while the GC duration increase is still > > > > well within bounds. > > > > > > > > > > From the performance result in [2], it seems the max latency are reduced, > > > but higher latency period are longer. I am not sure whether this is a happy > > > result. > > > > > > Can I have a download link for the whole log? Then I can look at the > > > performance numbers more close. > > > > > > > Here's a summary of the improved latency: > > > > > > > > Before: > > > > > > > > Median latency (P50): 210 ms > > > > Max latency (P100): 3.5 sec > > > > > > > > btree_gc_average_duration_ms:381138 > > > > btree_gc_average_frequency_sec:3834 > > > > btree_gc_last_sec:60668 > > > > btree_gc_max_duration_ms:825228 > > > > bset_tree_stats: > > > > btree nodes: 144330 > > > > written sets: 283733 > > > > unwritten sets: 144329 > > > > written key bytes: 24993783392 > > > > unwritten key bytes: 11777400 > > > > floats: 30936844345385 > > > > failed: 5776 > > > > > > > > After: > > > > > > > > Median latency (P50): 25 ms > > > > Max latency (P100): 0.8 sec > > > > > > > > btree_gc_average_duration_ms:622274 > > > > btree_gc_average_frequency_sec:3518 > > > > btree_gc_last_sec:8931 > > > > btree_gc_max_duration_ms:953146 > > > > bset_tree_stats: > > > > btree nodes: 175491 > > > > written sets: 339078 > > > > unwritten sets: 175488 > > > > written key bytes: 29821314856 > > > > unwritten key bytes: 14076504 > > > > floats: 90520963280544 > > > > failed: 6462 > > > > > > > > The complete latency data is available at [2]. > > > > > > > > I will be glad to run further tests to solidify these findings for the inclusion > > > > of this patch in the coming merge window. Let me know if you'd like me to > > > > conduct any specific tests. > > > > > > Yes, more testing are necessary, from 512 Bytes block size to 1 MiB or > > > 8MiB block size. We need to make sure it won't introduce performance > > > regression in other workload or circumstances. > > > > > > I don't have plan to submit this patch in this merge window, and please don't > > > push me. For performance improvement change, I prefer the defalt > > > configuration will cover most of work loads, so more testing and perforamce > > > data are desired. E.g. the patch you mentioned (commit a14a68b76954 "bcache: > > > allow allocator to invalidate bucket in gc"), it had been deployed in Easy > > > Stack product environment for 20+ months before it got merged. > > > > > > Thanks. > > > > > > > > > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a14a68b76954e73031ca6399abace17dcb77c17a > > > > [2[ https://gist.github.com/robert-pang/cc7c88f356293ea6d43103e6e5f9180f > > > > > > [snipped] > > > > > > -- > > > Coly Li ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] bcache: reduce front IO latency during GC 2025-05-16 3:58 ` Robert Pang @ 2025-05-16 16:14 ` Kuan-Wei Chiu 2025-05-17 11:02 ` Coly Li 0 siblings, 1 reply; 27+ messages in thread From: Kuan-Wei Chiu @ 2025-05-16 16:14 UTC (permalink / raw) To: Robert Pang; +Cc: Coly Li, Coly Li, Kent Overstreet, linux-bcache, Mingzhe Zou On Thu, May 15, 2025 at 08:58:44PM -0700, Robert Pang wrote: > Hi Kuan-Wei, > > Thank you for your prompt response. I tested your suggested patch to > inline the min heap operations for 8 hours and it is still ongoing. > Unfortunately, basing on the results so far, it didn't resolve the > regression, suggesting inlining isn't the issue. > > After reviewing the commits in lib/min_heap.h, I noticed commit > c641722 ("lib min_heap: optimize number of comparisons in > min_heapify()") and it looked like a potential candidate. I reverted > this commit (below) and ran the tests. While the test is still > ongoing, the results for the past 3 hours show that the latency spikes > during invalidate_buckets_lru() disappeared after this change, > indicating that this commit is likely the root cause of the > regression. > > My hypothesis is that while commit c641722 was designed to reduce > comparisons with randomized input [1], it might inadvertently increase > comparisons when the input isn't as random. A scenario where this > could happen is within invalidate_buckets_lru() before the cache is > fully populated. In such cases, many buckets are unfilled, causing > new_bucket_prio() to return zero, leading to more frequent > compare-equal operations with other unfilled buckets. In the case when > the cache is populated, the bucket priorities fall in a range with > many duplicates. How will heap_sift() behave in such cases? > > [1] https://lore.kernel.org/linux-bcache/20240121153649.2733274-6-visitorckw@gmail.com/ > You're very likely correct. In scenarios where the majority of elements in the heap are identical, the traditional top-down version of heapify finishes after just 2 comparisons. However, with the bottom-up version introduced by that commit, it ends up performing roughly 2 * log₂(n) comparisons in the same case. That said, reverting the commit would increase the number of comparisons by about 2x in cases where all elements in the heap are distinct, which was the original motivation for the change. I'm not entirely sure what the best way would be to fix this regression without negatively impacting the performance of the other use cases. Regards, Kuan-Wei > Best regards, > Robert > > --- > include/linux/min_heap.h | 46 +++++++++++++++++++--------------------- > 1 file changed, 22 insertions(+), 24 deletions(-) > > diff --git a/include/linux/min_heap.h b/include/linux/min_heap.h > index 1160bed6579e..2d16e6747e36 100644 > --- a/include/linux/min_heap.h > +++ b/include/linux/min_heap.h > @@ -257,34 +257,32 @@ static __always_inline > void __min_heap_sift_down_inline(min_heap_char *heap, int pos, size_t > elem_size, > const struct min_heap_callbacks *func, void *args) > { > - const unsigned long lsbit = elem_size & -elem_size; > + void *left, *right, *parent, *smallest; > void *data = heap->data; > void (*swp)(void *lhs, void *rhs, void *args) = func->swp; > - /* pre-scale counters for performance */ > - size_t a = pos * elem_size; > - size_t b, c, d; > - size_t n = heap->nr * elem_size; > - > - if (!swp) > - swp = select_swap_func(data, elem_size); > > - /* Find the sift-down path all the way to the leaves. */ > - for (b = a; c = 2 * b + elem_size, (d = c + elem_size) < n;) > - b = func->less(data + c, data + d, args) ? c : d; > - > - /* Special case for the last leaf with no sibling. */ > - if (d == n) > - b = c; > - > - /* Backtrack to the correct location. */ > - while (b != a && func->less(data + a, data + b, args)) > - b = parent(b, lsbit, elem_size); > + for (;;) { > + if (pos * 2 + 1 >= heap->nr) > + break; > > - /* Shift the element into its correct place. */ > - c = b; > - while (b != a) { > - b = parent(b, lsbit, elem_size); > - do_swap(data + b, data + c, elem_size, swp, args); > + left = data + ((pos * 2 + 1) * elem_size); > + parent = data + (pos * elem_size); > + smallest = parent; > + if (func->less(left, smallest, args)) > + smallest = left; > + > + if (pos * 2 + 2 < heap->nr) { > + right = data + ((pos * 2 + 2) * elem_size); > + if (func->less(right, smallest, args)) > + smallest = right; > + } > + if (smallest == parent) > + break; > + do_swap(smallest, parent, elem_size, swp, args); > + if (smallest == left) > + pos = (pos * 2) + 1; > + else > + pos = (pos * 2) + 2; > } > } > > -- > > On Thu, May 15, 2025 at 12:12 AM Kuan-Wei Chiu <visitorckw@gmail.com> wrote: > > > > Hi Robert, > > > > On Wed, May 14, 2025 at 05:58:01PM -0700, Robert Pang wrote: > > > Hi Coly, > > > > > > My apologies for the delay in providing this update; comprehensive testing > > > takes some time to complete. > > > > > > As you suggested, I conducted extensive tests for 24 hours against the > > > latest 6.14.5 Linux kernel, exploring more configurations to get a complete > > > picture: > > > > > > 1. 4KB block size with writethrough mode > > > 2. 4KB block size with writeback mode (70% dirty) > > > 3. 1MB block size with writethrough mode > > > > > > The detailed results, available at [1], consistently demonstrate that our patch > > > is effective in significantly reducing latency during garbage collection. This > > > holds true for both the default writethrough mode and the 70% writeback mode. > > > As anticipated, with 1MB block sizes, we observed no difference in latency > > > because the number of btree nodes is much smaller. > > > > > > [1] https://gist.github.com/robert-pang/817fa7c11ece99d25aabc0467a9427d8 > > > > > > However, during these tests, we've uncovered a new and distinct latency problem > > > that appears to be introduced in the recent Linux kernel. This issue manifests > > > as frequent and periodic latency spikes that occur outside of garbage > > > collection. > > > Below is a snippet of the latency data illustrating this: > > > > > > time (s) median (ms) max (ms) > > > 60810 2.28 679.37 > > > 60840 2.32 2,434.24 * > > > 60870 2.46 2,434.24 * > > > 60900 2.52 2,434.24 * > > > 60930 2.63 566.15 > > > 60960 2.82 566.15 > > > 60990 2.82 566.15 > > > 61020 2.78 471.79 > > > 61050 2.93 2,028.54 * > > > 61080 3.11 2,028.54 * > > > 61110 3.29 2,028.54 * > > > 61140 3.42 679.37 > > > 61170 3.42 679.37 > > > 61200 3.41 679.37 > > > 61230 3.30 566.15 > > > 61260 2.93 1,690.45 * > > > 61290 2.75 1,690.45 * > > > 61320 2.72 1,690.45 * > > > 61350 2.88 1,408.71 * > > > 61380 5.07 1,408.71 * > > > 61410 107.94 1,408.71 ** > > > 61440 65.28 1,408.71 ** > > > 61470 45.41 2,028.54 ** > > > 61500 72.45 2,028.54 ** > > > 61530 55.37 2,028.54 ** > > > 61560 40.73 1,408.71 ** > > > 61590 11.48 1,690.45 ** > > > 61620 2.92 1,690.45 * > > > 61650 2.54 1,690.45 * > > > 61680 2.58 679.37 > > > 61710 2.78 679.37 > > > > > > ** garbage collection > > > * cache replacement > > > > > > Based on the consistent periodicity of these spikes, we deduce that they are > > > linked to the invalidate_buckets_lru() function during cache replacement. This > > > function was recently modified to use min heap operations [2]. To confirm our > > > hypothesis, we reverted the relevant commits and re-ran the tests. Results show > > > that the latency spikes completely disappeared, positively confirming that the > > > min heap changes introduce this regression. Furthermore, these changes also > > > reduce the effectiveness of our GC patch. It appears that the min heap changes > > > reduce heap sort speed somehow in invalidate_buckets_lr() and in GC. > > > > Thank you for reporting this regression and for taking the time to > > perform such thorough testing. > > > > My current hypothesis is that the root cause may be related to the > > earlier change where the min heap API was converted from inline to > > non-inline [1]. As a result, the comparison function used by the min > > heap is now invoked via an indirect function call instead of a direct > > one, which introduces significant overhead - especially when > > CONFIG_MITIGATION_RETPOLINE is enabled, as the cost of indirect calls > > can be quite substantial in that case. > > > > I understand that running these tests takes considerable time, but > > could you try the attached diff to see if it addresses the regression? > > > > Regards, > > Kuan-Wei > > > > [1]: https://lore.kernel.org/lkml/20241020040200.939973-2-visitorckw@gmail.com/ > > > > diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c > > index 8998e61efa40..862e7118f81d 100644 > > --- a/drivers/md/bcache/alloc.c > > +++ b/drivers/md/bcache/alloc.c > > @@ -207,15 +207,15 @@ static void invalidate_buckets_lru(struct cache *ca) > > if (!bch_can_invalidate_bucket(ca, b)) > > continue; > > > > - if (!min_heap_full(&ca->heap)) > > - min_heap_push(&ca->heap, &b, &bucket_max_cmp_callback, ca); > > + if (!min_heap_full_inline(&ca->heap)) > > + min_heap_push_inline(&ca->heap, &b, &bucket_max_cmp_callback, ca); > > else if (!new_bucket_max_cmp(&b, min_heap_peek(&ca->heap), ca)) { > > ca->heap.data[0] = b; > > - min_heap_sift_down(&ca->heap, 0, &bucket_max_cmp_callback, ca); > > + min_heap_sift_down_inline(&ca->heap, 0, &bucket_max_cmp_callback, ca); > > } > > } > > > > - min_heapify_all(&ca->heap, &bucket_min_cmp_callback, ca); > > + min_heapify_all_inline(&ca->heap, &bucket_min_cmp_callback, ca); > > > > while (!fifo_full(&ca->free_inc)) { > > if (!ca->heap.nr) { > > @@ -227,8 +227,8 @@ static void invalidate_buckets_lru(struct cache *ca) > > wake_up_gc(ca->set); > > return; > > } > > - b = min_heap_peek(&ca->heap)[0]; > > - min_heap_pop(&ca->heap, &bucket_min_cmp_callback, ca); > > + b = min_heap_peek_inline(&ca->heap)[0]; > > + min_heap_pop_inline(&ca->heap, &bucket_min_cmp_callback, ca); > > > > bch_invalidate_one_bucket(ca, b); > > } > > > > > > > > [2] https://lore.kernel.org/linux-bcache/ZxzkLJmhn3a%2F1ALQ@visitorckw-System-Product-Name/T/#m0dd24ba0c63615de465d3fec72dc73febb0f7a94 > > > > > > You may download the full test result data from these links. > > > > > > https://gist.github.com/robert-pang/5df1d595ee77756c0a01d6479bdf8e34#file-bcache-latency-4kb-no-patch-csv > > > https://gist.github.com/robert-pang/5df1d595ee77756c0a01d6479bdf8e34#file-bcache-latency-4kb-with-patch-csv > > > https://gist.github.com/robert-pang/bcc26a3aa90dc95a083799cf4fd48116#file-bcache-latency-4kb-wb-no-patch-csv > > > https://gist.github.com/robert-pang/bcc26a3aa90dc95a083799cf4fd48116#file-bcache-latency-4kb-wb-with-patch-csv > > > https://gist.github.com/robert-pang/7036b06b66c8de7e958cdbddcd92a3f5#file-bcache-latency-1mb-no-patch-csv > > > https://gist.github.com/robert-pang/7036b06b66c8de7e958cdbddcd92a3f5#file-bcache-latency-1mb-with-patch-csv > > > https://gist.github.com/robert-pang/40f90afdea2d2a8c3f6e22ff959eff03#file-bcache-latency-4kb-no-patch-min-heap-reverted-csv > > > https://gist.github.com/robert-pang/40f90afdea2d2a8c3f6e22ff959eff03#file-bcache-latency-4kb-with-patch-min-heap-reverted-csv > > > > > > Best regards > > > Robert Pang > > > > > > On Sat, May 3, 2025 at 10:33 AM Coly Li <colyli@kernel.org> wrote: > > > > > > > > On Thu, May 01, 2025 at 06:01:09PM +0800, Robert Pang wrote: > > > > > Hi Coly, > > > > > > > > > > Please disregard the test results I shared over a week ago. After digging > > > > > deeper into the recent latency spikes with various workloads and by > > > > > instrumenting the garbage collector, I realized that the earlier GC latency > > > > > patch, "bcache: allow allocator to invalidate bucket in gc" [1], wasn't > > > > > backported to the Linux 6.6 branch I tested my patch against. This omission > > > > > explains the much higher latency observed during the extended test because the > > > > > allocator was blocked for the entire GC. My sincere apologies for the > > > > > inconsistent results and any confusion this has caused. > > > > > > > > > > > > > Did you also backport commit 05356938a4be ("bcache: call force_wake_up_gc() > > > > if necessary in check_should_bypass()") ? Last time when you pushed me to > > > > add commit a14a68b76954 into mainline kernel, I tested a regression from this > > > > patch and fixed it. Please add this fix if you didn't, otherwise the testing > > > > might not be completed. > > > > > > > > > > > > > With patch [1] back-patched and after a 24-hour re-test, the fio results clearly > > > > > demonstrate that this patch effectively reduces front IO latency during GC due > > > > > to the smaller incremental GC cycles, while the GC duration increase is still > > > > > well within bounds. > > > > > > > > > > > > > From the performance result in [2], it seems the max latency are reduced, > > > > but higher latency period are longer. I am not sure whether this is a happy > > > > result. > > > > > > > > Can I have a download link for the whole log? Then I can look at the > > > > performance numbers more close. > > > > > > > > > Here's a summary of the improved latency: > > > > > > > > > > Before: > > > > > > > > > > Median latency (P50): 210 ms > > > > > Max latency (P100): 3.5 sec > > > > > > > > > > btree_gc_average_duration_ms:381138 > > > > > btree_gc_average_frequency_sec:3834 > > > > > btree_gc_last_sec:60668 > > > > > btree_gc_max_duration_ms:825228 > > > > > bset_tree_stats: > > > > > btree nodes: 144330 > > > > > written sets: 283733 > > > > > unwritten sets: 144329 > > > > > written key bytes: 24993783392 > > > > > unwritten key bytes: 11777400 > > > > > floats: 30936844345385 > > > > > failed: 5776 > > > > > > > > > > After: > > > > > > > > > > Median latency (P50): 25 ms > > > > > Max latency (P100): 0.8 sec > > > > > > > > > > btree_gc_average_duration_ms:622274 > > > > > btree_gc_average_frequency_sec:3518 > > > > > btree_gc_last_sec:8931 > > > > > btree_gc_max_duration_ms:953146 > > > > > bset_tree_stats: > > > > > btree nodes: 175491 > > > > > written sets: 339078 > > > > > unwritten sets: 175488 > > > > > written key bytes: 29821314856 > > > > > unwritten key bytes: 14076504 > > > > > floats: 90520963280544 > > > > > failed: 6462 > > > > > > > > > > The complete latency data is available at [2]. > > > > > > > > > > I will be glad to run further tests to solidify these findings for the inclusion > > > > > of this patch in the coming merge window. Let me know if you'd like me to > > > > > conduct any specific tests. > > > > > > > > Yes, more testing are necessary, from 512 Bytes block size to 1 MiB or > > > > 8MiB block size. We need to make sure it won't introduce performance > > > > regression in other workload or circumstances. > > > > > > > > I don't have plan to submit this patch in this merge window, and please don't > > > > push me. For performance improvement change, I prefer the defalt > > > > configuration will cover most of work loads, so more testing and perforamce > > > > data are desired. E.g. the patch you mentioned (commit a14a68b76954 "bcache: > > > > allow allocator to invalidate bucket in gc"), it had been deployed in Easy > > > > Stack product environment for 20+ months before it got merged. > > > > > > > > Thanks. > > > > > > > > > > > > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a14a68b76954e73031ca6399abace17dcb77c17a > > > > > [2[ https://gist.github.com/robert-pang/cc7c88f356293ea6d43103e6e5f9180f > > > > > > > > [snipped] > > > > > > > > -- > > > > Coly Li ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] bcache: reduce front IO latency during GC 2025-05-16 16:14 ` Kuan-Wei Chiu @ 2025-05-17 11:02 ` Coly Li 2025-05-20 11:51 ` Kuan-Wei Chiu 0 siblings, 1 reply; 27+ messages in thread From: Coly Li @ 2025-05-17 11:02 UTC (permalink / raw) To: Kuan-Wei Chiu Cc: Robert Pang, Coly Li, Kent Overstreet, linux-bcache, Mingzhe Zou > 2025年5月17日 00:14,Kuan-Wei Chiu <visitorckw@gmail.com> 写道: > > On Thu, May 15, 2025 at 08:58:44PM -0700, Robert Pang wrote: >> Hi Kuan-Wei, >> >> Thank you for your prompt response. I tested your suggested patch to >> inline the min heap operations for 8 hours and it is still ongoing. >> Unfortunately, basing on the results so far, it didn't resolve the >> regression, suggesting inlining isn't the issue. >> >> After reviewing the commits in lib/min_heap.h, I noticed commit >> c641722 ("lib min_heap: optimize number of comparisons in >> min_heapify()") and it looked like a potential candidate. I reverted >> this commit (below) and ran the tests. While the test is still >> ongoing, the results for the past 3 hours show that the latency spikes >> during invalidate_buckets_lru() disappeared after this change, >> indicating that this commit is likely the root cause of the >> regression. >> >> My hypothesis is that while commit c641722 was designed to reduce >> comparisons with randomized input [1], it might inadvertently increase >> comparisons when the input isn't as random. A scenario where this >> could happen is within invalidate_buckets_lru() before the cache is >> fully populated. In such cases, many buckets are unfilled, causing >> new_bucket_prio() to return zero, leading to more frequent >> compare-equal operations with other unfilled buckets. In the case when >> the cache is populated, the bucket priorities fall in a range with >> many duplicates. How will heap_sift() behave in such cases? >> >> [1] https://lore.kernel.org/linux-bcache/20240121153649.2733274-6-visitorckw@gmail.com/ >> > > You're very likely correct. > > In scenarios where the majority of elements in the heap are identical, > the traditional top-down version of heapify finishes after just 2 > comparisons. However, with the bottom-up version introduced by that > commit, it ends up performing roughly 2 * log₂(n) comparisons in the > same case. For bcache scenario for ideal circumstances and best performance, the cached data and following requests should have spatial or temporal locality. I guess it means for the heap usage, the input might not be typical random. > > That said, reverting the commit would increase the number of > comparisons by about 2x in cases where all elements in the heap are > distinct, which was the original motivation for the change. I'm not > entirely sure what the best way would be to fix this regression without > negatively impacting the performance of the other use cases. If the data read model are fully sequential or random, bcache cannot help too much. So I guess maybe we still need to old heapify code? The new version is for full random input, and previous version for not that much random input. Thanks. Coly Li > > Regards, > Kuan-Wei > >> Best regards, >> Robert >> >> --- >> include/linux/min_heap.h | 46 +++++++++++++++++++--------------------- >> 1 file changed, 22 insertions(+), 24 deletions(-) >> >> diff --git a/include/linux/min_heap.h b/include/linux/min_heap.h >> index 1160bed6579e..2d16e6747e36 100644 >> --- a/include/linux/min_heap.h >> +++ b/include/linux/min_heap.h >> @@ -257,34 +257,32 @@ static __always_inline >> void __min_heap_sift_down_inline(min_heap_char *heap, int pos, size_t >> elem_size, >> const struct min_heap_callbacks *func, void *args) >> { >> - const unsigned long lsbit = elem_size & -elem_size; >> + void *left, *right, *parent, *smallest; >> void *data = heap->data; >> void (*swp)(void *lhs, void *rhs, void *args) = func->swp; >> - /* pre-scale counters for performance */ >> - size_t a = pos * elem_size; >> - size_t b, c, d; >> - size_t n = heap->nr * elem_size; >> - >> - if (!swp) >> - swp = select_swap_func(data, elem_size); >> >> - /* Find the sift-down path all the way to the leaves. */ >> - for (b = a; c = 2 * b + elem_size, (d = c + elem_size) < n;) >> - b = func->less(data + c, data + d, args) ? c : d; >> - >> - /* Special case for the last leaf with no sibling. */ >> - if (d == n) >> - b = c; >> - >> - /* Backtrack to the correct location. */ >> - while (b != a && func->less(data + a, data + b, args)) >> - b = parent(b, lsbit, elem_size); >> + for (;;) { >> + if (pos * 2 + 1 >= heap->nr) >> + break; >> >> - /* Shift the element into its correct place. */ >> - c = b; >> - while (b != a) { >> - b = parent(b, lsbit, elem_size); >> - do_swap(data + b, data + c, elem_size, swp, args); >> + left = data + ((pos * 2 + 1) * elem_size); >> + parent = data + (pos * elem_size); >> + smallest = parent; >> + if (func->less(left, smallest, args)) >> + smallest = left; >> + >> + if (pos * 2 + 2 < heap->nr) { >> + right = data + ((pos * 2 + 2) * elem_size); >> + if (func->less(right, smallest, args)) >> + smallest = right; >> + } >> + if (smallest == parent) >> + break; >> + do_swap(smallest, parent, elem_size, swp, args); >> + if (smallest == left) >> + pos = (pos * 2) + 1; >> + else >> + pos = (pos * 2) + 2; >> } >> } >> >> -- >> >> On Thu, May 15, 2025 at 12:12 AM Kuan-Wei Chiu <visitorckw@gmail.com> wrote: >>> >>> Hi Robert, >>> >>> On Wed, May 14, 2025 at 05:58:01PM -0700, Robert Pang wrote: >>>> Hi Coly, >>>> >>>> My apologies for the delay in providing this update; comprehensive testing >>>> takes some time to complete. >>>> >>>> As you suggested, I conducted extensive tests for 24 hours against the >>>> latest 6.14.5 Linux kernel, exploring more configurations to get a complete >>>> picture: >>>> >>>> 1. 4KB block size with writethrough mode >>>> 2. 4KB block size with writeback mode (70% dirty) >>>> 3. 1MB block size with writethrough mode >>>> >>>> The detailed results, available at [1], consistently demonstrate that our patch >>>> is effective in significantly reducing latency during garbage collection. This >>>> holds true for both the default writethrough mode and the 70% writeback mode. >>>> As anticipated, with 1MB block sizes, we observed no difference in latency >>>> because the number of btree nodes is much smaller. >>>> >>>> [1] https://gist.github.com/robert-pang/817fa7c11ece99d25aabc0467a9427d8 >>>> >>>> However, during these tests, we've uncovered a new and distinct latency problem >>>> that appears to be introduced in the recent Linux kernel. This issue manifests >>>> as frequent and periodic latency spikes that occur outside of garbage >>>> collection. >>>> Below is a snippet of the latency data illustrating this: >>>> >>>> time (s) median (ms) max (ms) >>>> 60810 2.28 679.37 >>>> 60840 2.32 2,434.24 * >>>> 60870 2.46 2,434.24 * >>>> 60900 2.52 2,434.24 * >>>> 60930 2.63 566.15 >>>> 60960 2.82 566.15 >>>> 60990 2.82 566.15 >>>> 61020 2.78 471.79 >>>> 61050 2.93 2,028.54 * >>>> 61080 3.11 2,028.54 * >>>> 61110 3.29 2,028.54 * >>>> 61140 3.42 679.37 >>>> 61170 3.42 679.37 >>>> 61200 3.41 679.37 >>>> 61230 3.30 566.15 >>>> 61260 2.93 1,690.45 * >>>> 61290 2.75 1,690.45 * >>>> 61320 2.72 1,690.45 * >>>> 61350 2.88 1,408.71 * >>>> 61380 5.07 1,408.71 * >>>> 61410 107.94 1,408.71 ** >>>> 61440 65.28 1,408.71 ** >>>> 61470 45.41 2,028.54 ** >>>> 61500 72.45 2,028.54 ** >>>> 61530 55.37 2,028.54 ** >>>> 61560 40.73 1,408.71 ** >>>> 61590 11.48 1,690.45 ** >>>> 61620 2.92 1,690.45 * >>>> 61650 2.54 1,690.45 * >>>> 61680 2.58 679.37 >>>> 61710 2.78 679.37 >>>> >>>> ** garbage collection >>>> * cache replacement >>>> >>>> Based on the consistent periodicity of these spikes, we deduce that they are >>>> linked to the invalidate_buckets_lru() function during cache replacement. This >>>> function was recently modified to use min heap operations [2]. To confirm our >>>> hypothesis, we reverted the relevant commits and re-ran the tests. Results show >>>> that the latency spikes completely disappeared, positively confirming that the >>>> min heap changes introduce this regression. Furthermore, these changes also >>>> reduce the effectiveness of our GC patch. It appears that the min heap changes >>>> reduce heap sort speed somehow in invalidate_buckets_lr() and in GC. >>> >>> Thank you for reporting this regression and for taking the time to >>> perform such thorough testing. >>> >>> My current hypothesis is that the root cause may be related to the >>> earlier change where the min heap API was converted from inline to >>> non-inline [1]. As a result, the comparison function used by the min >>> heap is now invoked via an indirect function call instead of a direct >>> one, which introduces significant overhead - especially when >>> CONFIG_MITIGATION_RETPOLINE is enabled, as the cost of indirect calls >>> can be quite substantial in that case. >>> >>> I understand that running these tests takes considerable time, but >>> could you try the attached diff to see if it addresses the regression? >>> >>> Regards, >>> Kuan-Wei >>> >>> [1]: https://lore.kernel.org/lkml/20241020040200.939973-2-visitorckw@gmail.com/ >>> >>> diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c >>> index 8998e61efa40..862e7118f81d 100644 >>> --- a/drivers/md/bcache/alloc.c >>> +++ b/drivers/md/bcache/alloc.c >>> @@ -207,15 +207,15 @@ static void invalidate_buckets_lru(struct cache *ca) >>> if (!bch_can_invalidate_bucket(ca, b)) >>> continue; >>> >>> - if (!min_heap_full(&ca->heap)) >>> - min_heap_push(&ca->heap, &b, &bucket_max_cmp_callback, ca); >>> + if (!min_heap_full_inline(&ca->heap)) >>> + min_heap_push_inline(&ca->heap, &b, &bucket_max_cmp_callback, ca); >>> else if (!new_bucket_max_cmp(&b, min_heap_peek(&ca->heap), ca)) { >>> ca->heap.data[0] = b; >>> - min_heap_sift_down(&ca->heap, 0, &bucket_max_cmp_callback, ca); >>> + min_heap_sift_down_inline(&ca->heap, 0, &bucket_max_cmp_callback, ca); >>> } >>> } >>> >>> - min_heapify_all(&ca->heap, &bucket_min_cmp_callback, ca); >>> + min_heapify_all_inline(&ca->heap, &bucket_min_cmp_callback, ca); >>> >>> while (!fifo_full(&ca->free_inc)) { >>> if (!ca->heap.nr) { >>> @@ -227,8 +227,8 @@ static void invalidate_buckets_lru(struct cache *ca) >>> wake_up_gc(ca->set); >>> return; >>> } >>> - b = min_heap_peek(&ca->heap)[0]; >>> - min_heap_pop(&ca->heap, &bucket_min_cmp_callback, ca); >>> + b = min_heap_peek_inline(&ca->heap)[0]; >>> + min_heap_pop_inline(&ca->heap, &bucket_min_cmp_callback, ca); >>> >>> bch_invalidate_one_bucket(ca, b); >>> } >>> >>>> >>>> [2] https://lore.kernel.org/linux-bcache/ZxzkLJmhn3a%2F1ALQ@visitorckw-System-Product-Name/T/#m0dd24ba0c63615de465d3fec72dc73febb0f7a94 >>>> >>>> You may download the full test result data from these links. >>>> >>>> https://gist.github.com/robert-pang/5df1d595ee77756c0a01d6479bdf8e34#file-bcache-latency-4kb-no-patch-csv >>>> https://gist.github.com/robert-pang/5df1d595ee77756c0a01d6479bdf8e34#file-bcache-latency-4kb-with-patch-csv >>>> https://gist.github.com/robert-pang/bcc26a3aa90dc95a083799cf4fd48116#file-bcache-latency-4kb-wb-no-patch-csv >>>> https://gist.github.com/robert-pang/bcc26a3aa90dc95a083799cf4fd48116#file-bcache-latency-4kb-wb-with-patch-csv >>>> https://gist.github.com/robert-pang/7036b06b66c8de7e958cdbddcd92a3f5#file-bcache-latency-1mb-no-patch-csv >>>> https://gist.github.com/robert-pang/7036b06b66c8de7e958cdbddcd92a3f5#file-bcache-latency-1mb-with-patch-csv >>>> https://gist.github.com/robert-pang/40f90afdea2d2a8c3f6e22ff959eff03#file-bcache-latency-4kb-no-patch-min-heap-reverted-csv >>>> https://gist.github.com/robert-pang/40f90afdea2d2a8c3f6e22ff959eff03#file-bcache-latency-4kb-with-patch-min-heap-reverted-csv >>>> >>>> Best regards >>>> Robert Pang >>>> >>>> On Sat, May 3, 2025 at 10:33 AM Coly Li <colyli@kernel.org> wrote: >>>>> >>>>> On Thu, May 01, 2025 at 06:01:09PM +0800, Robert Pang wrote: >>>>>> Hi Coly, >>>>>> >>>>>> Please disregard the test results I shared over a week ago. After digging >>>>>> deeper into the recent latency spikes with various workloads and by >>>>>> instrumenting the garbage collector, I realized that the earlier GC latency >>>>>> patch, "bcache: allow allocator to invalidate bucket in gc" [1], wasn't >>>>>> backported to the Linux 6.6 branch I tested my patch against. This omission >>>>>> explains the much higher latency observed during the extended test because the >>>>>> allocator was blocked for the entire GC. My sincere apologies for the >>>>>> inconsistent results and any confusion this has caused. >>>>>> >>>>> >>>>> Did you also backport commit 05356938a4be ("bcache: call force_wake_up_gc() >>>>> if necessary in check_should_bypass()") ? Last time when you pushed me to >>>>> add commit a14a68b76954 into mainline kernel, I tested a regression from this >>>>> patch and fixed it. Please add this fix if you didn't, otherwise the testing >>>>> might not be completed. >>>>> >>>>> >>>>>> With patch [1] back-patched and after a 24-hour re-test, the fio results clearly >>>>>> demonstrate that this patch effectively reduces front IO latency during GC due >>>>>> to the smaller incremental GC cycles, while the GC duration increase is still >>>>>> well within bounds. >>>>>> >>>>> >>>>> From the performance result in [2], it seems the max latency are reduced, >>>>> but higher latency period are longer. I am not sure whether this is a happy >>>>> result. >>>>> >>>>> Can I have a download link for the whole log? Then I can look at the >>>>> performance numbers more close. >>>>> >>>>>> Here's a summary of the improved latency: >>>>>> >>>>>> Before: >>>>>> >>>>>> Median latency (P50): 210 ms >>>>>> Max latency (P100): 3.5 sec >>>>>> >>>>>> btree_gc_average_duration_ms:381138 >>>>>> btree_gc_average_frequency_sec:3834 >>>>>> btree_gc_last_sec:60668 >>>>>> btree_gc_max_duration_ms:825228 >>>>>> bset_tree_stats: >>>>>> btree nodes: 144330 >>>>>> written sets: 283733 >>>>>> unwritten sets: 144329 >>>>>> written key bytes: 24993783392 >>>>>> unwritten key bytes: 11777400 >>>>>> floats: 30936844345385 >>>>>> failed: 5776 >>>>>> >>>>>> After: >>>>>> >>>>>> Median latency (P50): 25 ms >>>>>> Max latency (P100): 0.8 sec >>>>>> >>>>>> btree_gc_average_duration_ms:622274 >>>>>> btree_gc_average_frequency_sec:3518 >>>>>> btree_gc_last_sec:8931 >>>>>> btree_gc_max_duration_ms:953146 >>>>>> bset_tree_stats: >>>>>> btree nodes: 175491 >>>>>> written sets: 339078 >>>>>> unwritten sets: 175488 >>>>>> written key bytes: 29821314856 >>>>>> unwritten key bytes: 14076504 >>>>>> floats: 90520963280544 >>>>>> failed: 6462 >>>>>> >>>>>> The complete latency data is available at [2]. >>>>>> >>>>>> I will be glad to run further tests to solidify these findings for the inclusion >>>>>> of this patch in the coming merge window. Let me know if you'd like me to >>>>>> conduct any specific tests. >>>>> >>>>> Yes, more testing are necessary, from 512 Bytes block size to 1 MiB or >>>>> 8MiB block size. We need to make sure it won't introduce performance >>>>> regression in other workload or circumstances. >>>>> >>>>> I don't have plan to submit this patch in this merge window, and please don't >>>>> push me. For performance improvement change, I prefer the defalt >>>>> configuration will cover most of work loads, so more testing and perforamce >>>>> data are desired. E.g. the patch you mentioned (commit a14a68b76954 "bcache: >>>>> allow allocator to invalidate bucket in gc"), it had been deployed in Easy >>>>> Stack product environment for 20+ months before it got merged. >>>>> >>>>> Thanks. >>>>> >>>>>> >>>>>> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a14a68b76954e73031ca6399abace17dcb77c17a >>>>>> [2[ https://gist.github.com/robert-pang/cc7c88f356293ea6d43103e6e5f9180f >>>>> >>>>> [snipped] >>>>> >>>>> -- >>>>> Coly Li ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] bcache: reduce front IO latency during GC 2025-05-17 11:02 ` Coly Li @ 2025-05-20 11:51 ` Kuan-Wei Chiu 2025-05-20 12:13 ` Coly Li 0 siblings, 1 reply; 27+ messages in thread From: Kuan-Wei Chiu @ 2025-05-20 11:51 UTC (permalink / raw) To: Coly Li; +Cc: Robert Pang, Coly Li, Kent Overstreet, linux-bcache, Mingzhe Zou On Sat, May 17, 2025 at 07:02:06PM +0800, Coly Li wrote: > > > > 2025年5月17日 00:14,Kuan-Wei Chiu <visitorckw@gmail.com> 写道: > > > > On Thu, May 15, 2025 at 08:58:44PM -0700, Robert Pang wrote: > >> Hi Kuan-Wei, > >> > >> Thank you for your prompt response. I tested your suggested patch to > >> inline the min heap operations for 8 hours and it is still ongoing. > >> Unfortunately, basing on the results so far, it didn't resolve the > >> regression, suggesting inlining isn't the issue. > >> > >> After reviewing the commits in lib/min_heap.h, I noticed commit > >> c641722 ("lib min_heap: optimize number of comparisons in > >> min_heapify()") and it looked like a potential candidate. I reverted > >> this commit (below) and ran the tests. While the test is still > >> ongoing, the results for the past 3 hours show that the latency spikes > >> during invalidate_buckets_lru() disappeared after this change, > >> indicating that this commit is likely the root cause of the > >> regression. > >> > >> My hypothesis is that while commit c641722 was designed to reduce > >> comparisons with randomized input [1], it might inadvertently increase > >> comparisons when the input isn't as random. A scenario where this > >> could happen is within invalidate_buckets_lru() before the cache is > >> fully populated. In such cases, many buckets are unfilled, causing > >> new_bucket_prio() to return zero, leading to more frequent > >> compare-equal operations with other unfilled buckets. In the case when > >> the cache is populated, the bucket priorities fall in a range with > >> many duplicates. How will heap_sift() behave in such cases? > >> > >> [1] https://lore.kernel.org/linux-bcache/20240121153649.2733274-6-visitorckw@gmail.com/ > >> > > > > You're very likely correct. > > > > In scenarios where the majority of elements in the heap are identical, > > the traditional top-down version of heapify finishes after just 2 > > comparisons. However, with the bottom-up version introduced by that > > commit, it ends up performing roughly 2 * log₂(n) comparisons in the > > same case. > > For bcache scenario for ideal circumstances and best performance, the cached data > and following requests should have spatial or temporal locality. > > I guess it means for the heap usage, the input might not be typical random. > > > > > > That said, reverting the commit would increase the number of > > comparisons by about 2x in cases where all elements in the heap are > > distinct, which was the original motivation for the change. I'm not > > entirely sure what the best way would be to fix this regression without > > negatively impacting the performance of the other use cases. > > If the data read model are fully sequential or random, bcache cannot help too much. > > So I guess maybe we still need to old heapify code? The new version is for full random input, > and previous version for not that much random input. > I think we have two options here. One is to add a classic heapify function to min_heap.h, allowing users to choose based on whether they expect many duplicate elements in the heap. While having two heapify variants might be confusing from a library design perspective, we could mitigate that with clear kernel-doc comments. The other option is to revert to the old bcache heap code. I'm not sure which approach is better. Regards, Kuan-Wei > Thanks. > > Coly Li > > > > > > Regards, > > Kuan-Wei > > > >> Best regards, > >> Robert > >> > >> --- > >> include/linux/min_heap.h | 46 +++++++++++++++++++--------------------- > >> 1 file changed, 22 insertions(+), 24 deletions(-) > >> > >> diff --git a/include/linux/min_heap.h b/include/linux/min_heap.h > >> index 1160bed6579e..2d16e6747e36 100644 > >> --- a/include/linux/min_heap.h > >> +++ b/include/linux/min_heap.h > >> @@ -257,34 +257,32 @@ static __always_inline > >> void __min_heap_sift_down_inline(min_heap_char *heap, int pos, size_t > >> elem_size, > >> const struct min_heap_callbacks *func, void *args) > >> { > >> - const unsigned long lsbit = elem_size & -elem_size; > >> + void *left, *right, *parent, *smallest; > >> void *data = heap->data; > >> void (*swp)(void *lhs, void *rhs, void *args) = func->swp; > >> - /* pre-scale counters for performance */ > >> - size_t a = pos * elem_size; > >> - size_t b, c, d; > >> - size_t n = heap->nr * elem_size; > >> - > >> - if (!swp) > >> - swp = select_swap_func(data, elem_size); > >> > >> - /* Find the sift-down path all the way to the leaves. */ > >> - for (b = a; c = 2 * b + elem_size, (d = c + elem_size) < n;) > >> - b = func->less(data + c, data + d, args) ? c : d; > >> - > >> - /* Special case for the last leaf with no sibling. */ > >> - if (d == n) > >> - b = c; > >> - > >> - /* Backtrack to the correct location. */ > >> - while (b != a && func->less(data + a, data + b, args)) > >> - b = parent(b, lsbit, elem_size); > >> + for (;;) { > >> + if (pos * 2 + 1 >= heap->nr) > >> + break; > >> > >> - /* Shift the element into its correct place. */ > >> - c = b; > >> - while (b != a) { > >> - b = parent(b, lsbit, elem_size); > >> - do_swap(data + b, data + c, elem_size, swp, args); > >> + left = data + ((pos * 2 + 1) * elem_size); > >> + parent = data + (pos * elem_size); > >> + smallest = parent; > >> + if (func->less(left, smallest, args)) > >> + smallest = left; > >> + > >> + if (pos * 2 + 2 < heap->nr) { > >> + right = data + ((pos * 2 + 2) * elem_size); > >> + if (func->less(right, smallest, args)) > >> + smallest = right; > >> + } > >> + if (smallest == parent) > >> + break; > >> + do_swap(smallest, parent, elem_size, swp, args); > >> + if (smallest == left) > >> + pos = (pos * 2) + 1; > >> + else > >> + pos = (pos * 2) + 2; > >> } > >> } > >> > >> -- > >> > >> On Thu, May 15, 2025 at 12:12 AM Kuan-Wei Chiu <visitorckw@gmail.com> wrote: > >>> > >>> Hi Robert, > >>> > >>> On Wed, May 14, 2025 at 05:58:01PM -0700, Robert Pang wrote: > >>>> Hi Coly, > >>>> > >>>> My apologies for the delay in providing this update; comprehensive testing > >>>> takes some time to complete. > >>>> > >>>> As you suggested, I conducted extensive tests for 24 hours against the > >>>> latest 6.14.5 Linux kernel, exploring more configurations to get a complete > >>>> picture: > >>>> > >>>> 1. 4KB block size with writethrough mode > >>>> 2. 4KB block size with writeback mode (70% dirty) > >>>> 3. 1MB block size with writethrough mode > >>>> > >>>> The detailed results, available at [1], consistently demonstrate that our patch > >>>> is effective in significantly reducing latency during garbage collection. This > >>>> holds true for both the default writethrough mode and the 70% writeback mode. > >>>> As anticipated, with 1MB block sizes, we observed no difference in latency > >>>> because the number of btree nodes is much smaller. > >>>> > >>>> [1] https://gist.github.com/robert-pang/817fa7c11ece99d25aabc0467a9427d8 > >>>> > >>>> However, during these tests, we've uncovered a new and distinct latency problem > >>>> that appears to be introduced in the recent Linux kernel. This issue manifests > >>>> as frequent and periodic latency spikes that occur outside of garbage > >>>> collection. > >>>> Below is a snippet of the latency data illustrating this: > >>>> > >>>> time (s) median (ms) max (ms) > >>>> 60810 2.28 679.37 > >>>> 60840 2.32 2,434.24 * > >>>> 60870 2.46 2,434.24 * > >>>> 60900 2.52 2,434.24 * > >>>> 60930 2.63 566.15 > >>>> 60960 2.82 566.15 > >>>> 60990 2.82 566.15 > >>>> 61020 2.78 471.79 > >>>> 61050 2.93 2,028.54 * > >>>> 61080 3.11 2,028.54 * > >>>> 61110 3.29 2,028.54 * > >>>> 61140 3.42 679.37 > >>>> 61170 3.42 679.37 > >>>> 61200 3.41 679.37 > >>>> 61230 3.30 566.15 > >>>> 61260 2.93 1,690.45 * > >>>> 61290 2.75 1,690.45 * > >>>> 61320 2.72 1,690.45 * > >>>> 61350 2.88 1,408.71 * > >>>> 61380 5.07 1,408.71 * > >>>> 61410 107.94 1,408.71 ** > >>>> 61440 65.28 1,408.71 ** > >>>> 61470 45.41 2,028.54 ** > >>>> 61500 72.45 2,028.54 ** > >>>> 61530 55.37 2,028.54 ** > >>>> 61560 40.73 1,408.71 ** > >>>> 61590 11.48 1,690.45 ** > >>>> 61620 2.92 1,690.45 * > >>>> 61650 2.54 1,690.45 * > >>>> 61680 2.58 679.37 > >>>> 61710 2.78 679.37 > >>>> > >>>> ** garbage collection > >>>> * cache replacement > >>>> > >>>> Based on the consistent periodicity of these spikes, we deduce that they are > >>>> linked to the invalidate_buckets_lru() function during cache replacement. This > >>>> function was recently modified to use min heap operations [2]. To confirm our > >>>> hypothesis, we reverted the relevant commits and re-ran the tests. Results show > >>>> that the latency spikes completely disappeared, positively confirming that the > >>>> min heap changes introduce this regression. Furthermore, these changes also > >>>> reduce the effectiveness of our GC patch. It appears that the min heap changes > >>>> reduce heap sort speed somehow in invalidate_buckets_lr() and in GC. > >>> > >>> Thank you for reporting this regression and for taking the time to > >>> perform such thorough testing. > >>> > >>> My current hypothesis is that the root cause may be related to the > >>> earlier change where the min heap API was converted from inline to > >>> non-inline [1]. As a result, the comparison function used by the min > >>> heap is now invoked via an indirect function call instead of a direct > >>> one, which introduces significant overhead - especially when > >>> CONFIG_MITIGATION_RETPOLINE is enabled, as the cost of indirect calls > >>> can be quite substantial in that case. > >>> > >>> I understand that running these tests takes considerable time, but > >>> could you try the attached diff to see if it addresses the regression? > >>> > >>> Regards, > >>> Kuan-Wei > >>> > >>> [1]: https://lore.kernel.org/lkml/20241020040200.939973-2-visitorckw@gmail.com/ > >>> > >>> diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c > >>> index 8998e61efa40..862e7118f81d 100644 > >>> --- a/drivers/md/bcache/alloc.c > >>> +++ b/drivers/md/bcache/alloc.c > >>> @@ -207,15 +207,15 @@ static void invalidate_buckets_lru(struct cache *ca) > >>> if (!bch_can_invalidate_bucket(ca, b)) > >>> continue; > >>> > >>> - if (!min_heap_full(&ca->heap)) > >>> - min_heap_push(&ca->heap, &b, &bucket_max_cmp_callback, ca); > >>> + if (!min_heap_full_inline(&ca->heap)) > >>> + min_heap_push_inline(&ca->heap, &b, &bucket_max_cmp_callback, ca); > >>> else if (!new_bucket_max_cmp(&b, min_heap_peek(&ca->heap), ca)) { > >>> ca->heap.data[0] = b; > >>> - min_heap_sift_down(&ca->heap, 0, &bucket_max_cmp_callback, ca); > >>> + min_heap_sift_down_inline(&ca->heap, 0, &bucket_max_cmp_callback, ca); > >>> } > >>> } > >>> > >>> - min_heapify_all(&ca->heap, &bucket_min_cmp_callback, ca); > >>> + min_heapify_all_inline(&ca->heap, &bucket_min_cmp_callback, ca); > >>> > >>> while (!fifo_full(&ca->free_inc)) { > >>> if (!ca->heap.nr) { > >>> @@ -227,8 +227,8 @@ static void invalidate_buckets_lru(struct cache *ca) > >>> wake_up_gc(ca->set); > >>> return; > >>> } > >>> - b = min_heap_peek(&ca->heap)[0]; > >>> - min_heap_pop(&ca->heap, &bucket_min_cmp_callback, ca); > >>> + b = min_heap_peek_inline(&ca->heap)[0]; > >>> + min_heap_pop_inline(&ca->heap, &bucket_min_cmp_callback, ca); > >>> > >>> bch_invalidate_one_bucket(ca, b); > >>> } > >>> > >>>> > >>>> [2] https://lore.kernel.org/linux-bcache/ZxzkLJmhn3a%2F1ALQ@visitorckw-System-Product-Name/T/#m0dd24ba0c63615de465d3fec72dc73febb0f7a94 > >>>> > >>>> You may download the full test result data from these links. > >>>> > >>>> https://gist.github.com/robert-pang/5df1d595ee77756c0a01d6479bdf8e34#file-bcache-latency-4kb-no-patch-csv > >>>> https://gist.github.com/robert-pang/5df1d595ee77756c0a01d6479bdf8e34#file-bcache-latency-4kb-with-patch-csv > >>>> https://gist.github.com/robert-pang/bcc26a3aa90dc95a083799cf4fd48116#file-bcache-latency-4kb-wb-no-patch-csv > >>>> https://gist.github.com/robert-pang/bcc26a3aa90dc95a083799cf4fd48116#file-bcache-latency-4kb-wb-with-patch-csv > >>>> https://gist.github.com/robert-pang/7036b06b66c8de7e958cdbddcd92a3f5#file-bcache-latency-1mb-no-patch-csv > >>>> https://gist.github.com/robert-pang/7036b06b66c8de7e958cdbddcd92a3f5#file-bcache-latency-1mb-with-patch-csv > >>>> https://gist.github.com/robert-pang/40f90afdea2d2a8c3f6e22ff959eff03#file-bcache-latency-4kb-no-patch-min-heap-reverted-csv > >>>> https://gist.github.com/robert-pang/40f90afdea2d2a8c3f6e22ff959eff03#file-bcache-latency-4kb-with-patch-min-heap-reverted-csv > >>>> > >>>> Best regards > >>>> Robert Pang > >>>> > >>>> On Sat, May 3, 2025 at 10:33 AM Coly Li <colyli@kernel.org> wrote: > >>>>> > >>>>> On Thu, May 01, 2025 at 06:01:09PM +0800, Robert Pang wrote: > >>>>>> Hi Coly, > >>>>>> > >>>>>> Please disregard the test results I shared over a week ago. After digging > >>>>>> deeper into the recent latency spikes with various workloads and by > >>>>>> instrumenting the garbage collector, I realized that the earlier GC latency > >>>>>> patch, "bcache: allow allocator to invalidate bucket in gc" [1], wasn't > >>>>>> backported to the Linux 6.6 branch I tested my patch against. This omission > >>>>>> explains the much higher latency observed during the extended test because the > >>>>>> allocator was blocked for the entire GC. My sincere apologies for the > >>>>>> inconsistent results and any confusion this has caused. > >>>>>> > >>>>> > >>>>> Did you also backport commit 05356938a4be ("bcache: call force_wake_up_gc() > >>>>> if necessary in check_should_bypass()") ? Last time when you pushed me to > >>>>> add commit a14a68b76954 into mainline kernel, I tested a regression from this > >>>>> patch and fixed it. Please add this fix if you didn't, otherwise the testing > >>>>> might not be completed. > >>>>> > >>>>> > >>>>>> With patch [1] back-patched and after a 24-hour re-test, the fio results clearly > >>>>>> demonstrate that this patch effectively reduces front IO latency during GC due > >>>>>> to the smaller incremental GC cycles, while the GC duration increase is still > >>>>>> well within bounds. > >>>>>> > >>>>> > >>>>> From the performance result in [2], it seems the max latency are reduced, > >>>>> but higher latency period are longer. I am not sure whether this is a happy > >>>>> result. > >>>>> > >>>>> Can I have a download link for the whole log? Then I can look at the > >>>>> performance numbers more close. > >>>>> > >>>>>> Here's a summary of the improved latency: > >>>>>> > >>>>>> Before: > >>>>>> > >>>>>> Median latency (P50): 210 ms > >>>>>> Max latency (P100): 3.5 sec > >>>>>> > >>>>>> btree_gc_average_duration_ms:381138 > >>>>>> btree_gc_average_frequency_sec:3834 > >>>>>> btree_gc_last_sec:60668 > >>>>>> btree_gc_max_duration_ms:825228 > >>>>>> bset_tree_stats: > >>>>>> btree nodes: 144330 > >>>>>> written sets: 283733 > >>>>>> unwritten sets: 144329 > >>>>>> written key bytes: 24993783392 > >>>>>> unwritten key bytes: 11777400 > >>>>>> floats: 30936844345385 > >>>>>> failed: 5776 > >>>>>> > >>>>>> After: > >>>>>> > >>>>>> Median latency (P50): 25 ms > >>>>>> Max latency (P100): 0.8 sec > >>>>>> > >>>>>> btree_gc_average_duration_ms:622274 > >>>>>> btree_gc_average_frequency_sec:3518 > >>>>>> btree_gc_last_sec:8931 > >>>>>> btree_gc_max_duration_ms:953146 > >>>>>> bset_tree_stats: > >>>>>> btree nodes: 175491 > >>>>>> written sets: 339078 > >>>>>> unwritten sets: 175488 > >>>>>> written key bytes: 29821314856 > >>>>>> unwritten key bytes: 14076504 > >>>>>> floats: 90520963280544 > >>>>>> failed: 6462 > >>>>>> > >>>>>> The complete latency data is available at [2]. > >>>>>> > >>>>>> I will be glad to run further tests to solidify these findings for the inclusion > >>>>>> of this patch in the coming merge window. Let me know if you'd like me to > >>>>>> conduct any specific tests. > >>>>> > >>>>> Yes, more testing are necessary, from 512 Bytes block size to 1 MiB or > >>>>> 8MiB block size. We need to make sure it won't introduce performance > >>>>> regression in other workload or circumstances. > >>>>> > >>>>> I don't have plan to submit this patch in this merge window, and please don't > >>>>> push me. For performance improvement change, I prefer the defalt > >>>>> configuration will cover most of work loads, so more testing and perforamce > >>>>> data are desired. E.g. the patch you mentioned (commit a14a68b76954 "bcache: > >>>>> allow allocator to invalidate bucket in gc"), it had been deployed in Easy > >>>>> Stack product environment for 20+ months before it got merged. > >>>>> > >>>>> Thanks. > >>>>> > >>>>>> > >>>>>> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a14a68b76954e73031ca6399abace17dcb77c17a > >>>>>> [2[ https://gist.github.com/robert-pang/cc7c88f356293ea6d43103e6e5f9180f > >>>>> > >>>>> [snipped] > >>>>> > >>>>> -- > >>>>> Coly Li > > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] bcache: reduce front IO latency during GC 2025-05-20 11:51 ` Kuan-Wei Chiu @ 2025-05-20 12:13 ` Coly Li 2025-05-20 12:26 ` Kuan-Wei Chiu 0 siblings, 1 reply; 27+ messages in thread From: Coly Li @ 2025-05-20 12:13 UTC (permalink / raw) To: Kuan-Wei Chiu, Kent Overstreet Cc: Robert Pang, Coly Li, linux-bcache, Mingzhe Zou > 2025年5月20日 19:51,Kuan-Wei Chiu <visitorckw@gmail.com> 写道: > > On Sat, May 17, 2025 at 07:02:06PM +0800, Coly Li wrote: >> >> >>> 2025年5月17日 00:14,Kuan-Wei Chiu <visitorckw@gmail.com> 写道: >>> >>> On Thu, May 15, 2025 at 08:58:44PM -0700, Robert Pang wrote: >>>> Hi Kuan-Wei, >>>> >>>> Thank you for your prompt response. I tested your suggested patch to >>>> inline the min heap operations for 8 hours and it is still ongoing. >>>> Unfortunately, basing on the results so far, it didn't resolve the >>>> regression, suggesting inlining isn't the issue. >>>> >>>> After reviewing the commits in lib/min_heap.h, I noticed commit >>>> c641722 ("lib min_heap: optimize number of comparisons in >>>> min_heapify()") and it looked like a potential candidate. I reverted >>>> this commit (below) and ran the tests. While the test is still >>>> ongoing, the results for the past 3 hours show that the latency spikes >>>> during invalidate_buckets_lru() disappeared after this change, >>>> indicating that this commit is likely the root cause of the >>>> regression. >>>> >>>> My hypothesis is that while commit c641722 was designed to reduce >>>> comparisons with randomized input [1], it might inadvertently increase >>>> comparisons when the input isn't as random. A scenario where this >>>> could happen is within invalidate_buckets_lru() before the cache is >>>> fully populated. In such cases, many buckets are unfilled, causing >>>> new_bucket_prio() to return zero, leading to more frequent >>>> compare-equal operations with other unfilled buckets. In the case when >>>> the cache is populated, the bucket priorities fall in a range with >>>> many duplicates. How will heap_sift() behave in such cases? >>>> >>>> [1] https://lore.kernel.org/linux-bcache/20240121153649.2733274-6-visitorckw@gmail.com/ >>>> >>> >>> You're very likely correct. >>> >>> In scenarios where the majority of elements in the heap are identical, >>> the traditional top-down version of heapify finishes after just 2 >>> comparisons. However, with the bottom-up version introduced by that >>> commit, it ends up performing roughly 2 * log₂(n) comparisons in the >>> same case. >> >> For bcache scenario for ideal circumstances and best performance, the cached data >> and following requests should have spatial or temporal locality. >> >> I guess it means for the heap usage, the input might not be typical random. >> >> >>> >>> That said, reverting the commit would increase the number of >>> comparisons by about 2x in cases where all elements in the heap are >>> distinct, which was the original motivation for the change. I'm not >>> entirely sure what the best way would be to fix this regression without >>> negatively impacting the performance of the other use cases. >> >> If the data read model are fully sequential or random, bcache cannot help too much. >> >> So I guess maybe we still need to old heapify code? The new version is for full random input, >> and previous version for not that much random input. >> > > I think we have two options here. One is to add a classic heapify > function to min_heap.h, allowing users to choose based on whether they > expect many duplicate elements in the heap. While having two heapify > variants might be confusing from a library design perspective, we could > mitigate that with clear kernel-doc comments. The other option is to > revert to the old bcache heap code. I'm not sure which approach is > better. > I prefer to have two min_heap APIs, but how to name them, this is a question from me. Also if the full-random min_heap version has no user in kernel, whether to keep it in kernel also is a question. Kent, Could you please offer your opinion? Thanks. Coly Li ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] bcache: reduce front IO latency during GC 2025-05-20 12:13 ` Coly Li @ 2025-05-20 12:26 ` Kuan-Wei Chiu 2025-05-20 13:13 ` Coly Li 0 siblings, 1 reply; 27+ messages in thread From: Kuan-Wei Chiu @ 2025-05-20 12:26 UTC (permalink / raw) To: Coly Li; +Cc: Kent Overstreet, Robert Pang, Coly Li, linux-bcache, Mingzhe Zou On Tue, May 20, 2025 at 08:13:47PM +0800, Coly Li wrote: > > > > 2025年5月20日 19:51,Kuan-Wei Chiu <visitorckw@gmail.com> 写道: > > > > On Sat, May 17, 2025 at 07:02:06PM +0800, Coly Li wrote: > >> > >> > >>> 2025年5月17日 00:14,Kuan-Wei Chiu <visitorckw@gmail.com> 写道: > >>> > >>> On Thu, May 15, 2025 at 08:58:44PM -0700, Robert Pang wrote: > >>>> Hi Kuan-Wei, > >>>> > >>>> Thank you for your prompt response. I tested your suggested patch to > >>>> inline the min heap operations for 8 hours and it is still ongoing. > >>>> Unfortunately, basing on the results so far, it didn't resolve the > >>>> regression, suggesting inlining isn't the issue. > >>>> > >>>> After reviewing the commits in lib/min_heap.h, I noticed commit > >>>> c641722 ("lib min_heap: optimize number of comparisons in > >>>> min_heapify()") and it looked like a potential candidate. I reverted > >>>> this commit (below) and ran the tests. While the test is still > >>>> ongoing, the results for the past 3 hours show that the latency spikes > >>>> during invalidate_buckets_lru() disappeared after this change, > >>>> indicating that this commit is likely the root cause of the > >>>> regression. > >>>> > >>>> My hypothesis is that while commit c641722 was designed to reduce > >>>> comparisons with randomized input [1], it might inadvertently increase > >>>> comparisons when the input isn't as random. A scenario where this > >>>> could happen is within invalidate_buckets_lru() before the cache is > >>>> fully populated. In such cases, many buckets are unfilled, causing > >>>> new_bucket_prio() to return zero, leading to more frequent > >>>> compare-equal operations with other unfilled buckets. In the case when > >>>> the cache is populated, the bucket priorities fall in a range with > >>>> many duplicates. How will heap_sift() behave in such cases? > >>>> > >>>> [1] https://lore.kernel.org/linux-bcache/20240121153649.2733274-6-visitorckw@gmail.com/ > >>>> > >>> > >>> You're very likely correct. > >>> > >>> In scenarios where the majority of elements in the heap are identical, > >>> the traditional top-down version of heapify finishes after just 2 > >>> comparisons. However, with the bottom-up version introduced by that > >>> commit, it ends up performing roughly 2 * log₂(n) comparisons in the > >>> same case. > >> > >> For bcache scenario for ideal circumstances and best performance, the cached data > >> and following requests should have spatial or temporal locality. > >> > >> I guess it means for the heap usage, the input might not be typical random. > >> > >> > >>> > >>> That said, reverting the commit would increase the number of > >>> comparisons by about 2x in cases where all elements in the heap are > >>> distinct, which was the original motivation for the change. I'm not > >>> entirely sure what the best way would be to fix this regression without > >>> negatively impacting the performance of the other use cases. > >> > >> If the data read model are fully sequential or random, bcache cannot help too much. > >> > >> So I guess maybe we still need to old heapify code? The new version is for full random input, > >> and previous version for not that much random input. > >> > > > > I think we have two options here. One is to add a classic heapify > > function to min_heap.h, allowing users to choose based on whether they > > expect many duplicate elements in the heap. While having two heapify > > variants might be confusing from a library design perspective, we could > > mitigate that with clear kernel-doc comments. The other option is to > > revert to the old bcache heap code. I'm not sure which approach is > > better. > > > > I prefer to have two min_heap APIs, but how to name them, this is a question from me. > > Also if the full-random min_heap version has no user in kernel, whether to keep it in kernel also is a question. From the perspective of the number of comparisons in heapify, what matters more is whether the data contains many equal elements, rather than whether it's truly random. I assume that for most other kernel users, their use cases don't typically involve a large number of equal elements? Regards, Kuan-Wei > > Kent, > Could you please offer your opinion? > > Thanks. > > Coly Li > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] bcache: reduce front IO latency during GC 2025-05-20 12:26 ` Kuan-Wei Chiu @ 2025-05-20 13:13 ` Coly Li 2025-05-21 14:40 ` Kuan-Wei Chiu 0 siblings, 1 reply; 27+ messages in thread From: Coly Li @ 2025-05-20 13:13 UTC (permalink / raw) To: Kuan-Wei Chiu Cc: Kent Overstreet, Robert Pang, Coly Li, linux-bcache, Mingzhe Zou > 2025年5月20日 20:26,Kuan-Wei Chiu <visitorckw@gmail.com> 写道: > > On Tue, May 20, 2025 at 08:13:47PM +0800, Coly Li wrote: >> >> >>> 2025年5月20日 19:51,Kuan-Wei Chiu <visitorckw@gmail.com> 写道: >>> >>> On Sat, May 17, 2025 at 07:02:06PM +0800, Coly Li wrote: >>>> >>>> >>>>> 2025年5月17日 00:14,Kuan-Wei Chiu <visitorckw@gmail.com> 写道: >>>>> >>>>> On Thu, May 15, 2025 at 08:58:44PM -0700, Robert Pang wrote: >>>>>> Hi Kuan-Wei, >>>>>> >>>>>> Thank you for your prompt response. I tested your suggested patch to >>>>>> inline the min heap operations for 8 hours and it is still ongoing. >>>>>> Unfortunately, basing on the results so far, it didn't resolve the >>>>>> regression, suggesting inlining isn't the issue. >>>>>> >>>>>> After reviewing the commits in lib/min_heap.h, I noticed commit >>>>>> c641722 ("lib min_heap: optimize number of comparisons in >>>>>> min_heapify()") and it looked like a potential candidate. I reverted >>>>>> this commit (below) and ran the tests. While the test is still >>>>>> ongoing, the results for the past 3 hours show that the latency spikes >>>>>> during invalidate_buckets_lru() disappeared after this change, >>>>>> indicating that this commit is likely the root cause of the >>>>>> regression. >>>>>> >>>>>> My hypothesis is that while commit c641722 was designed to reduce >>>>>> comparisons with randomized input [1], it might inadvertently increase >>>>>> comparisons when the input isn't as random. A scenario where this >>>>>> could happen is within invalidate_buckets_lru() before the cache is >>>>>> fully populated. In such cases, many buckets are unfilled, causing >>>>>> new_bucket_prio() to return zero, leading to more frequent >>>>>> compare-equal operations with other unfilled buckets. In the case when >>>>>> the cache is populated, the bucket priorities fall in a range with >>>>>> many duplicates. How will heap_sift() behave in such cases? >>>>>> >>>>>> [1] https://lore.kernel.org/linux-bcache/20240121153649.2733274-6-visitorckw@gmail.com/ >>>>>> >>>>> >>>>> You're very likely correct. >>>>> >>>>> In scenarios where the majority of elements in the heap are identical, >>>>> the traditional top-down version of heapify finishes after just 2 >>>>> comparisons. However, with the bottom-up version introduced by that >>>>> commit, it ends up performing roughly 2 * log₂(n) comparisons in the >>>>> same case. >>>> >>>> For bcache scenario for ideal circumstances and best performance, the cached data >>>> and following requests should have spatial or temporal locality. >>>> >>>> I guess it means for the heap usage, the input might not be typical random. >>>> >>>> >>>>> >>>>> That said, reverting the commit would increase the number of >>>>> comparisons by about 2x in cases where all elements in the heap are >>>>> distinct, which was the original motivation for the change. I'm not >>>>> entirely sure what the best way would be to fix this regression without >>>>> negatively impacting the performance of the other use cases. >>>> >>>> If the data read model are fully sequential or random, bcache cannot help too much. >>>> >>>> So I guess maybe we still need to old heapify code? The new version is for full random input, >>>> and previous version for not that much random input. >>>> >>> >>> I think we have two options here. One is to add a classic heapify >>> function to min_heap.h, allowing users to choose based on whether they >>> expect many duplicate elements in the heap. While having two heapify >>> variants might be confusing from a library design perspective, we could >>> mitigate that with clear kernel-doc comments. The other option is to >>> revert to the old bcache heap code. I'm not sure which approach is >>> better. >>> >> >> I prefer to have two min_heap APIs, but how to name them, this is a question from me. >> >> Also if the full-random min_heap version has no user in kernel, whether to keep it in kernel also is a question. > > From the perspective of the number of comparisons in heapify, what > matters more is whether the data contains many equal elements, rather > than whether it's truly random. I assume that for most other kernel > users, their use cases don't typically involve a large number of equal > elements? > Yes, you are right. Maybe dm-vdo also has similar I/O pattern? Deduplication may also have duplicated items in heap I guess. Thanks. >> >> Kent, >> Could you please offer your opinion? >> >> Thanks. >> >> Coly Li ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] bcache: reduce front IO latency during GC 2025-05-20 13:13 ` Coly Li @ 2025-05-21 14:40 ` Kuan-Wei Chiu 2025-06-06 7:39 ` Robert Pang 0 siblings, 1 reply; 27+ messages in thread From: Kuan-Wei Chiu @ 2025-05-21 14:40 UTC (permalink / raw) To: Coly Li; +Cc: Kent Overstreet, Robert Pang, Coly Li, linux-bcache, Mingzhe Zou On Tue, May 20, 2025 at 09:13:09PM +0800, Coly Li wrote: > > > > 2025年5月20日 20:26,Kuan-Wei Chiu <visitorckw@gmail.com> 写道: > > > > On Tue, May 20, 2025 at 08:13:47PM +0800, Coly Li wrote: > >> > >> > >>> 2025年5月20日 19:51,Kuan-Wei Chiu <visitorckw@gmail.com> 写道: > >>> > >>> On Sat, May 17, 2025 at 07:02:06PM +0800, Coly Li wrote: > >>>> > >>>> > >>>>> 2025年5月17日 00:14,Kuan-Wei Chiu <visitorckw@gmail.com> 写道: > >>>>> > >>>>> On Thu, May 15, 2025 at 08:58:44PM -0700, Robert Pang wrote: > >>>>>> Hi Kuan-Wei, > >>>>>> > >>>>>> Thank you for your prompt response. I tested your suggested patch to > >>>>>> inline the min heap operations for 8 hours and it is still ongoing. > >>>>>> Unfortunately, basing on the results so far, it didn't resolve the > >>>>>> regression, suggesting inlining isn't the issue. > >>>>>> > >>>>>> After reviewing the commits in lib/min_heap.h, I noticed commit > >>>>>> c641722 ("lib min_heap: optimize number of comparisons in > >>>>>> min_heapify()") and it looked like a potential candidate. I reverted > >>>>>> this commit (below) and ran the tests. While the test is still > >>>>>> ongoing, the results for the past 3 hours show that the latency spikes > >>>>>> during invalidate_buckets_lru() disappeared after this change, > >>>>>> indicating that this commit is likely the root cause of the > >>>>>> regression. > >>>>>> > >>>>>> My hypothesis is that while commit c641722 was designed to reduce > >>>>>> comparisons with randomized input [1], it might inadvertently increase > >>>>>> comparisons when the input isn't as random. A scenario where this > >>>>>> could happen is within invalidate_buckets_lru() before the cache is > >>>>>> fully populated. In such cases, many buckets are unfilled, causing > >>>>>> new_bucket_prio() to return zero, leading to more frequent > >>>>>> compare-equal operations with other unfilled buckets. In the case when > >>>>>> the cache is populated, the bucket priorities fall in a range with > >>>>>> many duplicates. How will heap_sift() behave in such cases? > >>>>>> > >>>>>> [1] https://lore.kernel.org/linux-bcache/20240121153649.2733274-6-visitorckw@gmail.com/ > >>>>>> > >>>>> > >>>>> You're very likely correct. > >>>>> > >>>>> In scenarios where the majority of elements in the heap are identical, > >>>>> the traditional top-down version of heapify finishes after just 2 > >>>>> comparisons. However, with the bottom-up version introduced by that > >>>>> commit, it ends up performing roughly 2 * log₂(n) comparisons in the > >>>>> same case. > >>>> > >>>> For bcache scenario for ideal circumstances and best performance, the cached data > >>>> and following requests should have spatial or temporal locality. > >>>> > >>>> I guess it means for the heap usage, the input might not be typical random. > >>>> > >>>> > >>>>> > >>>>> That said, reverting the commit would increase the number of > >>>>> comparisons by about 2x in cases where all elements in the heap are > >>>>> distinct, which was the original motivation for the change. I'm not > >>>>> entirely sure what the best way would be to fix this regression without > >>>>> negatively impacting the performance of the other use cases. > >>>> > >>>> If the data read model are fully sequential or random, bcache cannot help too much. > >>>> > >>>> So I guess maybe we still need to old heapify code? The new version is for full random input, > >>>> and previous version for not that much random input. > >>>> > >>> > >>> I think we have two options here. One is to add a classic heapify > >>> function to min_heap.h, allowing users to choose based on whether they > >>> expect many duplicate elements in the heap. While having two heapify > >>> variants might be confusing from a library design perspective, we could > >>> mitigate that with clear kernel-doc comments. The other option is to > >>> revert to the old bcache heap code. I'm not sure which approach is > >>> better. > >>> > >> > >> I prefer to have two min_heap APIs, but how to name them, this is a question from me. > >> > >> Also if the full-random min_heap version has no user in kernel, whether to keep it in kernel also is a question. > > > > From the perspective of the number of comparisons in heapify, what > > matters more is whether the data contains many equal elements, rather > > than whether it's truly random. I assume that for most other kernel > > users, their use cases don't typically involve a large number of equal > > elements? > > > > Yes, you are right. Maybe dm-vdo also has similar I/O pattern? > > Deduplication may also have duplicated items in heap I guess. > Thanks for pointing out this potential issue. I'll check with Matthew to confirm. Regards, Kuan-Wei > Thanks. > > > >> > >> Kent, > >> Could you please offer your opinion? > >> > >> Thanks. > >> > >> Coly Li > > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] bcache: reduce front IO latency during GC 2025-05-21 14:40 ` Kuan-Wei Chiu @ 2025-06-06 7:39 ` Robert Pang 2025-06-06 12:37 ` Kuan-Wei Chiu 0 siblings, 1 reply; 27+ messages in thread From: Robert Pang @ 2025-06-06 7:39 UTC (permalink / raw) To: Kuan-Wei Chiu Cc: Coly Li, Kent Overstreet, Coly Li, linux-bcache, Mingzhe Zou Hi Kuan-Wei, I'm circling back on our plan to address this regression. Based on our discussions to-date, it seems a separate min_heap API that uses the conventional top-down sift-down strategy is the preferred approach. I've prototyped this idea and sent out a patch series [1] to kick off the discussion on this approach. The patch has been running for over 12 hours and is looking promising. The API name is only my initial thoughts and suggestions are welcome. Given that this regression was introduced in Linux 6.11 and affects versions up to 6.15 (including 6.12 LTS), a timely solution will be important. Best regress Robert Pang [1] https://lore.kernel.org/linux-bcache/20250606071959.1685079-1-robertpang@google.com/T/#t On Wed, May 21, 2025 at 7:40 AM Kuan-Wei Chiu <visitorckw@gmail.com> wrote: > > On Tue, May 20, 2025 at 09:13:09PM +0800, Coly Li wrote: > > > > > > > 2025年5月20日 20:26,Kuan-Wei Chiu <visitorckw@gmail.com> 写道: > > > > > > On Tue, May 20, 2025 at 08:13:47PM +0800, Coly Li wrote: > > >> > > >> > > >>> 2025年5月20日 19:51,Kuan-Wei Chiu <visitorckw@gmail.com> 写道: > > >>> > > >>> On Sat, May 17, 2025 at 07:02:06PM +0800, Coly Li wrote: > > >>>> > > >>>> > > >>>>> 2025年5月17日 00:14,Kuan-Wei Chiu <visitorckw@gmail.com> 写道: > > >>>>> > > >>>>> On Thu, May 15, 2025 at 08:58:44PM -0700, Robert Pang wrote: > > >>>>>> Hi Kuan-Wei, > > >>>>>> > > >>>>>> Thank you for your prompt response. I tested your suggested patch to > > >>>>>> inline the min heap operations for 8 hours and it is still ongoing. > > >>>>>> Unfortunately, basing on the results so far, it didn't resolve the > > >>>>>> regression, suggesting inlining isn't the issue. > > >>>>>> > > >>>>>> After reviewing the commits in lib/min_heap.h, I noticed commit > > >>>>>> c641722 ("lib min_heap: optimize number of comparisons in > > >>>>>> min_heapify()") and it looked like a potential candidate. I reverted > > >>>>>> this commit (below) and ran the tests. While the test is still > > >>>>>> ongoing, the results for the past 3 hours show that the latency spikes > > >>>>>> during invalidate_buckets_lru() disappeared after this change, > > >>>>>> indicating that this commit is likely the root cause of the > > >>>>>> regression. > > >>>>>> > > >>>>>> My hypothesis is that while commit c641722 was designed to reduce > > >>>>>> comparisons with randomized input [1], it might inadvertently increase > > >>>>>> comparisons when the input isn't as random. A scenario where this > > >>>>>> could happen is within invalidate_buckets_lru() before the cache is > > >>>>>> fully populated. In such cases, many buckets are unfilled, causing > > >>>>>> new_bucket_prio() to return zero, leading to more frequent > > >>>>>> compare-equal operations with other unfilled buckets. In the case when > > >>>>>> the cache is populated, the bucket priorities fall in a range with > > >>>>>> many duplicates. How will heap_sift() behave in such cases? > > >>>>>> > > >>>>>> [1] https://lore.kernel.org/linux-bcache/20240121153649.2733274-6-visitorckw@gmail.com/ > > >>>>>> > > >>>>> > > >>>>> You're very likely correct. > > >>>>> > > >>>>> In scenarios where the majority of elements in the heap are identical, > > >>>>> the traditional top-down version of heapify finishes after just 2 > > >>>>> comparisons. However, with the bottom-up version introduced by that > > >>>>> commit, it ends up performing roughly 2 * log₂(n) comparisons in the > > >>>>> same case. > > >>>> > > >>>> For bcache scenario for ideal circumstances and best performance, the cached data > > >>>> and following requests should have spatial or temporal locality. > > >>>> > > >>>> I guess it means for the heap usage, the input might not be typical random. > > >>>> > > >>>> > > >>>>> > > >>>>> That said, reverting the commit would increase the number of > > >>>>> comparisons by about 2x in cases where all elements in the heap are > > >>>>> distinct, which was the original motivation for the change. I'm not > > >>>>> entirely sure what the best way would be to fix this regression without > > >>>>> negatively impacting the performance of the other use cases. > > >>>> > > >>>> If the data read model are fully sequential or random, bcache cannot help too much. > > >>>> > > >>>> So I guess maybe we still need to old heapify code? The new version is for full random input, > > >>>> and previous version for not that much random input. > > >>>> > > >>> > > >>> I think we have two options here. One is to add a classic heapify > > >>> function to min_heap.h, allowing users to choose based on whether they > > >>> expect many duplicate elements in the heap. While having two heapify > > >>> variants might be confusing from a library design perspective, we could > > >>> mitigate that with clear kernel-doc comments. The other option is to > > >>> revert to the old bcache heap code. I'm not sure which approach is > > >>> better. > > >>> > > >> > > >> I prefer to have two min_heap APIs, but how to name them, this is a question from me. > > >> > > >> Also if the full-random min_heap version has no user in kernel, whether to keep it in kernel also is a question. > > > > > > From the perspective of the number of comparisons in heapify, what > > > matters more is whether the data contains many equal elements, rather > > > than whether it's truly random. I assume that for most other kernel > > > users, their use cases don't typically involve a large number of equal > > > elements? > > > > > > > Yes, you are right. Maybe dm-vdo also has similar I/O pattern? > > > > Deduplication may also have duplicated items in heap I guess. > > > > Thanks for pointing out this potential issue. > I'll check with Matthew to confirm. > > Regards, > Kuan-Wei > > > Thanks. > > > > > > >> > > >> Kent, > > >> Could you please offer your opinion? > > >> > > >> Thanks. > > >> > > >> Coly Li > > > > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] bcache: reduce front IO latency during GC 2025-06-06 7:39 ` Robert Pang @ 2025-06-06 12:37 ` Kuan-Wei Chiu 0 siblings, 0 replies; 27+ messages in thread From: Kuan-Wei Chiu @ 2025-06-06 12:37 UTC (permalink / raw) To: Robert Pang; +Cc: Coly Li, Kent Overstreet, Coly Li, linux-bcache, Mingzhe Zou On Fri, Jun 06, 2025 at 12:39:52AM -0700, Robert Pang wrote: > Hi Kuan-Wei, > > I'm circling back on our plan to address this regression. Based on our > discussions to-date, it seems a separate min_heap API that uses the > conventional top-down sift-down strategy is the preferred approach. > > I've prototyped this idea and sent out a patch series [1] to kick off > the discussion on this approach. The patch has been running for over > 12 hours and is looking promising. The API name is only my initial > thoughts and suggestions are welcome. > > Given that this regression was introduced in Linux 6.11 and affects > versions up to 6.15 (including 6.12 LTS), a timely solution will be > important. > I'm terribly sorry for not submitting the patch to fix this regression earlier. I had something drafted, but I got sick, and the issue unfortunately slipped off my radar and I ended up forgetting about it. Regards, Kuan-Wei > Best regress > Robert Pang > > [1] https://lore.kernel.org/linux-bcache/20250606071959.1685079-1-robertpang@google.com/T/#t > > > On Wed, May 21, 2025 at 7:40 AM Kuan-Wei Chiu <visitorckw@gmail.com> wrote: > > > > On Tue, May 20, 2025 at 09:13:09PM +0800, Coly Li wrote: > > > > > > > > > > 2025年5月20日 20:26,Kuan-Wei Chiu <visitorckw@gmail.com> 写道: > > > > > > > > On Tue, May 20, 2025 at 08:13:47PM +0800, Coly Li wrote: > > > >> > > > >> > > > >>> 2025年5月20日 19:51,Kuan-Wei Chiu <visitorckw@gmail.com> 写道: > > > >>> > > > >>> On Sat, May 17, 2025 at 07:02:06PM +0800, Coly Li wrote: > > > >>>> > > > >>>> > > > >>>>> 2025年5月17日 00:14,Kuan-Wei Chiu <visitorckw@gmail.com> 写道: > > > >>>>> > > > >>>>> On Thu, May 15, 2025 at 08:58:44PM -0700, Robert Pang wrote: > > > >>>>>> Hi Kuan-Wei, > > > >>>>>> > > > >>>>>> Thank you for your prompt response. I tested your suggested patch to > > > >>>>>> inline the min heap operations for 8 hours and it is still ongoing. > > > >>>>>> Unfortunately, basing on the results so far, it didn't resolve the > > > >>>>>> regression, suggesting inlining isn't the issue. > > > >>>>>> > > > >>>>>> After reviewing the commits in lib/min_heap.h, I noticed commit > > > >>>>>> c641722 ("lib min_heap: optimize number of comparisons in > > > >>>>>> min_heapify()") and it looked like a potential candidate. I reverted > > > >>>>>> this commit (below) and ran the tests. While the test is still > > > >>>>>> ongoing, the results for the past 3 hours show that the latency spikes > > > >>>>>> during invalidate_buckets_lru() disappeared after this change, > > > >>>>>> indicating that this commit is likely the root cause of the > > > >>>>>> regression. > > > >>>>>> > > > >>>>>> My hypothesis is that while commit c641722 was designed to reduce > > > >>>>>> comparisons with randomized input [1], it might inadvertently increase > > > >>>>>> comparisons when the input isn't as random. A scenario where this > > > >>>>>> could happen is within invalidate_buckets_lru() before the cache is > > > >>>>>> fully populated. In such cases, many buckets are unfilled, causing > > > >>>>>> new_bucket_prio() to return zero, leading to more frequent > > > >>>>>> compare-equal operations with other unfilled buckets. In the case when > > > >>>>>> the cache is populated, the bucket priorities fall in a range with > > > >>>>>> many duplicates. How will heap_sift() behave in such cases? > > > >>>>>> > > > >>>>>> [1] https://lore.kernel.org/linux-bcache/20240121153649.2733274-6-visitorckw@gmail.com/ > > > >>>>>> > > > >>>>> > > > >>>>> You're very likely correct. > > > >>>>> > > > >>>>> In scenarios where the majority of elements in the heap are identical, > > > >>>>> the traditional top-down version of heapify finishes after just 2 > > > >>>>> comparisons. However, with the bottom-up version introduced by that > > > >>>>> commit, it ends up performing roughly 2 * log₂(n) comparisons in the > > > >>>>> same case. > > > >>>> > > > >>>> For bcache scenario for ideal circumstances and best performance, the cached data > > > >>>> and following requests should have spatial or temporal locality. > > > >>>> > > > >>>> I guess it means for the heap usage, the input might not be typical random. > > > >>>> > > > >>>> > > > >>>>> > > > >>>>> That said, reverting the commit would increase the number of > > > >>>>> comparisons by about 2x in cases where all elements in the heap are > > > >>>>> distinct, which was the original motivation for the change. I'm not > > > >>>>> entirely sure what the best way would be to fix this regression without > > > >>>>> negatively impacting the performance of the other use cases. > > > >>>> > > > >>>> If the data read model are fully sequential or random, bcache cannot help too much. > > > >>>> > > > >>>> So I guess maybe we still need to old heapify code? The new version is for full random input, > > > >>>> and previous version for not that much random input. > > > >>>> > > > >>> > > > >>> I think we have two options here. One is to add a classic heapify > > > >>> function to min_heap.h, allowing users to choose based on whether they > > > >>> expect many duplicate elements in the heap. While having two heapify > > > >>> variants might be confusing from a library design perspective, we could > > > >>> mitigate that with clear kernel-doc comments. The other option is to > > > >>> revert to the old bcache heap code. I'm not sure which approach is > > > >>> better. > > > >>> > > > >> > > > >> I prefer to have two min_heap APIs, but how to name them, this is a question from me. > > > >> > > > >> Also if the full-random min_heap version has no user in kernel, whether to keep it in kernel also is a question. > > > > > > > > From the perspective of the number of comparisons in heapify, what > > > > matters more is whether the data contains many equal elements, rather > > > > than whether it's truly random. I assume that for most other kernel > > > > users, their use cases don't typically involve a large number of equal > > > > elements? > > > > > > > > > > Yes, you are right. Maybe dm-vdo also has similar I/O pattern? > > > > > > Deduplication may also have duplicated items in heap I guess. > > > > > > > Thanks for pointing out this potential issue. > > I'll check with Matthew to confirm. > > > > Regards, > > Kuan-Wei > > > > > Thanks. > > > > > > > > > >> > > > >> Kent, > > > >> Could you please offer your opinion? > > > >> > > > >> Thanks. > > > >> > > > >> Coly Li > > > > > > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] bcache: reduce front IO latency during GC 2025-05-15 0:58 ` Robert Pang 2025-05-15 1:04 ` Robert Pang 2025-05-15 7:12 ` Kuan-Wei Chiu @ 2025-07-03 15:28 ` Robert Pang [not found] ` <10F3457F-FC71-4D21-B2CA-05346B68D873@coly.li> 2 siblings, 1 reply; 27+ messages in thread From: Robert Pang @ 2025-07-03 15:28 UTC (permalink / raw) To: Coly Li; +Cc: Coly Li, Kent Overstreet, linux-bcache, Mingzhe Zou, Kuan-Wei Chiu Hi Coly I'm writing to provide an update on this bcache GC latency patch. With the min heap regression addressed by the recent revert [1], I am going to initiate a rerun of our last test set [2]. I will share the results as soon as they are available. To ensure this patch is fully qualified for the next merge window, I'm eager to perform any further tests you deem necessary. What additional test scenarios would you recommend?" Best regards Robert Pang [1] https://lore.kernel.org/linux-bcache/20250614202353.1632957-1-visitorckw@gmail.com/T/#m86a71f8030c26f8f366f9442e33f2a236a5a2eac [2] https://lore.kernel.org/linux-bcache/wtfuhfntbi6yorxqtpcs4vg5w67mvyckp2a6jmxuzt2hvbw65t@gznwsae5653d/T/#me50a9ddd0386ce602b2f17415e02d33b8e29f533 On Thu, May 15, 2025 at 8:58 AM Robert Pang <robertpang@google.com> wrote: > > Hi Coly, > > My apologies for the delay in providing this update; comprehensive testing > takes some time to complete. > > As you suggested, I conducted extensive tests for 24 hours against the > latest 6.14.5 Linux kernel, exploring more configurations to get a complete > picture: > > 1. 4KB block size with writethrough mode > 2. 4KB block size with writeback mode (70% dirty) > 3. 1MB block size with writethrough mode > > The detailed results, available at [1], consistently demonstrate that our patch > is effective in significantly reducing latency during garbage collection. This > holds true for both the default writethrough mode and the 70% writeback mode. > As anticipated, with 1MB block sizes, we observed no difference in latency > because the number of btree nodes is much smaller. > > [1] https://gist.github.com/robert-pang/817fa7c11ece99d25aabc0467a9427d8 > > However, during these tests, we've uncovered a new and distinct latency problem > that appears to be introduced in the recent Linux kernel. This issue manifests > as frequent and periodic latency spikes that occur outside of garbage > collection. > Below is a snippet of the latency data illustrating this: > > time (s) median (ms) max (ms) > 60810 2.28 679.37 > 60840 2.32 2,434.24 * > 60870 2.46 2,434.24 * > 60900 2.52 2,434.24 * > 60930 2.63 566.15 > 60960 2.82 566.15 > 60990 2.82 566.15 > 61020 2.78 471.79 > 61050 2.93 2,028.54 * > 61080 3.11 2,028.54 * > 61110 3.29 2,028.54 * > 61140 3.42 679.37 > 61170 3.42 679.37 > 61200 3.41 679.37 > 61230 3.30 566.15 > 61260 2.93 1,690.45 * > 61290 2.75 1,690.45 * > 61320 2.72 1,690.45 * > 61350 2.88 1,408.71 * > 61380 5.07 1,408.71 * > 61410 107.94 1,408.71 ** > 61440 65.28 1,408.71 ** > 61470 45.41 2,028.54 ** > 61500 72.45 2,028.54 ** > 61530 55.37 2,028.54 ** > 61560 40.73 1,408.71 ** > 61590 11.48 1,690.45 ** > 61620 2.92 1,690.45 * > 61650 2.54 1,690.45 * > 61680 2.58 679.37 > 61710 2.78 679.37 > > ** garbage collection > * cache replacement > > Based on the consistent periodicity of these spikes, we deduce that they are > linked to the invalidate_buckets_lru() function during cache replacement. This > function was recently modified to use min heap operations [2]. To confirm our > hypothesis, we reverted the relevant commits and re-ran the tests. Results show > that the latency spikes completely disappeared, positively confirming that the > min heap changes introduce this regression. Furthermore, these changes also > reduce the effectiveness of our GC patch. It appears that the min heap changes > reduce heap sort speed somehow in invalidate_buckets_lr() and in GC. > > [2] https://lore.kernel.org/linux-bcache/ZxzkLJmhn3a%2F1ALQ@visitorckw-System-Product-Name/T/#m0dd24ba0c63615de465d3fec72dc73febb0f7a94 > > You may download the full test result data from these links. > > https://gist.github.com/robert-pang/5df1d595ee77756c0a01d6479bdf8e34#file-bcache-latency-4kb-no-patch-csv > https://gist.github.com/robert-pang/5df1d595ee77756c0a01d6479bdf8e34#file-bcache-latency-4kb-with-patch-csv > https://gist.github.com/robert-pang/bcc26a3aa90dc95a083799cf4fd48116#file-bcache-latency-4kb-wb-no-patch-csv > https://gist.github.com/robert-pang/bcc26a3aa90dc95a083799cf4fd48116#file-bcache-latency-4kb-wb-with-patch-csv > https://gist.github.com/robert-pang/7036b06b66c8de7e958cdbddcd92a3f5#file-bcache-latency-1mb-no-patch-csv > https://gist.github.com/robert-pang/7036b06b66c8de7e958cdbddcd92a3f5#file-bcache-latency-1mb-with-patch-csv > https://gist.github.com/robert-pang/40f90afdea2d2a8c3f6e22ff959eff03#file-bcache-latency-4kb-no-patch-min-heap-reverted-csv > https://gist.github.com/robert-pang/40f90afdea2d2a8c3f6e22ff959eff03#file-bcache-latency-4kb-with-patch-min-heap-reverted-csv > > Best regards > Robert Pang > > On Sat, May 3, 2025 at 10:33 AM Coly Li <colyli@kernel.org> wrote: > > > > On Thu, May 01, 2025 at 06:01:09PM +0800, Robert Pang wrote: > > > Hi Coly, > > > > > > Please disregard the test results I shared over a week ago. After digging > > > deeper into the recent latency spikes with various workloads and by > > > instrumenting the garbage collector, I realized that the earlier GC latency > > > patch, "bcache: allow allocator to invalidate bucket in gc" [1], wasn't > > > backported to the Linux 6.6 branch I tested my patch against. This omission > > > explains the much higher latency observed during the extended test because the > > > allocator was blocked for the entire GC. My sincere apologies for the > > > inconsistent results and any confusion this has caused. > > > > > > > Did you also backport commit 05356938a4be ("bcache: call force_wake_up_gc() > > if necessary in check_should_bypass()") ? Last time when you pushed me to > > add commit a14a68b76954 into mainline kernel, I tested a regression from this > > patch and fixed it. Please add this fix if you didn't, otherwise the testing > > might not be completed. > > > > > > > With patch [1] back-patched and after a 24-hour re-test, the fio results clearly > > > demonstrate that this patch effectively reduces front IO latency during GC due > > > to the smaller incremental GC cycles, while the GC duration increase is still > > > well within bounds. > > > > > > > From the performance result in [2], it seems the max latency are reduced, > > but higher latency period are longer. I am not sure whether this is a happy > > result. > > > > Can I have a download link for the whole log? Then I can look at the > > performance numbers more close. > > > > > Here's a summary of the improved latency: > > > > > > Before: > > > > > > Median latency (P50): 210 ms > > > Max latency (P100): 3.5 sec > > > > > > btree_gc_average_duration_ms:381138 > > > btree_gc_average_frequency_sec:3834 > > > btree_gc_last_sec:60668 > > > btree_gc_max_duration_ms:825228 > > > bset_tree_stats: > > > btree nodes: 144330 > > > written sets: 283733 > > > unwritten sets: 144329 > > > written key bytes: 24993783392 > > > unwritten key bytes: 11777400 > > > floats: 30936844345385 > > > failed: 5776 > > > > > > After: > > > > > > Median latency (P50): 25 ms > > > Max latency (P100): 0.8 sec > > > > > > btree_gc_average_duration_ms:622274 > > > btree_gc_average_frequency_sec:3518 > > > btree_gc_last_sec:8931 > > > btree_gc_max_duration_ms:953146 > > > bset_tree_stats: > > > btree nodes: 175491 > > > written sets: 339078 > > > unwritten sets: 175488 > > > written key bytes: 29821314856 > > > unwritten key bytes: 14076504 > > > floats: 90520963280544 > > > failed: 6462 > > > > > > The complete latency data is available at [2]. > > > > > > I will be glad to run further tests to solidify these findings for the inclusion > > > of this patch in the coming merge window. Let me know if you'd like me to > > > conduct any specific tests. > > > > Yes, more testing are necessary, from 512 Bytes block size to 1 MiB or > > 8MiB block size. We need to make sure it won't introduce performance > > regression in other workload or circumstances. > > > > I don't have plan to submit this patch in this merge window, and please don't > > push me. For performance improvement change, I prefer the defalt > > configuration will cover most of work loads, so more testing and perforamce > > data are desired. E.g. the patch you mentioned (commit a14a68b76954 "bcache: > > allow allocator to invalidate bucket in gc"), it had been deployed in Easy > > Stack product environment for 20+ months before it got merged. > > > > Thanks. > > > > > > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a14a68b76954e73031ca6399abace17dcb77c17a > > > [2[ https://gist.github.com/robert-pang/cc7c88f356293ea6d43103e6e5f9180f > > > > [snipped] > > > > -- > > Coly Li ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <10F3457F-FC71-4D21-B2CA-05346B68D873@coly.li>]
[parent not found: <A52B16E2-DF9C-4C8A-9853-464385D1A7FA@coly.li>]
* Re: [PATCH 0/1] bcache: reduce front IO latency during GC [not found] ` <A52B16E2-DF9C-4C8A-9853-464385D1A7FA@coly.li> @ 2025-07-17 22:52 ` Robert Pang 2025-07-18 6:09 ` Coly Li 0 siblings, 1 reply; 27+ messages in thread From: Robert Pang @ 2025-07-17 22:52 UTC (permalink / raw) To: Coly Li; +Cc: Coly Li, Kent Overstreet, linux-bcache, Mingzhe Zou, Kuan-Wei Chiu Hi Coly >> With the min heap regression addressed by the recent revert [1], I am >> going to initiate a rerun of our last test set [2]. I will share the >> results as soon as they are available. My apologies for the delay in sharing the results due to my work schedule. I've completed the rerun of the tests, and the results are available for your review in [1]. As expected, the data confirms the desired latency improvements. Furthermore, I've checked for any unintended side effects and can confirm that no regressions were observed. Please take a look and let me know what you think. If there are any other specific scenarios or tests you would like me to execute to further validate the changes, please don't hesitate to let me know. Best regards Robert Pang [1] https://gist.github.com/robert-pang/a22b7c5dee2600be3260f4db57e5776d#file-test-results-md On Thu, Jul 3, 2025 at 9:47 AM Coly Li <i@coly.li> wrote: > > > > > 2025年7月3日 23:34,Coly Li <i@coly.li> 写道: > > > > Hi Robert, > > > >> 2025年7月3日 23:28,Robert Pang <robertpang@google.com> 写道: > >> > >> Hi Coly > >> > >> I'm writing to provide an update on this bcache GC latency patch. > >> > >> With the min heap regression addressed by the recent revert [1], I am > >> going to initiate a rerun of our last test set [2]. I will share the > >> results as soon as they are available. > >> > >> To ensure this patch is fully qualified for the next merge window, I'm > >> eager to perform any further tests you deem necessary. What additional > >> test scenarios would you recommend?" > >> > > > > Can you also do some full random write (4K and 64K block size) performance testing? > > > > If there is no performance regression (and I believe improvement will be observed), then let’s go ahead. > > BTW, I don’t commit that this patch will go into next merge window. > > Thanks. > > Coly Li > > > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] bcache: reduce front IO latency during GC 2025-07-17 22:52 ` Robert Pang @ 2025-07-18 6:09 ` Coly Li 0 siblings, 0 replies; 27+ messages in thread From: Coly Li @ 2025-07-18 6:09 UTC (permalink / raw) To: Robert Pang Cc: Coly Li, Kent Overstreet, linux-bcache, Mingzhe Zou, Kuan-Wei Chiu Hi Robert, Thanks for the great effort! Let me take a look and respond later. Coly Li > 2025年7月18日 06:52,Robert Pang <robertpang@google.com> 写道: > > Hi Coly > >>> With the min heap regression addressed by the recent revert [1], I am >>> going to initiate a rerun of our last test set [2]. I will share the >>> results as soon as they are available. > > My apologies for the delay in sharing the results due to my work schedule. > > I've completed the rerun of the tests, and the results are available > for your review in [1]. As expected, the data confirms the desired > latency improvements. Furthermore, I've checked for any unintended > side effects and can confirm that no regressions were observed. > > Please take a look and let me know what you think. If there are any > other specific scenarios or tests you would like me to execute to > further validate the changes, please don't hesitate to let me know. > > Best regards > Robert Pang > > [1] https://gist.github.com/robert-pang/a22b7c5dee2600be3260f4db57e5776d#file-test-results-md > > On Thu, Jul 3, 2025 at 9:47 AM Coly Li <i@coly.li> wrote: >> >> >> >>> 2025年7月3日 23:34,Coly Li <i@coly.li> 写道: >>> >>> Hi Robert, >>> >>>> 2025年7月3日 23:28,Robert Pang <robertpang@google.com> 写道: >>>> >>>> Hi Coly >>>> >>>> I'm writing to provide an update on this bcache GC latency patch. >>>> >>>> With the min heap regression addressed by the recent revert [1], I am >>>> going to initiate a rerun of our last test set [2]. I will share the >>>> results as soon as they are available. >>>> >>>> To ensure this patch is fully qualified for the next merge window, I'm >>>> eager to perform any further tests you deem necessary. What additional >>>> test scenarios would you recommend?" >>>> >>> >>> Can you also do some full random write (4K and 64K block size) performance testing? >>> >>> If there is no performance regression (and I believe improvement will be observed), then let’s go ahead. >> >> BTW, I don’t commit that this patch will go into next merge window. >> >> Thanks. >> >> Coly Li >> >> >> > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/1] bcache: reduce front IO latency during GC 2025-04-14 22:44 [PATCH 0/1] bcache: reduce front IO latency during GC Robert Pang 2025-04-14 22:44 ` [PATCH 1/1] bcache: process fewer btree nodes in incremental GC cycles Robert Pang 2025-04-15 1:57 ` [PATCH 0/1] bcache: reduce front IO latency during GC Coly Li @ 2025-05-16 17:10 ` Kent Overstreet 2 siblings, 0 replies; 27+ messages in thread From: Kent Overstreet @ 2025-05-16 17:10 UTC (permalink / raw) To: Robert Pang; +Cc: Coly Li, linux-bcache, Mingzhe Zou On Mon, Apr 14, 2025 at 03:44:03PM -0700, Robert Pang wrote: > In performance benchmarks on disks with bcache using the Linux 6.6 kernel, we > observe noticeable IO latency increase during btree garbage collection. The > increase ranges from high tens to hundreds of milliseconds, depending on the > size of the cache device. Further investigation reveals that it is the same > issue reported in [1], where the large number of nodes processed in each > incremental GC cycle causes the front IO latency. > > Building upon the approach suggested in [1], this patch decomposes the > incremental GC process into more but smaller cycles. In contrast to [1], this > implementation adopts a simpler strategy by setting a lower limit of 10 nodes > per cycle to reduce front IO delay and introducing a fixed 10ms sleep per cycle > when front IO is in progress. Furthermore, when garbage collection statistics > are available, the number of nodes processed per cycle is dynamically rescaled > based on the average GC frequency to ensure GC completes well within the next > subsequent scheduled interval. > > Testing with a 750GB NVMe cache and 256KB bucket size using the following fio > configuration demonstrates that our patch reduces front IO latency during GC > without significantly increasing GC duration. Have you been looking at bcachefs yet? It solves all the latency issues in bcache. ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2025-07-18 6:18 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-14 22:44 [PATCH 0/1] bcache: reduce front IO latency during GC Robert Pang
2025-04-14 22:44 ` [PATCH 1/1] bcache: process fewer btree nodes in incremental GC cycles Robert Pang
2025-04-15 2:08 ` Coly Li
2025-04-15 17:25 ` Robert Pang
2025-04-16 9:44 ` Coly Li
2025-04-17 5:44 ` Robert Pang
2025-04-15 1:57 ` [PATCH 0/1] bcache: reduce front IO latency during GC Coly Li
2025-04-22 1:44 ` Robert Pang
2025-05-02 1:01 ` Robert Pang
2025-05-03 17:33 ` Coly Li
2025-05-15 0:58 ` Robert Pang
2025-05-15 1:04 ` Robert Pang
2025-05-15 7:12 ` Kuan-Wei Chiu
2025-05-16 3:58 ` Robert Pang
2025-05-16 16:14 ` Kuan-Wei Chiu
2025-05-17 11:02 ` Coly Li
2025-05-20 11:51 ` Kuan-Wei Chiu
2025-05-20 12:13 ` Coly Li
2025-05-20 12:26 ` Kuan-Wei Chiu
2025-05-20 13:13 ` Coly Li
2025-05-21 14:40 ` Kuan-Wei Chiu
2025-06-06 7:39 ` Robert Pang
2025-06-06 12:37 ` Kuan-Wei Chiu
2025-07-03 15:28 ` Robert Pang
[not found] ` <10F3457F-FC71-4D21-B2CA-05346B68D873@coly.li>
[not found] ` <A52B16E2-DF9C-4C8A-9853-464385D1A7FA@coly.li>
2025-07-17 22:52 ` Robert Pang
2025-07-18 6:09 ` Coly Li
2025-05-16 17:10 ` Kent Overstreet
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox