* [PATCH v2 0/1] bcache: track active bypass writes to prevent stale cache reads
@ 2026-06-17 10:33 Ankit Kapoor
2026-06-17 10:33 ` [PATCH v2 1/1] " Ankit Kapoor
2026-06-17 10:41 ` [PATCH v2 0/1] " Coly Li
0 siblings, 2 replies; 3+ messages in thread
From: Ankit Kapoor @ 2026-06-17 10:33 UTC (permalink / raw)
To: linux-bcache; +Cc: colyli, kent.overstreet, linux-kernel, Ankit Kapoor
Hi Coly,
This is v2 of the patch to fix a race condition between read cache
misses and bypass writes.
Changes in v2:
Instead of deferring key invalidation, we now explicitly track active
bypass writes using dynamically allocated pages (modeled closely after
md-bitmap.c as suggested by Coly). We use this tracking information
during cache miss reads to determine if the read must also bypass the
cache.
The active bypass writes are tracked by dividing the backing device
space into 32MB chunks and maintaining concurrent write refcounts. The
memory overhead is minimal; a single 4KB page covers 64 GB of backing
device space in the chosen approach.
Implementation Approaches Evaluated:
When designing this, we evaluated three synchronization approaches:
1. Global Spinlock
2. Page-level Spinlock (Chosen Approach)
3. Atomic Counters (Lockless)
Memory Consumption:
Idle Memory Usage (No active bypass writes)
Backing Disk Size | Global Lk | Page Lk | Atomic Counters
1 TB | 256 bytes | 256 B | 1 KB
10 TB | 2.5 KB | 2.5 KB | 10 KB
100 TB | 25 KB | 25 KB | 100 KB
Peak Memory Usage (Tracking pages fully allocated)
Backing Disk Size | Global Lk | Page Lk | Atomic Counters
1 TB | 64.25 KB | 64.25 KB| 257 KB
10 TB | 642.5 KB | 642.5 KB| 2.51 MB
100 TB | 6.27 MB | 6.27 MB | 25.10 MB
Performance Benchmarks (FIO):
Setup:
- CPU: 32 vCPU, Intel Cascade Lake x86_64 (n2-standard-32 GCP VM)
- Memory: 128 GB RAM
- OS: Linux 6.12.68 (Google COS)
- Storage: Google Cloud Extreme PD (1000 GB) + Local SSD (375 GB)
FIO config:
rw=randrw, bs=(R) 4096B-4096B, (W) 128KiB-128KiB, (T) 128KiB-128KiB,
ioengine=libaio, iodepth=32 - 16 jobs
10 GB file workload (1 tracking page active)
Metric | Unpatch | Global Lk | Page Lk | Atomic | Analysis
Read IOPS | 20,977 | 20,897 | 20,900 | 20,886 | Almost flat
Write IOPS | 8,993 | 8,956 | 8,956 | 8,951 | Almost flat
Total IOPS | 29,971 | 29,853 | 29,856 | 29,836 | Almost flat
Avg R Lat | 16.31ms | 16.66 ms | 16.66 ms| 16.68 ms| Almost flat
Avg W Lat | 18.94ms | 18.33 ms | 18.34 ms| 18.34 ms| Almost flat
Kernel CPU % | 63.02 | 77.94 | 72.88 | 75.41 | Overhead
Avg Ker CPU% | 3.94 | 4.87 | 4.56 | 4.71 | Overhead
NVMe Util % | 66.89 | 0.62 | 0.62 | 0.63 | Offld succs
320 GB file workload (5 tracking pages active)
Metric | Unpatch | Global Lk | Page Lk | Atomic | Analysis
Read IOPS | 20,974 | 20,898 | 20,897 | 20,898 | Almost flat
Write IOPS | 8,988 | 8,957 | 8,956 | 8,956 | Almost flat
Total IOPS | 29,963 | 29,855 | 29,853 | 29,854 | Almost flat
Avg R Lat | 16.54ms | 16.76 ms | 16.78 ms| 16.76 ms| Almost flat
Avg W Lat | 18.41ms | 18.11 ms | 18.07 ms| 18.10 ms| Almost flat
Kernel CPU % | 70.39 | 70.70 | 60.85 | 62.20 | Scaled eff.
Avg Ker CPU% | 4.40 | 4.42 | 3.80 | 3.89 | Scaled eff.
NVMe Util % | 67.74 | 0.66 | 0.61 | 0.62 | Offld succs
Rationale & Analysis:
1. Tracking Overhead: Implementing tracking inherently adds minor CPU
overhead for smaller workloads. However, the chosen Page-level Lock
minimizes this penalty.
2. Cache Device Offloading: The massive drop in NVMe cache utilization
(67% to 0.6%) occurs because the reads are safely bypassing the
cache instead of fetching and attempting to populate the cache with
stale data.
3. Superior Scaling Under Load: For larger files (320GB), the added
overhead of page tracking is more than compensated for by the saved
cache update operations during reads.
Questions for Coly:
1. If `kzalloc` fails on the fast path in `bch_bypass_write_start()`,
we currently skip incrementing the counter (leaving it untracked).
This opens a tiny edge case (a "stolen increment"): if a concurrent
bio successfully allocates the tracking page slightly later, the
initially untracked bio will blindly decrement that counter in
`bch_bypass_write_end()`. Because of the `> 0` check in
`bch_bypass_write_end()`, this doesn't cause a true integer
underflow, but it does prematurely drop the count to 0 (effectively
stealing the track). `md-bitmap.c` solves similar issues via a
hijacking fallback. Given that memory corruption is prevented by
the `> 0` check, do you think a similar fallback mechanism is
necessary here, or is the rate-limited warning and sysfs tracking
sufficient given how rare `kzalloc` failures are for 4KB pages?
2. We currently use `u16` for the tracking counters. Since standard
block layer queue depths rarely exceed a few thousand, a counter
overflow (65,535 concurrent writes to the exact same 32MB chunk)
seems practically impossible. However, if you'd prefer to be
absolutely defensive against overflows without doubling the memory
footprint to `u32`, I could clamp the counter at `U16_MAX` during
increments. This would safely prevent wrap-around memory leaks
while keeping our current memory efficiency. Would you prefer
leaving it as is, clamping it, or bumping to `u32`?
Looking forward for your feedback on the changes.
Ankit Kapoor (1):
bcache: track active bypass writes to prevent stale cache reads
Documentation/admin-guide/bcache.rst | 8 ++
drivers/md/bcache/bcache.h | 35 +++++++
drivers/md/bcache/request.c | 132 +++++++++++++++++++++++++++
drivers/md/bcache/stats.c | 14 +++
drivers/md/bcache/stats.h | 4 +
drivers/md/bcache/super.c | 30 ++++++
drivers/md/bcache/sysfs.c | 5 +
include/trace/events/bcache.h | 5 +
8 files changed, 233 insertions(+)
--
2.54.0.1136.gdb2ca164c4-goog
^ permalink raw reply [flat|nested] 3+ messages in thread
* [PATCH v2 1/1] bcache: track active bypass writes to prevent stale cache reads
2026-06-17 10:33 [PATCH v2 0/1] bcache: track active bypass writes to prevent stale cache reads Ankit Kapoor
@ 2026-06-17 10:33 ` Ankit Kapoor
2026-06-17 10:41 ` [PATCH v2 0/1] " Coly Li
1 sibling, 0 replies; 3+ messages in thread
From: Ankit Kapoor @ 2026-06-17 10:33 UTC (permalink / raw)
To: linux-bcache; +Cc: colyli, kent.overstreet, linux-kernel, Ankit Kapoor
A race condition exists between a read cache miss and a bypass write
due to either congestion or sequential bypass, that causes stale data
to be cached when the read cache miss runs concurrently with a bypass
write targeting the same sectors. If the read cache miss fetches data
from the backing device before the write to the backing device, stale
data populates the cache.
The root cause is that bcache currently executes btree key
invalidation in parallel with (or prior to) writing the actual data
payload to the backing device. Under this sequence, a concurrent read
path can register a cache miss and insert a placeholder key. If the
write's btree key invalidation completes before the read finishes
fetching old data from the backing device, the read's subsequent key
replacement will not detect a collision, allowing stale data to
persist in the cache.
Fix this by tracking active bypass writes. We divide the backing
device space into 32MB chunks and track concurrent bypass writes
using refcounts. The tracking counters are stored in dynamically
allocated pages, minimizing memory overhead (a single 4KB page
supports 64GB of disk space). Synchronization is handled via
page-level spinlocks rather than a single global lock, ensuring
minimal contention across different regions of the disk.
On a cache miss read, bcache checks if there are any active bypass
writes overlapping the target sectors. If an active bypass write is
detected, the read is forced to bypass the cache to ensure data
consistency.
If the system fails to allocate a tracking page due to memory
pressure, the bypass write proceeds untracked. To provide
observability into this fallback, we print a rate-limited dmesg
warning and track the allocation failures using a new
`bypass_tracking_alloc_fails` sysfs counter.
Additionally, add a `cache_read_bypass_races` sysfs counter and a
corresponding tracepoint to monitor these occurrences.
Signed-off-by: Ankit Kapoor <ankitkap@google.com>
Suggested-by: Coly Li <colyli@fygo.io>
---
v2:
- Addressed feedback from Coly Li regarding SSD power failures.
- Implemented bypass write monitoring to force concurrent reads
to the bypass path.
- Referenced md RAID bitmap implementation for the architectural
approach as per Coly Li's suggestion
Documentation/admin-guide/bcache.rst | 8 ++
drivers/md/bcache/bcache.h | 35 +++++++
drivers/md/bcache/request.c | 132 +++++++++++++++++++++++++++
drivers/md/bcache/stats.c | 14 +++
drivers/md/bcache/stats.h | 4 +
drivers/md/bcache/super.c | 30 ++++++
drivers/md/bcache/sysfs.c | 5 +
include/trace/events/bcache.h | 5 +
8 files changed, 233 insertions(+)
diff --git a/Documentation/admin-guide/bcache.rst b/Documentation/admin-guide/bcache.rst
index 325816edbdab..e08cfa6b0ea0 100644
--- a/Documentation/admin-guide/bcache.rst
+++ b/Documentation/admin-guide/bcache.rst
@@ -507,6 +507,10 @@ cache_miss_collisions
cache miss, but raced with a write and data was already present (usually 0
since the synchronization for cache misses was rewritten)
+cache_read_bypass_races
+ Counts instances where a cache miss read raced with a concurrent bypass
+ write, forcing the read to bypass the cache to prevent reading stale data.
+
Sysfs - cache set
~~~~~~~~~~~~~~~~~
@@ -592,6 +596,10 @@ bset_tree_stats
btree_cache_max_chain
Longest chain in the btree node cache's hash table
+bypass_tracking_alloc_fails
+ Counts instances where memory allocation for bypass write tracking
+ failed. When this occurs, a bypass write proceeds untracked.
+
cache_read_races
Counts instances where while data was being read from the cache, the bucket
was reused and invalidated - i.e. where the pointer was stale after the read
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index ec9ff9715081..8e08503a698b 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -299,6 +299,12 @@ enum stop_on_failure {
BCH_CACHED_DEV_STOP_MODE_MAX,
};
+struct bch_bypass_page {
+ u16 *counts;
+ unsigned int active;
+ spinlock_t lock;
+};
+
struct cached_dev {
struct list_head list;
struct bcache_device disk;
@@ -407,8 +413,36 @@ struct cached_dev {
*/
#define BCH_WBRATE_UPDATE_MAX_SKIPS 15
unsigned int rate_update_retry;
+
+ /* For tracking active bypass writes */
+#define BCH_BYPASS_CHUNK_SHIFT 16 /* 2^16 sectors = 32MB */
+#define BCH_BYPASS_PAGE_COUNTERS (PAGE_SIZE / sizeof(u16))
+#define BCH_BYPASS_PAGE_SHIFT (PAGE_SHIFT - 1)
+#define BCH_BYPASS_PAGE_MASK ((1UL << BCH_BYPASS_PAGE_SHIFT) - 1)
+ struct bch_bypass_page *bypass_pages;
+ unsigned long bypass_num_pages;
};
+static inline unsigned long sector_to_bypass_chunk(sector_t sector)
+{
+ return sector >> BCH_BYPASS_CHUNK_SHIFT;
+}
+
+static inline unsigned long bypass_chunk_to_page(unsigned long chunk)
+{
+ return chunk >> BCH_BYPASS_PAGE_SHIFT;
+}
+
+static inline unsigned long bypass_chunk_to_offset(unsigned long chunk)
+{
+ return chunk & BCH_BYPASS_PAGE_MASK;
+}
+
+static inline sector_t bypass_chunk_to_sector(unsigned long chunk)
+{
+ return (sector_t)chunk << BCH_BYPASS_CHUNK_SHIFT;
+}
+
enum alloc_reserve {
RESERVE_BTREE,
RESERVE_PRIO,
@@ -714,6 +748,7 @@ struct cache_set {
struct time_stats btree_read_time;
atomic_long_t cache_read_races;
+ atomic_long_t bypass_tracking_alloc_fails;
atomic_long_t writeback_keys_done;
atomic_long_t writeback_keys_failed;
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 3fa3b13a410f..426665c07394 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -830,6 +830,126 @@ static CLOSURE_CALLBACK(cached_dev_cache_miss_done)
closure_put(&d->cl);
}
+static void bch_bypass_write_start(struct cached_dev *dc, sector_t sector, unsigned int sectors)
+{
+ unsigned long start_chunk = sector_to_bypass_chunk(sector);
+ unsigned long end_chunk = sector_to_bypass_chunk(sector + sectors - 1);
+ unsigned long end_pg_idx = bypass_chunk_to_page(end_chunk);
+ unsigned long chunk;
+
+ if (!dc->bypass_pages)
+ return;
+
+ if (WARN_ON_ONCE(end_pg_idx >= dc->bypass_num_pages))
+ return;
+
+ for (chunk = start_chunk; chunk <= end_chunk; chunk++) {
+ unsigned long pg_idx = bypass_chunk_to_page(chunk);
+ unsigned long pg_off = bypass_chunk_to_offset(chunk);
+ struct bch_bypass_page *pg = &dc->bypass_pages[pg_idx];
+ u16 *new_counts;
+ u16 *dup_counts = NULL;
+
+ spin_lock_irq(&pg->lock);
+ if (!pg->counts) {
+ spin_unlock_irq(&pg->lock);
+ new_counts = kzalloc(PAGE_SIZE, __GFP_NOWARN | GFP_NOIO);
+ spin_lock_irq(&pg->lock);
+ if (!new_counts) {
+ if (!pg->counts) {
+ spin_unlock_irq(&pg->lock);
+ pr_warn_ratelimited("failed to allocate bypass write tracking page, bypass write untracked for sectors %llu-%llu\n",
+ (uint64_t)bypass_chunk_to_sector(chunk),
+ (uint64_t)bypass_chunk_to_sector(
+ chunk + 1) - 1);
+ if (dc->disk.c)
+ atomic_long_inc(
+ &dc->disk.c->bypass_tracking_alloc_fails);
+ continue;
+ }
+ }
+ if (new_counts) {
+ if (pg->counts)
+ dup_counts = new_counts;
+ else
+ pg->counts = new_counts;
+ }
+ }
+ pg->counts[pg_off]++;
+ pg->active++;
+ spin_unlock_irq(&pg->lock);
+
+ kfree(dup_counts);
+ }
+}
+
+static void bch_bypass_write_end(struct cached_dev *dc, sector_t sector, unsigned int sectors)
+{
+ unsigned long start_chunk = sector_to_bypass_chunk(sector);
+ unsigned long end_chunk = sector_to_bypass_chunk(sector + sectors - 1);
+ unsigned long end_pg_idx = bypass_chunk_to_page(end_chunk);
+ unsigned long chunk;
+
+ if (!dc->bypass_pages)
+ return;
+
+ if (WARN_ON_ONCE(end_pg_idx >= dc->bypass_num_pages))
+ return;
+
+ for (chunk = start_chunk; chunk <= end_chunk; chunk++) {
+ unsigned long pg_idx = bypass_chunk_to_page(chunk);
+ unsigned long pg_off = bypass_chunk_to_offset(chunk);
+ struct bch_bypass_page *pg = &dc->bypass_pages[pg_idx];
+ u16 *counts = NULL;
+ unsigned long flags;
+
+ spin_lock_irqsave(&pg->lock, flags);
+ if (pg->counts && pg->counts[pg_off] > 0) {
+ pg->counts[pg_off]--;
+ pg->active--;
+ if (!pg->active) {
+ counts = pg->counts;
+ pg->counts = NULL;
+ }
+ }
+ spin_unlock_irqrestore(&pg->lock, flags);
+
+ kfree(counts);
+ }
+}
+
+static bool bch_has_active_bypass_writes(struct cached_dev *dc, sector_t sector,
+ unsigned int sectors)
+{
+ unsigned long start_chunk = sector_to_bypass_chunk(sector);
+ unsigned long end_chunk = sector_to_bypass_chunk(sector + sectors - 1);
+ unsigned long end_pg_idx = bypass_chunk_to_page(end_chunk);
+ unsigned long chunk;
+ bool has_active = false;
+
+ if (!dc->bypass_pages)
+ return false;
+
+ if (WARN_ON_ONCE(end_pg_idx >= dc->bypass_num_pages))
+ return false;
+
+ for (chunk = start_chunk; chunk <= end_chunk; chunk++) {
+ unsigned long pg_idx = bypass_chunk_to_page(chunk);
+ unsigned long pg_off = bypass_chunk_to_offset(chunk);
+ struct bch_bypass_page *pg = &dc->bypass_pages[pg_idx];
+
+ spin_lock_irq(&pg->lock);
+ if (pg->counts && pg->counts[pg_off] > 0) {
+ has_active = true;
+ spin_unlock_irq(&pg->lock);
+ break;
+ }
+ spin_unlock_irq(&pg->lock);
+ }
+
+ return has_active;
+}
+
static CLOSURE_CALLBACK(cached_dev_read_done)
{
closure_type(s, struct search, cl);
@@ -899,6 +1019,13 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s,
s->cache_missed = 1;
+ if (bch_has_active_bypass_writes(dc, bio->bi_iter.bi_sector,
+ min(sectors, bio_sectors(bio)))) {
+ s->iop.bypass = true;
+ trace_bcache_read_bypass_race(bio);
+ bch_mark_cache_read_bypass_race(s->iop.c, s->d);
+ }
+
if (s->cache_miss || s->iop.bypass) {
miss = bio_next_split(bio, sectors, GFP_NOIO, &s->d->bio_split);
ret = miss == bio ? MAP_DONE : MAP_CONTINUE;
@@ -974,6 +1101,9 @@ static CLOSURE_CALLBACK(cached_dev_write_complete)
closure_type(s, struct search, cl);
struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);
+ if (s->iop.bypass)
+ bch_bypass_write_end(dc, s->iop.bio->bi_iter.bi_sector, bio_sectors(s->iop.bio));
+
up_read_non_owner(&dc->writeback_lock);
cached_dev_bio_complete(&cl->work);
}
@@ -1018,6 +1148,8 @@ static void cached_dev_write(struct cached_dev *dc, struct search *s)
s->iop.bio = s->orig_bio;
bio_get(s->iop.bio);
+ bch_bypass_write_start(dc, bio->bi_iter.bi_sector, bio_sectors(bio));
+
if (bio_op(bio) == REQ_OP_DISCARD &&
!bdev_max_discard_sectors(dc->bdev))
goto insert_data;
diff --git a/drivers/md/bcache/stats.c b/drivers/md/bcache/stats.c
index 0056106495a7..df21452e106a 100644
--- a/drivers/md/bcache/stats.c
+++ b/drivers/md/bcache/stats.c
@@ -47,6 +47,7 @@ read_attribute(cache_bypass_hits);
read_attribute(cache_bypass_misses);
read_attribute(cache_hit_ratio);
read_attribute(cache_miss_collisions);
+read_attribute(cache_read_bypass_races);
read_attribute(bypassed);
SHOW(bch_stats)
@@ -64,6 +65,7 @@ SHOW(bch_stats)
var(cache_hits) + var(cache_misses)));
var_print(cache_miss_collisions);
+ var_print(cache_read_bypass_races);
sysfs_hprint(bypassed, var(sectors_bypassed) << 9);
#undef var
return 0;
@@ -85,6 +87,7 @@ static struct attribute *bch_stats_attrs[] = {
&sysfs_cache_bypass_misses,
&sysfs_cache_hit_ratio,
&sysfs_cache_miss_collisions,
+ &sysfs_cache_read_bypass_races,
&sysfs_bypassed,
NULL
};
@@ -112,6 +115,7 @@ void bch_cache_accounting_clear(struct cache_accounting *acc)
acc->total.cache_bypass_hits = 0;
acc->total.cache_bypass_misses = 0;
acc->total.cache_miss_collisions = 0;
+ acc->total.cache_read_bypass_races = 0;
acc->total.sectors_bypassed = 0;
}
@@ -143,6 +147,7 @@ static void scale_stats(struct cache_stats *stats, unsigned long rescale_at)
scale_stat(&stats->cache_bypass_hits);
scale_stat(&stats->cache_bypass_misses);
scale_stat(&stats->cache_miss_collisions);
+ scale_stat(&stats->cache_read_bypass_races);
scale_stat(&stats->sectors_bypassed);
}
}
@@ -165,6 +170,7 @@ static void scale_accounting(struct timer_list *t)
move_stat(cache_bypass_hits);
move_stat(cache_bypass_misses);
move_stat(cache_miss_collisions);
+ move_stat(cache_read_bypass_races);
move_stat(sectors_bypassed);
scale_stats(&acc->total, 0);
@@ -212,6 +218,14 @@ void bch_mark_cache_miss_collision(struct cache_set *c, struct bcache_device *d)
atomic_inc(&c->accounting.collector.cache_miss_collisions);
}
+void bch_mark_cache_read_bypass_race(struct cache_set *c, struct bcache_device *d)
+{
+ struct cached_dev *dc = container_of(d, struct cached_dev, disk);
+
+ atomic_inc(&dc->accounting.collector.cache_read_bypass_races);
+ atomic_inc(&c->accounting.collector.cache_read_bypass_races);
+}
+
void bch_mark_sectors_bypassed(struct cache_set *c, struct cached_dev *dc,
int sectors)
{
diff --git a/drivers/md/bcache/stats.h b/drivers/md/bcache/stats.h
index 21b445f8af15..97d25e0d177c 100644
--- a/drivers/md/bcache/stats.h
+++ b/drivers/md/bcache/stats.h
@@ -8,6 +8,7 @@ struct cache_stat_collector {
atomic_t cache_bypass_hits;
atomic_t cache_bypass_misses;
atomic_t cache_miss_collisions;
+ atomic_t cache_read_bypass_races;
atomic_t sectors_bypassed;
};
@@ -19,6 +20,7 @@ struct cache_stats {
unsigned long cache_bypass_hits;
unsigned long cache_bypass_misses;
unsigned long cache_miss_collisions;
+ unsigned long cache_read_bypass_races;
unsigned long sectors_bypassed;
unsigned int rescale;
@@ -55,6 +57,8 @@ void bch_mark_cache_accounting(struct cache_set *c, struct bcache_device *d,
bool hit, bool bypass);
void bch_mark_cache_miss_collision(struct cache_set *c,
struct bcache_device *d);
+void bch_mark_cache_read_bypass_race(struct cache_set *c,
+ struct bcache_device *d);
void bch_mark_sectors_bypassed(struct cache_set *c,
struct cached_dev *dc,
int sectors);
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 97d9adb0bf96..dc584da349d9 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1346,6 +1346,13 @@ void bch_cached_dev_release(struct kobject *kobj)
{
struct cached_dev *dc = container_of(kobj, struct cached_dev,
disk.kobj);
+ if (dc->bypass_pages) {
+ unsigned long i;
+
+ for (i = 0; i < dc->bypass_num_pages; i++)
+ kfree(dc->bypass_pages[i].counts);
+ kfree(dc->bypass_pages);
+ }
kfree(dc);
module_put(THIS_MODULE);
}
@@ -1407,6 +1414,25 @@ static CLOSURE_CALLBACK(cached_dev_flush)
continue_at(cl, cached_dev_free, system_percpu_wq);
}
+static int bch_cached_dev_bypass_init(struct cached_dev *dc, sector_t sectors)
+{
+ unsigned long chunks = (sectors + (1UL << BCH_BYPASS_CHUNK_SHIFT) - 1) >>
+ BCH_BYPASS_CHUNK_SHIFT;
+ unsigned long i;
+
+ dc->bypass_num_pages = DIV_ROUND_UP(chunks, BCH_BYPASS_PAGE_COUNTERS);
+ dc->bypass_pages = kcalloc(dc->bypass_num_pages,
+ sizeof(struct bch_bypass_page),
+ GFP_KERNEL);
+ if (!dc->bypass_pages)
+ return -ENOMEM;
+
+ for (i = 0; i < dc->bypass_num_pages; i++)
+ spin_lock_init(&dc->bypass_pages[i].lock);
+
+ return 0;
+}
+
static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
{
int ret;
@@ -1447,6 +1473,10 @@ static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
/* default to auto */
dc->stop_when_cache_set_failed = BCH_CACHED_DEV_STOP_AUTO;
+ ret = bch_cached_dev_bypass_init(dc, bdev_nr_sectors(dc->bdev));
+ if (ret)
+ return ret;
+
bch_cached_dev_request_init(dc);
bch_cached_dev_writeback_init(dc);
return 0;
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index cfac56caa804..26b17edd5576 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -95,6 +95,7 @@ read_attribute(feature_incompat);
read_attribute(state);
read_attribute(cache_read_races);
+read_attribute(bypass_tracking_alloc_fails);
read_attribute(reclaim);
read_attribute(reclaimed_journal_buckets);
read_attribute(flush_write);
@@ -748,6 +749,9 @@ SHOW(__bch_cache_set)
sysfs_print(cache_read_races,
atomic_long_read(&c->cache_read_races));
+ sysfs_print(bypass_tracking_alloc_fails,
+ atomic_long_read(&c->bypass_tracking_alloc_fails));
+
sysfs_print(reclaim,
atomic_long_read(&c->reclaim));
@@ -993,6 +997,7 @@ static struct attribute *bch_cache_set_internal_attrs[] = {
&sysfs_bset_tree_stats,
&sysfs_cache_read_races,
+ &sysfs_bypass_tracking_alloc_fails,
&sysfs_reclaim,
&sysfs_reclaimed_journal_buckets,
&sysfs_flush_write,
diff --git a/include/trace/events/bcache.h b/include/trace/events/bcache.h
index 899fdacf57b9..b76bca0c5285 100644
--- a/include/trace/events/bcache.h
+++ b/include/trace/events/bcache.h
@@ -120,6 +120,11 @@ DEFINE_EVENT(bcache_bio, bcache_bypass_congested,
TP_ARGS(bio)
);
+DEFINE_EVENT(bcache_bio, bcache_read_bypass_race,
+ TP_PROTO(struct bio *bio),
+ TP_ARGS(bio)
+);
+
TRACE_EVENT(bcache_read,
TP_PROTO(struct bio *bio, bool hit, bool bypass),
TP_ARGS(bio, hit, bypass),
--
2.54.0.1136.gdb2ca164c4-goog
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH v2 0/1] bcache: track active bypass writes to prevent stale cache reads
2026-06-17 10:33 [PATCH v2 0/1] bcache: track active bypass writes to prevent stale cache reads Ankit Kapoor
2026-06-17 10:33 ` [PATCH v2 1/1] " Ankit Kapoor
@ 2026-06-17 10:41 ` Coly Li
1 sibling, 0 replies; 3+ messages in thread
From: Coly Li @ 2026-06-17 10:41 UTC (permalink / raw)
To: Ankit Kapoor; +Cc: linux-bcache, kent.overstreet, linux-kernel
> 2026年6月17日 18:33,Ankit Kapoor <ankitkap@google.com> 写道:
>
> Hi Coly,
>
> This is v2 of the patch to fix a race condition between read cache
> misses and bypass writes.
>
> Changes in v2:
> Instead of deferring key invalidation, we now explicitly track active
> bypass writes using dynamically allocated pages (modeled closely after
> md-bitmap.c as suggested by Coly). We use this tracking information
> during cache miss reads to determine if the read must also bypass the
> cache.
>
> The active bypass writes are tracked by dividing the backing device
> space into 32MB chunks and maintaining concurrent write refcounts. The
> memory overhead is minimal; a single 4KB page covers 64 GB of backing
> device space in the chosen approach.
>
> Implementation Approaches Evaluated:
> When designing this, we evaluated three synchronization approaches:
> 1. Global Spinlock
> 2. Page-level Spinlock (Chosen Approach)
> 3. Atomic Counters (Lockless)
Thanks, patch received. I will response you after reviewed them.
Coly Li
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-06-17 10:44 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-17 10:33 [PATCH v2 0/1] bcache: track active bypass writes to prevent stale cache reads Ankit Kapoor
2026-06-17 10:33 ` [PATCH v2 1/1] " Ankit Kapoor
2026-06-17 10:41 ` [PATCH v2 0/1] " Coly Li
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox