From: Ankit Kapoor <ankitkap@google.com>
To: linux-bcache@vger.kernel.org
Cc: colyli@fygo.io, kent.overstreet@linux.dev,
linux-kernel@vger.kernel.org, Ankit Kapoor <ankitkap@google.com>
Subject: [PATCH v2 0/1] bcache: track active bypass writes to prevent stale cache reads
Date: Wed, 17 Jun 2026 10:33:55 +0000 [thread overview]
Message-ID: <20260617103356.3287775-1-ankitkap@google.com> (raw)
Hi Coly,
This is v2 of the patch to fix a race condition between read cache
misses and bypass writes.
Changes in v2:
Instead of deferring key invalidation, we now explicitly track active
bypass writes using dynamically allocated pages (modeled closely after
md-bitmap.c as suggested by Coly). We use this tracking information
during cache miss reads to determine if the read must also bypass the
cache.
The active bypass writes are tracked by dividing the backing device
space into 32MB chunks and maintaining concurrent write refcounts. The
memory overhead is minimal; a single 4KB page covers 64 GB of backing
device space in the chosen approach.
Implementation Approaches Evaluated:
When designing this, we evaluated three synchronization approaches:
1. Global Spinlock
2. Page-level Spinlock (Chosen Approach)
3. Atomic Counters (Lockless)
Memory Consumption:
Idle Memory Usage (No active bypass writes)
Backing Disk Size | Global Lk | Page Lk | Atomic Counters
1 TB | 256 bytes | 256 B | 1 KB
10 TB | 2.5 KB | 2.5 KB | 10 KB
100 TB | 25 KB | 25 KB | 100 KB
Peak Memory Usage (Tracking pages fully allocated)
Backing Disk Size | Global Lk | Page Lk | Atomic Counters
1 TB | 64.25 KB | 64.25 KB| 257 KB
10 TB | 642.5 KB | 642.5 KB| 2.51 MB
100 TB | 6.27 MB | 6.27 MB | 25.10 MB
Performance Benchmarks (FIO):
Setup:
- CPU: 32 vCPU, Intel Cascade Lake x86_64 (n2-standard-32 GCP VM)
- Memory: 128 GB RAM
- OS: Linux 6.12.68 (Google COS)
- Storage: Google Cloud Extreme PD (1000 GB) + Local SSD (375 GB)
FIO config:
rw=randrw, bs=(R) 4096B-4096B, (W) 128KiB-128KiB, (T) 128KiB-128KiB,
ioengine=libaio, iodepth=32 - 16 jobs
10 GB file workload (1 tracking page active)
Metric | Unpatch | Global Lk | Page Lk | Atomic | Analysis
Read IOPS | 20,977 | 20,897 | 20,900 | 20,886 | Almost flat
Write IOPS | 8,993 | 8,956 | 8,956 | 8,951 | Almost flat
Total IOPS | 29,971 | 29,853 | 29,856 | 29,836 | Almost flat
Avg R Lat | 16.31ms | 16.66 ms | 16.66 ms| 16.68 ms| Almost flat
Avg W Lat | 18.94ms | 18.33 ms | 18.34 ms| 18.34 ms| Almost flat
Kernel CPU % | 63.02 | 77.94 | 72.88 | 75.41 | Overhead
Avg Ker CPU% | 3.94 | 4.87 | 4.56 | 4.71 | Overhead
NVMe Util % | 66.89 | 0.62 | 0.62 | 0.63 | Offld succs
320 GB file workload (5 tracking pages active)
Metric | Unpatch | Global Lk | Page Lk | Atomic | Analysis
Read IOPS | 20,974 | 20,898 | 20,897 | 20,898 | Almost flat
Write IOPS | 8,988 | 8,957 | 8,956 | 8,956 | Almost flat
Total IOPS | 29,963 | 29,855 | 29,853 | 29,854 | Almost flat
Avg R Lat | 16.54ms | 16.76 ms | 16.78 ms| 16.76 ms| Almost flat
Avg W Lat | 18.41ms | 18.11 ms | 18.07 ms| 18.10 ms| Almost flat
Kernel CPU % | 70.39 | 70.70 | 60.85 | 62.20 | Scaled eff.
Avg Ker CPU% | 4.40 | 4.42 | 3.80 | 3.89 | Scaled eff.
NVMe Util % | 67.74 | 0.66 | 0.61 | 0.62 | Offld succs
Rationale & Analysis:
1. Tracking Overhead: Implementing tracking inherently adds minor CPU
overhead for smaller workloads. However, the chosen Page-level Lock
minimizes this penalty.
2. Cache Device Offloading: The massive drop in NVMe cache utilization
(67% to 0.6%) occurs because the reads are safely bypassing the
cache instead of fetching and attempting to populate the cache with
stale data.
3. Superior Scaling Under Load: For larger files (320GB), the added
overhead of page tracking is more than compensated for by the saved
cache update operations during reads.
Questions for Coly:
1. If `kzalloc` fails on the fast path in `bch_bypass_write_start()`,
we currently skip incrementing the counter (leaving it untracked).
This opens a tiny edge case (a "stolen increment"): if a concurrent
bio successfully allocates the tracking page slightly later, the
initially untracked bio will blindly decrement that counter in
`bch_bypass_write_end()`. Because of the `> 0` check in
`bch_bypass_write_end()`, this doesn't cause a true integer
underflow, but it does prematurely drop the count to 0 (effectively
stealing the track). `md-bitmap.c` solves similar issues via a
hijacking fallback. Given that memory corruption is prevented by
the `> 0` check, do you think a similar fallback mechanism is
necessary here, or is the rate-limited warning and sysfs tracking
sufficient given how rare `kzalloc` failures are for 4KB pages?
2. We currently use `u16` for the tracking counters. Since standard
block layer queue depths rarely exceed a few thousand, a counter
overflow (65,535 concurrent writes to the exact same 32MB chunk)
seems practically impossible. However, if you'd prefer to be
absolutely defensive against overflows without doubling the memory
footprint to `u32`, I could clamp the counter at `U16_MAX` during
increments. This would safely prevent wrap-around memory leaks
while keeping our current memory efficiency. Would you prefer
leaving it as is, clamping it, or bumping to `u32`?
Looking forward for your feedback on the changes.
Ankit Kapoor (1):
bcache: track active bypass writes to prevent stale cache reads
Documentation/admin-guide/bcache.rst | 8 ++
drivers/md/bcache/bcache.h | 35 +++++++
drivers/md/bcache/request.c | 132 +++++++++++++++++++++++++++
drivers/md/bcache/stats.c | 14 +++
drivers/md/bcache/stats.h | 4 +
drivers/md/bcache/super.c | 30 ++++++
drivers/md/bcache/sysfs.c | 5 +
include/trace/events/bcache.h | 5 +
8 files changed, 233 insertions(+)
--
2.54.0.1136.gdb2ca164c4-goog
next reply other threads:[~2026-06-17 10:34 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-17 10:33 Ankit Kapoor [this message]
2026-06-17 10:33 ` [PATCH v2 1/1] bcache: track active bypass writes to prevent stale cache reads Ankit Kapoor
2026-06-17 10:41 ` [PATCH v2 0/1] " Coly Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260617103356.3287775-1-ankitkap@google.com \
--to=ankitkap@google.com \
--cc=colyli@fygo.io \
--cc=kent.overstreet@linux.dev \
--cc=linux-bcache@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox