From: "Coly Li" <colyli@fnnas.com>
To: "Ankit Kapoor" <ankitkap@google.com>
Cc: "Kent Overstreet" <kent.overstreet@linux.dev>,
<linux-bcache@vger.kernel.org>, <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 0/1] bcache: fix stale data race between read cache miss and bypass write
Date: Mon, 25 May 2026 00:12:53 +0800 [thread overview]
Message-ID: <ahMi6qmGtJsf1X03@studio.local> (raw)
In-Reply-To: <20260521163925.178264-1-ankitkap@google.com>
On Thu, May 21, 2026 at 04:39:24PM +0800, Ankit Kapoor wrote:
Hi Ankit,
From your description and analysis, I feel this is a real issue.
Let me understand this deeper and response you later.
Thanks.
Coly Li
> Overview
> --------
> This series addresses a cache inconsistency issue with stale data in bcache
> that arises from a race condition between a read cache miss and a bypass
> write due to congestion or sequential cutoff. The fix involves sequencing
> the btree invalidation of the bypass write to occur strictly after the
> backing device write.
>
> Race Analysis
> -------------
> The following sequence illustrates how stale data is cached after a read
> cache miss when btree invalidation of a bypass write happens in parallel
> with a delayed write to the backing device:
>
> Write IO Path (Parallel) Read IO Path
> ------------------------ ------------
> |
> [Btree Invalidation]
> |
> | [Cache Miss]
> | |
> | [Btree Placeholder Key Insertion]
> | |
> (Delay in writing |
> to the backing device) |
> | [Cache data from the backing device]
> | |
> +-------------------------->| <-- No key collision detected!
> | [Btree Placeholder Key Replacement]
> | |
> [Write to the |
> backing device] -------------
> CRITICAL BUG:
> Stale data gets cached
>
> Reproduction Steps
> ------------------
> The bug can be reliably reproduced by injecting a 5-second delay into
> the backing device write path via dm-delay. Cache mode is set to
> writearound to simulate bypass write.
>
> 1. Data Preparation:
> # printf -- '%.0s\0' {1..4096} > /tmp/0.txt
> # printf -- '%.0s\1' {1..4096} > /tmp/1.txt
> # echo writearound > /sys/block/bcache0/bcache/cache_mode
> # dd if=/tmp/0.txt of=/media/bcache/data.txt oflag=direct \
> bs=4096 count=1 conv=notrunc
>
> 2. Race Execution:
> # dd if=/tmp/1.txt of=/media/bcache/data.txt oflag=direct \
> bs=4096 count=1 conv=notrunc &
> # sleep 1
> # dd if=/media/bcache/data.txt iflag=direct bs=4096 count=1 \
> status=none | hexdump > ./concurrent-read-result
> # sleep 10
> # dd if=/media/bcache/data.txt iflag=direct bs=4096 count=1 \
> status=none | hexdump > ./second-read-result
>
> 3. Results (Without Patch):
> # cat second-read-result
> 0000000 0000 0000 0000 0000 0000 0000 0000 0000 # <--- STALE READ
>
> Proposed Fix
> ------------
> The fix enforces strict total (sequential) order of btree invalidation
> after write to the backing device in a bypass write:
>
> OLD FLOW NEW FLOW
> ------------------------------- --------------------------------
> [ Write Start ] [ Write Start ]
> | |
> +-------+-------+ |
> | | v
> v v [ Write to ]
> [ Btree ] [ Write to ] [ backing-device ]
> [ Invalidation ] [ backing-device] |
> | | v
> +-------+-------+ [ Btree ]
> | [ Invalidation ]
> v |
> [ Write End ] v
> [ Write End ]
>
> Enforcing this sequential execution ensures that either:
> 1. A stale read is followed and invalidated by the deferred write
> invalidation flow.
> 2. The write invalidation executes first, forcing the subsequent read
> path's key replacement sequence to properly catch the collision.
>
> Failure Handling
> ----------------
> This patch keeps existing error-handling behavior intact. Although
> execution is now sequential, btree invalidation is still triggered
> regardless of whether the write to the backing device succeeds
> or fails.
>
> Verification and Performance
> ----------------------------
> Manual Results (With Patch):
> # cat second-read-result
> 0000000 0101 0101 0101 0101 0101 0101 0101 0101 # <--- CORRECT DATA
>
> Stress Verification:
> FIO was executed under a write-only workload (128 KB Write, libaio,
> iodepth=64, direct=1). Without the patch, FIO reported CRC errors
> due to stale read corruptions; with the patch, zero CRC errors or
> corruptions were reported.
>
> Write-Only Workload (FIO Averages CSV):
> Metric,With Fix,Without Fix,Delta
>
> Write IOPS,1630,1630,0.00%
> Write Bandwidth (MiB/s),204,204,0.00%
> Write Avg Latency (micro second),39219.95,39219.58,0.00%
>
> Test Environment
> ----------------
> - CPU: 1 vCPU, Intel Haswell x86_64 (n1-standard-1 instance)
> - Memory: 3.75 GB RAM
> - OS: Linux 6.12.68 (Google COS)
> - Storage: Google Cloud SSD PD + Local SSD
>
> Ankit Kapoor (1):
> bcache: fix stale data race between read cache miss and bypass write
>
> drivers/md/bcache/request.c | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
> --
> 2.54.0.669.g59709faab0-goog
prev parent reply other threads:[~2026-05-24 16:13 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-21 16:39 [PATCH 0/1] bcache: fix stale data race between read cache miss and bypass write Ankit Kapoor
2026-05-21 16:39 ` [PATCH 1/1] " Ankit Kapoor
2026-05-25 13:41 ` Coly Li
2026-05-27 13:41 ` Ankit Kapoor
2026-05-27 15:27 ` Coly Li
2026-05-24 16:12 ` Coly Li [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ahMi6qmGtJsf1X03@studio.local \
--to=colyli@fnnas.com \
--cc=ankitkap@google.com \
--cc=kent.overstreet@linux.dev \
--cc=linux-bcache@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox