[PATCH 0/1] bcache: fix stale data race between read cache miss and bypass write

Linux bcache driver list
 help / color / mirror / Atom feed

* [PATCH 0/1] bcache: fix stale data race between read cache miss and bypass write
@ 2026-05-21 16:39 Ankit Kapoor
  2026-05-21 16:39 ` [PATCH 1/1] " Ankit Kapoor
  2026-05-24 16:12 ` [PATCH 0/1] " Coly Li
  0 siblings, 2 replies; 6+ messages in thread
From: Ankit Kapoor @ 2026-05-21 16:39 UTC (permalink / raw)
  To: Coly Li, Kent Overstreet; +Cc: linux-bcache, linux-kernel, Ankit Kapoor

Overview
--------
This series addresses a cache inconsistency issue with stale data in bcache
that arises from a race condition between a read cache miss and a bypass 
write due to congestion or sequential cutoff. The fix involves sequencing 
the btree invalidation of the bypass write to occur strictly after the 
backing device write.

Race Analysis
-------------
The following sequence illustrates how stale data is cached after a read
cache miss when btree invalidation of a bypass write happens in parallel
with a delayed write to the backing device:

Write IO Path (Parallel)            Read IO Path
------------------------            ------------
           |
 [Btree Invalidation]
           |
           |                      [Cache Miss]
           |                           |
           |                     [Btree Placeholder Key Insertion]
           |                           |
 (Delay in writing                     |
 to the backing device)                |
           |                     [Cache data from the backing device]
           |                           |
           +-------------------------->|  <-- No key collision detected!
           |                      [Btree Placeholder Key Replacement]
           |                           |
    [Write to the                      |
    backing device]                -------------
                                 CRITICAL BUG:
                             Stale data gets cached

Reproduction Steps
------------------
The bug can be reliably reproduced by injecting a 5-second delay into
the backing device write path via dm-delay. Cache mode is set to
writearound to simulate bypass write.

1. Data Preparation:
  # printf -- '%.0s\0' {1..4096} > /tmp/0.txt
  # printf -- '%.0s\1' {1..4096} > /tmp/1.txt
  # echo writearound > /sys/block/bcache0/bcache/cache_mode
  # dd if=/tmp/0.txt of=/media/bcache/data.txt oflag=direct \
    bs=4096 count=1 conv=notrunc

2. Race Execution:
  # dd if=/tmp/1.txt of=/media/bcache/data.txt oflag=direct \
    bs=4096 count=1 conv=notrunc &
  # sleep 1
  # dd if=/media/bcache/data.txt iflag=direct bs=4096 count=1 \
    status=none | hexdump > ./concurrent-read-result
  # sleep 10
  # dd if=/media/bcache/data.txt iflag=direct bs=4096 count=1 \
    status=none | hexdump > ./second-read-result

3. Results (Without Patch):
  # cat second-read-result
  0000000 0000 0000 0000 0000 0000 0000 0000 0000  # <--- STALE READ

Proposed Fix
------------
The fix enforces strict total (sequential) order of btree invalidation
after write to the backing device in a bypass write:

OLD FLOW                                          NEW FLOW
-------------------------------       --------------------------------
        [ Write Start ]                       [ Write Start ]
               |                                     |
       +-------+-------+                             |
       |               |                             v
       v               v                    [     Write to   ]
 [    Btree     ] [   Write to    ]         [ backing-device ]
 [ Invalidation ] [ backing-device]                  |
       |               |                             v
       +-------+-------+                    [      Btree     ]
               |                            [  Invalidation  ]
               v                                     |
         [ Write End ]                               v
                                               [ Write End ]

Enforcing this sequential execution ensures that either:
1. A stale read is followed and invalidated by the deferred write
   invalidation flow.
2. The write invalidation executes first, forcing the subsequent read
   path's key replacement sequence to properly catch the collision.

Failure Handling
----------------
This patch keeps existing error-handling behavior intact. Although
execution is now sequential, btree invalidation is still triggered
regardless of whether the write to the backing device succeeds
or fails.

Verification and Performance
----------------------------
Manual Results (With Patch):
  # cat second-read-result
  0000000 0101 0101 0101 0101 0101 0101 0101 0101  # <--- CORRECT DATA

Stress Verification:
FIO was executed under a write-only workload (128 KB Write, libaio,
iodepth=64, direct=1). Without the patch, FIO reported CRC errors
due to stale read corruptions; with the patch, zero CRC errors or
corruptions were reported.

Write-Only Workload (FIO Averages CSV):
Metric,With Fix,Without Fix,Delta

Write IOPS,1630,1630,0.00%
Write Bandwidth (MiB/s),204,204,0.00%
Write Avg Latency (micro second),39219.95,39219.58,0.00%

Test Environment
----------------
- CPU: 1 vCPU, Intel Haswell x86_64 (n1-standard-1 instance)
- Memory: 3.75 GB RAM
- OS: Linux 6.12.68 (Google COS)
- Storage: Google Cloud SSD PD + Local SSD

Ankit Kapoor (1):
  bcache: fix stale data race between read cache miss and bypass write

 drivers/md/bcache/request.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

-- 
2.54.0.669.g59709faab0-goog


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/1] bcache: fix stale data race between read cache miss and bypass write
  2026-05-21 16:39 [PATCH 0/1] bcache: fix stale data race between read cache miss and bypass write Ankit Kapoor
@ 2026-05-21 16:39 ` Ankit Kapoor
  2026-05-25 13:41   ` Coly Li
  2026-05-24 16:12 ` [PATCH 0/1] " Coly Li
  1 sibling, 1 reply; 6+ messages in thread
From: Ankit Kapoor @ 2026-05-21 16:39 UTC (permalink / raw)
  To: Coly Li, Kent Overstreet; +Cc: linux-bcache, linux-kernel, Ankit Kapoor

A race condition exists between a read cache miss and a bypass write
due to either congestion or sequential bypass, that causes stale data
to be cached when the read cache miss runs concurrently with a bypass
write targeting the same sectors. If the read cache miss fetches data
from the backing device before the write to the backing device,
stale data populates the cache.

The root cause is that bcache currently executes btree key
invalidation in parallel with (or prior to) writing the actual data
payload to the backing device. Under this sequence, a concurrent
read path can register a cache miss and insert a placeholder key.
If the write's btree key invalidation completes before the read finishes
fetching old data from the backing device, the read's subsequent
key replacement will not detect a collision, allowing stale data
to persist in the cache.

Fix this by deferring the btree key invalidation until after the
backing device write completes successfully. Enforcing this
sequential execution ensures that a stale read is always detected
and invalidated.

Signed-off-by: Ankit Kapoor <ankitkap@google.com>
---
 drivers/md/bcache/request.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index af345dc6fde1..ef2cf55df3bb 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -978,6 +978,14 @@ static CLOSURE_CALLBACK(cached_dev_write_complete)
 	cached_dev_bio_complete(&cl->work);
 }

+static CLOSURE_CALLBACK(backing_device_bypass_write_complete)
+{
+	closure_type(s, struct search, cl);
+
+	closure_call(&s->iop.cl, bch_data_insert, NULL, cl);
+	continue_at(cl, cached_dev_write_complete, NULL);
+}
+
 static void cached_dev_write(struct cached_dev *dc, struct search *s)
 {
 	struct closure *cl = &s->cl;
@@ -1058,6 +1066,11 @@ static void cached_dev_write(struct cached_dev *dc, struct search *s)
 	}

 insert_data:
+	if (s->iop.bypass) {
+		continue_at(cl, backing_device_bypass_write_complete, NULL);
+		return;
+	}
+
 	closure_call(&s->iop.cl, bch_data_insert, NULL, cl);
 	continue_at(cl, cached_dev_write_complete, NULL);
 }
-- 
2.54.0.669.g59709faab0-goog

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/1] bcache: fix stale data race between read cache miss and bypass write
  2026-05-21 16:39 [PATCH 0/1] bcache: fix stale data race between read cache miss and bypass write Ankit Kapoor
  2026-05-21 16:39 ` [PATCH 1/1] " Ankit Kapoor
@ 2026-05-24 16:12 ` Coly Li
  1 sibling, 0 replies; 6+ messages in thread
From: Coly Li @ 2026-05-24 16:12 UTC (permalink / raw)
  To: Ankit Kapoor; +Cc: Kent Overstreet, linux-bcache, linux-kernel

On Thu, May 21, 2026 at 04:39:24PM +0800, Ankit Kapoor wrote:

Hi Ankit,

From your description and analysis, I feel this is a real issue.
Let me understand this deeper and response you later.

Thanks.

Coly Li

> Overview
> --------
> This series addresses a cache inconsistency issue with stale data in bcache
> that arises from a race condition between a read cache miss and a bypass 
> write due to congestion or sequential cutoff. The fix involves sequencing 
> the btree invalidation of the bypass write to occur strictly after the 
> backing device write.
> 
> Race Analysis
> -------------
> The following sequence illustrates how stale data is cached after a read
> cache miss when btree invalidation of a bypass write happens in parallel
> with a delayed write to the backing device:
> 
> Write IO Path (Parallel)            Read IO Path
> ------------------------            ------------
>            |
>  [Btree Invalidation]
>            |
>            |                      [Cache Miss]
>            |                           |
>            |                     [Btree Placeholder Key Insertion]
>            |                           |
>  (Delay in writing                     |
>  to the backing device)                |
>            |                     [Cache data from the backing device]
>            |                           |
>            +-------------------------->|  <-- No key collision detected!
>            |                      [Btree Placeholder Key Replacement]
>            |                           |
>     [Write to the                      |
>     backing device]                -------------
>                                  CRITICAL BUG:
>                              Stale data gets cached
> 
> Reproduction Steps
> ------------------
> The bug can be reliably reproduced by injecting a 5-second delay into
> the backing device write path via dm-delay. Cache mode is set to
> writearound to simulate bypass write.
> 
> 1. Data Preparation:
>   # printf -- '%.0s\0' {1..4096} > /tmp/0.txt
>   # printf -- '%.0s\1' {1..4096} > /tmp/1.txt
>   # echo writearound > /sys/block/bcache0/bcache/cache_mode
>   # dd if=/tmp/0.txt of=/media/bcache/data.txt oflag=direct \
>     bs=4096 count=1 conv=notrunc
> 
> 2. Race Execution:
>   # dd if=/tmp/1.txt of=/media/bcache/data.txt oflag=direct \
>     bs=4096 count=1 conv=notrunc &
>   # sleep 1
>   # dd if=/media/bcache/data.txt iflag=direct bs=4096 count=1 \
>     status=none | hexdump > ./concurrent-read-result
>   # sleep 10
>   # dd if=/media/bcache/data.txt iflag=direct bs=4096 count=1 \
>     status=none | hexdump > ./second-read-result
> 
> 3. Results (Without Patch):
>   # cat second-read-result
>   0000000 0000 0000 0000 0000 0000 0000 0000 0000  # <--- STALE READ
> 
> Proposed Fix
> ------------
> The fix enforces strict total (sequential) order of btree invalidation
> after write to the backing device in a bypass write:
> 
> OLD FLOW                                          NEW FLOW
> -------------------------------       --------------------------------
>         [ Write Start ]                       [ Write Start ]
>                |                                     |
>        +-------+-------+                             |
>        |               |                             v
>        v               v                    [     Write to   ]
>  [    Btree     ] [   Write to    ]         [ backing-device ]
>  [ Invalidation ] [ backing-device]                  |
>        |               |                             v
>        +-------+-------+                    [      Btree     ]
>                |                            [  Invalidation  ]
>                v                                     |
>          [ Write End ]                               v
>                                                [ Write End ]
> 
> Enforcing this sequential execution ensures that either:
> 1. A stale read is followed and invalidated by the deferred write
>    invalidation flow.
> 2. The write invalidation executes first, forcing the subsequent read
>    path's key replacement sequence to properly catch the collision.
> 
> Failure Handling
> ----------------
> This patch keeps existing error-handling behavior intact. Although
> execution is now sequential, btree invalidation is still triggered
> regardless of whether the write to the backing device succeeds
> or fails.
> 
> Verification and Performance
> ----------------------------
> Manual Results (With Patch):
>   # cat second-read-result
>   0000000 0101 0101 0101 0101 0101 0101 0101 0101  # <--- CORRECT DATA
> 
> Stress Verification:
> FIO was executed under a write-only workload (128 KB Write, libaio,
> iodepth=64, direct=1). Without the patch, FIO reported CRC errors
> due to stale read corruptions; with the patch, zero CRC errors or
> corruptions were reported.
> 
> Write-Only Workload (FIO Averages CSV):
> Metric,With Fix,Without Fix,Delta
> 
> Write IOPS,1630,1630,0.00%
> Write Bandwidth (MiB/s),204,204,0.00%
> Write Avg Latency (micro second),39219.95,39219.58,0.00%
> 
> Test Environment
> ----------------
> - CPU: 1 vCPU, Intel Haswell x86_64 (n1-standard-1 instance)
> - Memory: 3.75 GB RAM
> - OS: Linux 6.12.68 (Google COS)
> - Storage: Google Cloud SSD PD + Local SSD
> 
> Ankit Kapoor (1):
>   bcache: fix stale data race between read cache miss and bypass write
> 
>  drivers/md/bcache/request.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> -- 
> 2.54.0.669.g59709faab0-goog

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/1] bcache: fix stale data race between read cache miss and bypass write
  2026-05-21 16:39 ` [PATCH 1/1] " Ankit Kapoor
@ 2026-05-25 13:41   ` Coly Li
  2026-05-27 13:41     ` Ankit Kapoor
  0 siblings, 1 reply; 6+ messages in thread
From: Coly Li @ 2026-05-25 13:41 UTC (permalink / raw)
  To: Ankit Kapoor; +Cc: Kent Overstreet, linux-bcache, linux-kernel

Hi Ankit,

Yes, I confirm this is an issue that must be solved. Nice catch!

On Thu, May 21, 2026 at 04:39:25PM +0800, Ankit Kapoor wrote:
> A race condition exists between a read cache miss and a bypass write
> due to either congestion or sequential bypass, that causes stale data
> to be cached when the read cache miss runs concurrently with a bypass
> write targeting the same sectors. If the read cache miss fetches data
> from the backing device before the write to the backing device,
> stale data populates the cache.
> 
> The root cause is that bcache currently executes btree key
> invalidation in parallel with (or prior to) writing the actual data
> payload to the backing device. Under this sequence, a concurrent
> read path can register a cache miss and insert a placeholder key.
> If the write's btree key invalidation completes before the read finishes
> fetching old data from the backing device, the read's subsequent
> key replacement will not detect a collision, allowing stale data
> to persist in the cache.
> 
> Fix this by deferring the btree key invalidation until after the
> backing device write completes successfully. Enforcing this
> sequential execution ensures that a stale read is always detected
> and invalidated.
>

This patch fixes the stale data issue in run time, but if power failure
happens inside the race window, after boot up again, the stale data
still exists in cache for following read hits.

And your fix invalidate the key after on-disk bio completed, which makes
such stale data window by power failure longer.

To solve all the stale data race both for run time and power failure
condition, could you please consider the following proposal.

Maintain a data structure to hold all invalidate range from by-pass
write, record/insert the invalidation range before bch_data_insert(),
and after cached_dev_write_complete(), clear/remove the invalidation
range.

For a cache-miss read, if there is any invalidation range refcount
exists, check all non-zero refcount ranges, if any range overlaps with
the cache-miss read range, do NOT update the missing bkey back to btree
and only read data from backing device.

Here you need to design a efficient data structure both for performance
and memory consumption. I would sugguest to maintain chunk refcounts
which mapping multiple 32MB ranges on cache device (current max key size
if I remember correctly) range. You may look at how md raid maintains
the legacy bitmap refcount, hope that code can give you any hint.

Finally thank you for report this issue, nick catch!

Coly Li

> Signed-off-by: Ankit Kapoor <ankitkap@google.com>
> ---
>  drivers/md/bcache/request.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index af345dc6fde1..ef2cf55df3bb 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -978,6 +978,14 @@ static CLOSURE_CALLBACK(cached_dev_write_complete)
>  	cached_dev_bio_complete(&cl->work);
>  }
>  
> +static CLOSURE_CALLBACK(backing_device_bypass_write_complete)
> +{
> +	closure_type(s, struct search, cl);
> +
> +	closure_call(&s->iop.cl, bch_data_insert, NULL, cl);
> +	continue_at(cl, cached_dev_write_complete, NULL);
> +}
> +
>  static void cached_dev_write(struct cached_dev *dc, struct search *s)
>  {
>  	struct closure *cl = &s->cl;
> @@ -1058,6 +1066,11 @@ static void cached_dev_write(struct cached_dev *dc, struct search *s)
>  	}
>  
>  insert_data:
> +	if (s->iop.bypass) {
> +		continue_at(cl, backing_device_bypass_write_complete, NULL);
> +		return;
> +	}
> +
>  	closure_call(&s->iop.cl, bch_data_insert, NULL, cl);
>  	continue_at(cl, cached_dev_write_complete, NULL);
>  }
> -- 
> 2.54.0.669.g59709faab0-goog

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/1] bcache: fix stale data race between read cache miss and bypass write
  2026-05-25 13:41   ` Coly Li
@ 2026-05-27 13:41     ` Ankit Kapoor
  2026-05-27 15:27       ` Coly Li
  0 siblings, 1 reply; 6+ messages in thread
From: Ankit Kapoor @ 2026-05-27 13:41 UTC (permalink / raw)
  To: colyli; +Cc: ankitkap, kent.overstreet, linux-bcache, linux-kernel

Hi Coly,

Thank you for the feedback, for confirming the issue, and for the guidance.

> Hi Ankit,
> 
> Yes, I confirm this is an issue that must be solved. Nice catch!
> 
> On Thu, May 21, 2026 at 04:39:25PM +0800, Ankit Kapoor wrote:
>> A race condition exists between a read cache miss and a bypass write
>> due to either congestion or sequential bypass, that causes stale data
>> to be cached when the read cache miss runs concurrently with a bypass
>> write targeting the same sectors.
> 
> This patch fixes the stale data issue in run time, but if power failure
> happens inside the race window, after boot up again, the stale data
> still exists in cache for following read hits.
> 
> And your fix invalidate the key after on-disk bio completed, which makes
> such stale data window by power failure longer.

While I initially hoped that serializing the operations would suffice, I
completely agree with your point regarding the power-failure risk
which shall be addressed.

> To solve all the stale data race both for run time and power failure
> condition, could you please consider the following proposal.
> 
> Maintain a data structure to hold all invalidate range from by-pass
> write, record/insert the invalidation range before bch_data_insert(),
> and after cached_dev_write_complete(), clear/remove the invalidation
> range.
> 
> For a cache-miss read, if there is any invalidation range refcount
> exists, check all non-zero refcount ranges, if any range overlaps with
> the cache-miss read range, do NOT update the missing bkey back to btree
> and only read data from backing device.

I am now working on a new implementation to track the in-flight 
sectors currently being written, exactly as you suggested here.

> Here you need to design a efficient data structure both for performance
> and memory consumption. I would sugguest to maintain chunk refcounts
> which mapping multiple 32MB ranges on cache device (current max key size
> if I remember correctly) range. You may look at how md raid maintains
> the legacy bitmap refcount, hope that code can give you any hint.

Thanks, I will look into the md raid legacy bitmap reference implementation for
hints. In the meantime, could you please recommend any specific fio
configurations or workloads you prefer for evaluating the memory
overhead and performance impact of this change?

I will send a v2 patch series as soon as the tracking mechanism is ready
and thoroughly tested.

Best regards,
Ankit Kapoor

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/1] bcache: fix stale data race between read cache miss and bypass write
  2026-05-27 13:41     ` Ankit Kapoor
@ 2026-05-27 15:27       ` Coly Li
  0 siblings, 0 replies; 6+ messages in thread
From: Coly Li @ 2026-05-27 15:27 UTC (permalink / raw)
  To: Ankit Kapoor; +Cc: kent.overstreet, linux-bcache, linux-kernel

> 2026年5月27日 21:41，Ankit Kapoor <ankitkap@google.com> 写道：
> 
> Hi Coly,
> 
> Thank you for the feedback, for confirming the issue, and for the guidance.
> 
>> Hi Ankit,
>> 
>> Yes, I confirm this is an issue that must be solved. Nice catch!
>> 
>> On Thu, May 21, 2026 at 04:39:25PM +0800, Ankit Kapoor wrote:
>>> A race condition exists between a read cache miss and a bypass write
>>> due to either congestion or sequential bypass, that causes stale data
>>> to be cached when the read cache miss runs concurrently with a bypass
>>> write targeting the same sectors.
>> 
>> This patch fixes the stale data issue in run time, but if power failure
>> happens inside the race window, after boot up again, the stale data
>> still exists in cache for following read hits.
>> 
>> And your fix invalidate the key after on-disk bio completed, which makes
>> such stale data window by power failure longer.
> 
> While I initially hoped that serializing the operations would suffice, I
> completely agree with your point regarding the power-failure risk
> which shall be addressed.
> 
>> To solve all the stale data race both for run time and power failure
>> condition, could you please consider the following proposal.
>> 
>> Maintain a data structure to hold all invalidate range from by-pass
>> write, record/insert the invalidation range before bch_data_insert(),
>> and after cached_dev_write_complete(), clear/remove the invalidation
>> range.
>> 
>> For a cache-miss read, if there is any invalidation range refcount
>> exists, check all non-zero refcount ranges, if any range overlaps with
>> the cache-miss read range, do NOT update the missing bkey back to btree
>> and only read data from backing device.
> 
> I am now working on a new implementation to track the in-flight 
> sectors currently being written, exactly as you suggested here.
> 
>> Here you need to design a efficient data structure both for performance
>> and memory consumption. I would sugguest to maintain chunk refcounts
>> which mapping multiple 32MB ranges on cache device (current max key size
>> if I remember correctly) range. You may look at how md raid maintains
>> the legacy bitmap refcount, hope that code can give you any hint.
> 
> Thanks, I will look into the md raid legacy bitmap reference implementation for
> hints. In the meantime, could you please recommend any specific fio
> configurations or workloads you prefer for evaluating the memory
> overhead and performance impact of this change?

Maybe you can use a large and fast SSD as backing device, and do full random I/O with write around mode.
Then try to setup the race windows, that the in-memory refcount may occupy a more memory.

I don’t suggest to use a tree-like structures. Just use a refcount to cover 32MB range on backing device, it can be faster.
If a cache-miss read overlay a refcount covered range, change it to read-without-refill-cache.  To avoid the refcounts
Occupy too much memory, if a page’s refcounts are all zero, you may think of releasing this page. This is what I mentioned
how md bitmap manages the pages of bits. Maybe the idea may help a little bit.


> 
> I will send a v2 patch series as soon as the tracking mechanism is ready
> and thoroughly tested.

Thank you, for catch this issue and work on the fix.

Coly Li

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-05-27 15:28 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-21 16:39 [PATCH 0/1] bcache: fix stale data race between read cache miss and bypass write Ankit Kapoor
2026-05-21 16:39 ` [PATCH 1/1] " Ankit Kapoor
2026-05-25 13:41   ` Coly Li
2026-05-27 13:41     ` Ankit Kapoor
2026-05-27 15:27       ` Coly Li
2026-05-24 16:12 ` [PATCH 0/1] " Coly Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox