Linux bcache driver list
 help / color / mirror / Atom feed
From: "Coly Li" <colyli@fnnas.com>
To: "Ankit Kapoor" <ankitkap@google.com>
Cc: "Kent Overstreet" <kent.overstreet@linux.dev>,
	 <linux-bcache@vger.kernel.org>, <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 0/1] bcache: fix stale data race between read cache miss and bypass write
Date: Mon, 25 May 2026 00:12:53 +0800	[thread overview]
Message-ID: <ahMi6qmGtJsf1X03@studio.local> (raw)
In-Reply-To: <20260521163925.178264-1-ankitkap@google.com>

On Thu, May 21, 2026 at 04:39:24PM +0800, Ankit Kapoor wrote:

Hi Ankit,

From your description and analysis, I feel this is a real issue.
Let me understand this deeper and response you later.

Thanks.

Coly Li

> Overview
> --------
> This series addresses a cache inconsistency issue with stale data in bcache
> that arises from a race condition between a read cache miss and a bypass 
> write due to congestion or sequential cutoff. The fix involves sequencing 
> the btree invalidation of the bypass write to occur strictly after the 
> backing device write.
> 
> Race Analysis
> -------------
> The following sequence illustrates how stale data is cached after a read
> cache miss when btree invalidation of a bypass write happens in parallel
> with a delayed write to the backing device:
> 
> Write IO Path (Parallel)            Read IO Path
> ------------------------            ------------
>            |
>  [Btree Invalidation]
>            |
>            |                      [Cache Miss]
>            |                           |
>            |                     [Btree Placeholder Key Insertion]
>            |                           |
>  (Delay in writing                     |
>  to the backing device)                |
>            |                     [Cache data from the backing device]
>            |                           |
>            +-------------------------->|  <-- No key collision detected!
>            |                      [Btree Placeholder Key Replacement]
>            |                           |
>     [Write to the                      |
>     backing device]                -------------
>                                  CRITICAL BUG:
>                              Stale data gets cached
> 
> Reproduction Steps
> ------------------
> The bug can be reliably reproduced by injecting a 5-second delay into
> the backing device write path via dm-delay. Cache mode is set to
> writearound to simulate bypass write.
> 
> 1. Data Preparation:
>   # printf -- '%.0s\0' {1..4096} > /tmp/0.txt
>   # printf -- '%.0s\1' {1..4096} > /tmp/1.txt
>   # echo writearound > /sys/block/bcache0/bcache/cache_mode
>   # dd if=/tmp/0.txt of=/media/bcache/data.txt oflag=direct \
>     bs=4096 count=1 conv=notrunc
> 
> 2. Race Execution:
>   # dd if=/tmp/1.txt of=/media/bcache/data.txt oflag=direct \
>     bs=4096 count=1 conv=notrunc &
>   # sleep 1
>   # dd if=/media/bcache/data.txt iflag=direct bs=4096 count=1 \
>     status=none | hexdump > ./concurrent-read-result
>   # sleep 10
>   # dd if=/media/bcache/data.txt iflag=direct bs=4096 count=1 \
>     status=none | hexdump > ./second-read-result
> 
> 3. Results (Without Patch):
>   # cat second-read-result
>   0000000 0000 0000 0000 0000 0000 0000 0000 0000  # <--- STALE READ
> 
> Proposed Fix
> ------------
> The fix enforces strict total (sequential) order of btree invalidation
> after write to the backing device in a bypass write:
> 
> OLD FLOW                                          NEW FLOW
> -------------------------------       --------------------------------
>         [ Write Start ]                       [ Write Start ]
>                |                                     |
>        +-------+-------+                             |
>        |               |                             v
>        v               v                    [     Write to   ]
>  [    Btree     ] [   Write to    ]         [ backing-device ]
>  [ Invalidation ] [ backing-device]                  |
>        |               |                             v
>        +-------+-------+                    [      Btree     ]
>                |                            [  Invalidation  ]
>                v                                     |
>          [ Write End ]                               v
>                                                [ Write End ]
> 
> Enforcing this sequential execution ensures that either:
> 1. A stale read is followed and invalidated by the deferred write
>    invalidation flow.
> 2. The write invalidation executes first, forcing the subsequent read
>    path's key replacement sequence to properly catch the collision.
> 
> Failure Handling
> ----------------
> This patch keeps existing error-handling behavior intact. Although
> execution is now sequential, btree invalidation is still triggered
> regardless of whether the write to the backing device succeeds
> or fails.
> 
> Verification and Performance
> ----------------------------
> Manual Results (With Patch):
>   # cat second-read-result
>   0000000 0101 0101 0101 0101 0101 0101 0101 0101  # <--- CORRECT DATA
> 
> Stress Verification:
> FIO was executed under a write-only workload (128 KB Write, libaio,
> iodepth=64, direct=1). Without the patch, FIO reported CRC errors
> due to stale read corruptions; with the patch, zero CRC errors or
> corruptions were reported.
> 
> Write-Only Workload (FIO Averages CSV):
> Metric,With Fix,Without Fix,Delta
> 
> Write IOPS,1630,1630,0.00%
> Write Bandwidth (MiB/s),204,204,0.00%
> Write Avg Latency (micro second),39219.95,39219.58,0.00%
> 
> Test Environment
> ----------------
> - CPU: 1 vCPU, Intel Haswell x86_64 (n1-standard-1 instance)
> - Memory: 3.75 GB RAM
> - OS: Linux 6.12.68 (Google COS)
> - Storage: Google Cloud SSD PD + Local SSD
> 
> Ankit Kapoor (1):
>   bcache: fix stale data race between read cache miss and bypass write
> 
>  drivers/md/bcache/request.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> -- 
> 2.54.0.669.g59709faab0-goog

      parent reply	other threads:[~2026-05-24 16:13 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-21 16:39 [PATCH 0/1] bcache: fix stale data race between read cache miss and bypass write Ankit Kapoor
2026-05-21 16:39 ` [PATCH 1/1] " Ankit Kapoor
2026-05-25 13:41   ` Coly Li
2026-05-27 13:41     ` Ankit Kapoor
2026-05-27 15:27       ` Coly Li
2026-05-24 16:12 ` Coly Li [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ahMi6qmGtJsf1X03@studio.local \
    --to=colyli@fnnas.com \
    --cc=ankitkap@google.com \
    --cc=kent.overstreet@linux.dev \
    --cc=linux-bcache@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox