From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-dl1-f73.google.com (mail-dl1-f73.google.com [74.125.82.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B04B23F871E for ; Wed, 17 Jun 2026 10:34:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=74.125.82.73 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781692462; cv=none; b=CSRgUiABiay3nEn7Q7/lCmWubh33AJ6w+Y8i/r9nE7lSdEB86qFNLChFMm5l9iIkQrMzrDmZny0jnvhlXXCek2MQXgEBHQ+GbF8wyEMXmTPKTan7TTc8QAFUJ0ubiUqCaVba2XVKTKxISWJt/R92RG4wWdWDOwzUdyQ2gkKCiuI= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781692462; c=relaxed/simple; bh=O/BJrQ67R9Vbj5DZXVq/pjvAOoJcwUHZYtgK2g0yuu8=; h=Date:Mime-Version:Message-ID:Subject:From:To:Cc:Content-Type; b=Zp7PyM7DryuOUhT9nAScywl3P+z0JiJmGQDyp6kXIS9D2elMuINVmebEEEdvR1DLEoaCQSj+OfdnWRkdPGB0IbFMxmXjYpm2464Eh4iX+wKz2VVMjZS6tJXUjaWFD5J/hRS23FjO+MfkxrV6xXtZmdxSCJc2UvsEpqiJ14ozvKs= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--ankitkap.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=isCTD+Ec; arc=none smtp.client-ip=74.125.82.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--ankitkap.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="isCTD+Ec" Received: by mail-dl1-f73.google.com with SMTP id a92af1059eb24-138acbc0e69so8016149c88.0 for ; Wed, 17 Jun 2026 03:34:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1781692460; x=1782297260; darn=vger.kernel.org; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=JvA54ztKKAEJWoEFg+ZZGRyvpSIkmqZWAXwaNHlCzxE=; b=isCTD+EcUbcuBPyXYDBbCSX42yB3H3y0jPj4nkozOi4ifSrDsaFB6z5x/jRHmnwW6V 7hN/7K02AP+SOj9cmabuXxbV7TWMeoDExIglYcH/+3EP01xU0dmmQFBVhM66RVeZxgWO enaQF3YysNyLvjdpVtfRufRX+rQ2CmNbYKw8d5BJJQZO/aNZhGB3rLK2jAuBtA8s6a03 pD8KUcnzBKsG5ihJl/S8q/UuFGZB8LN9rXJnZNc60/XUDKTcp8KrUoBUC9PQFtESFgbx eJ8OHuogO62p4kuWsmKuaPWapceQDb91Yycr8Bh1ltq2FIq1iiwHNFl+If36ldetrnVN /10g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1781692460; x=1782297260; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=JvA54ztKKAEJWoEFg+ZZGRyvpSIkmqZWAXwaNHlCzxE=; b=f9s+j8NCD9dhLh9JBAiO7ay7X+vuZLaJNxPjb2Kpi+hEtF8m7P4xL89IBG+1j9CCvP aaCPmCuWoU1Bwj60jw4TTuY9qvdcbHAWQJ1Ax/l4ByT5P9gIn55fgJb97Z2uUrculmQ2 t40zCwI2bokGQNfjWlVOpdG929WQz3UToRNTnV6XCJJkeLXb6wuVR+itRx6Tvj1SseVi 1kdcjPYc0XrZjOj88JQh12nslKqmJWwJ92q0FLZ1t58zxzlDq25SzHuVq7xk1vYKFamg Cl0W0NSAhdAvRiu4sbiQYA4krJb84HBGRNmR52bvwajEQ+sYEDkVsdNZJ5X0NhUi/ta7 J/GQ== X-Gm-Message-State: AOJu0Ywj889Ss2Dr/xpM4o0VZhFfL5WgkvglGmFm/qCIl7yuxZeeaf2r /NEJxiSuBvtE6riMKjay6WMBuen6mFxn5RAt96NGqumZ/4OiYJlcy25I5KawPJrrprU2FOrZUou NTUt5NFC7Iz8F9r+PYjLpVbUbsaLnO6Qnq39Q9WtcmiH7Yqb6qSBOqxr4PqtQu6LxtDDIJzFqM+ y23fSlpsFRanPk84M+4nsvZrpPfTgxEcVGBozr3btLPrRbGR7Z0NNajuU= X-Received: from dldoa11.prod.google.com ([2002:a05:701a:ca8b:b0:137:fdb1:854c]) (user=ankitkap job=prod-delivery.src-stubby-dispatcher) by 2002:a05:7022:221e:b0:138:212b:705b with SMTP id a92af1059eb24-1398f6770bcmr1475581c88.12.1781692459305; Wed, 17 Jun 2026 03:34:19 -0700 (PDT) Date: Wed, 17 Jun 2026 10:33:55 +0000 Precedence: bulk X-Mailing-List: linux-bcache@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 X-Mailer: git-send-email 2.54.0.1136.gdb2ca164c4-goog Message-ID: <20260617103356.3287775-1-ankitkap@google.com> Subject: [PATCH v2 0/1] bcache: track active bypass writes to prevent stale cache reads From: Ankit Kapoor To: linux-bcache@vger.kernel.org Cc: colyli@fygo.io, kent.overstreet@linux.dev, linux-kernel@vger.kernel.org, Ankit Kapoor Content-Type: text/plain; charset="UTF-8" Hi Coly, This is v2 of the patch to fix a race condition between read cache misses and bypass writes. Changes in v2: Instead of deferring key invalidation, we now explicitly track active bypass writes using dynamically allocated pages (modeled closely after md-bitmap.c as suggested by Coly). We use this tracking information during cache miss reads to determine if the read must also bypass the cache. The active bypass writes are tracked by dividing the backing device space into 32MB chunks and maintaining concurrent write refcounts. The memory overhead is minimal; a single 4KB page covers 64 GB of backing device space in the chosen approach. Implementation Approaches Evaluated: When designing this, we evaluated three synchronization approaches: 1. Global Spinlock 2. Page-level Spinlock (Chosen Approach) 3. Atomic Counters (Lockless) Memory Consumption: Idle Memory Usage (No active bypass writes) Backing Disk Size | Global Lk | Page Lk | Atomic Counters 1 TB | 256 bytes | 256 B | 1 KB 10 TB | 2.5 KB | 2.5 KB | 10 KB 100 TB | 25 KB | 25 KB | 100 KB Peak Memory Usage (Tracking pages fully allocated) Backing Disk Size | Global Lk | Page Lk | Atomic Counters 1 TB | 64.25 KB | 64.25 KB| 257 KB 10 TB | 642.5 KB | 642.5 KB| 2.51 MB 100 TB | 6.27 MB | 6.27 MB | 25.10 MB Performance Benchmarks (FIO): Setup: - CPU: 32 vCPU, Intel Cascade Lake x86_64 (n2-standard-32 GCP VM) - Memory: 128 GB RAM - OS: Linux 6.12.68 (Google COS) - Storage: Google Cloud Extreme PD (1000 GB) + Local SSD (375 GB) FIO config: rw=randrw, bs=(R) 4096B-4096B, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=32 - 16 jobs 10 GB file workload (1 tracking page active) Metric | Unpatch | Global Lk | Page Lk | Atomic | Analysis Read IOPS | 20,977 | 20,897 | 20,900 | 20,886 | Almost flat Write IOPS | 8,993 | 8,956 | 8,956 | 8,951 | Almost flat Total IOPS | 29,971 | 29,853 | 29,856 | 29,836 | Almost flat Avg R Lat | 16.31ms | 16.66 ms | 16.66 ms| 16.68 ms| Almost flat Avg W Lat | 18.94ms | 18.33 ms | 18.34 ms| 18.34 ms| Almost flat Kernel CPU % | 63.02 | 77.94 | 72.88 | 75.41 | Overhead Avg Ker CPU% | 3.94 | 4.87 | 4.56 | 4.71 | Overhead NVMe Util % | 66.89 | 0.62 | 0.62 | 0.63 | Offld succs 320 GB file workload (5 tracking pages active) Metric | Unpatch | Global Lk | Page Lk | Atomic | Analysis Read IOPS | 20,974 | 20,898 | 20,897 | 20,898 | Almost flat Write IOPS | 8,988 | 8,957 | 8,956 | 8,956 | Almost flat Total IOPS | 29,963 | 29,855 | 29,853 | 29,854 | Almost flat Avg R Lat | 16.54ms | 16.76 ms | 16.78 ms| 16.76 ms| Almost flat Avg W Lat | 18.41ms | 18.11 ms | 18.07 ms| 18.10 ms| Almost flat Kernel CPU % | 70.39 | 70.70 | 60.85 | 62.20 | Scaled eff. Avg Ker CPU% | 4.40 | 4.42 | 3.80 | 3.89 | Scaled eff. NVMe Util % | 67.74 | 0.66 | 0.61 | 0.62 | Offld succs Rationale & Analysis: 1. Tracking Overhead: Implementing tracking inherently adds minor CPU overhead for smaller workloads. However, the chosen Page-level Lock minimizes this penalty. 2. Cache Device Offloading: The massive drop in NVMe cache utilization (67% to 0.6%) occurs because the reads are safely bypassing the cache instead of fetching and attempting to populate the cache with stale data. 3. Superior Scaling Under Load: For larger files (320GB), the added overhead of page tracking is more than compensated for by the saved cache update operations during reads. Questions for Coly: 1. If `kzalloc` fails on the fast path in `bch_bypass_write_start()`, we currently skip incrementing the counter (leaving it untracked). This opens a tiny edge case (a "stolen increment"): if a concurrent bio successfully allocates the tracking page slightly later, the initially untracked bio will blindly decrement that counter in `bch_bypass_write_end()`. Because of the `> 0` check in `bch_bypass_write_end()`, this doesn't cause a true integer underflow, but it does prematurely drop the count to 0 (effectively stealing the track). `md-bitmap.c` solves similar issues via a hijacking fallback. Given that memory corruption is prevented by the `> 0` check, do you think a similar fallback mechanism is necessary here, or is the rate-limited warning and sysfs tracking sufficient given how rare `kzalloc` failures are for 4KB pages? 2. We currently use `u16` for the tracking counters. Since standard block layer queue depths rarely exceed a few thousand, a counter overflow (65,535 concurrent writes to the exact same 32MB chunk) seems practically impossible. However, if you'd prefer to be absolutely defensive against overflows without doubling the memory footprint to `u32`, I could clamp the counter at `U16_MAX` during increments. This would safely prevent wrap-around memory leaks while keeping our current memory efficiency. Would you prefer leaving it as is, clamping it, or bumping to `u32`? Looking forward for your feedback on the changes. Ankit Kapoor (1): bcache: track active bypass writes to prevent stale cache reads Documentation/admin-guide/bcache.rst | 8 ++ drivers/md/bcache/bcache.h | 35 +++++++ drivers/md/bcache/request.c | 132 +++++++++++++++++++++++++++ drivers/md/bcache/stats.c | 14 +++ drivers/md/bcache/stats.h | 4 + drivers/md/bcache/super.c | 30 ++++++ drivers/md/bcache/sysfs.c | 5 + include/trace/events/bcache.h | 5 + 8 files changed, 233 insertions(+) -- 2.54.0.1136.gdb2ca164c4-goog