From: Alison Schofield <alison.schofield@intel.com>
To: Dan Williams <djbw@kernel.org>,
Vishal Verma <vishal.l.verma@intel.com>,
Dave Jiang <dave.jiang@intel.com>, Ira Weiny <iweiny@kernel.org>,
Aboorva Devarajan <aboorvad@linux.ibm.com>
Cc: Alison Schofield <alison.schofield@intel.com>, nvdimm@lists.linux.dev
Subject: [PATCH v3] nvdimm/btt: Handle preemption in BTT lane acquisition
Date: Wed, 13 May 2026 17:23:12 -0700 [thread overview]
Message-ID: <20260514002314.65024-1-alison.schofield@intel.com> (raw)
BTT lanes serialize access to per-lane metadata and workspace state
during BTT I/O. The btt-check unit test reports data mismatches during
BTT writes due to a race in lane acquisition that can lead to silent
data corruption.
The existing lane model uses a spinlock together with a per-CPU
recursion count. That recursion model stopped being valid after BTT
lanes became preemptible: another task can run on the same CPU,
observe a non-zero recursion count, bypass locking, and use the same
lane concurrently.
BTT lanes are also held across metadata and data updates that can
reach nvdimm_flush(). Some provider flush callbacks can sleep, making
a spinlock the wrong primitive for the lane lifetime. That issue
predates this fix, but becomes more visible now that BTT lanes are
preemptible.
Replace the spinlock with a per-lane mutex, remove the per-CPU
recursion fast path, and take the lane lock unconditionally.
Add might_sleep() to catch any future atomic-context caller.
Found with the ndctl unit test btt-check.sh.
Fixes: 36c75ce3bd29 ("nd_btt: Make BTT lanes preemptible")
Assisted-by: Claude Sonnet 4.5
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
---
Aboorva: I appreciate your Tested-by tag, yet due to the churn,
I did not apply it. Please re-test with this version.
Changes in v3:
- Replace spinlock with a per-lane mutex (Arboorva)*
- Rebase onto 7.1-rc1
- Update commit log
*Arboorva pointed out that BTT write-side lane ownership can reach
provider flush callbacks that may sleep, making the existing
spinlock-based lane lifetime invalid. My initial thought was to
create a small series where the first patch converts the per-lane
lock to a mutex so the lane critical section can safely sleep.
That left an intermediate bad state, so the changes are kept
together in this single patch.
Changes in v2:
Use spin_(un)lock_bh() (Sashiko AI)
Update commit log per softirq re-enty and spinlock change
A new unit test to stress this is under review here:
https://lore.kernel.org/nvdimm/20260424233633.3762217-1-alison.schofield@intel.com/
drivers/nvdimm/nd.h | 3 +--
drivers/nvdimm/region_devs.c | 50 ++++++++++--------------------------
2 files changed, 14 insertions(+), 39 deletions(-)
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index b199eea3260e..3fbeaddb5b5c 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -366,8 +366,7 @@ unsigned sizeof_namespace_label(struct nvdimm_drvdata *ndd);
res; res = next, next = next ? next->sibling : NULL)
struct nd_percpu_lane {
- int count;
- spinlock_t lock;
+ struct mutex lock; /* serialize lane access */
};
enum nd_label_flags {
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index e35c2e18518f..d01b16f6a463 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -904,52 +904,33 @@ void nd_region_advance_seeds(struct nd_region *nd_region, struct device *dev)
* nd_region_acquire_lane - allocate and lock a lane
* @nd_region: region id and number of lanes possible
*
- * A lane correlates to a BLK-data-window and/or a log slot in the BTT.
- * We optimize for the common case where there are 256 lanes, one
- * per-cpu. For larger systems we need to lock to share lanes. For now
- * this implementation assumes the cost of maintaining an allocator for
- * free lanes is on the order of the lock hold time, so it implements a
- * static lane = cpu % num_lanes mapping.
+ * A lane correlates to a log slot in the BTT. Lanes are shared across
+ * CPUs using a static lane = cpu % num_lanes mapping, with a per-lane
+ * mutex to serialize access.
*
- * In the case of a BTT instance on top of a BLK namespace a lane may be
- * acquired recursively. We lock on the first instance.
- *
- * In the case of a BTT instance on top of PMEM, we only acquire a lane
- * for the BTT metadata updates.
+ * Callers must be in sleepable context. The only in-tree caller is
+ * BTT's ->submit_bio handler (btt_read_pg / btt_write_pg).
*/
unsigned int nd_region_acquire_lane(struct nd_region *nd_region)
{
unsigned int cpu, lane;
- migrate_disable();
- cpu = smp_processor_id();
- if (nd_region->num_lanes < nr_cpu_ids) {
- struct nd_percpu_lane *ndl_lock, *ndl_count;
+ might_sleep();
+ cpu = raw_smp_processor_id();
+ if (nd_region->num_lanes < nr_cpu_ids)
lane = cpu % nd_region->num_lanes;
- ndl_count = per_cpu_ptr(nd_region->lane, cpu);
- ndl_lock = per_cpu_ptr(nd_region->lane, lane);
- if (ndl_count->count++ == 0)
- spin_lock(&ndl_lock->lock);
- } else
+ else
lane = cpu;
+ mutex_lock(&per_cpu_ptr(nd_region->lane, lane)->lock);
return lane;
}
EXPORT_SYMBOL(nd_region_acquire_lane);
void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane)
{
- if (nd_region->num_lanes < nr_cpu_ids) {
- unsigned int cpu = smp_processor_id();
- struct nd_percpu_lane *ndl_lock, *ndl_count;
-
- ndl_count = per_cpu_ptr(nd_region->lane, cpu);
- ndl_lock = per_cpu_ptr(nd_region->lane, lane);
- if (--ndl_count->count == 0)
- spin_unlock(&ndl_lock->lock);
- }
- migrate_enable();
+ mutex_unlock(&per_cpu_ptr(nd_region->lane, lane)->lock);
}
EXPORT_SYMBOL(nd_region_release_lane);
@@ -1023,13 +1004,8 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
if (!nd_region->lane)
goto err_percpu;
- for (i = 0; i < nr_cpu_ids; i++) {
- struct nd_percpu_lane *ndl;
-
- ndl = per_cpu_ptr(nd_region->lane, i);
- spin_lock_init(&ndl->lock);
- ndl->count = 0;
- }
+ for (i = 0; i < nr_cpu_ids; i++)
+ mutex_init(&per_cpu_ptr(nd_region->lane, i)->lock);
for (i = 0; i < ndr_desc->num_mappings; i++) {
struct nd_mapping_desc *mapping = &ndr_desc->mapping[i];
base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731
--
2.37.3
next reply other threads:[~2026-05-14 0:23 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-14 0:23 Alison Schofield [this message]
2026-05-14 20:12 ` [PATCH v3] nvdimm/btt: Handle preemption in BTT lane acquisition Alison Schofield
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260514002314.65024-1-alison.schofield@intel.com \
--to=alison.schofield@intel.com \
--cc=aboorvad@linux.ibm.com \
--cc=dave.jiang@intel.com \
--cc=djbw@kernel.org \
--cc=iweiny@kernel.org \
--cc=nvdimm@lists.linux.dev \
--cc=vishal.l.verma@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.