* [PATCH V2 0/7] Fixes to the previously-merged drivers/dax/fsdev series
[not found] <20260522191804.79088-1-john@jagalactic.com>
@ 2026-05-22 19:18 ` John Groves
2026-05-22 19:18 ` [PATCH V2 1/7] dax: fix misleading comment about share/index union in dax_folio_reset_order() John Groves
` (6 more replies)
0 siblings, 7 replies; 22+ messages in thread
From: John Groves @ 2026-05-22 19:18 UTC (permalink / raw)
To: John Groves, Dan Williams
Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Miklos Szeredi,
Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
John Groves
From: John Groves <john@groves.net>
This series applies bug fixes (mostly found via sashiko) to the dax/fsdev
series. This has been soaking in the famfs CI pipeline for 2+ weeks and
1) won't affect anything that doesn't use drivers/dax/fsdev.c, and 2)
doesn't affect any known workloads - although the bugs would have
manifested when multi-range DCD dax devices are a thing (soon-ish).
Changes since v1:
* Dropped modes from patch 6 to fs/fuse/famfs.c and
fs/famfs/famfs_inode.c, which are not upstream so it broke
attempts to apply the series. Oops...
* Added patch 7, which addresses a previously-missed review comment
from Jonathan - minor cleanup
John Groves (7):
dax: fix misleading comment about share/index union in
dax_folio_reset_order()
dax/fsdev: fix multi-range offset, vmemmap_shift leak, and probe error
cleanup
dax/fsdev: fix kaddr for multi-range and fail probe on invalid pgmap
offset
dax/fsdev: clamp direct_access return to current physical range
dax: fix holder_ops race in fs_put_dax()
dax: replace exported dax_dev_get() with non-allocating dax_dev_find()
dax: fsdev.c minor formatting cleanup
drivers/dax/dax-private.h | 2 -
drivers/dax/fsdev.c | 104 ++++++++++++++++++++++++++------------
drivers/dax/super.c | 51 +++++++++++++++++--
fs/dax.c | 12 ++---
include/linux/dax.h | 6 ++-
5 files changed, 129 insertions(+), 46 deletions(-)
--
2.53.0
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH V2 1/7] dax: fix misleading comment about share/index union in dax_folio_reset_order()
2026-05-22 19:18 ` [PATCH V2 0/7] Fixes to the previously-merged drivers/dax/fsdev series John Groves
@ 2026-05-22 19:18 ` John Groves
2026-05-26 23:07 ` Dave Jiang
2026-05-22 19:18 ` [PATCH V2 2/7] dax/fsdev: fix multi-range offset, vmemmap_shift leak, and probe error cleanup John Groves
` (5 subsequent siblings)
6 siblings, 1 reply; 22+ messages in thread
From: John Groves @ 2026-05-22 19:18 UTC (permalink / raw)
To: John Groves, Dan Williams
Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Miklos Szeredi,
Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
John Groves
From: John Groves <John@Groves.net>
The comment in dax_folio_reset_order() claims that DAX maintains an
invariant where folio->share != 0 only when folio->mapping == NULL,
implying folio->share is zero whenever mapping is non-NULL. This is
misleading because folio->share and folio->index are a union -- for
non-shared folios with mapping != NULL, reading folio->share returns
the file page offset (folio->index), which is typically non-zero.
Reword the comment to accurately describe the union aliasing: the
assignment clears whichever interpretation of the union word is active
(index for non-shared folios, share for shared folios), which is correct
because the folio is being released in either case.
No functional change -- the code was already correct, only the
justification was wrong.
Fixes: 59eb73b98ae0b ("dax: Factor out dax_folio_reset_order() helper")
Reviewed-by: Jonathan Cameron <jic23@kernel.org>
Signed-off-by: John Groves <john@groves.net>
---
fs/dax.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 6d175cd47a99b..df19c9317d10e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -392,12 +392,12 @@ int dax_folio_reset_order(struct folio *folio)
int order = folio_order(folio);
/*
- * DAX maintains the invariant that folio->share != 0 only when
- * folio->mapping == NULL (enforced by dax_folio_make_shared()).
- * Equivalently: folio->mapping != NULL implies folio->share == 0.
- * Callers ensure share has been decremented to zero before
- * calling here, so unconditionally clearing both fields is
- * correct.
+ * Clear the mapping and the index/share union word. folio->share
+ * and folio->index occupy the same union in struct folio. For
+ * non-shared folios (mapping != NULL), the union holds folio->index
+ * (file page offset); for shared folios (mapping == NULL), it holds
+ * folio->share (reference count). Either way, we are releasing the
+ * folio and both fields should be zeroed.
*/
folio->mapping = NULL;
folio->share = 0;
--
2.53.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH V2 2/7] dax/fsdev: fix multi-range offset, vmemmap_shift leak, and probe error cleanup
2026-05-22 19:18 ` [PATCH V2 0/7] Fixes to the previously-merged drivers/dax/fsdev series John Groves
2026-05-22 19:18 ` [PATCH V2 1/7] dax: fix misleading comment about share/index union in dax_folio_reset_order() John Groves
@ 2026-05-22 19:18 ` John Groves
2026-05-26 23:22 ` Dave Jiang
2026-05-22 19:19 ` [PATCH V2 3/7] dax/fsdev: fix kaddr for multi-range and fail probe on invalid pgmap offset John Groves
` (4 subsequent siblings)
6 siblings, 1 reply; 22+ messages in thread
From: John Groves @ 2026-05-22 19:18 UTC (permalink / raw)
To: John Groves, Dan Williams
Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Miklos Szeredi,
Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
John Groves
From: John Groves <John@Groves.net>
Three fixes for fsdev.c:
1. Fix memory_failure offset calculation for multi-range devices.
The old code subtracted ranges[0].range.start from the faulting PFN's
physical address, which produces an incorrect (inflated) logical offset
when the PFN falls in ranges[1] or beyond due to physical gaps between
ranges. Add fsdev_pfn_to_offset() to walk the range list and compute
the correct device-linear byte offset.
2. Clear pgmap->vmemmap_shift for static DAX devices. When rebinding a
static device from device_dax (which may set vmemmap_shift based on
alignment) to fsdev_dax, the stale vmemmap_shift persists on the
shared pgmap. Explicitly zero it before devm_memremap_pages() so the
vmemmap is built for order-0 folios as fsdev requires.
3. Clear dev_dax->pgmap on probe failure for dynamic devices. After the
dynamic path sets dev_dax->pgmap, if a later probe step fails, devres
frees the devm_kzalloc'd pgmap but leaves dev_dax->pgmap dangling.
Subsequent probe attempts would hit the "dynamic-dax with pre-populated
page map" check and fail permanently. Use a goto cleanup to NULL
dev_dax->pgmap on error.
Fixes: d5406bd458b0a ("dax: add fsdev.c driver for fs-dax on character dax")
Signed-off-by: John Groves <john@groves.net>
---
drivers/dax/fsdev.c | 50 ++++++++++++++++++++++++++++++++++++---------
1 file changed, 40 insertions(+), 10 deletions(-)
diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
index 188b2526bee45..42aac7e952516 100644
--- a/drivers/dax/fsdev.c
+++ b/drivers/dax/fsdev.c
@@ -135,11 +135,26 @@ static void fsdev_clear_ops(void *data)
* The core mm code in free_zone_device_folio() handles the wake_up_var()
* directly for this memory type.
*/
+static u64 fsdev_pfn_to_offset(struct dev_dax *dev_dax, unsigned long pfn)
+{
+ phys_addr_t phys = PFN_PHYS(pfn);
+ u64 offset = 0;
+
+ for (int i = 0; i < dev_dax->nr_range; i++) {
+ struct range *range = &dev_dax->ranges[i].range;
+
+ if (phys >= range->start && phys <= range->end)
+ return offset + (phys - range->start);
+ offset += range_len(range);
+ }
+ return -1ULL;
+}
+
static int fsdev_pagemap_memory_failure(struct dev_pagemap *pgmap,
unsigned long pfn, unsigned long nr_pages, int mf_flags)
{
struct dev_dax *dev_dax = pgmap->owner;
- u64 offset = PFN_PHYS(pfn) - dev_dax->ranges[0].range.start;
+ u64 offset = fsdev_pfn_to_offset(dev_dax, pfn);
u64 len = nr_pages << PAGE_SHIFT;
return dax_holder_notify_failure(dev_dax->dax_dev, offset,
@@ -208,6 +223,7 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
{
struct dax_device *dax_dev = dev_dax->dax_dev;
struct device *dev = &dev_dax->dev;
+ bool pgmap_allocated = false;
struct dev_pagemap *pgmap;
struct inode *inode;
u64 data_offset = 0;
@@ -222,6 +238,7 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
}
pgmap = dev_dax->pgmap;
+ pgmap->vmemmap_shift = 0;
} else {
size_t pgmap_size;
@@ -237,6 +254,7 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
pgmap->nr_range = dev_dax->nr_range;
dev_dax->pgmap = pgmap;
+ pgmap_allocated = true;
for (i = 0; i < dev_dax->nr_range; i++) {
struct range *range = &dev_dax->ranges[i].range;
@@ -252,7 +270,8 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
range_len(range), dev_name(dev))) {
dev_warn(dev, "mapping%d: %#llx-%#llx could not reserve range\n",
i, range->start, range->end);
- return -EBUSY;
+ rc = -EBUSY;
+ goto err_pgmap;
}
}
@@ -272,8 +291,10 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
pgmap->owner = dev_dax;
addr = devm_memremap_pages(dev, pgmap);
- if (IS_ERR(addr))
- return PTR_ERR(addr);
+ if (IS_ERR(addr)) {
+ rc = PTR_ERR(addr);
+ goto err_pgmap;
+ }
/*
* Clear any stale compound folio state left over from a previous
@@ -285,7 +306,7 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
rc = devm_add_action_or_reset(dev, fsdev_clear_folio_state_action,
dev_dax);
if (rc)
- return rc;
+ goto err_pgmap;
/* Detect whether the data is at a non-zero offset into the memory */
if (pgmap->range.start != dev_dax->ranges[0].range.start) {
@@ -307,23 +328,32 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
cdev_set_parent(cdev, &dev->kobj);
rc = cdev_add(cdev, dev->devt, 1);
if (rc)
- return rc;
+ goto err_pgmap;
rc = devm_add_action_or_reset(dev, fsdev_cdev_del, cdev);
if (rc)
- return rc;
+ goto err_pgmap;
/* Set the dax operations for fs-dax access path */
rc = dax_set_ops(dax_dev, &dev_dax_ops);
if (rc)
- return rc;
+ goto err_pgmap;
rc = devm_add_action_or_reset(dev, fsdev_clear_ops, dev_dax);
if (rc)
- return rc;
+ goto err_pgmap;
run_dax(dax_dev);
- return devm_add_action_or_reset(dev, fsdev_kill, dev_dax);
+ rc = devm_add_action_or_reset(dev, fsdev_kill, dev_dax);
+ if (rc)
+ goto err_pgmap;
+
+ return 0;
+
+err_pgmap:
+ if (pgmap_allocated)
+ dev_dax->pgmap = NULL;
+ return rc;
}
static struct dax_device_driver fsdev_dax_driver = {
--
2.53.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH V2 3/7] dax/fsdev: fix kaddr for multi-range and fail probe on invalid pgmap offset
2026-05-22 19:18 ` [PATCH V2 0/7] Fixes to the previously-merged drivers/dax/fsdev series John Groves
2026-05-22 19:18 ` [PATCH V2 1/7] dax: fix misleading comment about share/index union in dax_folio_reset_order() John Groves
2026-05-22 19:18 ` [PATCH V2 2/7] dax/fsdev: fix multi-range offset, vmemmap_shift leak, and probe error cleanup John Groves
@ 2026-05-22 19:19 ` John Groves
2026-05-26 23:31 ` Dave Jiang
2026-05-22 19:19 ` [PATCH V2 4/7] dax/fsdev: clamp direct_access return to current physical range John Groves
` (3 subsequent siblings)
6 siblings, 1 reply; 22+ messages in thread
From: John Groves @ 2026-05-22 19:19 UTC (permalink / raw)
To: John Groves, Dan Williams
Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Miklos Szeredi,
Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
John Groves
From: John Groves <John@Groves.net>
Two fixes for virtual address handling in fsdev:
1. Use __va(phys) instead of virt_addr + linear_offset for the kaddr
return in __fsdev_dax_direct_access(). The previous code added a
device-linear byte offset to virt_addr (which is __va of ranges[0]),
but for multi-range devices with physical gaps between ranges, this
linear arithmetic crosses the gap and produces a wrong kernel virtual
address. Using __va(phys) where phys comes from dax_pgoff_to_phys()
is correct for any range layout because the direct map translates
each physical address independently.
2. Convert the WARN_ON to a fatal error when pgmap_phys > phys. This
condition means the remapped region starts after the device's data
region, which is an impossible state. Previously the probe continued
with data_offset=0, leaving virt_addr silently misaligned. Now probe
returns -EINVAL with a diagnostic message.
Fixes: 759455848df0b ("dax: Save the kva from memremap")
Signed-off-by: John Groves <john@groves.net>
---
drivers/dax/fsdev.c | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)
diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
index 42aac7e952516..aac0130ab2833 100644
--- a/drivers/dax/fsdev.c
+++ b/drivers/dax/fsdev.c
@@ -51,9 +51,7 @@ static long __fsdev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
struct dev_dax *dev_dax = dax_get_private(dax_dev);
size_t size = nr_pages << PAGE_SHIFT;
size_t offset = pgoff << PAGE_SHIFT;
- void *virt_addr = dev_dax->virt_addr + offset;
phys_addr_t phys;
- unsigned long local_pfn;
phys = dax_pgoff_to_phys(dev_dax, pgoff, size);
if (phys == -1) {
@@ -63,11 +61,10 @@ static long __fsdev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
}
if (kaddr)
- *kaddr = virt_addr;
+ *kaddr = __va(phys);
- local_pfn = PHYS_PFN(phys);
if (pfn)
- *pfn = local_pfn;
+ *pfn = PHYS_PFN(phys);
/*
* Use cached_size which was computed at probe time. The size cannot
@@ -313,8 +310,13 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
u64 phys = dev_dax->ranges[0].range.start;
u64 pgmap_phys = dev_dax->pgmap[0].range.start;
- if (!WARN_ON(pgmap_phys > phys))
- data_offset = phys - pgmap_phys;
+ if (pgmap_phys > phys) {
+ dev_err(dev, "pgmap start %#llx exceeds data start %#llx\n",
+ pgmap_phys, phys);
+ rc = -EINVAL;
+ goto err_pgmap;
+ }
+ data_offset = phys - pgmap_phys;
pr_debug("%s: offset detected phys=%llx pgmap_phys=%llx offset=%llx\n",
__func__, phys, pgmap_phys, data_offset);
--
2.53.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH V2 4/7] dax/fsdev: clamp direct_access return to current physical range
2026-05-22 19:18 ` [PATCH V2 0/7] Fixes to the previously-merged drivers/dax/fsdev series John Groves
` (2 preceding siblings ...)
2026-05-22 19:19 ` [PATCH V2 3/7] dax/fsdev: fix kaddr for multi-range and fail probe on invalid pgmap offset John Groves
@ 2026-05-22 19:19 ` John Groves
2026-05-27 0:00 ` Dave Jiang
2026-05-22 19:19 ` [PATCH V2 5/7] dax: fix holder_ops race in fs_put_dax() John Groves
` (2 subsequent siblings)
6 siblings, 1 reply; 22+ messages in thread
From: John Groves @ 2026-05-22 19:19 UTC (permalink / raw)
To: John Groves, Dan Williams
Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Miklos Szeredi,
Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
John Groves
From: John Groves <John@Groves.net>
__fsdev_dax_direct_access() returned the number of available pages based
on cached_size (the total size across all ranges). For multi-range
devices with physical gaps between ranges, this over-reports the number
of physically contiguous pages available from the returned kaddr/pfn.
Callers trust this return value to mean contiguous pages, so accessing
beyond the current range boundary would hit unmapped or unrelated memory.
Fix by finding the range that contains the translated physical address
and clamping the return to the remaining pages within that range.
Also remove the now-unused cached_size field from struct dev_dax, since
it was only consumed by the old return calculation.
Fixes: 099c81a1f0ab3 ("dax: Add dax_operations for use by fs-dax on fsdev dax")
Signed-off-by: John Groves <john@groves.net>
---
drivers/dax/dax-private.h | 2 --
drivers/dax/fsdev.c | 23 ++++++++++++++---------
2 files changed, 14 insertions(+), 11 deletions(-)
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 81e4af49e39c1..7a3727d76a68a 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -70,7 +70,6 @@ struct dev_dax_range {
* @region: parent region
* @dax_dev: core dax functionality
* @virt_addr: kva from memremap; used by fsdev_dax
- * @cached_size: size of daxdev cached by fsdev_dax
* @align: alignment of this instance
* @target_node: effective numa node if dev_dax memory range is onlined
* @dyn_id: is this a dynamic or statically created instance
@@ -86,7 +85,6 @@ struct dev_dax {
struct dax_region *region;
struct dax_device *dax_dev;
void *virt_addr;
- u64 cached_size;
unsigned int align;
int target_node;
bool dyn_id;
diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
index aac0130ab2833..f74fd1bb7f4c5 100644
--- a/drivers/dax/fsdev.c
+++ b/drivers/dax/fsdev.c
@@ -50,8 +50,8 @@ static long __fsdev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
{
struct dev_dax *dev_dax = dax_get_private(dax_dev);
size_t size = nr_pages << PAGE_SHIFT;
- size_t offset = pgoff << PAGE_SHIFT;
phys_addr_t phys;
+ int i;
phys = dax_pgoff_to_phys(dev_dax, pgoff, size);
if (phys == -1) {
@@ -67,10 +67,20 @@ static long __fsdev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
*pfn = PHYS_PFN(phys);
/*
- * Use cached_size which was computed at probe time. The size cannot
- * change while the driver is bound (resize returns -EBUSY).
+ * Return the number of physically contiguous pages available from
+ * phys, clamped to the current range. For multi-range devices the
+ * ranges may not be physically contiguous, so we cannot report
+ * pages beyond the end of the range that contains phys.
*/
- return PHYS_PFN(min(size, dev_dax->cached_size - offset));
+ for (i = 0; i < dev_dax->nr_range; i++) {
+ struct range *range = &dev_dax->ranges[i].range;
+
+ if (phys >= range->start && phys <= range->end)
+ return PHYS_PFN(min(size,
+ (size_t)(range->end - phys + 1)));
+ }
+
+ return -EFAULT;
}
static int fsdev_dax_zero_page_range(struct dax_device *dax_dev,
@@ -272,11 +282,6 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
}
}
- /* Cache size now; it cannot change while driver is bound */
- dev_dax->cached_size = 0;
- for (i = 0; i < dev_dax->nr_range; i++)
- dev_dax->cached_size += range_len(&dev_dax->ranges[i].range);
-
/*
* Use MEMORY_DEVICE_FS_DAX without setting vmemmap_shift, leaving
* folios at order-0. Unlike device.c (MEMORY_DEVICE_GENERIC), this
--
2.53.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH V2 5/7] dax: fix holder_ops race in fs_put_dax()
2026-05-22 19:18 ` [PATCH V2 0/7] Fixes to the previously-merged drivers/dax/fsdev series John Groves
` (3 preceding siblings ...)
2026-05-22 19:19 ` [PATCH V2 4/7] dax/fsdev: clamp direct_access return to current physical range John Groves
@ 2026-05-22 19:19 ` John Groves
2026-05-27 0:16 ` Dave Jiang
2026-05-22 19:19 ` [PATCH V2 6/7] dax: replace exported dax_dev_get() with non-allocating dax_dev_find() John Groves
2026-05-22 19:19 ` [PATCH V2 7/7] dax: fsdev.c minor formatting cleanup John Groves
6 siblings, 1 reply; 22+ messages in thread
From: John Groves @ 2026-05-22 19:19 UTC (permalink / raw)
To: John Groves, Dan Williams
Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Miklos Szeredi,
Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
John Groves
From: John Groves <John@Groves.net>
Clear holder_ops before holder_data so that a concurrent fs_dax_get()
cannot have its newly installed holder_ops overwritten. Also add a
kerneldoc comment documenting that fs_put_dax() must only be called
by the current holder.
Fixes: eec38f5d86d27 ("dax: add fs_dax_get() for devdax")
Signed-off-by: John Groves <john@groves.net>
---
drivers/dax/super.c | 24 ++++++++++++++++++++++--
1 file changed, 22 insertions(+), 2 deletions(-)
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 25cf99dd9360b..fa1d2a6eb2408 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -116,11 +116,31 @@ EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
#if IS_ENABLED(CONFIG_FS_DAX)
+/**
+ * fs_put_dax() - release holder ownership of a dax_device
+ * @dax_dev: dax device to release (may be NULL)
+ * @holder: the holder pointer previously passed to fs_dax_get() or
+ * fs_dax_get_by_bdev(); must match exactly, as it is used
+ * in a cmpxchg to atomically release ownership
+ *
+ * Must only be called by the current holder. Clears holder_ops before
+ * holder_data to avoid a race where a concurrent fs_dax_get() could have
+ * its newly installed holder_ops overwritten.
+ */
void fs_put_dax(struct dax_device *dax_dev, void *holder)
{
- if (dax_dev && holder &&
- cmpxchg(&dax_dev->holder_data, holder, NULL) == holder)
+ if (dax_dev && holder) {
+ /*
+ * Clear holder_ops before holder_data so that a concurrent
+ * fs_dax_get() cannot have its newly installed holder_ops
+ * overwritten. holder_ops is only consulted when holder_data
+ * is non-NULL, so clearing ops first is safe — any in-flight
+ * holder_notify_failure() will see the old holder_data with
+ * NULL ops (a no-op) rather than new ops with wrong context.
+ */
dax_dev->holder_ops = NULL;
+ cmpxchg(&dax_dev->holder_data, holder, NULL);
+ }
put_dax(dax_dev);
}
EXPORT_SYMBOL_GPL(fs_put_dax);
--
2.53.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH V2 6/7] dax: replace exported dax_dev_get() with non-allocating dax_dev_find()
2026-05-22 19:18 ` [PATCH V2 0/7] Fixes to the previously-merged drivers/dax/fsdev series John Groves
` (4 preceding siblings ...)
2026-05-22 19:19 ` [PATCH V2 5/7] dax: fix holder_ops race in fs_put_dax() John Groves
@ 2026-05-22 19:19 ` John Groves
2026-05-27 0:28 ` Dave Jiang
2026-05-22 19:19 ` [PATCH V2 7/7] dax: fsdev.c minor formatting cleanup John Groves
6 siblings, 1 reply; 22+ messages in thread
From: John Groves @ 2026-05-22 19:19 UTC (permalink / raw)
To: John Groves, Dan Williams
Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Miklos Szeredi,
Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
John Groves
From: John Groves <John@Groves.net>
This fix is in response to a Sashiko review, and some subsequent
analysis.
dax_dev_get() uses iget5_locked() which creates a new inode if no
matching one exists. This is correct for the internal caller
(alloc_dax), but dangerous for external callers that look up devices
from user-supplied or metadata-supplied dev_t values:
1. A new inode is created with DAXDEV_ALIVE set but no backing driver,
no ops, and no IDA-allocated minor number.
2. On teardown, dax_destroy_inode() warns because kill_dax() was never
called, and dax_free_inode() calls ida_free() for a minor that was
never ida_alloc'd — potentially freeing the minor of a real device.
Add dax_dev_find() which uses ilookup5() for lookup-only semantics:
it returns an existing dax_device with an elevated inode reference, or
NULL if no device with the given dev_t exists. It never creates inodes.
Make dax_dev_get() static again (internal to super.c for alloc_dax),
export dax_dev_find() instead, and update the two external callers
(famfs_inode.c, famfs.c). Also add the missing CONFIG_DAX=n stub.
Fixes: 2ae624d5a555d ("dax: export dax_dev_get()")
Signed-off-by: John Groves <john@groves.net>
---
drivers/dax/super.c | 27 +++++++++++++++++++++++++--
include/linux/dax.h | 6 +++++-
2 files changed, 30 insertions(+), 3 deletions(-)
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index fa1d2a6eb2408..79e5823d1010d 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -541,7 +541,7 @@ static int dax_set(struct inode *inode, void *data)
return 0;
}
-struct dax_device *dax_dev_get(dev_t devt)
+static struct dax_device *dax_dev_get(dev_t devt)
{
struct dax_device *dax_dev;
struct inode *inode;
@@ -564,7 +564,30 @@ struct dax_device *dax_dev_get(dev_t devt)
return dax_dev;
}
-EXPORT_SYMBOL_GPL(dax_dev_get);
+
+/**
+ * dax_dev_find - look up an existing dax_device by dev_t
+ * @devt: the device number to find
+ *
+ * Returns a dax_device with an elevated inode reference, or NULL if no
+ * device with the given dev_t exists. Unlike dax_dev_get(), this never
+ * allocates a new inode — it is safe for external callers that are looking
+ * up devices from user-supplied or metadata-supplied dev_t values.
+ *
+ * Caller must put_dax() the returned device when done.
+ */
+struct dax_device *dax_dev_find(dev_t devt)
+{
+ struct inode *inode;
+
+ inode = ilookup5(dax_superblock, hash_32(devt + DAXFS_MAGIC, 31),
+ dax_test, &devt);
+ if (!inode)
+ return NULL;
+
+ return to_dax_dev(inode);
+}
+EXPORT_SYMBOL_GPL(dax_dev_find);
struct dax_device *alloc_dax(void *private, const struct dax_operations *ops)
{
diff --git a/include/linux/dax.h b/include/linux/dax.h
index fe6c3ded1b50f..29113eb95e72d 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -54,7 +54,7 @@ struct dax_device *alloc_dax(void *private, const struct dax_operations *ops);
void *dax_holder(struct dax_device *dax_dev);
void put_dax(struct dax_device *dax_dev);
void kill_dax(struct dax_device *dax_dev);
-struct dax_device *dax_dev_get(dev_t devt);
+struct dax_device *dax_dev_find(dev_t devt);
void dax_write_cache(struct dax_device *dax_dev, bool wc);
bool dax_write_cache_enabled(struct dax_device *dax_dev);
bool dax_synchronous(struct dax_device *dax_dev);
@@ -92,6 +92,10 @@ static inline void put_dax(struct dax_device *dax_dev)
static inline void kill_dax(struct dax_device *dax_dev)
{
}
+static inline struct dax_device *dax_dev_find(dev_t devt)
+{
+ return NULL;
+}
static inline void dax_write_cache(struct dax_device *dax_dev, bool wc)
{
}
--
2.53.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH V2 7/7] dax: fsdev.c minor formatting cleanup
2026-05-22 19:18 ` [PATCH V2 0/7] Fixes to the previously-merged drivers/dax/fsdev series John Groves
` (5 preceding siblings ...)
2026-05-22 19:19 ` [PATCH V2 6/7] dax: replace exported dax_dev_get() with non-allocating dax_dev_find() John Groves
@ 2026-05-22 19:19 ` John Groves
2026-05-27 0:31 ` Dave Jiang
6 siblings, 1 reply; 22+ messages in thread
From: John Groves @ 2026-05-22 19:19 UTC (permalink / raw)
To: John Groves, Dan Williams
Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Miklos Szeredi,
Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
John Groves
From: John Groves <John@Groves.net>
Address some comments from Jonathan that were missed in the merged
series. Fix line wrapping in fsdev_dax_recovery_write() and
fsdev_dax_zero_page_range() signatures.
Fixes: 099c81a1f0ab ("dax: Add dax_operations for use by fs-dax on fsdev dax")
Signed-off-by: John Groves <john@groves.net>
---
drivers/dax/fsdev.c | 15 ++++++++-------
1 file changed, 8 insertions(+), 7 deletions(-)
diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
index f74fd1bb7f4c5..dafa24761e8ff 100644
--- a/drivers/dax/fsdev.c
+++ b/drivers/dax/fsdev.c
@@ -45,8 +45,8 @@ static void fsdev_write_dax(void *addr, struct page *page,
}
static long __fsdev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
- long nr_pages, enum dax_access_mode mode, void **kaddr,
- unsigned long *pfn)
+ long nr_pages, enum dax_access_mode mode, void **kaddr,
+ unsigned long *pfn)
{
struct dev_dax *dev_dax = dax_get_private(dax_dev);
size_t size = nr_pages << PAGE_SHIFT;
@@ -90,7 +90,8 @@ static int fsdev_dax_zero_page_range(struct dax_device *dax_dev,
long rc;
WARN_ONCE(nr_pages > 1, "%s: nr_pages > 1\n", __func__);
- rc = __fsdev_dax_direct_access(dax_dev, pgoff, 1, DAX_ACCESS, &kaddr, NULL);
+ rc = __fsdev_dax_direct_access(dax_dev, pgoff, 1, DAX_ACCESS,
+ &kaddr, NULL);
if (rc < 0)
return rc;
fsdev_write_dax(kaddr, ZERO_PAGE(0), 0, PAGE_SIZE);
@@ -98,15 +99,15 @@ static int fsdev_dax_zero_page_range(struct dax_device *dax_dev,
}
static long fsdev_dax_direct_access(struct dax_device *dax_dev,
- pgoff_t pgoff, long nr_pages, enum dax_access_mode mode,
- void **kaddr, unsigned long *pfn)
+ pgoff_t pgoff, long nr_pages, enum dax_access_mode mode,
+ void **kaddr, unsigned long *pfn)
{
return __fsdev_dax_direct_access(dax_dev, pgoff, nr_pages, mode,
kaddr, pfn);
}
-static size_t fsdev_dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
- void *addr, size_t bytes, struct iov_iter *i)
+static size_t fsdev_dax_recovery_write(struct dax_device *dax_dev,
+ pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i)
{
return _copy_from_iter_flushcache(addr, bytes, i);
}
--
2.53.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH V2 1/7] dax: fix misleading comment about share/index union in dax_folio_reset_order()
2026-05-22 19:18 ` [PATCH V2 1/7] dax: fix misleading comment about share/index union in dax_folio_reset_order() John Groves
@ 2026-05-26 23:07 ` Dave Jiang
2026-05-29 23:41 ` John Groves
0 siblings, 1 reply; 22+ messages in thread
From: Dave Jiang @ 2026-05-26 23:07 UTC (permalink / raw)
To: John Groves, John Groves, Dan Williams
Cc: John Groves, Vishal Verma, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Miklos Szeredi,
Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
On 5/22/26 12:18 PM, John Groves wrote:
> From: John Groves <John@Groves.net>
>
> The comment in dax_folio_reset_order() claims that DAX maintains an
> invariant where folio->share != 0 only when folio->mapping == NULL,
> implying folio->share is zero whenever mapping is non-NULL. This is
> misleading because folio->share and folio->index are a union -- for
> non-shared folios with mapping != NULL, reading folio->share returns
> the file page offset (folio->index), which is typically non-zero.
>
> Reword the comment to accurately describe the union aliasing: the
> assignment clears whichever interpretation of the union word is active
> (index for non-shared folios, share for shared folios), which is correct
> because the folio is being released in either case.
>
> No functional change -- the code was already correct, only the
> justification was wrong.
>
> Fixes: 59eb73b98ae0b ("dax: Factor out dax_folio_reset_order() helper")
>
> Reviewed-by: Jonathan Cameron <jic23@kernel.org>
> Signed-off-by: John Groves <john@groves.net>
> ---
> fs/dax.c | 12 ++++++------
> 1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 6d175cd47a99b..df19c9317d10e 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -392,12 +392,12 @@ int dax_folio_reset_order(struct folio *folio)
> int order = folio_order(folio);
>
> /*
> - * DAX maintains the invariant that folio->share != 0 only when
> - * folio->mapping == NULL (enforced by dax_folio_make_shared()).
> - * Equivalently: folio->mapping != NULL implies folio->share == 0.
> - * Callers ensure share has been decremented to zero before
> - * calling here, so unconditionally clearing both fields is
> - * correct.
> + * Clear the mapping and the index/share union word. folio->share
> + * and folio->index occupy the same union in struct folio. For
> + * non-shared folios (mapping != NULL), the union holds folio->index
> + * (file page offset); for shared folios (mapping == NULL), it holds
> + * folio->share (reference count). Either way, we are releasing the
> + * folio and both fields should be zeroed.
In the old comments, there is the pre-condition that "callers ensure share has been decremented to zero before calling here." Is this precondition remain true? Maybe should leave that comment in if that is the case?
> */
> folio->mapping = NULL;
> folio->share = 0;
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH V2 2/7] dax/fsdev: fix multi-range offset, vmemmap_shift leak, and probe error cleanup
2026-05-22 19:18 ` [PATCH V2 2/7] dax/fsdev: fix multi-range offset, vmemmap_shift leak, and probe error cleanup John Groves
@ 2026-05-26 23:22 ` Dave Jiang
2026-05-29 23:59 ` John Groves
0 siblings, 1 reply; 22+ messages in thread
From: Dave Jiang @ 2026-05-26 23:22 UTC (permalink / raw)
To: John Groves, John Groves, Dan Williams
Cc: John Groves, Vishal Verma, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Miklos Szeredi,
Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
On 5/22/26 12:18 PM, John Groves wrote:
> From: John Groves <John@Groves.net>
>
> Three fixes for fsdev.c:
>
> 1. Fix memory_failure offset calculation for multi-range devices.
> The old code subtracted ranges[0].range.start from the faulting PFN's
> physical address, which produces an incorrect (inflated) logical offset
> when the PFN falls in ranges[1] or beyond due to physical gaps between
> ranges. Add fsdev_pfn_to_offset() to walk the range list and compute
> the correct device-linear byte offset.
>
> 2. Clear pgmap->vmemmap_shift for static DAX devices. When rebinding a
> static device from device_dax (which may set vmemmap_shift based on
> alignment) to fsdev_dax, the stale vmemmap_shift persists on the
> shared pgmap. Explicitly zero it before devm_memremap_pages() so the
> vmemmap is built for order-0 folios as fsdev requires.
>
> 3. Clear dev_dax->pgmap on probe failure for dynamic devices. After the
> dynamic path sets dev_dax->pgmap, if a later probe step fails, devres
> frees the devm_kzalloc'd pgmap but leaves dev_dax->pgmap dangling.
> Subsequent probe attempts would hit the "dynamic-dax with pre-populated
> page map" check and fail permanently. Use a goto cleanup to NULL
> dev_dax->pgmap on error.
3 fixes, 3 separate patches?
DJ
>
> Fixes: d5406bd458b0a ("dax: add fsdev.c driver for fs-dax on character dax")
> Signed-off-by: John Groves <john@groves.net>
> ---
> drivers/dax/fsdev.c | 50 ++++++++++++++++++++++++++++++++++++---------
> 1 file changed, 40 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> index 188b2526bee45..42aac7e952516 100644
> --- a/drivers/dax/fsdev.c
> +++ b/drivers/dax/fsdev.c
> @@ -135,11 +135,26 @@ static void fsdev_clear_ops(void *data)
> * The core mm code in free_zone_device_folio() handles the wake_up_var()
> * directly for this memory type.
> */
> +static u64 fsdev_pfn_to_offset(struct dev_dax *dev_dax, unsigned long pfn)
> +{
> + phys_addr_t phys = PFN_PHYS(pfn);
> + u64 offset = 0;
> +
> + for (int i = 0; i < dev_dax->nr_range; i++) {
> + struct range *range = &dev_dax->ranges[i].range;
> +
> + if (phys >= range->start && phys <= range->end)
> + return offset + (phys - range->start);
> + offset += range_len(range);
> + }
> + return -1ULL;
> +}
> +
> static int fsdev_pagemap_memory_failure(struct dev_pagemap *pgmap,
> unsigned long pfn, unsigned long nr_pages, int mf_flags)
> {
> struct dev_dax *dev_dax = pgmap->owner;
> - u64 offset = PFN_PHYS(pfn) - dev_dax->ranges[0].range.start;
> + u64 offset = fsdev_pfn_to_offset(dev_dax, pfn);
> u64 len = nr_pages << PAGE_SHIFT;
>
> return dax_holder_notify_failure(dev_dax->dax_dev, offset,
> @@ -208,6 +223,7 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
> {
> struct dax_device *dax_dev = dev_dax->dax_dev;
> struct device *dev = &dev_dax->dev;
> + bool pgmap_allocated = false;
> struct dev_pagemap *pgmap;
> struct inode *inode;
> u64 data_offset = 0;
> @@ -222,6 +238,7 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
> }
>
> pgmap = dev_dax->pgmap;
> + pgmap->vmemmap_shift = 0;
> } else {
> size_t pgmap_size;
>
> @@ -237,6 +254,7 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
>
> pgmap->nr_range = dev_dax->nr_range;
> dev_dax->pgmap = pgmap;
> + pgmap_allocated = true;
>
> for (i = 0; i < dev_dax->nr_range; i++) {
> struct range *range = &dev_dax->ranges[i].range;
> @@ -252,7 +270,8 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
> range_len(range), dev_name(dev))) {
> dev_warn(dev, "mapping%d: %#llx-%#llx could not reserve range\n",
> i, range->start, range->end);
> - return -EBUSY;
> + rc = -EBUSY;
> + goto err_pgmap;
> }
> }
>
> @@ -272,8 +291,10 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
> pgmap->owner = dev_dax;
>
> addr = devm_memremap_pages(dev, pgmap);
> - if (IS_ERR(addr))
> - return PTR_ERR(addr);
> + if (IS_ERR(addr)) {
> + rc = PTR_ERR(addr);
> + goto err_pgmap;
> + }
>
> /*
> * Clear any stale compound folio state left over from a previous
> @@ -285,7 +306,7 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
> rc = devm_add_action_or_reset(dev, fsdev_clear_folio_state_action,
> dev_dax);
> if (rc)
> - return rc;
> + goto err_pgmap;
>
> /* Detect whether the data is at a non-zero offset into the memory */
> if (pgmap->range.start != dev_dax->ranges[0].range.start) {
> @@ -307,23 +328,32 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
> cdev_set_parent(cdev, &dev->kobj);
> rc = cdev_add(cdev, dev->devt, 1);
> if (rc)
> - return rc;
> + goto err_pgmap;
>
> rc = devm_add_action_or_reset(dev, fsdev_cdev_del, cdev);
> if (rc)
> - return rc;
> + goto err_pgmap;
>
> /* Set the dax operations for fs-dax access path */
> rc = dax_set_ops(dax_dev, &dev_dax_ops);
> if (rc)
> - return rc;
> + goto err_pgmap;
>
> rc = devm_add_action_or_reset(dev, fsdev_clear_ops, dev_dax);
> if (rc)
> - return rc;
> + goto err_pgmap;
>
> run_dax(dax_dev);
> - return devm_add_action_or_reset(dev, fsdev_kill, dev_dax);
> + rc = devm_add_action_or_reset(dev, fsdev_kill, dev_dax);
> + if (rc)
> + goto err_pgmap;
> +
> + return 0;
> +
> +err_pgmap:
> + if (pgmap_allocated)
> + dev_dax->pgmap = NULL;
> + return rc;
> }
>
> static struct dax_device_driver fsdev_dax_driver = {
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH V2 3/7] dax/fsdev: fix kaddr for multi-range and fail probe on invalid pgmap offset
2026-05-22 19:19 ` [PATCH V2 3/7] dax/fsdev: fix kaddr for multi-range and fail probe on invalid pgmap offset John Groves
@ 2026-05-26 23:31 ` Dave Jiang
2026-05-30 0:04 ` John Groves
0 siblings, 1 reply; 22+ messages in thread
From: Dave Jiang @ 2026-05-26 23:31 UTC (permalink / raw)
To: John Groves, John Groves, Dan Williams
Cc: John Groves, Vishal Verma, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Miklos Szeredi,
Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
On 5/22/26 12:19 PM, John Groves wrote:
> From: John Groves <John@Groves.net>
>
> Two fixes for virtual address handling in fsdev:
>
> 1. Use __va(phys) instead of virt_addr + linear_offset for the kaddr
> return in __fsdev_dax_direct_access(). The previous code added a
> device-linear byte offset to virt_addr (which is __va of ranges[0]),
> but for multi-range devices with physical gaps between ranges, this
> linear arithmetic crosses the gap and produces a wrong kernel virtual
> address. Using __va(phys) where phys comes from dax_pgoff_to_phys()
> is correct for any range layout because the direct map translates
> each physical address independently.
>
> 2. Convert the WARN_ON to a fatal error when pgmap_phys > phys. This
> condition means the remapped region starts after the device's data
> region, which is an impossible state. Previously the probe continued
> with data_offset=0, leaving virt_addr silently misaligned. Now probe
> returns -EINVAL with a diagnostic message.
Split to 2 different patches I'd say.
DJ
>
> Fixes: 759455848df0b ("dax: Save the kva from memremap")
> Signed-off-by: John Groves <john@groves.net>
> ---
> drivers/dax/fsdev.c | 16 +++++++++-------
> 1 file changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> index 42aac7e952516..aac0130ab2833 100644
> --- a/drivers/dax/fsdev.c
> +++ b/drivers/dax/fsdev.c
> @@ -51,9 +51,7 @@ static long __fsdev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
> struct dev_dax *dev_dax = dax_get_private(dax_dev);
> size_t size = nr_pages << PAGE_SHIFT;
> size_t offset = pgoff << PAGE_SHIFT;
> - void *virt_addr = dev_dax->virt_addr + offset;
> phys_addr_t phys;
> - unsigned long local_pfn;
>
> phys = dax_pgoff_to_phys(dev_dax, pgoff, size);
> if (phys == -1) {
> @@ -63,11 +61,10 @@ static long __fsdev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
> }
>
> if (kaddr)
> - *kaddr = virt_addr;
> + *kaddr = __va(phys);
>
> - local_pfn = PHYS_PFN(phys);
> if (pfn)
> - *pfn = local_pfn;
> + *pfn = PHYS_PFN(phys);
>
> /*
> * Use cached_size which was computed at probe time. The size cannot
> @@ -313,8 +310,13 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
> u64 phys = dev_dax->ranges[0].range.start;
> u64 pgmap_phys = dev_dax->pgmap[0].range.start;
>
> - if (!WARN_ON(pgmap_phys > phys))
> - data_offset = phys - pgmap_phys;
> + if (pgmap_phys > phys) {
> + dev_err(dev, "pgmap start %#llx exceeds data start %#llx\n",
> + pgmap_phys, phys);
> + rc = -EINVAL;
> + goto err_pgmap;
> + }
> + data_offset = phys - pgmap_phys;
>
> pr_debug("%s: offset detected phys=%llx pgmap_phys=%llx offset=%llx\n",
> __func__, phys, pgmap_phys, data_offset);
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH V2 4/7] dax/fsdev: clamp direct_access return to current physical range
2026-05-22 19:19 ` [PATCH V2 4/7] dax/fsdev: clamp direct_access return to current physical range John Groves
@ 2026-05-27 0:00 ` Dave Jiang
2026-05-30 13:06 ` John Groves
0 siblings, 1 reply; 22+ messages in thread
From: Dave Jiang @ 2026-05-27 0:00 UTC (permalink / raw)
To: John Groves, John Groves, Dan Williams
Cc: John Groves, Vishal Verma, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Miklos Szeredi,
Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
On 5/22/26 12:19 PM, John Groves wrote:
> From: John Groves <John@Groves.net>
>
> __fsdev_dax_direct_access() returned the number of available pages based
> on cached_size (the total size across all ranges). For multi-range
> devices with physical gaps between ranges, this over-reports the number
> of physically contiguous pages available from the returned kaddr/pfn.
> Callers trust this return value to mean contiguous pages, so accessing
> beyond the current range boundary would hit unmapped or unrelated memory.
>
> Fix by finding the range that contains the translated physical address
> and clamping the return to the remaining pages within that range.
>
> Also remove the now-unused cached_size field from struct dev_dax, since
> it was only consumed by the old return calculation.
>
> Fixes: 099c81a1f0ab3 ("dax: Add dax_operations for use by fs-dax on fsdev dax")
> Signed-off-by: John Groves <john@groves.net>
I ran this through Claude and this is what it came back with and it looks reasonable to me:
The claimed bug does not manifest in the current tree. This is a
refactor + dead-field removal dressed as a bugfix. Either the justification is
wrong or there's a missing companion change.
The claim doesn't hold up
Commit message:
▎ __fsdev_dax_direct_access() returned the number of available pages based on
▎ cached_size (the total size across all ranges). For multi-range devices with
▎ physical gaps between ranges, this over-reports the number of physically
▎ contiguous pages available from the returned kaddr/pfn.
Walk through it. __fsdev_dax_direct_access does:
phys = dax_pgoff_to_phys(dev_dax, pgoff, size);
if (phys == -1)
return -EFAULT;
...
return PHYS_PFN(min(size, dev_dax->cached_size - offset));
Now look at what dax_pgoff_to_phys (drivers/dax/bus.c:1433) enforces:
phys = PFN_PHYS(pgoff - dax_range->pgoff) + range->start;
if (phys + size - 1 <= range->end)
return phys;
break; /* falls through to return -1 */
It returns valid phys only when the entire requested size fits inside a single
range. If the request straddles a gap, it returns -1 and the caller bails
with -EFAULT before ever reaching the clamp.
So in the only path where the clamp runs, we already know size <= range->end -
phys + 1, which means cached_size - offset >= size, which means min(size,
cached_size - offset) == size. Pre-patch returns nr_pages. Post-patch returns
nr_pages. Same answer.
I worked it through with concrete numbers on a 2-range device with a physical
gap (range[0]=4 pages at 0x1000_0000, range[1]=4 pages at 0x2000_0000):
- pgoff=2, nr_pages=2: both return 2. ✓
- pgoff=3, nr_pages=2 (straddles gap): both return -EFAULT via the early bail.
✓
- pgoff=4, nr_pages=2: both return 2. ✓
I cannot construct a case where pre-patch over-reports.
What's actually changed
1. A no-op clamp loop replaces a constant calculation. Same return value in
every reachable case.
2. The trailing return -EFAULT; is dead code. If dax_pgoff_to_phys returned
valid phys, then by construction phys lies inside some range in
dev_dax->ranges[], so the loop will match.
3. O(N) walk replaces O(1) arithmetic. Trivial for small nr_range, but it's a
regression in the wrong direction for a hot path.
4. cached_size field is removed. Good cleanup — I grepped, no other consumers.
DJ
> ---
> drivers/dax/dax-private.h | 2 --
> drivers/dax/fsdev.c | 23 ++++++++++++++---------
> 2 files changed, 14 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
> index 81e4af49e39c1..7a3727d76a68a 100644
> --- a/drivers/dax/dax-private.h
> +++ b/drivers/dax/dax-private.h
> @@ -70,7 +70,6 @@ struct dev_dax_range {
> * @region: parent region
> * @dax_dev: core dax functionality
> * @virt_addr: kva from memremap; used by fsdev_dax
> - * @cached_size: size of daxdev cached by fsdev_dax
> * @align: alignment of this instance
> * @target_node: effective numa node if dev_dax memory range is onlined
> * @dyn_id: is this a dynamic or statically created instance
> @@ -86,7 +85,6 @@ struct dev_dax {
> struct dax_region *region;
> struct dax_device *dax_dev;
> void *virt_addr;
> - u64 cached_size;
> unsigned int align;
> int target_node;
> bool dyn_id;
> diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> index aac0130ab2833..f74fd1bb7f4c5 100644
> --- a/drivers/dax/fsdev.c
> +++ b/drivers/dax/fsdev.c
> @@ -50,8 +50,8 @@ static long __fsdev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
> {
> struct dev_dax *dev_dax = dax_get_private(dax_dev);
> size_t size = nr_pages << PAGE_SHIFT;
> - size_t offset = pgoff << PAGE_SHIFT;
> phys_addr_t phys;
> + int i;
>
> phys = dax_pgoff_to_phys(dev_dax, pgoff, size);
> if (phys == -1) {
> @@ -67,10 +67,20 @@ static long __fsdev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
> *pfn = PHYS_PFN(phys);
>
> /*
> - * Use cached_size which was computed at probe time. The size cannot
> - * change while the driver is bound (resize returns -EBUSY).
> + * Return the number of physically contiguous pages available from
> + * phys, clamped to the current range. For multi-range devices the
> + * ranges may not be physically contiguous, so we cannot report
> + * pages beyond the end of the range that contains phys.
> */
> - return PHYS_PFN(min(size, dev_dax->cached_size - offset));
> + for (i = 0; i < dev_dax->nr_range; i++) {
> + struct range *range = &dev_dax->ranges[i].range;
> +
> + if (phys >= range->start && phys <= range->end)
> + return PHYS_PFN(min(size,
> + (size_t)(range->end - phys + 1)));
> + }
> +
> + return -EFAULT;
> }
>
> static int fsdev_dax_zero_page_range(struct dax_device *dax_dev,
> @@ -272,11 +282,6 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
> }
> }
>
> - /* Cache size now; it cannot change while driver is bound */
> - dev_dax->cached_size = 0;
> - for (i = 0; i < dev_dax->nr_range; i++)
> - dev_dax->cached_size += range_len(&dev_dax->ranges[i].range);
> -
> /*
> * Use MEMORY_DEVICE_FS_DAX without setting vmemmap_shift, leaving
> * folios at order-0. Unlike device.c (MEMORY_DEVICE_GENERIC), this
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH V2 5/7] dax: fix holder_ops race in fs_put_dax()
2026-05-22 19:19 ` [PATCH V2 5/7] dax: fix holder_ops race in fs_put_dax() John Groves
@ 2026-05-27 0:16 ` Dave Jiang
2026-05-30 14:02 ` John Groves
0 siblings, 1 reply; 22+ messages in thread
From: Dave Jiang @ 2026-05-27 0:16 UTC (permalink / raw)
To: John Groves, John Groves, Dan Williams
Cc: John Groves, Vishal Verma, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Miklos Szeredi,
Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
On 5/22/26 12:19 PM, John Groves wrote:
> From: John Groves <John@Groves.net>
>
> Clear holder_ops before holder_data so that a concurrent fs_dax_get()
> cannot have its newly installed holder_ops overwritten. Also add a
> kerneldoc comment documenting that fs_put_dax() must only be called
> by the current holder.
>
> Fixes: eec38f5d86d27 ("dax: add fs_dax_get() for devdax")
> Signed-off-by: John Groves <john@groves.net>
Couple things from Claude that may be worth taking a look at:
1. Memory ordering is now load-bearing and missing
The whole correctness argument depends on the reader observing holder_ops =
NULL before observing holder_data = NULL. The patch uses a plain store
followed by cmpxchg. On x86 plain stores are ordered, but on arm64/ppc they
are not — the reader can observe cmpxchg's release of holder_data while still
seeing the old holder_ops. That puts us back in the dangerous (holder_data ==
NULL, holder_ops == old) state on weakly-ordered arches.
Required:
smp_store_release(&dax_dev->holder_ops, NULL); /* publish ops=NULL first */
cmpxchg(&dax_dev->holder_data, holder, NULL); /* then release holder_data
*/
And the reader in dax_holder_notify_failure should use
smp_load_acquire/READ_ONCE because today it reads dax_dev->holder_ops twice
(line 334 and line 339), allowing tearing or stale-cache reads. Pre-existing
weakness, but this patch is what makes the ordering matter.
kill_dax (line 461-462) has the same naked-store pattern — it should be made
consistent.
2. Unconditional holder_ops = NULL is a behavior regression
Pre-patch was defensive: if a caller passed the wrong holder, the cmpxchg
failed and nothing got cleared.
Post-patch clears holder_ops unconditionally whenever dax_dev && holder is
truthy. A wrong-holder fs_put_dax() now actively damages the legitimate
holder's state — sets holder_ops to NULL while holder_data retains the
legitimate holder's pointer. From that point, all dax_holder_notify_failure()
calls return -EOPNOTSUPP, silently breaking the legitimate holder's
poison-recovery path.
DJ
> ---
> drivers/dax/super.c | 24 ++++++++++++++++++++++--
> 1 file changed, 22 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 25cf99dd9360b..fa1d2a6eb2408 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -116,11 +116,31 @@ EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
>
> #if IS_ENABLED(CONFIG_FS_DAX)
>
> +/**
> + * fs_put_dax() - release holder ownership of a dax_device
> + * @dax_dev: dax device to release (may be NULL)
> + * @holder: the holder pointer previously passed to fs_dax_get() or
> + * fs_dax_get_by_bdev(); must match exactly, as it is used
> + * in a cmpxchg to atomically release ownership
> + *
> + * Must only be called by the current holder. Clears holder_ops before
> + * holder_data to avoid a race where a concurrent fs_dax_get() could have
> + * its newly installed holder_ops overwritten.
> + */
> void fs_put_dax(struct dax_device *dax_dev, void *holder)
> {
> - if (dax_dev && holder &&
> - cmpxchg(&dax_dev->holder_data, holder, NULL) == holder)
> + if (dax_dev && holder) {
> + /*
> + * Clear holder_ops before holder_data so that a concurrent
> + * fs_dax_get() cannot have its newly installed holder_ops
> + * overwritten. holder_ops is only consulted when holder_data
> + * is non-NULL, so clearing ops first is safe — any in-flight
> + * holder_notify_failure() will see the old holder_data with
> + * NULL ops (a no-op) rather than new ops with wrong context.
> + */
> dax_dev->holder_ops = NULL;
> + cmpxchg(&dax_dev->holder_data, holder, NULL);
> + }
> put_dax(dax_dev);
> }
> EXPORT_SYMBOL_GPL(fs_put_dax);
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH V2 6/7] dax: replace exported dax_dev_get() with non-allocating dax_dev_find()
2026-05-22 19:19 ` [PATCH V2 6/7] dax: replace exported dax_dev_get() with non-allocating dax_dev_find() John Groves
@ 2026-05-27 0:28 ` Dave Jiang
2026-05-30 14:19 ` John Groves
0 siblings, 1 reply; 22+ messages in thread
From: Dave Jiang @ 2026-05-27 0:28 UTC (permalink / raw)
To: John Groves, John Groves, Dan Williams
Cc: John Groves, Vishal Verma, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Miklos Szeredi,
Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
On 5/22/26 12:19 PM, John Groves wrote:
> From: John Groves <John@Groves.net>
>
> This fix is in response to a Sashiko review, and some subsequent
> analysis.
>
> dax_dev_get() uses iget5_locked() which creates a new inode if no
> matching one exists. This is correct for the internal caller
> (alloc_dax), but dangerous for external callers that look up devices
> from user-supplied or metadata-supplied dev_t values:
>
> 1. A new inode is created with DAXDEV_ALIVE set but no backing driver,
> no ops, and no IDA-allocated minor number.
>
> 2. On teardown, dax_destroy_inode() warns because kill_dax() was never
> called, and dax_free_inode() calls ida_free() for a minor that was
> never ida_alloc'd — potentially freeing the minor of a real device.
>
> Add dax_dev_find() which uses ilookup5() for lookup-only semantics:
> it returns an existing dax_device with an elevated inode reference, or
> NULL if no device with the given dev_t exists. It never creates inodes.
>
> Make dax_dev_get() static again (internal to super.c for alloc_dax),
> export dax_dev_find() instead, and update the two external callers
> (famfs_inode.c, famfs.c). Also add the missing CONFIG_DAX=n stub.
>
> Fixes: 2ae624d5a555d ("dax: export dax_dev_get()")
> Signed-off-by: John Groves <john@groves.net>
> ---
> drivers/dax/super.c | 27 +++++++++++++++++++++++++--
> include/linux/dax.h | 6 +++++-
> 2 files changed, 30 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index fa1d2a6eb2408..79e5823d1010d 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -541,7 +541,7 @@ static int dax_set(struct inode *inode, void *data)
> return 0;
> }
>
> -struct dax_device *dax_dev_get(dev_t devt)
> +static struct dax_device *dax_dev_get(dev_t devt)
> {
> struct dax_device *dax_dev;
> struct inode *inode;
> @@ -564,7 +564,30 @@ struct dax_device *dax_dev_get(dev_t devt)
>
> return dax_dev;
> }
> -EXPORT_SYMBOL_GPL(dax_dev_get);
> +
> +/**
> + * dax_dev_find - look up an existing dax_device by dev_t
> + * @devt: the device number to find
> + *
> + * Returns a dax_device with an elevated inode reference, or NULL if no
> + * device with the given dev_t exists. Unlike dax_dev_get(), this never
> + * allocates a new inode — it is safe for external callers that are looking
> + * up devices from user-supplied or metadata-supplied dev_t values.
> + *
> + * Caller must put_dax() the returned device when done.
> + */
> +struct dax_device *dax_dev_find(dev_t devt)
> +{
> + struct inode *inode;
> +
> + inode = ilookup5(dax_superblock, hash_32(devt + DAXFS_MAGIC, 31),
> + dax_test, &devt);
> + if (!inode)
> + return NULL;
> +
Claude mentions that dax_alive() check may be a good idea after grabbing the inode ref. Otherwise famfs may get a dax_dev that may be racing with a teardown. Do something similar that fs_dax_get_by_dev() or fs_dax_get() do WRT dax_alive() check perhaps.
DJ
> + return to_dax_dev(inode);
> +}
> +EXPORT_SYMBOL_GPL(dax_dev_find);
>
> struct dax_device *alloc_dax(void *private, const struct dax_operations *ops)
> {
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index fe6c3ded1b50f..29113eb95e72d 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -54,7 +54,7 @@ struct dax_device *alloc_dax(void *private, const struct dax_operations *ops);
> void *dax_holder(struct dax_device *dax_dev);
> void put_dax(struct dax_device *dax_dev);
> void kill_dax(struct dax_device *dax_dev);
> -struct dax_device *dax_dev_get(dev_t devt);
> +struct dax_device *dax_dev_find(dev_t devt);
> void dax_write_cache(struct dax_device *dax_dev, bool wc);
> bool dax_write_cache_enabled(struct dax_device *dax_dev);
> bool dax_synchronous(struct dax_device *dax_dev);
> @@ -92,6 +92,10 @@ static inline void put_dax(struct dax_device *dax_dev)
> static inline void kill_dax(struct dax_device *dax_dev)
> {
> }
> +static inline struct dax_device *dax_dev_find(dev_t devt)
> +{
> + return NULL;
> +}
> static inline void dax_write_cache(struct dax_device *dax_dev, bool wc)
> {
> }
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH V2 7/7] dax: fsdev.c minor formatting cleanup
2026-05-22 19:19 ` [PATCH V2 7/7] dax: fsdev.c minor formatting cleanup John Groves
@ 2026-05-27 0:31 ` Dave Jiang
0 siblings, 0 replies; 22+ messages in thread
From: Dave Jiang @ 2026-05-27 0:31 UTC (permalink / raw)
To: John Groves, John Groves, Dan Williams
Cc: John Groves, Vishal Verma, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Miklos Szeredi,
Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
On 5/22/26 12:19 PM, John Groves wrote:
> From: John Groves <John@Groves.net>
>
> Address some comments from Jonathan that were missed in the merged
> series. Fix line wrapping in fsdev_dax_recovery_write() and
> fsdev_dax_zero_page_range() signatures.
>
> Fixes: 099c81a1f0ab ("dax: Add dax_operations for use by fs-dax on fsdev dax")
No need for fixes tag when addressing white space and text formatting issues. Not a bug fix.
> Signed-off-by: John Groves <john@groves.net>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> ---
> drivers/dax/fsdev.c | 15 ++++++++-------
> 1 file changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> index f74fd1bb7f4c5..dafa24761e8ff 100644
> --- a/drivers/dax/fsdev.c
> +++ b/drivers/dax/fsdev.c
> @@ -45,8 +45,8 @@ static void fsdev_write_dax(void *addr, struct page *page,
> }
>
> static long __fsdev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
> - long nr_pages, enum dax_access_mode mode, void **kaddr,
> - unsigned long *pfn)
> + long nr_pages, enum dax_access_mode mode, void **kaddr,
> + unsigned long *pfn)
> {
> struct dev_dax *dev_dax = dax_get_private(dax_dev);
> size_t size = nr_pages << PAGE_SHIFT;
> @@ -90,7 +90,8 @@ static int fsdev_dax_zero_page_range(struct dax_device *dax_dev,
> long rc;
>
> WARN_ONCE(nr_pages > 1, "%s: nr_pages > 1\n", __func__);
> - rc = __fsdev_dax_direct_access(dax_dev, pgoff, 1, DAX_ACCESS, &kaddr, NULL);
> + rc = __fsdev_dax_direct_access(dax_dev, pgoff, 1, DAX_ACCESS,
> + &kaddr, NULL);
> if (rc < 0)
> return rc;
> fsdev_write_dax(kaddr, ZERO_PAGE(0), 0, PAGE_SIZE);
> @@ -98,15 +99,15 @@ static int fsdev_dax_zero_page_range(struct dax_device *dax_dev,
> }
>
> static long fsdev_dax_direct_access(struct dax_device *dax_dev,
> - pgoff_t pgoff, long nr_pages, enum dax_access_mode mode,
> - void **kaddr, unsigned long *pfn)
> + pgoff_t pgoff, long nr_pages, enum dax_access_mode mode,
> + void **kaddr, unsigned long *pfn)
> {
> return __fsdev_dax_direct_access(dax_dev, pgoff, nr_pages, mode,
> kaddr, pfn);
> }
>
> -static size_t fsdev_dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
> - void *addr, size_t bytes, struct iov_iter *i)
> +static size_t fsdev_dax_recovery_write(struct dax_device *dax_dev,
> + pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i)
> {
> return _copy_from_iter_flushcache(addr, bytes, i);
> }
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH V2 1/7] dax: fix misleading comment about share/index union in dax_folio_reset_order()
2026-05-26 23:07 ` Dave Jiang
@ 2026-05-29 23:41 ` John Groves
0 siblings, 0 replies; 22+ messages in thread
From: John Groves @ 2026-05-29 23:41 UTC (permalink / raw)
To: Dave Jiang
Cc: John Groves, Dan Williams, John Groves, Vishal Verma,
Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
Miklos Szeredi, Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
On 26/05/26 04:07PM, Dave Jiang wrote:
>
>
> On 5/22/26 12:18 PM, John Groves wrote:
> > From: John Groves <John@Groves.net>
> >
> > The comment in dax_folio_reset_order() claims that DAX maintains an
> > invariant where folio->share != 0 only when folio->mapping == NULL,
> > implying folio->share is zero whenever mapping is non-NULL. This is
> > misleading because folio->share and folio->index are a union -- for
> > non-shared folios with mapping != NULL, reading folio->share returns
> > the file page offset (folio->index), which is typically non-zero.
> >
> > Reword the comment to accurately describe the union aliasing: the
> > assignment clears whichever interpretation of the union word is active
> > (index for non-shared folios, share for shared folios), which is correct
> > because the folio is being released in either case.
> >
> > No functional change -- the code was already correct, only the
> > justification was wrong.
> >
> > Fixes: 59eb73b98ae0b ("dax: Factor out dax_folio_reset_order() helper")
> >
> > Reviewed-by: Jonathan Cameron <jic23@kernel.org>
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> > fs/dax.c | 12 ++++++------
> > 1 file changed, 6 insertions(+), 6 deletions(-)
> >
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 6d175cd47a99b..df19c9317d10e 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -392,12 +392,12 @@ int dax_folio_reset_order(struct folio *folio)
> > int order = folio_order(folio);
> >
> > /*
> > - * DAX maintains the invariant that folio->share != 0 only when
> > - * folio->mapping == NULL (enforced by dax_folio_make_shared()).
> > - * Equivalently: folio->mapping != NULL implies folio->share == 0.
> > - * Callers ensure share has been decremented to zero before
> > - * calling here, so unconditionally clearing both fields is
> > - * correct.
> > + * Clear the mapping and the index/share union word. folio->share
> > + * and folio->index occupy the same union in struct folio. For
> > + * non-shared folios (mapping != NULL), the union holds folio->index
> > + * (file page offset); for shared folios (mapping == NULL), it holds
> > + * folio->share (reference count). Either way, we are releasing the
> > + * folio and both fields should be zeroed.
>
> In the old comments, there is the pre-condition that "callers ensure share has been decremented to zero before calling here." Is this precondition remain true? Maybe should leave that comment in if that is the case?
That precondition is no longer universally true.
fsdev_clear_folio_state() calls this at probe time on folios that may be in
arbitrary state. The new wording justifies the zeroing correctly for all
callers: we are releasing the folio so both the mapping and the union word
(whether it holds an index or a share count) must be cleared.
So I think this is right...
Thanks for the review Dave!
John
<snip>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH V2 2/7] dax/fsdev: fix multi-range offset, vmemmap_shift leak, and probe error cleanup
2026-05-26 23:22 ` Dave Jiang
@ 2026-05-29 23:59 ` John Groves
0 siblings, 0 replies; 22+ messages in thread
From: John Groves @ 2026-05-29 23:59 UTC (permalink / raw)
To: Dave Jiang
Cc: John Groves, Dan Williams, John Groves, Vishal Verma,
Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
Miklos Szeredi, Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
On 26/05/26 04:22PM, Dave Jiang wrote:
>
>
> On 5/22/26 12:18 PM, John Groves wrote:
> > From: John Groves <John@Groves.net>
> >
> > Three fixes for fsdev.c:
> >
> > 1. Fix memory_failure offset calculation for multi-range devices.
> > The old code subtracted ranges[0].range.start from the faulting PFN's
> > physical address, which produces an incorrect (inflated) logical offset
> > when the PFN falls in ranges[1] or beyond due to physical gaps between
> > ranges. Add fsdev_pfn_to_offset() to walk the range list and compute
> > the correct device-linear byte offset.
> >
> > 2. Clear pgmap->vmemmap_shift for static DAX devices. When rebinding a
> > static device from device_dax (which may set vmemmap_shift based on
> > alignment) to fsdev_dax, the stale vmemmap_shift persists on the
> > shared pgmap. Explicitly zero it before devm_memremap_pages() so the
> > vmemmap is built for order-0 folios as fsdev requires.
> >
> > 3. Clear dev_dax->pgmap on probe failure for dynamic devices. After the
> > dynamic path sets dev_dax->pgmap, if a later probe step fails, devres
> > frees the devm_kzalloc'd pgmap but leaves dev_dax->pgmap dangling.
> > Subsequent probe attempts would hit the "dynamic-dax with pre-populated
> > page map" check and fail permanently. Use a goto cleanup to NULL
> > dev_dax->pgmap on error.
>
> 3 fixes, 3 separate patches?
>
> DJ
Yes, thanks. Done.
John
<snip>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH V2 3/7] dax/fsdev: fix kaddr for multi-range and fail probe on invalid pgmap offset
2026-05-26 23:31 ` Dave Jiang
@ 2026-05-30 0:04 ` John Groves
0 siblings, 0 replies; 22+ messages in thread
From: John Groves @ 2026-05-30 0:04 UTC (permalink / raw)
To: Dave Jiang
Cc: John Groves, Dan Williams, John Groves, Vishal Verma,
Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
Miklos Szeredi, Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
On 26/05/26 04:31PM, Dave Jiang wrote:
>
>
> On 5/22/26 12:19 PM, John Groves wrote:
> > From: John Groves <John@Groves.net>
> >
> > Two fixes for virtual address handling in fsdev:
> >
> > 1. Use __va(phys) instead of virt_addr + linear_offset for the kaddr
> > return in __fsdev_dax_direct_access(). The previous code added a
> > device-linear byte offset to virt_addr (which is __va of ranges[0]),
> > but for multi-range devices with physical gaps between ranges, this
> > linear arithmetic crosses the gap and produces a wrong kernel virtual
> > address. Using __va(phys) where phys comes from dax_pgoff_to_phys()
> > is correct for any range layout because the direct map translates
> > each physical address independently.
> >
> > 2. Convert the WARN_ON to a fatal error when pgmap_phys > phys. This
> > condition means the remapped region starts after the device's data
> > region, which is an impossible state. Previously the probe continued
> > with data_offset=0, leaving virt_addr silently misaligned. Now probe
> > returns -EINVAL with a diagnostic message.
>
> Split to 2 different patches I'd say.
>
> DJ
Agree - done. Thanks!
John
<snip>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH V2 4/7] dax/fsdev: clamp direct_access return to current physical range
2026-05-27 0:00 ` Dave Jiang
@ 2026-05-30 13:06 ` John Groves
0 siblings, 0 replies; 22+ messages in thread
From: John Groves @ 2026-05-30 13:06 UTC (permalink / raw)
To: Dave Jiang
Cc: John Groves, Dan Williams, John Groves, Vishal Verma,
Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
Miklos Szeredi, Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
On 26/05/26 05:00PM, Dave Jiang wrote:
>
>
> On 5/22/26 12:19 PM, John Groves wrote:
> > From: John Groves <John@Groves.net>
> >
> > __fsdev_dax_direct_access() returned the number of available pages based
> > on cached_size (the total size across all ranges). For multi-range
> > devices with physical gaps between ranges, this over-reports the number
> > of physically contiguous pages available from the returned kaddr/pfn.
> > Callers trust this return value to mean contiguous pages, so accessing
> > beyond the current range boundary would hit unmapped or unrelated memory.
> >
> > Fix by finding the range that contains the translated physical address
> > and clamping the return to the remaining pages within that range.
> >
> > Also remove the now-unused cached_size field from struct dev_dax, since
> > it was only consumed by the old return calculation.
> >
> > Fixes: 099c81a1f0ab3 ("dax: Add dax_operations for use by fs-dax on fsdev dax")
> > Signed-off-by: John Groves <john@groves.net>
>
> I ran this through Claude and this is what it came back with and it looks reasonable to me:
>
> The claimed bug does not manifest in the current tree. This is a
> refactor + dead-field removal dressed as a bugfix. Either the justification is
> wrong or there's a missing companion change.
>
> The claim doesn't hold up
>
> Commit message:
>
> ▎ __fsdev_dax_direct_access() returned the number of available pages based on
> ▎ cached_size (the total size across all ranges). For multi-range devices with
> ▎ physical gaps between ranges, this over-reports the number of physically
> ▎ contiguous pages available from the returned kaddr/pfn.
>
> Walk through it. __fsdev_dax_direct_access does:
>
> phys = dax_pgoff_to_phys(dev_dax, pgoff, size);
> if (phys == -1)
> return -EFAULT;
> ...
> return PHYS_PFN(min(size, dev_dax->cached_size - offset));
>
> Now look at what dax_pgoff_to_phys (drivers/dax/bus.c:1433) enforces:
>
> phys = PFN_PHYS(pgoff - dax_range->pgoff) + range->start;
> if (phys + size - 1 <= range->end)
> return phys;
> break; /* falls through to return -1 */
>
> It returns valid phys only when the entire requested size fits inside a single
> range. If the request straddles a gap, it returns -1 and the caller bails
> with -EFAULT before ever reaching the clamp.
>
> So in the only path where the clamp runs, we already know size <= range->end -
> phys + 1, which means cached_size - offset >= size, which means min(size,
> cached_size - offset) == size. Pre-patch returns nr_pages. Post-patch returns
> nr_pages. Same answer.
>
> I worked it through with concrete numbers on a 2-range device with a physical
> gap (range[0]=4 pages at 0x1000_0000, range[1]=4 pages at 0x2000_0000):
> - pgoff=2, nr_pages=2: both return 2. ✓
> - pgoff=3, nr_pages=2 (straddles gap): both return -EFAULT via the early bail.
> ✓
> - pgoff=4, nr_pages=2: both return 2. ✓
>
> I cannot construct a case where pre-patch over-reports.
>
> What's actually changed
>
> 1. A no-op clamp loop replaces a constant calculation. Same return value in
> every reachable case.
> 2. The trailing return -EFAULT; is dead code. If dax_pgoff_to_phys returned
> valid phys, then by construction phys lies inside some range in
> dev_dax->ranges[], so the loop will match.
> 3. O(N) walk replaces O(1) arithmetic. Trivial for small nr_range, but it's a
> regression in the wrong direction for a hot path.
> 4. cached_size field is removed. Good cleanup — I grepped, no other consumers.
>
>
> DJ
Thanks Dave. Sashiko reviews can be great, but sometimes they make us (me)
dumber.
Dropping this patch from the series.
John
<snip>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH V2 5/7] dax: fix holder_ops race in fs_put_dax()
2026-05-27 0:16 ` Dave Jiang
@ 2026-05-30 14:02 ` John Groves
2026-05-30 14:32 ` John Groves
0 siblings, 1 reply; 22+ messages in thread
From: John Groves @ 2026-05-30 14:02 UTC (permalink / raw)
To: Dave Jiang
Cc: John Groves, Dan Williams, John Groves, Vishal Verma,
Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
Miklos Szeredi, Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
On 26/05/26 05:16PM, Dave Jiang wrote:
>
>
> On 5/22/26 12:19 PM, John Groves wrote:
> > From: John Groves <John@Groves.net>
> >
> > Clear holder_ops before holder_data so that a concurrent fs_dax_get()
> > cannot have its newly installed holder_ops overwritten. Also add a
> > kerneldoc comment documenting that fs_put_dax() must only be called
> > by the current holder.
> >
> > Fixes: eec38f5d86d27 ("dax: add fs_dax_get() for devdax")
> > Signed-off-by: John Groves <john@groves.net>
>
> Couple things from Claude that may be worth taking a look at:
>
> 1. Memory ordering is now load-bearing and missing
>
> The whole correctness argument depends on the reader observing holder_ops =
> NULL before observing holder_data = NULL. The patch uses a plain store
> followed by cmpxchg. On x86 plain stores are ordered, but on arm64/ppc they
> are not — the reader can observe cmpxchg's release of holder_data while still
> seeing the old holder_ops. That puts us back in the dangerous (holder_data ==
> NULL, holder_ops == old) state on weakly-ordered arches.
>
> Required:
>
> smp_store_release(&dax_dev->holder_ops, NULL); /* publish ops=NULL first */
> cmpxchg(&dax_dev->holder_data, holder, NULL); /* then release holder_data
> */
Updating to WRITE_ONCE(), which I think is the right choice
>
> And the reader in dax_holder_notify_failure should use
> smp_load_acquire/READ_ONCE because today it reads dax_dev->holder_ops twice
> (line 334 and line 339), allowing tearing or stale-cache reads. Pre-existing
> weakness, but this patch is what makes the ordering matter.
>
> kill_dax (line 461-462) has the same naked-store pattern — it should be made
> consistent.
Will study this and post a separate patch for kill_dax if I think it's
warranted
>
> 2. Unconditional holder_ops = NULL is a behavior regression
>
> Pre-patch was defensive: if a caller passed the wrong holder, the cmpxchg
> failed and nothing got cleared.
>
> Post-patch clears holder_ops unconditionally whenever dax_dev && holder is
> truthy. A wrong-holder fs_put_dax() now actively damages the legitimate
> holder's state — sets holder_ops to NULL while holder_data retains the
> legitimate holder's pointer. From that point, all dax_holder_notify_failure()
> calls return -EOPNOTSUPP, silently breaking the legitimate holder's
> poison-recovery path.
This is a bit of a sticky wicket. The API contract is that the caller
of fs_dax_put() is the holder. To get the ordering right AND guard against
non-holder callers would require a lock.
Instead, I think the right answer is:
WRITE_ONCE(dax_dev->holder_ops, NULL);
WARN_ON(cmpxchg(&dax_dev->holder_data, holder, NULL) != holder);
If a non-holder calls this function, that's a bug and we'll get the
WARN_ON(). If a holder calls this function twice, we'll get the WARN_ON()
(the second time).
And when the API contract is honored, we have correct ordering.
>
> DJ
Thanks Dave!
John
<snip>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH V2 6/7] dax: replace exported dax_dev_get() with non-allocating dax_dev_find()
2026-05-27 0:28 ` Dave Jiang
@ 2026-05-30 14:19 ` John Groves
0 siblings, 0 replies; 22+ messages in thread
From: John Groves @ 2026-05-30 14:19 UTC (permalink / raw)
To: Dave Jiang
Cc: John Groves, Dan Williams, John Groves, Vishal Verma,
Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
Miklos Szeredi, Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
On 26/05/26 05:28PM, Dave Jiang wrote:
>
>
> On 5/22/26 12:19 PM, John Groves wrote:
> > From: John Groves <John@Groves.net>
> >
> > This fix is in response to a Sashiko review, and some subsequent
> > analysis.
> >
> > dax_dev_get() uses iget5_locked() which creates a new inode if no
> > matching one exists. This is correct for the internal caller
> > (alloc_dax), but dangerous for external callers that look up devices
> > from user-supplied or metadata-supplied dev_t values:
> >
> > 1. A new inode is created with DAXDEV_ALIVE set but no backing driver,
> > no ops, and no IDA-allocated minor number.
> >
> > 2. On teardown, dax_destroy_inode() warns because kill_dax() was never
> > called, and dax_free_inode() calls ida_free() for a minor that was
> > never ida_alloc'd — potentially freeing the minor of a real device.
> >
> > Add dax_dev_find() which uses ilookup5() for lookup-only semantics:
> > it returns an existing dax_device with an elevated inode reference, or
> > NULL if no device with the given dev_t exists. It never creates inodes.
> >
> > Make dax_dev_get() static again (internal to super.c for alloc_dax),
> > export dax_dev_find() instead, and update the two external callers
> > (famfs_inode.c, famfs.c). Also add the missing CONFIG_DAX=n stub.
> >
> > Fixes: 2ae624d5a555d ("dax: export dax_dev_get()")
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> > drivers/dax/super.c | 27 +++++++++++++++++++++++++--
> > include/linux/dax.h | 6 +++++-
> > 2 files changed, 30 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > index fa1d2a6eb2408..79e5823d1010d 100644
> > --- a/drivers/dax/super.c
> > +++ b/drivers/dax/super.c
> > @@ -541,7 +541,7 @@ static int dax_set(struct inode *inode, void *data)
> > return 0;
> > }
> >
> > -struct dax_device *dax_dev_get(dev_t devt)
> > +static struct dax_device *dax_dev_get(dev_t devt)
> > {
> > struct dax_device *dax_dev;
> > struct inode *inode;
> > @@ -564,7 +564,30 @@ struct dax_device *dax_dev_get(dev_t devt)
> >
> > return dax_dev;
> > }
> > -EXPORT_SYMBOL_GPL(dax_dev_get);
> > +
> > +/**
> > + * dax_dev_find - look up an existing dax_device by dev_t
> > + * @devt: the device number to find
> > + *
> > + * Returns a dax_device with an elevated inode reference, or NULL if no
> > + * device with the given dev_t exists. Unlike dax_dev_get(), this never
> > + * allocates a new inode — it is safe for external callers that are looking
> > + * up devices from user-supplied or metadata-supplied dev_t values.
> > + *
> > + * Caller must put_dax() the returned device when done.
> > + */
> > +struct dax_device *dax_dev_find(dev_t devt)
> > +{
> > + struct inode *inode;
> > +
> > + inode = ilookup5(dax_superblock, hash_32(devt + DAXFS_MAGIC, 31),
> > + dax_test, &devt);
> > + if (!inode)
> > + return NULL;
> > +
>
> Claude mentions that dax_alive() check may be a good idea after grabbing the inode ref. Otherwise famfs may get a dax_dev that may be racing with a teardown. Do something similar that fs_dax_get_by_dev() or fs_dax_get() do WRT dax_alive() check perhaps.
>
> DJ
I think that's right. Calling dax_alive() requires a dax_read_lock();
Adding that too.
Thanks!
John
<snip>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH V2 5/7] dax: fix holder_ops race in fs_put_dax()
2026-05-30 14:02 ` John Groves
@ 2026-05-30 14:32 ` John Groves
0 siblings, 0 replies; 22+ messages in thread
From: John Groves @ 2026-05-30 14:32 UTC (permalink / raw)
To: Dave Jiang
Cc: John Groves, Dan Williams, John Groves (jgroves), Vishal Verma,
Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
Miklos Szeredi, Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
On Sat, May 30, 2026, at 9:02 AM, John Groves wrote:
> On 26/05/26 05:16PM, Dave Jiang wrote:
> >
> >
> > On 5/22/26 12:19 PM, John Groves wrote:
> > > From: John Groves <John@Groves.net>
> > >
> > > Clear holder_ops before holder_data so that a concurrent fs_dax_get()
> > > cannot have its newly installed holder_ops overwritten. Also add a
> > > kerneldoc comment documenting that fs_put_dax() must only be called
> > > by the current holder.
> > >
> > > Fixes: eec38f5d86d27 ("dax: add fs_dax_get() for devdax")
> > > Signed-off-by: John Groves <john@groves.net>
> >
> > Couple things from Claude that may be worth taking a look at:
> >
> > 1. Memory ordering is now load-bearing and missing
> >
> > The whole correctness argument depends on the reader observing holder_ops =
> > NULL before observing holder_data = NULL. The patch uses a plain store
> > followed by cmpxchg. On x86 plain stores are ordered, but on arm64/ppc they
> > are not — the reader can observe cmpxchg's release of holder_data while still
> > seeing the old holder_ops. That puts us back in the dangerous (holder_data ==
> > NULL, holder_ops == old) state on weakly-ordered arches.
> >
> > Required:
> >
> > smp_store_release(&dax_dev->holder_ops, NULL); /* publish ops=NULL first */
> > cmpxchg(&dax_dev->holder_data, holder, NULL); /* then release holder_data
> > */
>
> Updating to WRITE_ONCE(), which I think is the right choice
>
> >
> > And the reader in dax_holder_notify_failure should use
> > smp_load_acquire/READ_ONCE because today it reads dax_dev->holder_ops twice
> > (line 334 and line 339), allowing tearing or stale-cache reads. Pre-existing
> > weakness, but this patch is what makes the ordering matter.
> >
> > kill_dax (line 461-462) has the same naked-store pattern — it should be made
> > consistent.
>
> Will study this and post a separate patch for kill_dax if I think it's
> warranted
>
Fixing kill_dax() isn't necessary because it does a synchronize_srcu() after
clear_bit(DAXDEV_ALIVE). So it can't race with fs_dax_get()...
Thanks,
John
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2026-05-30 14:33 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20260522191804.79088-1-john@jagalactic.com>
2026-05-22 19:18 ` [PATCH V2 0/7] Fixes to the previously-merged drivers/dax/fsdev series John Groves
2026-05-22 19:18 ` [PATCH V2 1/7] dax: fix misleading comment about share/index union in dax_folio_reset_order() John Groves
2026-05-26 23:07 ` Dave Jiang
2026-05-29 23:41 ` John Groves
2026-05-22 19:18 ` [PATCH V2 2/7] dax/fsdev: fix multi-range offset, vmemmap_shift leak, and probe error cleanup John Groves
2026-05-26 23:22 ` Dave Jiang
2026-05-29 23:59 ` John Groves
2026-05-22 19:19 ` [PATCH V2 3/7] dax/fsdev: fix kaddr for multi-range and fail probe on invalid pgmap offset John Groves
2026-05-26 23:31 ` Dave Jiang
2026-05-30 0:04 ` John Groves
2026-05-22 19:19 ` [PATCH V2 4/7] dax/fsdev: clamp direct_access return to current physical range John Groves
2026-05-27 0:00 ` Dave Jiang
2026-05-30 13:06 ` John Groves
2026-05-22 19:19 ` [PATCH V2 5/7] dax: fix holder_ops race in fs_put_dax() John Groves
2026-05-27 0:16 ` Dave Jiang
2026-05-30 14:02 ` John Groves
2026-05-30 14:32 ` John Groves
2026-05-22 19:19 ` [PATCH V2 6/7] dax: replace exported dax_dev_get() with non-allocating dax_dev_find() John Groves
2026-05-27 0:28 ` Dave Jiang
2026-05-30 14:19 ` John Groves
2026-05-22 19:19 ` [PATCH V2 7/7] dax: fsdev.c minor formatting cleanup John Groves
2026-05-27 0:31 ` Dave Jiang
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.