* [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series
[not found] <20260611173057.65868-1-john@jagalactic.com>
@ 2026-06-11 17:31 ` John Groves
2026-06-11 17:31 ` [PATCH V5 1/9] dax: fix misleading comment about share/index union in dax_folio_reset_order() John Groves
` (8 more replies)
0 siblings, 9 replies; 18+ messages in thread
From: John Groves @ 2026-06-11 17:31 UTC (permalink / raw)
To: John Groves, Dan Williams
Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Miklos Szeredi,
Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
John Groves
From: John Groves <john@groves.net>
This series applies bug fixes (mostly found via sashiko) to the dax/fsdev
series. It has been soaking in the famfs CI pipeline and 1) won't affect
anything that doesn't use drivers/dax/fsdev.c, and 2) doesn't affect any
known workloads -- although the bugs would have manifested when multi-range
DCD dax devices are a thing (soon-ish).
Most of the series is confined to drivers/dax/fsdev.c. Two patches touch
shared DAX core in drivers/dax/super.c: patch 7 reads holder_ops once in
dax_holder_notify_failure() to close a double-fetch NULL dereference, and
patch 8 reorders fs_put_dax() and adds a WARN_ON(). fs_put_dax() is used by
ext2/ext4/erofs/xfs, but only holder-passing callers (like XFS in-tree) will
see a behavior change, and only a new warning if they misuse it.
Changes since V4:
- New patch 7 (dax: read holder_ops once in dax_holder_notify_failure()):
split the reader-side READ_ONCE() fix out of the fs_put_dax() patch and
placed it first, so the fs_put_dax() patch's "a concurrent
dax_holder_notify_failure() that sees NULL ops returns -EOPNOTSUPP
cleanly" reasoning actually holds when it lands. dax_holder_notify_failure()
read holder_ops twice without READ_ONCE(); a concurrent clear could make
the NULL check pass while the indirect call dereferenced NULL. Carries
Fixes: 8012b86608552 ("dax: introduce holder for dax_device"), the commit
that introduced the unmarked double fetch. Suggested by Richard Cheng (and
the Sashiko bot).
- Patch 2 (multi-range memory_failure offset): the ->memory_failure callback
now walks the pagemap's own immutable range array (pgmap->ranges[]) rather
than dev_dax->ranges[], which a concurrent sysfs mapping_store() can
krealloc() under dax_region_rwsem while this callback holds no such lock.
For dynamic devices the two arrays are identical, so the reported offset is
unchanged for the multi-range case this patch targets. Suggested by Richard
Cheng (and the Sashiko bot).
- Dropped the dax_dev_get()/dax_dev_find() patch (V4 patch 8) from this
revision. There is no in-tree caller yet; it will be sent together with the
famfs filesystem series that introduces the caller. (Per Richard Cheng /
Sashiko.)
- Patch 8 (holder_ops race in fs_put_dax()): unchanged from V4 (renumbered
from 7 to 8).
- Collected Reviewed-by from Dave Jiang on patches 4 and 6.
Changes since V3:
- Patch 4: Adopted Dave's suggested refactor -- factor out
fsdev_acquire_pgmap() and defer the dev_dax->pgmap assignment until
probe can no longer fail, replacing the goto-based cleanup. Did not
carry Alison's V3 Reviewed-by due to the rewrite.
- Patch 5: Also remove the now write-only dev_dax->virt_addr field,
per Dave's review.
- Patch 7: Fixed the WARN_ON() to tolerate holder_data == NULL, which
legitimately occurs when kill_dax() clears it during device removal
under a live holder (per Dave's review). Wrong-holder calls still
warn.
- Patch 8: Kept the Fixes tag -- the exported symbol itself is the
hazard; stable kernels carrying the export should want this fix.
Changes since V2:
* Patch 1 (comment fix): No change. Responded to Dave's question about
the dropped precondition -- the new comment correctly covers both
callers; fsdev_clear_folio_state() does not guarantee share==0 before
calling, so the old precondition was no longer universally true.
* V2 patch 2 (three fixes): Split into three separate patches (patches
2-4) per Dave's review.
* V2 patch 3 (two fixes): Split into two separate patches (patches 5-6)
per Dave's review.
* V2 patch 4 (clamp direct_access / remove cached_size): Dropped.
Dave's analysis correctly showed the claimed bug does not exist --
dax_pgoff_to_phys() already enforces that the full requested size fits
within a single range before returning, making the clamp a no-op in
every reachable path.
* V2 patch 5 (holder_ops race): Use WRITE_ONCE() for the holder_ops
store; add WARN_ON() on the cmpxchg result to catch wrong-holder and
double-put API contract violations; fix the inline comment, which
incorrectly claimed dax_holder_notify_failure() consults holder_ops
only when holder_data is non-NULL.
* V2 patch 6 (dax_dev_find): Add dax_alive() check under dax_read_lock()
after ilookup5() to prevent returning a device that is concurrently
being torn down by kill_dax().
* V2 patch 7 (formatting cleanup): Drop incorrect Fixes: tag; add
Dave's Reviewed-by.
* The series grows from 7 to 9 patches.
Changes since v1:
* Dropped modes from patch 6 to fs/fuse/famfs.c and
fs/famfs/famfs_inode.c, which are not upstream so it broke
attempts to apply the series. Oops...
* Added patch 7, which addresses a previously-missed review comment
from Jonathan - minor cleanup
John Groves (9):
dax: fix misleading comment about share/index union in
dax_folio_reset_order()
dax/fsdev: fix multi-range offset in memory_failure handler
dax/fsdev: clear vmemmap_shift when binding static pgmap
dax/fsdev: don't leave a dangling dev_dax->pgmap on probe failure
dax/fsdev: use __va(phys) for kaddr in direct_access
dax/fsdev: fail probe on invalid pgmap offset
dax: read holder_ops once in dax_holder_notify_failure()
dax: fix holder_ops race in fs_put_dax()
dax: fsdev.c minor formatting cleanup
drivers/dax/dax-private.h | 2 -
drivers/dax/fsdev.c | 126 +++++++++++++++++++++++++-------------
drivers/dax/super.c | 53 ++++++++++++++--
fs/dax.c | 12 ++--
4 files changed, 136 insertions(+), 57 deletions(-)
base-commit: 4549871118cf616eecdd2d939f78e3b9e1dddc48
--
2.53.0
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH V5 1/9] dax: fix misleading comment about share/index union in dax_folio_reset_order()
2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
@ 2026-06-11 17:31 ` John Groves
2026-06-11 17:31 ` [PATCH V5 2/9] dax/fsdev: fix multi-range offset in memory_failure handler John Groves
` (7 subsequent siblings)
8 siblings, 0 replies; 18+ messages in thread
From: John Groves @ 2026-06-11 17:31 UTC (permalink / raw)
To: John Groves, Dan Williams
Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Miklos Szeredi,
Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
John Groves
From: John Groves <John@Groves.net>
The comment in dax_folio_reset_order() claims that DAX maintains an
invariant where folio->share != 0 only when folio->mapping == NULL,
implying folio->share is zero whenever mapping is non-NULL. This is
misleading because folio->share and folio->index are a union -- for
non-shared folios with mapping != NULL, reading folio->share returns
the file page offset (folio->index), which is typically non-zero.
Reword the comment to accurately describe the union aliasing: the
assignment clears whichever interpretation of the union word is active
(index for non-shared folios, share for shared folios), which is correct
because the folio is being released in either case.
No functional change -- the code was already correct, only the
justification was wrong.
Fixes: 59eb73b98ae0b ("dax: Factor out dax_folio_reset_order() helper")
Reviewed-by: Jonathan Cameron <jic23@kernel.org>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: John Groves <john@groves.net>
---
fs/dax.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 6d175cd47a99b..df19c9317d10e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -392,12 +392,12 @@ int dax_folio_reset_order(struct folio *folio)
int order = folio_order(folio);
/*
- * DAX maintains the invariant that folio->share != 0 only when
- * folio->mapping == NULL (enforced by dax_folio_make_shared()).
- * Equivalently: folio->mapping != NULL implies folio->share == 0.
- * Callers ensure share has been decremented to zero before
- * calling here, so unconditionally clearing both fields is
- * correct.
+ * Clear the mapping and the index/share union word. folio->share
+ * and folio->index occupy the same union in struct folio. For
+ * non-shared folios (mapping != NULL), the union holds folio->index
+ * (file page offset); for shared folios (mapping == NULL), it holds
+ * folio->share (reference count). Either way, we are releasing the
+ * folio and both fields should be zeroed.
*/
folio->mapping = NULL;
folio->share = 0;
--
2.53.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH V5 2/9] dax/fsdev: fix multi-range offset in memory_failure handler
2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
2026-06-11 17:31 ` [PATCH V5 1/9] dax: fix misleading comment about share/index union in dax_folio_reset_order() John Groves
@ 2026-06-11 17:31 ` John Groves
2026-06-12 3:08 ` Richard Cheng
2026-06-11 17:32 ` [PATCH V5 3/9] dax/fsdev: clear vmemmap_shift when binding static pgmap John Groves
` (6 subsequent siblings)
8 siblings, 1 reply; 18+ messages in thread
From: John Groves @ 2026-06-11 17:31 UTC (permalink / raw)
To: John Groves, Dan Williams
Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Miklos Szeredi,
Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
Richard Cheng, John Groves
From: John Groves <John@Groves.net>
Fix memory_failure offset calculation for multi-range devices. The old code
subtracted ranges[0].range.start from the faulting PFN's physical address,
which produces an incorrect (inflated) logical offset when the PFN falls in
ranges[1] or beyond due to physical gaps between ranges. Add
fsdev_pfn_to_offset() to walk the range list and compute the correct
device-linear byte offset.
Walk the pagemap's own range array (pgmap->ranges[]) rather than
dev_dax->ranges[]. The pgmap copy is the immutable snapshot populated at
probe and is never mutated afterwards, whereas dev_dax->ranges[] can be
krealloc()'d by a concurrent sysfs mapping_store() (under dax_region_rwsem,
which this ->memory_failure callback does not hold). For dynamic devices the
two arrays are identical, so the reported offset is unchanged for the
multi-range case this targets.
Fixes: d5406bd458b0a ("dax: add fsdev.c driver for fs-dax on character dax")
Suggested-by: Richard Cheng <icheng@nvidia.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: John Groves <john@groves.net>
---
drivers/dax/fsdev.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)
diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
index 188b2526bee45..2c5de3d80a618 100644
--- a/drivers/dax/fsdev.c
+++ b/drivers/dax/fsdev.c
@@ -135,11 +135,26 @@ static void fsdev_clear_ops(void *data)
* The core mm code in free_zone_device_folio() handles the wake_up_var()
* directly for this memory type.
*/
+static u64 fsdev_pfn_to_offset(struct dev_pagemap *pgmap, unsigned long pfn)
+{
+ phys_addr_t phys = PFN_PHYS(pfn);
+ u64 offset = 0;
+
+ for (int i = 0; i < pgmap->nr_range; i++) {
+ struct range *range = &pgmap->ranges[i];
+
+ if (phys >= range->start && phys <= range->end)
+ return offset + (phys - range->start);
+ offset += range_len(range);
+ }
+ return -1ULL;
+}
+
static int fsdev_pagemap_memory_failure(struct dev_pagemap *pgmap,
unsigned long pfn, unsigned long nr_pages, int mf_flags)
{
struct dev_dax *dev_dax = pgmap->owner;
- u64 offset = PFN_PHYS(pfn) - dev_dax->ranges[0].range.start;
+ u64 offset = fsdev_pfn_to_offset(pgmap, pfn);
u64 len = nr_pages << PAGE_SHIFT;
return dax_holder_notify_failure(dev_dax->dax_dev, offset,
--
2.53.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH V5 3/9] dax/fsdev: clear vmemmap_shift when binding static pgmap
2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
2026-06-11 17:31 ` [PATCH V5 1/9] dax: fix misleading comment about share/index union in dax_folio_reset_order() John Groves
2026-06-11 17:31 ` [PATCH V5 2/9] dax/fsdev: fix multi-range offset in memory_failure handler John Groves
@ 2026-06-11 17:32 ` John Groves
2026-06-12 2:56 ` Richard Cheng
2026-06-11 17:32 ` [PATCH V5 4/9] dax/fsdev: don't leave a dangling dev_dax->pgmap on probe failure John Groves
` (5 subsequent siblings)
8 siblings, 1 reply; 18+ messages in thread
From: John Groves @ 2026-06-11 17:32 UTC (permalink / raw)
To: John Groves, Dan Williams
Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Miklos Szeredi,
Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
John Groves
From: John Groves <John@Groves.net>
Clear pgmap->vmemmap_shift for static DAX devices. When rebinding a static
device from device_dax (which may set vmemmap_shift based on alignment) to
fsdev_dax, the stale vmemmap_shift persists on the shared pgmap. Explicitly
zero it before devm_memremap_pages() so the vmemmap is built for order-0
folios as fsdev requires.
Fixes: d5406bd458b0a ("dax: add fsdev.c driver for fs-dax on character dax")
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: John Groves <john@groves.net>
---
drivers/dax/fsdev.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
index 2c5de3d80a618..52f46b3e245ea 100644
--- a/drivers/dax/fsdev.c
+++ b/drivers/dax/fsdev.c
@@ -237,6 +237,7 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
}
pgmap = dev_dax->pgmap;
+ pgmap->vmemmap_shift = 0;
} else {
size_t pgmap_size;
--
2.53.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH V5 4/9] dax/fsdev: don't leave a dangling dev_dax->pgmap on probe failure
2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
` (2 preceding siblings ...)
2026-06-11 17:32 ` [PATCH V5 3/9] dax/fsdev: clear vmemmap_shift when binding static pgmap John Groves
@ 2026-06-11 17:32 ` John Groves
2026-06-11 17:32 ` [PATCH V5 5/9] dax/fsdev: use __va(phys) for kaddr in direct_access John Groves
` (4 subsequent siblings)
8 siblings, 0 replies; 18+ messages in thread
From: John Groves @ 2026-06-11 17:32 UTC (permalink / raw)
To: John Groves, Dan Williams
Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Miklos Szeredi,
Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
John Groves
From: John Groves <John@Groves.net>
After the dynamic path set dev_dax->pgmap, any later probe failure left
dev_dax->pgmap dangling: devres frees the devm_kzalloc'd pgmap on probe
failure, and subsequent probe attempts would hit the "dynamic-dax with
pre-populated page map" check and fail permanently.
Factor pgmap acquisition out into fsdev_acquire_pgmap(), and defer the
dev_dax->pgmap assignment until probe can no longer fail. A failed probe
now never publishes the pointer at all, so there is nothing to unwind.
This also matches kill_dev_dax(), which already clears the dynamic pgmap
pointer on unbind: dev_dax->pgmap is now non-NULL only while the pgmap
is actually valid.
Refactor suggested by Dave Jiang.
Fixes: d5406bd458b0a ("dax: add fsdev.c driver for fs-dax on character dax")
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: John Groves <john@groves.net>
---
drivers/dax/fsdev.c | 77 ++++++++++++++++++++++++++++-----------------
1 file changed, 49 insertions(+), 28 deletions(-)
diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
index 52f46b3e245ea..cc097167ad2c7 100644
--- a/drivers/dax/fsdev.c
+++ b/drivers/dax/fsdev.c
@@ -219,47 +219,62 @@ static const struct file_operations fsdev_fops = {
.release = fsdev_release,
};
-static int fsdev_dax_probe(struct dev_dax *dev_dax)
+/*
+ * Acquire the dev_pagemap for probe: the static (pre-populated) one if
+ * present, or a devm-allocated one for the dynamic case. Note that
+ * dev_dax->pgmap is not set here; fsdev_dax_probe() sets it only once
+ * probe succeeds, so a failed probe never leaves a dangling pointer
+ * to a devres-freed pgmap.
+ */
+static struct dev_pagemap *fsdev_acquire_pgmap(struct dev_dax *dev_dax)
{
- struct dax_device *dax_dev = dev_dax->dax_dev;
struct device *dev = &dev_dax->dev;
struct dev_pagemap *pgmap;
- struct inode *inode;
- u64 data_offset = 0;
- struct cdev *cdev;
- void *addr;
- int rc, i;
+ size_t pgmap_size;
if (static_dev_dax(dev_dax)) {
if (dev_dax->nr_range > 1) {
- dev_warn(dev, "static pgmap / multi-range device conflict\n");
- return -EINVAL;
+ dev_warn(dev,
+ "static pgmap / multi-range device conflict\n");
+ return ERR_PTR(-EINVAL);
}
pgmap = dev_dax->pgmap;
pgmap->vmemmap_shift = 0;
- } else {
- size_t pgmap_size;
+ return pgmap;
+ }
- if (dev_dax->pgmap) {
- dev_warn(dev, "dynamic-dax with pre-populated page map\n");
- return -EINVAL;
- }
+ if (dev_dax->pgmap) {
+ dev_warn(dev, "dynamic-dax with pre-populated page map\n");
+ return ERR_PTR(-EINVAL);
+ }
- pgmap_size = struct_size(pgmap, ranges, dev_dax->nr_range - 1);
- pgmap = devm_kzalloc(dev, pgmap_size, GFP_KERNEL);
- if (!pgmap)
- return -ENOMEM;
+ pgmap_size = struct_size(pgmap, ranges, dev_dax->nr_range - 1);
+ pgmap = devm_kzalloc(dev, pgmap_size, GFP_KERNEL);
+ if (!pgmap)
+ return ERR_PTR(-ENOMEM);
- pgmap->nr_range = dev_dax->nr_range;
- dev_dax->pgmap = pgmap;
+ pgmap->nr_range = dev_dax->nr_range;
+ for (int i = 0; i < dev_dax->nr_range; i++)
+ pgmap->ranges[i] = dev_dax->ranges[i].range;
- for (i = 0; i < dev_dax->nr_range; i++) {
- struct range *range = &dev_dax->ranges[i].range;
+ return pgmap;
+}
- pgmap->ranges[i] = *range;
- }
- }
+static int fsdev_dax_probe(struct dev_dax *dev_dax)
+{
+ struct dax_device *dax_dev = dev_dax->dax_dev;
+ struct device *dev = &dev_dax->dev;
+ struct dev_pagemap *pgmap;
+ struct inode *inode;
+ u64 data_offset = 0;
+ struct cdev *cdev;
+ void *addr;
+ int rc, i;
+
+ pgmap = fsdev_acquire_pgmap(dev_dax);
+ if (IS_ERR(pgmap))
+ return PTR_ERR(pgmap);
for (i = 0; i < dev_dax->nr_range; i++) {
struct range *range = &dev_dax->ranges[i].range;
@@ -306,7 +321,7 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
/* Detect whether the data is at a non-zero offset into the memory */
if (pgmap->range.start != dev_dax->ranges[0].range.start) {
u64 phys = dev_dax->ranges[0].range.start;
- u64 pgmap_phys = dev_dax->pgmap[0].range.start;
+ u64 pgmap_phys = pgmap[0].range.start;
if (!WARN_ON(pgmap_phys > phys))
data_offset = phys - pgmap_phys;
@@ -339,7 +354,13 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
return rc;
run_dax(dax_dev);
- return devm_add_action_or_reset(dev, fsdev_kill, dev_dax);
+ rc = devm_add_action_or_reset(dev, fsdev_kill, dev_dax);
+ if (rc)
+ return rc;
+
+ /* Probe can no longer fail; expose the pgmap via dev_dax */
+ dev_dax->pgmap = pgmap;
+ return 0;
}
static struct dax_device_driver fsdev_dax_driver = {
--
2.53.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH V5 5/9] dax/fsdev: use __va(phys) for kaddr in direct_access
2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
` (3 preceding siblings ...)
2026-06-11 17:32 ` [PATCH V5 4/9] dax/fsdev: don't leave a dangling dev_dax->pgmap on probe failure John Groves
@ 2026-06-11 17:32 ` John Groves
2026-06-11 17:32 ` [PATCH V5 6/9] dax/fsdev: fail probe on invalid pgmap offset John Groves
` (3 subsequent siblings)
8 siblings, 0 replies; 18+ messages in thread
From: John Groves @ 2026-06-11 17:32 UTC (permalink / raw)
To: John Groves, Dan Williams
Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Miklos Szeredi,
Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
John Groves
From: John Groves <John@Groves.net>
Use __va(phys) instead of virt_addr + linear_offset for the kaddr
return in __fsdev_dax_direct_access(). The previous code added a
device-linear byte offset to virt_addr (which is __va of ranges[0]),
but for multi-range devices with physical gaps between ranges, this
linear arithmetic crosses the gap and produces a wrong kernel virtual
address. Using __va(phys) where phys comes from dax_pgoff_to_phys()
is correct for any range layout because the direct map translates
each physical address independently.
This leaves dev_dax->virt_addr write-only, so remove the field
(suggested by Dave Jiang).
Fixes: 759455848df0b ("dax: Save the kva from memremap")
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: John Groves <john@groves.net>
---
drivers/dax/dax-private.h | 2 --
drivers/dax/fsdev.c | 8 ++------
2 files changed, 2 insertions(+), 8 deletions(-)
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 81e4af49e39c1..607a53a91f58b 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -69,7 +69,6 @@ struct dev_dax_range {
* data while the device is activated in the driver.
* @region: parent region
* @dax_dev: core dax functionality
- * @virt_addr: kva from memremap; used by fsdev_dax
* @cached_size: size of daxdev cached by fsdev_dax
* @align: alignment of this instance
* @target_node: effective numa node if dev_dax memory range is onlined
@@ -85,7 +84,6 @@ struct dev_dax_range {
struct dev_dax {
struct dax_region *region;
struct dax_device *dax_dev;
- void *virt_addr;
u64 cached_size;
unsigned int align;
int target_node;
diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
index cc097167ad2c7..07de6bfbbf673 100644
--- a/drivers/dax/fsdev.c
+++ b/drivers/dax/fsdev.c
@@ -51,9 +51,7 @@ static long __fsdev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
struct dev_dax *dev_dax = dax_get_private(dax_dev);
size_t size = nr_pages << PAGE_SHIFT;
size_t offset = pgoff << PAGE_SHIFT;
- void *virt_addr = dev_dax->virt_addr + offset;
phys_addr_t phys;
- unsigned long local_pfn;
phys = dax_pgoff_to_phys(dev_dax, pgoff, size);
if (phys == -1) {
@@ -63,11 +61,10 @@ static long __fsdev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
}
if (kaddr)
- *kaddr = virt_addr;
+ *kaddr = __va(phys);
- local_pfn = PHYS_PFN(phys);
if (pfn)
- *pfn = local_pfn;
+ *pfn = PHYS_PFN(phys);
/*
* Use cached_size which was computed at probe time. The size cannot
@@ -329,7 +326,6 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
pr_debug("%s: offset detected phys=%llx pgmap_phys=%llx offset=%llx\n",
__func__, phys, pgmap_phys, data_offset);
}
- dev_dax->virt_addr = addr + data_offset;
inode = dax_inode(dax_dev);
cdev = inode->i_cdev;
--
2.53.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH V5 6/9] dax/fsdev: fail probe on invalid pgmap offset
2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
` (4 preceding siblings ...)
2026-06-11 17:32 ` [PATCH V5 5/9] dax/fsdev: use __va(phys) for kaddr in direct_access John Groves
@ 2026-06-11 17:32 ` John Groves
2026-06-11 18:09 ` Gupta, Pankaj
2026-06-11 17:32 ` [PATCH V5 7/9] dax: read holder_ops once in dax_holder_notify_failure() John Groves
` (2 subsequent siblings)
8 siblings, 1 reply; 18+ messages in thread
From: John Groves @ 2026-06-11 17:32 UTC (permalink / raw)
To: John Groves, Dan Williams
Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Miklos Szeredi,
Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
John Groves
From: John Groves <John@Groves.net>
Convert the WARN_ON to a fatal error when pgmap_phys > phys. This
condition means the remapped region starts after the device's data
region, which is an impossible state. Previously the probe continued
with data_offset=0, leaving virt_addr silently misaligned. Now probe
returns -EINVAL with a diagnostic message.
Fixes: 759455848df0b ("dax: Save the kva from memremap")
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: John Groves <john@groves.net>
---
drivers/dax/fsdev.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
index 07de6bfbbf673..71d2bee1e2805 100644
--- a/drivers/dax/fsdev.c
+++ b/drivers/dax/fsdev.c
@@ -320,8 +320,12 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
u64 phys = dev_dax->ranges[0].range.start;
u64 pgmap_phys = pgmap[0].range.start;
- if (!WARN_ON(pgmap_phys > phys))
- data_offset = phys - pgmap_phys;
+ if (pgmap_phys > phys) {
+ dev_err(dev, "pgmap start %#llx exceeds data start %#llx\n",
+ pgmap_phys, phys);
+ return -EINVAL;
+ }
+ data_offset = phys - pgmap_phys;
pr_debug("%s: offset detected phys=%llx pgmap_phys=%llx offset=%llx\n",
__func__, phys, pgmap_phys, data_offset);
--
2.53.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH V5 7/9] dax: read holder_ops once in dax_holder_notify_failure()
2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
` (5 preceding siblings ...)
2026-06-11 17:32 ` [PATCH V5 6/9] dax/fsdev: fail probe on invalid pgmap offset John Groves
@ 2026-06-11 17:32 ` John Groves
2026-06-12 3:02 ` Richard Cheng
2026-06-11 17:32 ` [PATCH V5 8/9] dax: fix holder_ops race in fs_put_dax() John Groves
2026-06-11 17:33 ` [PATCH V5 9/9] dax: fsdev.c minor formatting cleanup John Groves
8 siblings, 1 reply; 18+ messages in thread
From: John Groves @ 2026-06-11 17:32 UTC (permalink / raw)
To: John Groves, Dan Williams
Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Miklos Szeredi,
Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
Richard Cheng, John Groves
From: John Groves <John@Groves.net>
dax_holder_notify_failure() reads dax_dev->holder_ops twice without
READ_ONCE() -- once for the NULL check and once for the indirect
notify_failure() call. A concurrent fs_put_dax() or kill_dax() can clear
holder_ops between the two reads, so the check can observe a non-NULL
pointer while the call dereferences NULL.
Fetch holder_ops once into a local with READ_ONCE() so the NULL check and
the indirect call observe the same value.
Fixes: 8012b86608552 ("dax: introduce holder for dax_device")
Suggested-by: Richard Cheng <icheng@nvidia.com>
Signed-off-by: John Groves <john@groves.net>
---
drivers/dax/super.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 25cf99dd9360b..6b5ee6589e39b 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -303,6 +303,7 @@ EXPORT_SYMBOL_GPL(dax_recovery_write);
int dax_holder_notify_failure(struct dax_device *dax_dev, u64 off,
u64 len, int mf_flags)
{
+ const struct dax_holder_operations *ops;
int rc, id;
id = dax_read_lock();
@@ -311,12 +312,18 @@ int dax_holder_notify_failure(struct dax_device *dax_dev, u64 off,
goto out;
}
- if (!dax_dev->holder_ops) {
+ /*
+ * Read holder_ops once: a concurrent fs_put_dax() or kill_dax() can
+ * clear it. Without the single fetch the compiler could reload
+ * between the NULL check and the call and dereference a NULL ops.
+ */
+ ops = READ_ONCE(dax_dev->holder_ops);
+ if (!ops) {
rc = -EOPNOTSUPP;
goto out;
}
- rc = dax_dev->holder_ops->notify_failure(dax_dev, off, len, mf_flags);
+ rc = ops->notify_failure(dax_dev, off, len, mf_flags);
out:
dax_read_unlock(id);
return rc;
--
2.53.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH V5 8/9] dax: fix holder_ops race in fs_put_dax()
2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
` (6 preceding siblings ...)
2026-06-11 17:32 ` [PATCH V5 7/9] dax: read holder_ops once in dax_holder_notify_failure() John Groves
@ 2026-06-11 17:32 ` John Groves
2026-06-11 17:33 ` [PATCH V5 9/9] dax: fsdev.c minor formatting cleanup John Groves
8 siblings, 0 replies; 18+ messages in thread
From: John Groves @ 2026-06-11 17:32 UTC (permalink / raw)
To: John Groves, Dan Williams
Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Miklos Szeredi,
Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
John Groves
From: John Groves <John@Groves.net>
Clear holder_ops before holder_data so that a concurrent fs_dax_get()
cannot have its newly installed holder_ops overwritten. cmpxchg()
provides release ordering on weakly-ordered architectures, ensuring the
WRITE_ONCE(holder_ops, NULL) store is visible to any CPU that observes
the holder_data release.
Add a WARN_ON() that fires only when the cmpxchg observes a non-NULL
value that is not @holder, i.e. fs_put_dax() called by something that
is not the current holder. That is an API contract violation; the
WARN_ON() does not prevent the damage but makes the bug visible.
A NULL cmpxchg result is deliberately tolerated: kill_dax() clears
holder_data while a holder is still attached when a device is removed
out from under a mounted filesystem (after delivering MF_MEM_PRE_REMOVE).
The holder's subsequent fs_put_dax() - e.g. xfs_free_buftarg() after a
forced shutdown - then legitimately finds holder_data already NULL, so
warning on that case would turn supported device removal into a splat
(or a panic with panic_on_warn).
Also add a kerneldoc comment documenting that fs_put_dax() must only
be called by the current holder.
Fixes: eec38f5d86d27 ("dax: Add fs_dax_get() func to prepare dax for fs-dax usage")
Signed-off-by: John Groves <john@groves.net>
---
drivers/dax/super.c | 42 +++++++++++++++++++++++++++++++++++++++---
1 file changed, 39 insertions(+), 3 deletions(-)
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 6b5ee6589e39b..af5c1e14f7e39 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -116,11 +116,47 @@ EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
#if IS_ENABLED(CONFIG_FS_DAX)
+/**
+ * fs_put_dax() - release holder ownership of a dax_device
+ * @dax_dev: dax device to release (may be NULL)
+ * @holder: the holder pointer previously passed to fs_dax_get() or
+ * fs_dax_get_by_bdev(); must match exactly, as it is used
+ * in a cmpxchg to atomically release ownership
+ *
+ * Must only be called by the current holder. Clears holder_ops before
+ * holder_data to avoid a race where a concurrent fs_dax_get() could have
+ * its newly installed holder_ops overwritten.
+ */
void fs_put_dax(struct dax_device *dax_dev, void *holder)
{
- if (dax_dev && holder &&
- cmpxchg(&dax_dev->holder_data, holder, NULL) == holder)
- dax_dev->holder_ops = NULL;
+ if (dax_dev && holder) {
+ void *prev;
+
+ /*
+ * Clear holder_ops before releasing holder_data. A concurrent
+ * dax_holder_notify_failure() that sees NULL ops returns
+ * -EOPNOTSUPP cleanly. A concurrent fs_dax_get() that acquires
+ * holder_data after the cmpxchg below is guaranteed to observe
+ * holder_ops=NULL first (cmpxchg provides release ordering), so
+ * its subsequent store of new ops will not be overwritten.
+ */
+ WRITE_ONCE(dax_dev->holder_ops, NULL);
+ prev = cmpxchg(&dax_dev->holder_data, holder, NULL);
+
+ /*
+ * prev == holder: normal release.
+ * prev == NULL: already released by kill_dax() when the
+ * device was removed under a live holder;
+ * not a bug.
+ * prev != holder (non-NULL): fs_put_dax() called by something
+ * that is not the current holder; an API
+ * contract violation. A lock would be needed
+ * to guard against this, but we WARN_ON()
+ * instead since violating the contract is
+ * a bug.
+ */
+ WARN_ON(prev && prev != holder);
+ }
put_dax(dax_dev);
}
EXPORT_SYMBOL_GPL(fs_put_dax);
--
2.53.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH V5 9/9] dax: fsdev.c minor formatting cleanup
2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
` (7 preceding siblings ...)
2026-06-11 17:32 ` [PATCH V5 8/9] dax: fix holder_ops race in fs_put_dax() John Groves
@ 2026-06-11 17:33 ` John Groves
8 siblings, 0 replies; 18+ messages in thread
From: John Groves @ 2026-06-11 17:33 UTC (permalink / raw)
To: John Groves, Dan Williams
Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Miklos Szeredi,
Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
John Groves
From: John Groves <John@Groves.net>
Address some comments from Jonathan that were missed in the merged
series. Fix line wrapping in fsdev_dax_recovery_write() and
fsdev_dax_zero_page_range() signatures.
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: John Groves <john@groves.net>
---
drivers/dax/fsdev.c | 15 ++++++++-------
1 file changed, 8 insertions(+), 7 deletions(-)
diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
index 71d2bee1e2805..7df75728ada89 100644
--- a/drivers/dax/fsdev.c
+++ b/drivers/dax/fsdev.c
@@ -45,8 +45,8 @@ static void fsdev_write_dax(void *addr, struct page *page,
}
static long __fsdev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
- long nr_pages, enum dax_access_mode mode, void **kaddr,
- unsigned long *pfn)
+ long nr_pages, enum dax_access_mode mode, void **kaddr,
+ unsigned long *pfn)
{
struct dev_dax *dev_dax = dax_get_private(dax_dev);
size_t size = nr_pages << PAGE_SHIFT;
@@ -80,7 +80,8 @@ static int fsdev_dax_zero_page_range(struct dax_device *dax_dev,
long rc;
WARN_ONCE(nr_pages > 1, "%s: nr_pages > 1\n", __func__);
- rc = __fsdev_dax_direct_access(dax_dev, pgoff, 1, DAX_ACCESS, &kaddr, NULL);
+ rc = __fsdev_dax_direct_access(dax_dev, pgoff, 1, DAX_ACCESS,
+ &kaddr, NULL);
if (rc < 0)
return rc;
fsdev_write_dax(kaddr, ZERO_PAGE(0), 0, PAGE_SIZE);
@@ -88,15 +89,15 @@ static int fsdev_dax_zero_page_range(struct dax_device *dax_dev,
}
static long fsdev_dax_direct_access(struct dax_device *dax_dev,
- pgoff_t pgoff, long nr_pages, enum dax_access_mode mode,
- void **kaddr, unsigned long *pfn)
+ pgoff_t pgoff, long nr_pages, enum dax_access_mode mode,
+ void **kaddr, unsigned long *pfn)
{
return __fsdev_dax_direct_access(dax_dev, pgoff, nr_pages, mode,
kaddr, pfn);
}
-static size_t fsdev_dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
- void *addr, size_t bytes, struct iov_iter *i)
+static size_t fsdev_dax_recovery_write(struct dax_device *dax_dev,
+ pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i)
{
return _copy_from_iter_flushcache(addr, bytes, i);
}
--
2.53.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH V5 6/9] dax/fsdev: fail probe on invalid pgmap offset
2026-06-11 17:32 ` [PATCH V5 6/9] dax/fsdev: fail probe on invalid pgmap offset John Groves
@ 2026-06-11 18:09 ` Gupta, Pankaj
2026-06-15 13:23 ` John Groves
0 siblings, 1 reply; 18+ messages in thread
From: Gupta, Pankaj @ 2026-06-11 18:09 UTC (permalink / raw)
To: John Groves, John Groves, Dan Williams
Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Miklos Szeredi,
Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
> From: John Groves <John@Groves.net>
>
> Convert the WARN_ON to a fatal error when pgmap_phys > phys. This
> condition means the remapped region starts after the device's data
> region, which is an impossible state. Previously the probe continued
> with data_offset=0, leaving virt_addr silently misaligned. Now probe
> returns -EINVAL with a diagnostic message.
>
> Fixes: 759455848df0b ("dax: Save the kva from memremap")
>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Alison Schofield <alison.schofield@intel.com>
> Signed-off-by: John Groves <john@groves.net>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
> ---
> drivers/dax/fsdev.c | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> index 07de6bfbbf673..71d2bee1e2805 100644
> --- a/drivers/dax/fsdev.c
> +++ b/drivers/dax/fsdev.c
> @@ -320,8 +320,12 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
> u64 phys = dev_dax->ranges[0].range.start;
> u64 pgmap_phys = pgmap[0].range.start;
>
> - if (!WARN_ON(pgmap_phys > phys))
> - data_offset = phys - pgmap_phys;
> + if (pgmap_phys > phys) {
> + dev_err(dev, "pgmap start %#llx exceeds data start %#llx\n",
> + pgmap_phys, phys);
> + return -EINVAL;
> + }
> + data_offset = phys - pgmap_phys;
>
> pr_debug("%s: offset detected phys=%llx pgmap_phys=%llx offset=%llx\n",
> __func__, phys, pgmap_phys, data_offset);
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH V5 3/9] dax/fsdev: clear vmemmap_shift when binding static pgmap
2026-06-11 17:32 ` [PATCH V5 3/9] dax/fsdev: clear vmemmap_shift when binding static pgmap John Groves
@ 2026-06-12 2:56 ` Richard Cheng
2026-06-15 13:16 ` John Groves
0 siblings, 1 reply; 18+ messages in thread
From: Richard Cheng @ 2026-06-12 2:56 UTC (permalink / raw)
To: John Groves
Cc: John Groves, Dan Williams, John Groves, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
Miklos Szeredi, Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
On Thu, Jun 11, 2026 at 05:32:07PM +0800, John Groves wrote:
> From: John Groves <John@Groves.net>
>
> Clear pgmap->vmemmap_shift for static DAX devices. When rebinding a static
> device from device_dax (which may set vmemmap_shift based on alignment) to
> fsdev_dax, the stale vmemmap_shift persists on the shared pgmap. Explicitly
> zero it before devm_memremap_pages() so the vmemmap is built for order-0
> folios as fsdev requires.
>
> Fixes: d5406bd458b0a ("dax: add fsdev.c driver for fs-dax on character dax")
>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Alison Schofield <alison.schofield@intel.com>
> Signed-off-by: John Groves <john@groves.net>
> ---
> drivers/dax/fsdev.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> index 2c5de3d80a618..52f46b3e245ea 100644
> --- a/drivers/dax/fsdev.c
> +++ b/drivers/dax/fsdev.c
> @@ -237,6 +237,7 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
> }
>
> pgmap = dev_dax->pgmap;
> + pgmap->vmemmap_shift = 0;
Hello John,
I would suggest also clearing pgmap->ops and pgmap->owner on teardown.
fsdev also writes them but never clears them. memuunmap_pages() leaves the
descriptor intact and kill_dev_dax() only NULLs dev_dax->pgmap for !static case.
After fsdev unbind the stale ops survive.
If we do rmmod later a HW failure dispatch pgmap->ops->memory_failure.
--Richard
> } else {
> size_t pgmap_size;
>
> --
> 2.53.0
>
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH V5 7/9] dax: read holder_ops once in dax_holder_notify_failure()
2026-06-11 17:32 ` [PATCH V5 7/9] dax: read holder_ops once in dax_holder_notify_failure() John Groves
@ 2026-06-12 3:02 ` Richard Cheng
2026-06-15 13:22 ` John Groves
0 siblings, 1 reply; 18+ messages in thread
From: Richard Cheng @ 2026-06-12 3:02 UTC (permalink / raw)
To: John Groves
Cc: John Groves, Dan Williams, John Groves, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
Miklos Szeredi, Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
On Thu, Jun 11, 2026 at 05:32:45PM +0800, John Groves wrote:
> From: John Groves <John@Groves.net>
>
> dax_holder_notify_failure() reads dax_dev->holder_ops twice without
> READ_ONCE() -- once for the NULL check and once for the indirect
> notify_failure() call. A concurrent fs_put_dax() or kill_dax() can clear
> holder_ops between the two reads, so the check can observe a non-NULL
> pointer while the call dereferences NULL.
>
Hello John,
Thanks for this.
Reviewed-by: Richard Cheng <icheng@nvidia.com>
Small message nit, kill_dax() isn't a racing clearer.
Plus I think this only fix holder_ops double-fetch, the fs_put_dax()
race issue is separate and pre-existing.
Best regards,
Richard Cheng.
> Fetch holder_ops once into a local with READ_ONCE() so the NULL check and
> the indirect call observe the same value.
>
> Fixes: 8012b86608552 ("dax: introduce holder for dax_device")
> Suggested-by: Richard Cheng <icheng@nvidia.com>
> Signed-off-by: John Groves <john@groves.net>
> ---
> drivers/dax/super.c | 11 +++++++++--
> 1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 25cf99dd9360b..6b5ee6589e39b 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -303,6 +303,7 @@ EXPORT_SYMBOL_GPL(dax_recovery_write);
> int dax_holder_notify_failure(struct dax_device *dax_dev, u64 off,
> u64 len, int mf_flags)
> {
> + const struct dax_holder_operations *ops;
> int rc, id;
>
> id = dax_read_lock();
> @@ -311,12 +312,18 @@ int dax_holder_notify_failure(struct dax_device *dax_dev, u64 off,
> goto out;
> }
>
> - if (!dax_dev->holder_ops) {
> + /*
> + * Read holder_ops once: a concurrent fs_put_dax() or kill_dax() can
> + * clear it. Without the single fetch the compiler could reload
> + * between the NULL check and the call and dereference a NULL ops.
> + */
> + ops = READ_ONCE(dax_dev->holder_ops);
> + if (!ops) {
> rc = -EOPNOTSUPP;
> goto out;
> }
>
> - rc = dax_dev->holder_ops->notify_failure(dax_dev, off, len, mf_flags);
> + rc = ops->notify_failure(dax_dev, off, len, mf_flags);
> out:
> dax_read_unlock(id);
> return rc;
> --
> 2.53.0
>
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH V5 2/9] dax/fsdev: fix multi-range offset in memory_failure handler
2026-06-11 17:31 ` [PATCH V5 2/9] dax/fsdev: fix multi-range offset in memory_failure handler John Groves
@ 2026-06-12 3:08 ` Richard Cheng
2026-06-15 13:13 ` John Groves
0 siblings, 1 reply; 18+ messages in thread
From: Richard Cheng @ 2026-06-12 3:08 UTC (permalink / raw)
To: John Groves
Cc: John Groves, Dan Williams, John Groves, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
Miklos Szeredi, Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
On Thu, Jun 11, 2026 at 05:31:59PM +0800, John Groves wrote:
> From: John Groves <John@Groves.net>
>
> Fix memory_failure offset calculation for multi-range devices. The old code
> subtracted ranges[0].range.start from the faulting PFN's physical address,
> which produces an incorrect (inflated) logical offset when the PFN falls in
> ranges[1] or beyond due to physical gaps between ranges. Add
> fsdev_pfn_to_offset() to walk the range list and compute the correct
> device-linear byte offset.
>
> Walk the pagemap's own range array (pgmap->ranges[]) rather than
> dev_dax->ranges[]. The pgmap copy is the immutable snapshot populated at
> probe and is never mutated afterwards, whereas dev_dax->ranges[] can be
> krealloc()'d by a concurrent sysfs mapping_store() (under dax_region_rwsem,
> which this ->memory_failure callback does not hold). For dynamic devices the
> two arrays are identical, so the reported offset is unchanged for the
> multi-range case this targets.
>
> Fixes: d5406bd458b0a ("dax: add fsdev.c driver for fs-dax on character dax")
>
> Suggested-by: Richard Cheng <icheng@nvidia.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Alison Schofield <alison.schofield@intel.com>
> Signed-off-by: John Groves <john@groves.net>
> ---
> drivers/dax/fsdev.c | 17 ++++++++++++++++-
> 1 file changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> index 188b2526bee45..2c5de3d80a618 100644
> --- a/drivers/dax/fsdev.c
> +++ b/drivers/dax/fsdev.c
> @@ -135,11 +135,26 @@ static void fsdev_clear_ops(void *data)
> * The core mm code in free_zone_device_folio() handles the wake_up_var()
> * directly for this memory type.
> */
> +static u64 fsdev_pfn_to_offset(struct dev_pagemap *pgmap, unsigned long pfn)
> +{
> + phys_addr_t phys = PFN_PHYS(pfn);
> + u64 offset = 0;
> +
> + for (int i = 0; i < pgmap->nr_range; i++) {
> + struct range *range = &pgmap->ranges[i];
> +
> + if (phys >= range->start && phys <= range->end)
> + return offset + (phys - range->start);
> + offset += range_len(range);
> + }
> + return -1ULL;
> +}
> +
> static int fsdev_pagemap_memory_failure(struct dev_pagemap *pgmap,
> unsigned long pfn, unsigned long nr_pages, int mf_flags)
> {
> struct dev_dax *dev_dax = pgmap->owner;
> - u64 offset = PFN_PHYS(pfn) - dev_dax->ranges[0].range.start;
> + u64 offset = fsdev_pfn_to_offset(pgmap, pfn);
Hi John,
I think this regresses static devices. pgmap->ranges[0].start can sit
data_offset below it on a static device, so the new offset = old + data_offset,
and XFS poisons the wrong blocks.
The gap walk only helps dynamic devices where data_offset ==0 . Maybe walking pgmap->ranges and
substract the probe's data_offset.
--Richard
> u64 len = nr_pages << PAGE_SHIFT;
>
> return dax_holder_notify_failure(dev_dax->dax_dev, offset,
> --
> 2.53.0
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH V5 2/9] dax/fsdev: fix multi-range offset in memory_failure handler
2026-06-12 3:08 ` Richard Cheng
@ 2026-06-15 13:13 ` John Groves
0 siblings, 0 replies; 18+ messages in thread
From: John Groves @ 2026-06-15 13:13 UTC (permalink / raw)
To: Richard Cheng
Cc: John Groves, Dan Williams, John Groves, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
Miklos Szeredi, Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
On 26/06/12 11:08AM, Richard Cheng wrote:
> On Thu, Jun 11, 2026 at 05:31:59PM +0800, John Groves wrote:
> > From: John Groves <John@Groves.net>
> >
> > Fix memory_failure offset calculation for multi-range devices. The old code
> > subtracted ranges[0].range.start from the faulting PFN's physical address,
> > which produces an incorrect (inflated) logical offset when the PFN falls in
> > ranges[1] or beyond due to physical gaps between ranges. Add
> > fsdev_pfn_to_offset() to walk the range list and compute the correct
> > device-linear byte offset.
> >
> > Walk the pagemap's own range array (pgmap->ranges[]) rather than
> > dev_dax->ranges[]. The pgmap copy is the immutable snapshot populated at
> > probe and is never mutated afterwards, whereas dev_dax->ranges[] can be
> > krealloc()'d by a concurrent sysfs mapping_store() (under dax_region_rwsem,
> > which this ->memory_failure callback does not hold). For dynamic devices the
> > two arrays are identical, so the reported offset is unchanged for the
> > multi-range case this targets.
> >
> > Fixes: d5406bd458b0a ("dax: add fsdev.c driver for fs-dax on character dax")
> >
> > Suggested-by: Richard Cheng <icheng@nvidia.com>
> > Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> > Reviewed-by: Alison Schofield <alison.schofield@intel.com>
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> > drivers/dax/fsdev.c | 17 ++++++++++++++++-
> > 1 file changed, 16 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> > index 188b2526bee45..2c5de3d80a618 100644
> > --- a/drivers/dax/fsdev.c
> > +++ b/drivers/dax/fsdev.c
> > @@ -135,11 +135,26 @@ static void fsdev_clear_ops(void *data)
> > * The core mm code in free_zone_device_folio() handles the wake_up_var()
> > * directly for this memory type.
> > */
> > +static u64 fsdev_pfn_to_offset(struct dev_pagemap *pgmap, unsigned long pfn)
> > +{
> > + phys_addr_t phys = PFN_PHYS(pfn);
> > + u64 offset = 0;
> > +
> > + for (int i = 0; i < pgmap->nr_range; i++) {
> > + struct range *range = &pgmap->ranges[i];
> > +
> > + if (phys >= range->start && phys <= range->end)
> > + return offset + (phys - range->start);
> > + offset += range_len(range);
> > + }
> > + return -1ULL;
> > +}
> > +
> > static int fsdev_pagemap_memory_failure(struct dev_pagemap *pgmap,
> > unsigned long pfn, unsigned long nr_pages, int mf_flags)
> > {
> > struct dev_dax *dev_dax = pgmap->owner;
> > - u64 offset = PFN_PHYS(pfn) - dev_dax->ranges[0].range.start;
> > + u64 offset = fsdev_pfn_to_offset(pgmap, pfn);
>
> Hi John,
>
> I think this regresses static devices. pgmap->ranges[0].start can sit
> data_offset below it on a static device, so the new offset = old + data_offset,
> and XFS poisons the wrong blocks.
>
> The gap walk only helps dynamic devices where data_offset ==0 . Maybe walking pgmap->ranges and
> substract the probe's data_offset.
>
> --Richard
Ugh, right.
Subtracting the data_offset would require newly stashing it somewhere the
->memory_failure callback could reach.
So I'm reverting to walking dev_dax->ranges[] -- the maybe-race there is the
same one the pre-existing single-range code already had.
I'd like to land this series before going too much farther down the suspected
pre-existing issues rabbit hole :D
Note: the current version of this patch (switching to pgmap->ranges) might
have been a bit much for keeping Dave and Alison's RB tags - but I'm
reverting back to what they reviewed for V6.
Thanks,
John
<snip>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH V5 3/9] dax/fsdev: clear vmemmap_shift when binding static pgmap
2026-06-12 2:56 ` Richard Cheng
@ 2026-06-15 13:16 ` John Groves
0 siblings, 0 replies; 18+ messages in thread
From: John Groves @ 2026-06-15 13:16 UTC (permalink / raw)
To: Richard Cheng
Cc: John Groves, Dan Williams, John Groves, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
Miklos Szeredi, Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
On 26/06/12 10:56AM, Richard Cheng wrote:
> On Thu, Jun 11, 2026 at 05:32:07PM +0800, John Groves wrote:
> > From: John Groves <John@Groves.net>
> >
> > Clear pgmap->vmemmap_shift for static DAX devices. When rebinding a static
> > device from device_dax (which may set vmemmap_shift based on alignment) to
> > fsdev_dax, the stale vmemmap_shift persists on the shared pgmap. Explicitly
> > zero it before devm_memremap_pages() so the vmemmap is built for order-0
> > folios as fsdev requires.
> >
> > Fixes: d5406bd458b0a ("dax: add fsdev.c driver for fs-dax on character dax")
> >
> > Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> > Reviewed-by: Alison Schofield <alison.schofield@intel.com>
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> > drivers/dax/fsdev.c | 1 +
> > 1 file changed, 1 insertion(+)
> >
> > diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> > index 2c5de3d80a618..52f46b3e245ea 100644
> > --- a/drivers/dax/fsdev.c
> > +++ b/drivers/dax/fsdev.c
> > @@ -237,6 +237,7 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
> > }
> >
> > pgmap = dev_dax->pgmap;
> > + pgmap->vmemmap_shift = 0;
>
>
> Hello John,
>
> I would suggest also clearing pgmap->ops and pgmap->owner on teardown.
> fsdev also writes them but never clears them. memuunmap_pages() leaves the
> descriptor intact and kill_dev_dax() only NULLs dev_dax->pgmap for !static case.
> After fsdev unbind the stale ops survive.
> If we do rmmod later a HW failure dispatch pgmap->ops->memory_failure.
>
> --Richard
Good catch, thanks.
Adding a patch for V6 ("dax/fsdev: clear pgmap ops and owner on unbind")
-- a devm action that NULLs both on unbind, symmetric with setting them
at probe. It matters for the static case where the pgmap is shared and
long-lived; otherwise a later rebind or rmmod could dispatch
memory_failure through the stale handler.
John
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH V5 7/9] dax: read holder_ops once in dax_holder_notify_failure()
2026-06-12 3:02 ` Richard Cheng
@ 2026-06-15 13:22 ` John Groves
0 siblings, 0 replies; 18+ messages in thread
From: John Groves @ 2026-06-15 13:22 UTC (permalink / raw)
To: Richard Cheng
Cc: John Groves, Dan Williams, John Groves, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
Miklos Szeredi, Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
On 26/06/12 11:02AM, Richard Cheng wrote:
> On Thu, Jun 11, 2026 at 05:32:45PM +0800, John Groves wrote:
> > From: John Groves <John@Groves.net>
> >
> > dax_holder_notify_failure() reads dax_dev->holder_ops twice without
> > READ_ONCE() -- once for the NULL check and once for the indirect
> > notify_failure() call. A concurrent fs_put_dax() or kill_dax() can clear
> > holder_ops between the two reads, so the check can observe a non-NULL
> > pointer while the call dereferences NULL.
> >
>
> Hello John,
>
> Thanks for this.
>
> Reviewed-by: Richard Cheng <icheng@nvidia.com>
>
> Small message nit, kill_dax() isn't a racing clearer.
Right -- kill_dax() clears holder_ops only after synchronize_srcu(), so it
can't race a reader under dax_read_lock(). Reworded the commit message and
the code comment for V6 to name only fs_put_dax().
> Plus I think this only fix holder_ops double-fetch, the fs_put_dax()
> race issue is separate and pre-existing.
Yes, intentionally -- this patch is just the reader-side double-fetch. The
fs_put_dax() race is the separate, pre-existing one handled by the next
patch in the series ("dax: fix holder_ops race in fs_put_dax()").
>
> Best regards,
> Richard Cheng.
Thanks,
John
<snip>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH V5 6/9] dax/fsdev: fail probe on invalid pgmap offset
2026-06-11 18:09 ` Gupta, Pankaj
@ 2026-06-15 13:23 ` John Groves
0 siblings, 0 replies; 18+ messages in thread
From: John Groves @ 2026-06-15 13:23 UTC (permalink / raw)
To: Gupta, Pankaj
Cc: John Groves, Dan Williams, John Groves, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
Miklos Szeredi, Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
On 26/06/11 08:09PM, Gupta, Pankaj wrote:
>
> > From: John Groves <John@Groves.net>
> >
> > Convert the WARN_ON to a fatal error when pgmap_phys > phys. This
> > condition means the remapped region starts after the device's data
> > region, which is an impossible state. Previously the probe continued
> > with data_offset=0, leaving virt_addr silently misaligned. Now probe
> > returns -EINVAL with a diagnostic message.
> >
> > Fixes: 759455848df0b ("dax: Save the kva from memremap")
> >
> > Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> > Reviewed-by: Alison Schofield <alison.schofield@intel.com>
> > Signed-off-by: John Groves <john@groves.net>
>
> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Thank you!
John
<snip>
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2026-06-15 13:30 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20260611173057.65868-1-john@jagalactic.com>
2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
2026-06-11 17:31 ` [PATCH V5 1/9] dax: fix misleading comment about share/index union in dax_folio_reset_order() John Groves
2026-06-11 17:31 ` [PATCH V5 2/9] dax/fsdev: fix multi-range offset in memory_failure handler John Groves
2026-06-12 3:08 ` Richard Cheng
2026-06-15 13:13 ` John Groves
2026-06-11 17:32 ` [PATCH V5 3/9] dax/fsdev: clear vmemmap_shift when binding static pgmap John Groves
2026-06-12 2:56 ` Richard Cheng
2026-06-15 13:16 ` John Groves
2026-06-11 17:32 ` [PATCH V5 4/9] dax/fsdev: don't leave a dangling dev_dax->pgmap on probe failure John Groves
2026-06-11 17:32 ` [PATCH V5 5/9] dax/fsdev: use __va(phys) for kaddr in direct_access John Groves
2026-06-11 17:32 ` [PATCH V5 6/9] dax/fsdev: fail probe on invalid pgmap offset John Groves
2026-06-11 18:09 ` Gupta, Pankaj
2026-06-15 13:23 ` John Groves
2026-06-11 17:32 ` [PATCH V5 7/9] dax: read holder_ops once in dax_holder_notify_failure() John Groves
2026-06-12 3:02 ` Richard Cheng
2026-06-15 13:22 ` John Groves
2026-06-11 17:32 ` [PATCH V5 8/9] dax: fix holder_ops race in fs_put_dax() John Groves
2026-06-11 17:33 ` [PATCH V5 9/9] dax: fsdev.c minor formatting cleanup John Groves
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox