The Linux Kernel Mailing List
 help / color / mirror / Atom feed
* [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series
       [not found] <20260611173057.65868-1-john@jagalactic.com>
@ 2026-06-11 17:31 ` John Groves
  2026-06-11 17:31   ` [PATCH V5 1/9] dax: fix misleading comment about share/index union in dax_folio_reset_order() John Groves
                     ` (8 more replies)
  0 siblings, 9 replies; 18+ messages in thread
From: John Groves @ 2026-06-11 17:31 UTC (permalink / raw)
  To: John Groves, Dan Williams
  Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Miklos Szeredi,
	Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	John Groves

From: John Groves <john@groves.net>

This series applies bug fixes (mostly found via sashiko) to the dax/fsdev
series. It has been soaking in the famfs CI pipeline and 1) won't affect
anything that doesn't use drivers/dax/fsdev.c, and 2) doesn't affect any
known workloads -- although the bugs would have manifested when multi-range
DCD dax devices are a thing (soon-ish).

Most of the series is confined to drivers/dax/fsdev.c. Two patches touch
shared DAX core in drivers/dax/super.c: patch 7 reads holder_ops once in
dax_holder_notify_failure() to close a double-fetch NULL dereference, and
patch 8 reorders fs_put_dax() and adds a WARN_ON(). fs_put_dax() is used by
ext2/ext4/erofs/xfs, but only holder-passing callers (like XFS in-tree) will
see a behavior change, and only a new warning if they misuse it.

Changes since V4:

- New patch 7 (dax: read holder_ops once in dax_holder_notify_failure()):
  split the reader-side READ_ONCE() fix out of the fs_put_dax() patch and
  placed it first, so the fs_put_dax() patch's "a concurrent
  dax_holder_notify_failure() that sees NULL ops returns -EOPNOTSUPP
  cleanly" reasoning actually holds when it lands. dax_holder_notify_failure()
  read holder_ops twice without READ_ONCE(); a concurrent clear could make
  the NULL check pass while the indirect call dereferenced NULL. Carries
  Fixes: 8012b86608552 ("dax: introduce holder for dax_device"), the commit
  that introduced the unmarked double fetch. Suggested by Richard Cheng (and
  the Sashiko bot).
- Patch 2 (multi-range memory_failure offset): the ->memory_failure callback
  now walks the pagemap's own immutable range array (pgmap->ranges[]) rather
  than dev_dax->ranges[], which a concurrent sysfs mapping_store() can
  krealloc() under dax_region_rwsem while this callback holds no such lock.
  For dynamic devices the two arrays are identical, so the reported offset is
  unchanged for the multi-range case this patch targets. Suggested by Richard
  Cheng (and the Sashiko bot).
- Dropped the dax_dev_get()/dax_dev_find() patch (V4 patch 8) from this
  revision. There is no in-tree caller yet; it will be sent together with the
  famfs filesystem series that introduces the caller. (Per Richard Cheng /
  Sashiko.)
- Patch 8 (holder_ops race in fs_put_dax()): unchanged from V4 (renumbered
  from 7 to 8).
- Collected Reviewed-by from Dave Jiang on patches 4 and 6.

Changes since V3:

- Patch 4: Adopted Dave's suggested refactor -- factor out
  fsdev_acquire_pgmap() and defer the dev_dax->pgmap assignment until
  probe can no longer fail, replacing the goto-based cleanup. Did not
  carry Alison's V3 Reviewed-by due to the rewrite.
- Patch 5: Also remove the now write-only dev_dax->virt_addr field,
  per Dave's review.
- Patch 7: Fixed the WARN_ON() to tolerate holder_data == NULL, which
  legitimately occurs when kill_dax() clears it during device removal
  under a live holder (per Dave's review). Wrong-holder calls still
  warn.
- Patch 8: Kept the Fixes tag -- the exported symbol itself is the
  hazard; stable kernels carrying the export should want this fix.

Changes since V2:

* Patch 1 (comment fix): No change. Responded to Dave's question about
  the dropped precondition -- the new comment correctly covers both
  callers; fsdev_clear_folio_state() does not guarantee share==0 before
  calling, so the old precondition was no longer universally true.
* V2 patch 2 (three fixes): Split into three separate patches (patches
  2-4) per Dave's review.
* V2 patch 3 (two fixes): Split into two separate patches (patches 5-6)
  per Dave's review.
* V2 patch 4 (clamp direct_access / remove cached_size): Dropped.
  Dave's analysis correctly showed the claimed bug does not exist --
  dax_pgoff_to_phys() already enforces that the full requested size fits
  within a single range before returning, making the clamp a no-op in
  every reachable path.
* V2 patch 5 (holder_ops race): Use WRITE_ONCE() for the holder_ops
  store; add WARN_ON() on the cmpxchg result to catch wrong-holder and
  double-put API contract violations; fix the inline comment, which
  incorrectly claimed dax_holder_notify_failure() consults holder_ops
  only when holder_data is non-NULL.
* V2 patch 6 (dax_dev_find): Add dax_alive() check under dax_read_lock()
  after ilookup5() to prevent returning a device that is concurrently
  being torn down by kill_dax().
* V2 patch 7 (formatting cleanup): Drop incorrect Fixes: tag; add
  Dave's Reviewed-by.
* The series grows from 7 to 9 patches.

Changes since v1:
* Dropped modes from patch 6 to fs/fuse/famfs.c and 
  fs/famfs/famfs_inode.c, which are not upstream so it broke
  attempts to apply the series. Oops...
* Added patch 7, which addresses a previously-missed review comment
  from Jonathan - minor cleanup



John Groves (9):
  dax: fix misleading comment about share/index union in
    dax_folio_reset_order()
  dax/fsdev: fix multi-range offset in memory_failure handler
  dax/fsdev: clear vmemmap_shift when binding static pgmap
  dax/fsdev: don't leave a dangling dev_dax->pgmap on probe failure
  dax/fsdev: use __va(phys) for kaddr in direct_access
  dax/fsdev: fail probe on invalid pgmap offset
  dax: read holder_ops once in dax_holder_notify_failure()
  dax: fix holder_ops race in fs_put_dax()
  dax: fsdev.c minor formatting cleanup

 drivers/dax/dax-private.h |   2 -
 drivers/dax/fsdev.c       | 126 +++++++++++++++++++++++++-------------
 drivers/dax/super.c       |  53 ++++++++++++++--
 fs/dax.c                  |  12 ++--
 4 files changed, 136 insertions(+), 57 deletions(-)


base-commit: 4549871118cf616eecdd2d939f78e3b9e1dddc48
-- 
2.53.0



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH V5 1/9] dax: fix misleading comment about share/index union in dax_folio_reset_order()
  2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
@ 2026-06-11 17:31   ` John Groves
  2026-06-11 17:31   ` [PATCH V5 2/9] dax/fsdev: fix multi-range offset in memory_failure handler John Groves
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: John Groves @ 2026-06-11 17:31 UTC (permalink / raw)
  To: John Groves, Dan Williams
  Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Miklos Szeredi,
	Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	John Groves

From: John Groves <John@Groves.net>

The comment in dax_folio_reset_order() claims that DAX maintains an
invariant where folio->share != 0 only when folio->mapping == NULL,
implying folio->share is zero whenever mapping is non-NULL. This is
misleading because folio->share and folio->index are a union -- for
non-shared folios with mapping != NULL, reading folio->share returns
the file page offset (folio->index), which is typically non-zero.

Reword the comment to accurately describe the union aliasing: the
assignment clears whichever interpretation of the union word is active
(index for non-shared folios, share for shared folios), which is correct
because the folio is being released in either case.

No functional change -- the code was already correct, only the
justification was wrong.

Fixes: 59eb73b98ae0b ("dax: Factor out dax_folio_reset_order() helper")

Reviewed-by: Jonathan Cameron <jic23@kernel.org>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: John Groves <john@groves.net>
---
 fs/dax.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 6d175cd47a99b..df19c9317d10e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -392,12 +392,12 @@ int dax_folio_reset_order(struct folio *folio)
 	int order = folio_order(folio);
 
 	/*
-	 * DAX maintains the invariant that folio->share != 0 only when
-	 * folio->mapping == NULL (enforced by dax_folio_make_shared()).
-	 * Equivalently: folio->mapping != NULL implies folio->share == 0.
-	 * Callers ensure share has been decremented to zero before
-	 * calling here, so unconditionally clearing both fields is
-	 * correct.
+	 * Clear the mapping and the index/share union word. folio->share
+	 * and folio->index occupy the same union in struct folio. For
+	 * non-shared folios (mapping != NULL), the union holds folio->index
+	 * (file page offset); for shared folios (mapping == NULL), it holds
+	 * folio->share (reference count). Either way, we are releasing the
+	 * folio and both fields should be zeroed.
 	 */
 	folio->mapping = NULL;
 	folio->share = 0;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH V5 2/9] dax/fsdev: fix multi-range offset in memory_failure handler
  2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
  2026-06-11 17:31   ` [PATCH V5 1/9] dax: fix misleading comment about share/index union in dax_folio_reset_order() John Groves
@ 2026-06-11 17:31   ` John Groves
  2026-06-12  3:08     ` Richard Cheng
  2026-06-11 17:32   ` [PATCH V5 3/9] dax/fsdev: clear vmemmap_shift when binding static pgmap John Groves
                     ` (6 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: John Groves @ 2026-06-11 17:31 UTC (permalink / raw)
  To: John Groves, Dan Williams
  Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Miklos Szeredi,
	Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	Richard Cheng, John Groves

From: John Groves <John@Groves.net>

Fix memory_failure offset calculation for multi-range devices. The old code
subtracted ranges[0].range.start from the faulting PFN's physical address,
which produces an incorrect (inflated) logical offset when the PFN falls in
ranges[1] or beyond due to physical gaps between ranges. Add
fsdev_pfn_to_offset() to walk the range list and compute the correct
device-linear byte offset.

Walk the pagemap's own range array (pgmap->ranges[]) rather than
dev_dax->ranges[]. The pgmap copy is the immutable snapshot populated at
probe and is never mutated afterwards, whereas dev_dax->ranges[] can be
krealloc()'d by a concurrent sysfs mapping_store() (under dax_region_rwsem,
which this ->memory_failure callback does not hold). For dynamic devices the
two arrays are identical, so the reported offset is unchanged for the
multi-range case this targets.

Fixes: d5406bd458b0a ("dax: add fsdev.c driver for fs-dax on character dax")

Suggested-by: Richard Cheng <icheng@nvidia.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/fsdev.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
index 188b2526bee45..2c5de3d80a618 100644
--- a/drivers/dax/fsdev.c
+++ b/drivers/dax/fsdev.c
@@ -135,11 +135,26 @@ static void fsdev_clear_ops(void *data)
  * The core mm code in free_zone_device_folio() handles the wake_up_var()
  * directly for this memory type.
  */
+static u64 fsdev_pfn_to_offset(struct dev_pagemap *pgmap, unsigned long pfn)
+{
+	phys_addr_t phys = PFN_PHYS(pfn);
+	u64 offset = 0;
+
+	for (int i = 0; i < pgmap->nr_range; i++) {
+		struct range *range = &pgmap->ranges[i];
+
+		if (phys >= range->start && phys <= range->end)
+			return offset + (phys - range->start);
+		offset += range_len(range);
+	}
+	return -1ULL;
+}
+
 static int fsdev_pagemap_memory_failure(struct dev_pagemap *pgmap,
 		unsigned long pfn, unsigned long nr_pages, int mf_flags)
 {
 	struct dev_dax *dev_dax = pgmap->owner;
-	u64 offset = PFN_PHYS(pfn) - dev_dax->ranges[0].range.start;
+	u64 offset = fsdev_pfn_to_offset(pgmap, pfn);
 	u64 len = nr_pages << PAGE_SHIFT;
 
 	return dax_holder_notify_failure(dev_dax->dax_dev, offset,
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH V5 3/9] dax/fsdev: clear vmemmap_shift when binding static pgmap
  2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
  2026-06-11 17:31   ` [PATCH V5 1/9] dax: fix misleading comment about share/index union in dax_folio_reset_order() John Groves
  2026-06-11 17:31   ` [PATCH V5 2/9] dax/fsdev: fix multi-range offset in memory_failure handler John Groves
@ 2026-06-11 17:32   ` John Groves
  2026-06-12  2:56     ` Richard Cheng
  2026-06-11 17:32   ` [PATCH V5 4/9] dax/fsdev: don't leave a dangling dev_dax->pgmap on probe failure John Groves
                     ` (5 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: John Groves @ 2026-06-11 17:32 UTC (permalink / raw)
  To: John Groves, Dan Williams
  Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Miklos Szeredi,
	Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	John Groves

From: John Groves <John@Groves.net>

Clear pgmap->vmemmap_shift for static DAX devices. When rebinding a static
device from device_dax (which may set vmemmap_shift based on alignment) to
fsdev_dax, the stale vmemmap_shift persists on the shared pgmap. Explicitly
zero it before devm_memremap_pages() so the vmemmap is built for order-0
folios as fsdev requires.

Fixes: d5406bd458b0a ("dax: add fsdev.c driver for fs-dax on character dax")

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/fsdev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
index 2c5de3d80a618..52f46b3e245ea 100644
--- a/drivers/dax/fsdev.c
+++ b/drivers/dax/fsdev.c
@@ -237,6 +237,7 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
 		}
 
 		pgmap = dev_dax->pgmap;
+		pgmap->vmemmap_shift = 0;
 	} else {
 		size_t pgmap_size;
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH V5 4/9] dax/fsdev: don't leave a dangling dev_dax->pgmap on probe failure
  2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
                     ` (2 preceding siblings ...)
  2026-06-11 17:32   ` [PATCH V5 3/9] dax/fsdev: clear vmemmap_shift when binding static pgmap John Groves
@ 2026-06-11 17:32   ` John Groves
  2026-06-11 17:32   ` [PATCH V5 5/9] dax/fsdev: use __va(phys) for kaddr in direct_access John Groves
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: John Groves @ 2026-06-11 17:32 UTC (permalink / raw)
  To: John Groves, Dan Williams
  Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Miklos Szeredi,
	Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	John Groves

From: John Groves <John@Groves.net>

After the dynamic path set dev_dax->pgmap, any later probe failure left
dev_dax->pgmap dangling: devres frees the devm_kzalloc'd pgmap on probe
failure, and subsequent probe attempts would hit the "dynamic-dax with
pre-populated page map" check and fail permanently.

Factor pgmap acquisition out into fsdev_acquire_pgmap(), and defer the
dev_dax->pgmap assignment until probe can no longer fail. A failed probe
now never publishes the pointer at all, so there is nothing to unwind.
This also matches kill_dev_dax(), which already clears the dynamic pgmap
pointer on unbind: dev_dax->pgmap is now non-NULL only while the pgmap
is actually valid.

Refactor suggested by Dave Jiang.

Fixes: d5406bd458b0a ("dax: add fsdev.c driver for fs-dax on character dax")

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/fsdev.c | 77 ++++++++++++++++++++++++++++-----------------
 1 file changed, 49 insertions(+), 28 deletions(-)

diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
index 52f46b3e245ea..cc097167ad2c7 100644
--- a/drivers/dax/fsdev.c
+++ b/drivers/dax/fsdev.c
@@ -219,47 +219,62 @@ static const struct file_operations fsdev_fops = {
 	.release = fsdev_release,
 };
 
-static int fsdev_dax_probe(struct dev_dax *dev_dax)
+/*
+ * Acquire the dev_pagemap for probe: the static (pre-populated) one if
+ * present, or a devm-allocated one for the dynamic case. Note that
+ * dev_dax->pgmap is not set here; fsdev_dax_probe() sets it only once
+ * probe succeeds, so a failed probe never leaves a dangling pointer
+ * to a devres-freed pgmap.
+ */
+static struct dev_pagemap *fsdev_acquire_pgmap(struct dev_dax *dev_dax)
 {
-	struct dax_device *dax_dev = dev_dax->dax_dev;
 	struct device *dev = &dev_dax->dev;
 	struct dev_pagemap *pgmap;
-	struct inode *inode;
-	u64 data_offset = 0;
-	struct cdev *cdev;
-	void *addr;
-	int rc, i;
+	size_t pgmap_size;
 
 	if (static_dev_dax(dev_dax)) {
 		if (dev_dax->nr_range > 1) {
-			dev_warn(dev, "static pgmap / multi-range device conflict\n");
-			return -EINVAL;
+			dev_warn(dev,
+				 "static pgmap / multi-range device conflict\n");
+			return ERR_PTR(-EINVAL);
 		}
 
 		pgmap = dev_dax->pgmap;
 		pgmap->vmemmap_shift = 0;
-	} else {
-		size_t pgmap_size;
+		return pgmap;
+	}
 
-		if (dev_dax->pgmap) {
-			dev_warn(dev, "dynamic-dax with pre-populated page map\n");
-			return -EINVAL;
-		}
+	if (dev_dax->pgmap) {
+		dev_warn(dev, "dynamic-dax with pre-populated page map\n");
+		return ERR_PTR(-EINVAL);
+	}
 
-		pgmap_size = struct_size(pgmap, ranges, dev_dax->nr_range - 1);
-		pgmap = devm_kzalloc(dev, pgmap_size, GFP_KERNEL);
-		if (!pgmap)
-			return -ENOMEM;
+	pgmap_size = struct_size(pgmap, ranges, dev_dax->nr_range - 1);
+	pgmap = devm_kzalloc(dev, pgmap_size, GFP_KERNEL);
+	if (!pgmap)
+		return ERR_PTR(-ENOMEM);
 
-		pgmap->nr_range = dev_dax->nr_range;
-		dev_dax->pgmap = pgmap;
+	pgmap->nr_range = dev_dax->nr_range;
+	for (int i = 0; i < dev_dax->nr_range; i++)
+		pgmap->ranges[i] = dev_dax->ranges[i].range;
 
-		for (i = 0; i < dev_dax->nr_range; i++) {
-			struct range *range = &dev_dax->ranges[i].range;
+	return pgmap;
+}
 
-			pgmap->ranges[i] = *range;
-		}
-	}
+static int fsdev_dax_probe(struct dev_dax *dev_dax)
+{
+	struct dax_device *dax_dev = dev_dax->dax_dev;
+	struct device *dev = &dev_dax->dev;
+	struct dev_pagemap *pgmap;
+	struct inode *inode;
+	u64 data_offset = 0;
+	struct cdev *cdev;
+	void *addr;
+	int rc, i;
+
+	pgmap = fsdev_acquire_pgmap(dev_dax);
+	if (IS_ERR(pgmap))
+		return PTR_ERR(pgmap);
 
 	for (i = 0; i < dev_dax->nr_range; i++) {
 		struct range *range = &dev_dax->ranges[i].range;
@@ -306,7 +321,7 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
 	/* Detect whether the data is at a non-zero offset into the memory */
 	if (pgmap->range.start != dev_dax->ranges[0].range.start) {
 		u64 phys = dev_dax->ranges[0].range.start;
-		u64 pgmap_phys = dev_dax->pgmap[0].range.start;
+		u64 pgmap_phys = pgmap[0].range.start;
 
 		if (!WARN_ON(pgmap_phys > phys))
 			data_offset = phys - pgmap_phys;
@@ -339,7 +354,13 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
 		return rc;
 
 	run_dax(dax_dev);
-	return devm_add_action_or_reset(dev, fsdev_kill, dev_dax);
+	rc = devm_add_action_or_reset(dev, fsdev_kill, dev_dax);
+	if (rc)
+		return rc;
+
+	/* Probe can no longer fail; expose the pgmap via dev_dax */
+	dev_dax->pgmap = pgmap;
+	return 0;
 }
 
 static struct dax_device_driver fsdev_dax_driver = {
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH V5 5/9] dax/fsdev: use __va(phys) for kaddr in direct_access
  2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
                     ` (3 preceding siblings ...)
  2026-06-11 17:32   ` [PATCH V5 4/9] dax/fsdev: don't leave a dangling dev_dax->pgmap on probe failure John Groves
@ 2026-06-11 17:32   ` John Groves
  2026-06-11 17:32   ` [PATCH V5 6/9] dax/fsdev: fail probe on invalid pgmap offset John Groves
                     ` (3 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: John Groves @ 2026-06-11 17:32 UTC (permalink / raw)
  To: John Groves, Dan Williams
  Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Miklos Szeredi,
	Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	John Groves

From: John Groves <John@Groves.net>

Use __va(phys) instead of virt_addr + linear_offset for the kaddr
return in __fsdev_dax_direct_access(). The previous code added a
device-linear byte offset to virt_addr (which is __va of ranges[0]),
but for multi-range devices with physical gaps between ranges, this
linear arithmetic crosses the gap and produces a wrong kernel virtual
address. Using __va(phys) where phys comes from dax_pgoff_to_phys()
is correct for any range layout because the direct map translates
each physical address independently.

This leaves dev_dax->virt_addr write-only, so remove the field
(suggested by Dave Jiang).

Fixes: 759455848df0b ("dax: Save the kva from memremap")

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/dax-private.h | 2 --
 drivers/dax/fsdev.c       | 8 ++------
 2 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 81e4af49e39c1..607a53a91f58b 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -69,7 +69,6 @@ struct dev_dax_range {
  * data while the device is activated in the driver.
  * @region: parent region
  * @dax_dev: core dax functionality
- * @virt_addr: kva from memremap; used by fsdev_dax
  * @cached_size: size of daxdev cached by fsdev_dax
  * @align: alignment of this instance
  * @target_node: effective numa node if dev_dax memory range is onlined
@@ -85,7 +84,6 @@ struct dev_dax_range {
 struct dev_dax {
 	struct dax_region *region;
 	struct dax_device *dax_dev;
-	void *virt_addr;
 	u64 cached_size;
 	unsigned int align;
 	int target_node;
diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
index cc097167ad2c7..07de6bfbbf673 100644
--- a/drivers/dax/fsdev.c
+++ b/drivers/dax/fsdev.c
@@ -51,9 +51,7 @@ static long __fsdev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
 	struct dev_dax *dev_dax = dax_get_private(dax_dev);
 	size_t size = nr_pages << PAGE_SHIFT;
 	size_t offset = pgoff << PAGE_SHIFT;
-	void *virt_addr = dev_dax->virt_addr + offset;
 	phys_addr_t phys;
-	unsigned long local_pfn;
 
 	phys = dax_pgoff_to_phys(dev_dax, pgoff, size);
 	if (phys == -1) {
@@ -63,11 +61,10 @@ static long __fsdev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
 	}
 
 	if (kaddr)
-		*kaddr = virt_addr;
+		*kaddr = __va(phys);
 
-	local_pfn = PHYS_PFN(phys);
 	if (pfn)
-		*pfn = local_pfn;
+		*pfn = PHYS_PFN(phys);
 
 	/*
 	 * Use cached_size which was computed at probe time. The size cannot
@@ -329,7 +326,6 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
 		pr_debug("%s: offset detected phys=%llx pgmap_phys=%llx offset=%llx\n",
 		       __func__, phys, pgmap_phys, data_offset);
 	}
-	dev_dax->virt_addr = addr + data_offset;
 
 	inode = dax_inode(dax_dev);
 	cdev = inode->i_cdev;
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH V5 6/9] dax/fsdev: fail probe on invalid pgmap offset
  2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
                     ` (4 preceding siblings ...)
  2026-06-11 17:32   ` [PATCH V5 5/9] dax/fsdev: use __va(phys) for kaddr in direct_access John Groves
@ 2026-06-11 17:32   ` John Groves
  2026-06-11 18:09     ` Gupta, Pankaj
  2026-06-11 17:32   ` [PATCH V5 7/9] dax: read holder_ops once in dax_holder_notify_failure() John Groves
                     ` (2 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: John Groves @ 2026-06-11 17:32 UTC (permalink / raw)
  To: John Groves, Dan Williams
  Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Miklos Szeredi,
	Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	John Groves

From: John Groves <John@Groves.net>

Convert the WARN_ON to a fatal error when pgmap_phys > phys. This
condition means the remapped region starts after the device's data
region, which is an impossible state. Previously the probe continued
with data_offset=0, leaving virt_addr silently misaligned. Now probe
returns -EINVAL with a diagnostic message.

Fixes: 759455848df0b ("dax: Save the kva from memremap")

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/fsdev.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
index 07de6bfbbf673..71d2bee1e2805 100644
--- a/drivers/dax/fsdev.c
+++ b/drivers/dax/fsdev.c
@@ -320,8 +320,12 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
 		u64 phys = dev_dax->ranges[0].range.start;
 		u64 pgmap_phys = pgmap[0].range.start;
 
-		if (!WARN_ON(pgmap_phys > phys))
-			data_offset = phys - pgmap_phys;
+		if (pgmap_phys > phys) {
+			dev_err(dev, "pgmap start %#llx exceeds data start %#llx\n",
+				pgmap_phys, phys);
+			return -EINVAL;
+		}
+		data_offset = phys - pgmap_phys;
 
 		pr_debug("%s: offset detected phys=%llx pgmap_phys=%llx offset=%llx\n",
 		       __func__, phys, pgmap_phys, data_offset);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH V5 7/9] dax: read holder_ops once in dax_holder_notify_failure()
  2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
                     ` (5 preceding siblings ...)
  2026-06-11 17:32   ` [PATCH V5 6/9] dax/fsdev: fail probe on invalid pgmap offset John Groves
@ 2026-06-11 17:32   ` John Groves
  2026-06-12  3:02     ` Richard Cheng
  2026-06-11 17:32   ` [PATCH V5 8/9] dax: fix holder_ops race in fs_put_dax() John Groves
  2026-06-11 17:33   ` [PATCH V5 9/9] dax: fsdev.c minor formatting cleanup John Groves
  8 siblings, 1 reply; 18+ messages in thread
From: John Groves @ 2026-06-11 17:32 UTC (permalink / raw)
  To: John Groves, Dan Williams
  Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Miklos Szeredi,
	Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	Richard Cheng, John Groves

From: John Groves <John@Groves.net>

dax_holder_notify_failure() reads dax_dev->holder_ops twice without
READ_ONCE() -- once for the NULL check and once for the indirect
notify_failure() call. A concurrent fs_put_dax() or kill_dax() can clear
holder_ops between the two reads, so the check can observe a non-NULL
pointer while the call dereferences NULL.

Fetch holder_ops once into a local with READ_ONCE() so the NULL check and
the indirect call observe the same value.

Fixes: 8012b86608552 ("dax: introduce holder for dax_device")
Suggested-by: Richard Cheng <icheng@nvidia.com>
Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/super.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 25cf99dd9360b..6b5ee6589e39b 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -303,6 +303,7 @@ EXPORT_SYMBOL_GPL(dax_recovery_write);
 int dax_holder_notify_failure(struct dax_device *dax_dev, u64 off,
 			      u64 len, int mf_flags)
 {
+	const struct dax_holder_operations *ops;
 	int rc, id;
 
 	id = dax_read_lock();
@@ -311,12 +312,18 @@ int dax_holder_notify_failure(struct dax_device *dax_dev, u64 off,
 		goto out;
 	}
 
-	if (!dax_dev->holder_ops) {
+	/*
+	 * Read holder_ops once: a concurrent fs_put_dax() or kill_dax() can
+	 * clear it. Without the single fetch the compiler could reload
+	 * between the NULL check and the call and dereference a NULL ops.
+	 */
+	ops = READ_ONCE(dax_dev->holder_ops);
+	if (!ops) {
 		rc = -EOPNOTSUPP;
 		goto out;
 	}
 
-	rc = dax_dev->holder_ops->notify_failure(dax_dev, off, len, mf_flags);
+	rc = ops->notify_failure(dax_dev, off, len, mf_flags);
 out:
 	dax_read_unlock(id);
 	return rc;
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH V5 8/9] dax: fix holder_ops race in fs_put_dax()
  2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
                     ` (6 preceding siblings ...)
  2026-06-11 17:32   ` [PATCH V5 7/9] dax: read holder_ops once in dax_holder_notify_failure() John Groves
@ 2026-06-11 17:32   ` John Groves
  2026-06-11 17:33   ` [PATCH V5 9/9] dax: fsdev.c minor formatting cleanup John Groves
  8 siblings, 0 replies; 18+ messages in thread
From: John Groves @ 2026-06-11 17:32 UTC (permalink / raw)
  To: John Groves, Dan Williams
  Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Miklos Szeredi,
	Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	John Groves

From: John Groves <John@Groves.net>

Clear holder_ops before holder_data so that a concurrent fs_dax_get()
cannot have its newly installed holder_ops overwritten. cmpxchg()
provides release ordering on weakly-ordered architectures, ensuring the
WRITE_ONCE(holder_ops, NULL) store is visible to any CPU that observes
the holder_data release.

Add a WARN_ON() that fires only when the cmpxchg observes a non-NULL
value that is not @holder, i.e. fs_put_dax() called by something that
is not the current holder. That is an API contract violation; the
WARN_ON() does not prevent the damage but makes the bug visible.

A NULL cmpxchg result is deliberately tolerated: kill_dax() clears
holder_data while a holder is still attached when a device is removed
out from under a mounted filesystem (after delivering MF_MEM_PRE_REMOVE).
The holder's subsequent fs_put_dax() - e.g. xfs_free_buftarg() after a
forced shutdown - then legitimately finds holder_data already NULL, so
warning on that case would turn supported device removal into a splat
(or a panic with panic_on_warn).

Also add a kerneldoc comment documenting that fs_put_dax() must only
be called by the current holder.

Fixes: eec38f5d86d27 ("dax: Add fs_dax_get() func to prepare dax for fs-dax usage")
Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/super.c | 42 +++++++++++++++++++++++++++++++++++++++---
 1 file changed, 39 insertions(+), 3 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 6b5ee6589e39b..af5c1e14f7e39 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -116,11 +116,47 @@ EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
 
 #if IS_ENABLED(CONFIG_FS_DAX)
 
+/**
+ * fs_put_dax() - release holder ownership of a dax_device
+ * @dax_dev: dax device to release (may be NULL)
+ * @holder: the holder pointer previously passed to fs_dax_get() or
+ *          fs_dax_get_by_bdev(); must match exactly, as it is used
+ *          in a cmpxchg to atomically release ownership
+ *
+ * Must only be called by the current holder. Clears holder_ops before
+ * holder_data to avoid a race where a concurrent fs_dax_get() could have
+ * its newly installed holder_ops overwritten.
+ */
 void fs_put_dax(struct dax_device *dax_dev, void *holder)
 {
-	if (dax_dev && holder &&
-	    cmpxchg(&dax_dev->holder_data, holder, NULL) == holder)
-		dax_dev->holder_ops = NULL;
+	if (dax_dev && holder) {
+		void *prev;
+
+		/*
+		 * Clear holder_ops before releasing holder_data. A concurrent
+		 * dax_holder_notify_failure() that sees NULL ops returns
+		 * -EOPNOTSUPP cleanly. A concurrent fs_dax_get() that acquires
+		 * holder_data after the cmpxchg below is guaranteed to observe
+		 * holder_ops=NULL first (cmpxchg provides release ordering), so
+		 * its subsequent store of new ops will not be overwritten.
+		 */
+		WRITE_ONCE(dax_dev->holder_ops, NULL);
+		prev = cmpxchg(&dax_dev->holder_data, holder, NULL);
+
+		/*
+		 * prev == holder: normal release.
+		 * prev == NULL:   already released by kill_dax() when the
+		 *                 device was removed under a live holder;
+		 *                 not a bug.
+		 * prev != holder (non-NULL): fs_put_dax() called by something
+		 *                 that is not the current holder; an API
+		 *                 contract violation. A lock would be needed
+		 *                 to guard against this, but we WARN_ON()
+		 *                 instead since violating the contract is
+		 *                 a bug.
+		 */
+		WARN_ON(prev && prev != holder);
+	}
 	put_dax(dax_dev);
 }
 EXPORT_SYMBOL_GPL(fs_put_dax);
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH V5 9/9] dax: fsdev.c minor formatting cleanup
  2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
                     ` (7 preceding siblings ...)
  2026-06-11 17:32   ` [PATCH V5 8/9] dax: fix holder_ops race in fs_put_dax() John Groves
@ 2026-06-11 17:33   ` John Groves
  8 siblings, 0 replies; 18+ messages in thread
From: John Groves @ 2026-06-11 17:33 UTC (permalink / raw)
  To: John Groves, Dan Williams
  Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Miklos Szeredi,
	Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	John Groves

From: John Groves <John@Groves.net>

Address some comments from Jonathan that were missed in the merged
series. Fix line wrapping in fsdev_dax_recovery_write() and
fsdev_dax_zero_page_range() signatures.

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/fsdev.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
index 71d2bee1e2805..7df75728ada89 100644
--- a/drivers/dax/fsdev.c
+++ b/drivers/dax/fsdev.c
@@ -45,8 +45,8 @@ static void fsdev_write_dax(void *addr, struct page *page,
 }
 
 static long __fsdev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
-			long nr_pages, enum dax_access_mode mode, void **kaddr,
-			unsigned long *pfn)
+		long nr_pages, enum dax_access_mode mode, void **kaddr,
+		unsigned long *pfn)
 {
 	struct dev_dax *dev_dax = dax_get_private(dax_dev);
 	size_t size = nr_pages << PAGE_SHIFT;
@@ -80,7 +80,8 @@ static int fsdev_dax_zero_page_range(struct dax_device *dax_dev,
 	long rc;
 
 	WARN_ONCE(nr_pages > 1, "%s: nr_pages > 1\n", __func__);
-	rc = __fsdev_dax_direct_access(dax_dev, pgoff, 1, DAX_ACCESS, &kaddr, NULL);
+	rc = __fsdev_dax_direct_access(dax_dev, pgoff, 1, DAX_ACCESS,
+				       &kaddr, NULL);
 	if (rc < 0)
 		return rc;
 	fsdev_write_dax(kaddr, ZERO_PAGE(0), 0, PAGE_SIZE);
@@ -88,15 +89,15 @@ static int fsdev_dax_zero_page_range(struct dax_device *dax_dev,
 }
 
 static long fsdev_dax_direct_access(struct dax_device *dax_dev,
-		  pgoff_t pgoff, long nr_pages, enum dax_access_mode mode,
-		  void **kaddr, unsigned long *pfn)
+		pgoff_t pgoff, long nr_pages, enum dax_access_mode mode,
+		void **kaddr, unsigned long *pfn)
 {
 	return __fsdev_dax_direct_access(dax_dev, pgoff, nr_pages, mode,
 					 kaddr, pfn);
 }
 
-static size_t fsdev_dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
-		void *addr, size_t bytes, struct iov_iter *i)
+static size_t fsdev_dax_recovery_write(struct dax_device *dax_dev,
+		pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i)
 {
 	return _copy_from_iter_flushcache(addr, bytes, i);
 }
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH V5 6/9] dax/fsdev: fail probe on invalid pgmap offset
  2026-06-11 17:32   ` [PATCH V5 6/9] dax/fsdev: fail probe on invalid pgmap offset John Groves
@ 2026-06-11 18:09     ` Gupta, Pankaj
  2026-06-15 13:23       ` John Groves
  0 siblings, 1 reply; 18+ messages in thread
From: Gupta, Pankaj @ 2026-06-11 18:09 UTC (permalink / raw)
  To: John Groves, John Groves, Dan Williams
  Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Miklos Szeredi,
	Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org


> From: John Groves <John@Groves.net>
>
> Convert the WARN_ON to a fatal error when pgmap_phys > phys. This
> condition means the remapped region starts after the device's data
> region, which is an impossible state. Previously the probe continued
> with data_offset=0, leaving virt_addr silently misaligned. Now probe
> returns -EINVAL with a diagnostic message.
>
> Fixes: 759455848df0b ("dax: Save the kva from memremap")
>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Alison Schofield <alison.schofield@intel.com>
> Signed-off-by: John Groves <john@groves.net>

Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>

> ---
>   drivers/dax/fsdev.c | 8 ++++++--
>   1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> index 07de6bfbbf673..71d2bee1e2805 100644
> --- a/drivers/dax/fsdev.c
> +++ b/drivers/dax/fsdev.c
> @@ -320,8 +320,12 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
>   		u64 phys = dev_dax->ranges[0].range.start;
>   		u64 pgmap_phys = pgmap[0].range.start;
>   
> -		if (!WARN_ON(pgmap_phys > phys))
> -			data_offset = phys - pgmap_phys;
> +		if (pgmap_phys > phys) {
> +			dev_err(dev, "pgmap start %#llx exceeds data start %#llx\n",
> +				pgmap_phys, phys);
> +			return -EINVAL;
> +		}
> +		data_offset = phys - pgmap_phys;
>   
>   		pr_debug("%s: offset detected phys=%llx pgmap_phys=%llx offset=%llx\n",
>   		       __func__, phys, pgmap_phys, data_offset);

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V5 3/9] dax/fsdev: clear vmemmap_shift when binding static pgmap
  2026-06-11 17:32   ` [PATCH V5 3/9] dax/fsdev: clear vmemmap_shift when binding static pgmap John Groves
@ 2026-06-12  2:56     ` Richard Cheng
  2026-06-15 13:16       ` John Groves
  0 siblings, 1 reply; 18+ messages in thread
From: Richard Cheng @ 2026-06-12  2:56 UTC (permalink / raw)
  To: John Groves
  Cc: John Groves, Dan Williams, John Groves, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Miklos Szeredi, Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org

On Thu, Jun 11, 2026 at 05:32:07PM +0800, John Groves wrote:
> From: John Groves <John@Groves.net>
> 
> Clear pgmap->vmemmap_shift for static DAX devices. When rebinding a static
> device from device_dax (which may set vmemmap_shift based on alignment) to
> fsdev_dax, the stale vmemmap_shift persists on the shared pgmap. Explicitly
> zero it before devm_memremap_pages() so the vmemmap is built for order-0
> folios as fsdev requires.
> 
> Fixes: d5406bd458b0a ("dax: add fsdev.c driver for fs-dax on character dax")
> 
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Alison Schofield <alison.schofield@intel.com>
> Signed-off-by: John Groves <john@groves.net>
> ---
>  drivers/dax/fsdev.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> index 2c5de3d80a618..52f46b3e245ea 100644
> --- a/drivers/dax/fsdev.c
> +++ b/drivers/dax/fsdev.c
> @@ -237,6 +237,7 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
>  		}
>  
>  		pgmap = dev_dax->pgmap;
> +		pgmap->vmemmap_shift = 0;


Hello John,

I would suggest also clearing pgmap->ops and pgmap->owner on teardown.
fsdev also writes them but never clears them. memuunmap_pages() leaves the
descriptor intact and kill_dev_dax() only NULLs dev_dax->pgmap for !static case.
After fsdev unbind the stale ops survive.
If we do rmmod later a HW failure dispatch pgmap->ops->memory_failure.

--Richard

>  	} else {
>  		size_t pgmap_size;
>  
> -- 
> 2.53.0
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V5 7/9] dax: read holder_ops once in dax_holder_notify_failure()
  2026-06-11 17:32   ` [PATCH V5 7/9] dax: read holder_ops once in dax_holder_notify_failure() John Groves
@ 2026-06-12  3:02     ` Richard Cheng
  2026-06-15 13:22       ` John Groves
  0 siblings, 1 reply; 18+ messages in thread
From: Richard Cheng @ 2026-06-12  3:02 UTC (permalink / raw)
  To: John Groves
  Cc: John Groves, Dan Williams, John Groves, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Miklos Szeredi, Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org

On Thu, Jun 11, 2026 at 05:32:45PM +0800, John Groves wrote:
> From: John Groves <John@Groves.net>
> 
> dax_holder_notify_failure() reads dax_dev->holder_ops twice without
> READ_ONCE() -- once for the NULL check and once for the indirect
> notify_failure() call. A concurrent fs_put_dax() or kill_dax() can clear
> holder_ops between the two reads, so the check can observe a non-NULL
> pointer while the call dereferences NULL.
> 

Hello John,

Thanks for this.

Reviewed-by: Richard Cheng <icheng@nvidia.com>

Small message nit, kill_dax() isn't a racing clearer.
Plus I think this only fix holder_ops double-fetch, the fs_put_dax()
race issue is separate and pre-existing.

Best regards,
Richard Cheng.

> Fetch holder_ops once into a local with READ_ONCE() so the NULL check and
> the indirect call observe the same value.
> 
> Fixes: 8012b86608552 ("dax: introduce holder for dax_device")
> Suggested-by: Richard Cheng <icheng@nvidia.com>
> Signed-off-by: John Groves <john@groves.net>
> ---
>  drivers/dax/super.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 25cf99dd9360b..6b5ee6589e39b 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -303,6 +303,7 @@ EXPORT_SYMBOL_GPL(dax_recovery_write);
>  int dax_holder_notify_failure(struct dax_device *dax_dev, u64 off,
>  			      u64 len, int mf_flags)
>  {
> +	const struct dax_holder_operations *ops;
>  	int rc, id;
>  
>  	id = dax_read_lock();
> @@ -311,12 +312,18 @@ int dax_holder_notify_failure(struct dax_device *dax_dev, u64 off,
>  		goto out;
>  	}
>  
> -	if (!dax_dev->holder_ops) {
> +	/*
> +	 * Read holder_ops once: a concurrent fs_put_dax() or kill_dax() can
> +	 * clear it. Without the single fetch the compiler could reload
> +	 * between the NULL check and the call and dereference a NULL ops.
> +	 */
> +	ops = READ_ONCE(dax_dev->holder_ops);
> +	if (!ops) {
>  		rc = -EOPNOTSUPP;
>  		goto out;
>  	}
>  
> -	rc = dax_dev->holder_ops->notify_failure(dax_dev, off, len, mf_flags);
> +	rc = ops->notify_failure(dax_dev, off, len, mf_flags);
>  out:
>  	dax_read_unlock(id);
>  	return rc;
> -- 
> 2.53.0
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V5 2/9] dax/fsdev: fix multi-range offset in memory_failure handler
  2026-06-11 17:31   ` [PATCH V5 2/9] dax/fsdev: fix multi-range offset in memory_failure handler John Groves
@ 2026-06-12  3:08     ` Richard Cheng
  2026-06-15 13:13       ` John Groves
  0 siblings, 1 reply; 18+ messages in thread
From: Richard Cheng @ 2026-06-12  3:08 UTC (permalink / raw)
  To: John Groves
  Cc: John Groves, Dan Williams, John Groves, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Miklos Szeredi, Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org

On Thu, Jun 11, 2026 at 05:31:59PM +0800, John Groves wrote:
> From: John Groves <John@Groves.net>
> 
> Fix memory_failure offset calculation for multi-range devices. The old code
> subtracted ranges[0].range.start from the faulting PFN's physical address,
> which produces an incorrect (inflated) logical offset when the PFN falls in
> ranges[1] or beyond due to physical gaps between ranges. Add
> fsdev_pfn_to_offset() to walk the range list and compute the correct
> device-linear byte offset.
> 
> Walk the pagemap's own range array (pgmap->ranges[]) rather than
> dev_dax->ranges[]. The pgmap copy is the immutable snapshot populated at
> probe and is never mutated afterwards, whereas dev_dax->ranges[] can be
> krealloc()'d by a concurrent sysfs mapping_store() (under dax_region_rwsem,
> which this ->memory_failure callback does not hold). For dynamic devices the
> two arrays are identical, so the reported offset is unchanged for the
> multi-range case this targets.
> 
> Fixes: d5406bd458b0a ("dax: add fsdev.c driver for fs-dax on character dax")
> 
> Suggested-by: Richard Cheng <icheng@nvidia.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Alison Schofield <alison.schofield@intel.com>
> Signed-off-by: John Groves <john@groves.net>
> ---
>  drivers/dax/fsdev.c | 17 ++++++++++++++++-
>  1 file changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> index 188b2526bee45..2c5de3d80a618 100644
> --- a/drivers/dax/fsdev.c
> +++ b/drivers/dax/fsdev.c
> @@ -135,11 +135,26 @@ static void fsdev_clear_ops(void *data)
>   * The core mm code in free_zone_device_folio() handles the wake_up_var()
>   * directly for this memory type.
>   */
> +static u64 fsdev_pfn_to_offset(struct dev_pagemap *pgmap, unsigned long pfn)
> +{
> +	phys_addr_t phys = PFN_PHYS(pfn);
> +	u64 offset = 0;
> +
> +	for (int i = 0; i < pgmap->nr_range; i++) {
> +		struct range *range = &pgmap->ranges[i];
> +
> +		if (phys >= range->start && phys <= range->end)
> +			return offset + (phys - range->start);
> +		offset += range_len(range);
> +	}
> +	return -1ULL;
> +}
> +
>  static int fsdev_pagemap_memory_failure(struct dev_pagemap *pgmap,
>  		unsigned long pfn, unsigned long nr_pages, int mf_flags)
>  {
>  	struct dev_dax *dev_dax = pgmap->owner;
> -	u64 offset = PFN_PHYS(pfn) - dev_dax->ranges[0].range.start;
> +	u64 offset = fsdev_pfn_to_offset(pgmap, pfn);

Hi John,

I think this regresses static devices. pgmap->ranges[0].start can sit
data_offset below it on a static device, so the new offset = old + data_offset,
and XFS poisons the wrong blocks.

The gap walk only helps dynamic devices where data_offset ==0 . Maybe walking pgmap->ranges and
substract the probe's data_offset.

--Richard

>  	u64 len = nr_pages << PAGE_SHIFT;
>  
>  	return dax_holder_notify_failure(dev_dax->dax_dev, offset,
> -- 
> 2.53.0
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V5 2/9] dax/fsdev: fix multi-range offset in memory_failure handler
  2026-06-12  3:08     ` Richard Cheng
@ 2026-06-15 13:13       ` John Groves
  0 siblings, 0 replies; 18+ messages in thread
From: John Groves @ 2026-06-15 13:13 UTC (permalink / raw)
  To: Richard Cheng
  Cc: John Groves, Dan Williams, John Groves, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Miklos Szeredi, Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org

On 26/06/12 11:08AM, Richard Cheng wrote:
> On Thu, Jun 11, 2026 at 05:31:59PM +0800, John Groves wrote:
> > From: John Groves <John@Groves.net>
> > 
> > Fix memory_failure offset calculation for multi-range devices. The old code
> > subtracted ranges[0].range.start from the faulting PFN's physical address,
> > which produces an incorrect (inflated) logical offset when the PFN falls in
> > ranges[1] or beyond due to physical gaps between ranges. Add
> > fsdev_pfn_to_offset() to walk the range list and compute the correct
> > device-linear byte offset.
> > 
> > Walk the pagemap's own range array (pgmap->ranges[]) rather than
> > dev_dax->ranges[]. The pgmap copy is the immutable snapshot populated at
> > probe and is never mutated afterwards, whereas dev_dax->ranges[] can be
> > krealloc()'d by a concurrent sysfs mapping_store() (under dax_region_rwsem,
> > which this ->memory_failure callback does not hold). For dynamic devices the
> > two arrays are identical, so the reported offset is unchanged for the
> > multi-range case this targets.
> > 
> > Fixes: d5406bd458b0a ("dax: add fsdev.c driver for fs-dax on character dax")
> > 
> > Suggested-by: Richard Cheng <icheng@nvidia.com>
> > Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> > Reviewed-by: Alison Schofield <alison.schofield@intel.com>
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> >  drivers/dax/fsdev.c | 17 ++++++++++++++++-
> >  1 file changed, 16 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> > index 188b2526bee45..2c5de3d80a618 100644
> > --- a/drivers/dax/fsdev.c
> > +++ b/drivers/dax/fsdev.c
> > @@ -135,11 +135,26 @@ static void fsdev_clear_ops(void *data)
> >   * The core mm code in free_zone_device_folio() handles the wake_up_var()
> >   * directly for this memory type.
> >   */
> > +static u64 fsdev_pfn_to_offset(struct dev_pagemap *pgmap, unsigned long pfn)
> > +{
> > +	phys_addr_t phys = PFN_PHYS(pfn);
> > +	u64 offset = 0;
> > +
> > +	for (int i = 0; i < pgmap->nr_range; i++) {
> > +		struct range *range = &pgmap->ranges[i];
> > +
> > +		if (phys >= range->start && phys <= range->end)
> > +			return offset + (phys - range->start);
> > +		offset += range_len(range);
> > +	}
> > +	return -1ULL;
> > +}
> > +
> >  static int fsdev_pagemap_memory_failure(struct dev_pagemap *pgmap,
> >  		unsigned long pfn, unsigned long nr_pages, int mf_flags)
> >  {
> >  	struct dev_dax *dev_dax = pgmap->owner;
> > -	u64 offset = PFN_PHYS(pfn) - dev_dax->ranges[0].range.start;
> > +	u64 offset = fsdev_pfn_to_offset(pgmap, pfn);
> 
> Hi John,
> 
> I think this regresses static devices. pgmap->ranges[0].start can sit
> data_offset below it on a static device, so the new offset = old + data_offset,
> and XFS poisons the wrong blocks.
> 
> The gap walk only helps dynamic devices where data_offset ==0 . Maybe walking pgmap->ranges and
> substract the probe's data_offset.
> 
> --Richard

Ugh, right.

Subtracting the data_offset would require newly stashing it somewhere the
->memory_failure callback could reach.

So I'm reverting to walking dev_dax->ranges[] -- the maybe-race there is the
same one the pre-existing single-range code already had.

I'd like to land this series before going too much farther down the suspected
pre-existing issues rabbit hole :D

Note: the current version of this patch (switching to pgmap->ranges) might 
have been a bit much for keeping Dave and Alison's RB tags - but I'm 
reverting back to what they reviewed for V6.

Thanks,
John

<snip>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V5 3/9] dax/fsdev: clear vmemmap_shift when binding static pgmap
  2026-06-12  2:56     ` Richard Cheng
@ 2026-06-15 13:16       ` John Groves
  0 siblings, 0 replies; 18+ messages in thread
From: John Groves @ 2026-06-15 13:16 UTC (permalink / raw)
  To: Richard Cheng
  Cc: John Groves, Dan Williams, John Groves, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Miklos Szeredi, Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org

On 26/06/12 10:56AM, Richard Cheng wrote:
> On Thu, Jun 11, 2026 at 05:32:07PM +0800, John Groves wrote:
> > From: John Groves <John@Groves.net>
> > 
> > Clear pgmap->vmemmap_shift for static DAX devices. When rebinding a static
> > device from device_dax (which may set vmemmap_shift based on alignment) to
> > fsdev_dax, the stale vmemmap_shift persists on the shared pgmap. Explicitly
> > zero it before devm_memremap_pages() so the vmemmap is built for order-0
> > folios as fsdev requires.
> > 
> > Fixes: d5406bd458b0a ("dax: add fsdev.c driver for fs-dax on character dax")
> > 
> > Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> > Reviewed-by: Alison Schofield <alison.schofield@intel.com>
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> >  drivers/dax/fsdev.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> > index 2c5de3d80a618..52f46b3e245ea 100644
> > --- a/drivers/dax/fsdev.c
> > +++ b/drivers/dax/fsdev.c
> > @@ -237,6 +237,7 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
> >  		}
> >  
> >  		pgmap = dev_dax->pgmap;
> > +		pgmap->vmemmap_shift = 0;
> 
> 
> Hello John,
> 
> I would suggest also clearing pgmap->ops and pgmap->owner on teardown.
> fsdev also writes them but never clears them. memuunmap_pages() leaves the
> descriptor intact and kill_dev_dax() only NULLs dev_dax->pgmap for !static case.
> After fsdev unbind the stale ops survive.
> If we do rmmod later a HW failure dispatch pgmap->ops->memory_failure.
> 
> --Richard

Good catch, thanks.

Adding a patch for V6 ("dax/fsdev: clear pgmap ops and owner on unbind")
-- a devm action that NULLs both on unbind, symmetric with setting them
at probe. It matters for the static case where the pgmap is shared and
long-lived; otherwise a later rebind or rmmod could dispatch
memory_failure through the stale handler.

John


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V5 7/9] dax: read holder_ops once in dax_holder_notify_failure()
  2026-06-12  3:02     ` Richard Cheng
@ 2026-06-15 13:22       ` John Groves
  0 siblings, 0 replies; 18+ messages in thread
From: John Groves @ 2026-06-15 13:22 UTC (permalink / raw)
  To: Richard Cheng
  Cc: John Groves, Dan Williams, John Groves, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Miklos Szeredi, Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org

On 26/06/12 11:02AM, Richard Cheng wrote:
> On Thu, Jun 11, 2026 at 05:32:45PM +0800, John Groves wrote:
> > From: John Groves <John@Groves.net>
> > 
> > dax_holder_notify_failure() reads dax_dev->holder_ops twice without
> > READ_ONCE() -- once for the NULL check and once for the indirect
> > notify_failure() call. A concurrent fs_put_dax() or kill_dax() can clear
> > holder_ops between the two reads, so the check can observe a non-NULL
> > pointer while the call dereferences NULL.
> > 
> 
> Hello John,
> 
> Thanks for this.
> 
> Reviewed-by: Richard Cheng <icheng@nvidia.com>
> 
> Small message nit, kill_dax() isn't a racing clearer.

Right -- kill_dax() clears holder_ops only after synchronize_srcu(), so it
can't race a reader under dax_read_lock(). Reworded the commit message and
the code comment for V6 to name only fs_put_dax().

> Plus I think this only fix holder_ops double-fetch, the fs_put_dax()
> race issue is separate and pre-existing.

Yes, intentionally -- this patch is just the reader-side double-fetch. The
fs_put_dax() race is the separate, pre-existing one handled by the next
patch in the series ("dax: fix holder_ops race in fs_put_dax()").

> 
> Best regards,
> Richard Cheng.

Thanks,
John

<snip>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V5 6/9] dax/fsdev: fail probe on invalid pgmap offset
  2026-06-11 18:09     ` Gupta, Pankaj
@ 2026-06-15 13:23       ` John Groves
  0 siblings, 0 replies; 18+ messages in thread
From: John Groves @ 2026-06-15 13:23 UTC (permalink / raw)
  To: Gupta, Pankaj
  Cc: John Groves, Dan Williams, John Groves, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Miklos Szeredi, Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org

On 26/06/11 08:09PM, Gupta, Pankaj wrote:
> 
> > From: John Groves <John@Groves.net>
> > 
> > Convert the WARN_ON to a fatal error when pgmap_phys > phys. This
> > condition means the remapped region starts after the device's data
> > region, which is an impossible state. Previously the probe continued
> > with data_offset=0, leaving virt_addr silently misaligned. Now probe
> > returns -EINVAL with a diagnostic message.
> > 
> > Fixes: 759455848df0b ("dax: Save the kva from memremap")
> > 
> > Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> > Reviewed-by: Alison Schofield <alison.schofield@intel.com>
> > Signed-off-by: John Groves <john@groves.net>
> 
> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>

Thank you!
John

<snip>


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2026-06-15 13:30 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260611173057.65868-1-john@jagalactic.com>
2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
2026-06-11 17:31   ` [PATCH V5 1/9] dax: fix misleading comment about share/index union in dax_folio_reset_order() John Groves
2026-06-11 17:31   ` [PATCH V5 2/9] dax/fsdev: fix multi-range offset in memory_failure handler John Groves
2026-06-12  3:08     ` Richard Cheng
2026-06-15 13:13       ` John Groves
2026-06-11 17:32   ` [PATCH V5 3/9] dax/fsdev: clear vmemmap_shift when binding static pgmap John Groves
2026-06-12  2:56     ` Richard Cheng
2026-06-15 13:16       ` John Groves
2026-06-11 17:32   ` [PATCH V5 4/9] dax/fsdev: don't leave a dangling dev_dax->pgmap on probe failure John Groves
2026-06-11 17:32   ` [PATCH V5 5/9] dax/fsdev: use __va(phys) for kaddr in direct_access John Groves
2026-06-11 17:32   ` [PATCH V5 6/9] dax/fsdev: fail probe on invalid pgmap offset John Groves
2026-06-11 18:09     ` Gupta, Pankaj
2026-06-15 13:23       ` John Groves
2026-06-11 17:32   ` [PATCH V5 7/9] dax: read holder_ops once in dax_holder_notify_failure() John Groves
2026-06-12  3:02     ` Richard Cheng
2026-06-15 13:22       ` John Groves
2026-06-11 17:32   ` [PATCH V5 8/9] dax: fix holder_ops race in fs_put_dax() John Groves
2026-06-11 17:33   ` [PATCH V5 9/9] dax: fsdev.c minor formatting cleanup John Groves

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox