All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series
       [not found] <20260611173057.65868-1-john@jagalactic.com>
@ 2026-06-11 17:31 ` John Groves
  2026-06-11 17:31   ` [PATCH V5 1/9] dax: fix misleading comment about share/index union in dax_folio_reset_order() John Groves
                     ` (8 more replies)
  0 siblings, 9 replies; 20+ messages in thread
From: John Groves @ 2026-06-11 17:31 UTC (permalink / raw)
  To: John Groves, Dan Williams
  Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Miklos Szeredi,
	Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	John Groves

From: John Groves <john@groves.net>

This series applies bug fixes (mostly found via sashiko) to the dax/fsdev
series. It has been soaking in the famfs CI pipeline and 1) won't affect
anything that doesn't use drivers/dax/fsdev.c, and 2) doesn't affect any
known workloads -- although the bugs would have manifested when multi-range
DCD dax devices are a thing (soon-ish).

Most of the series is confined to drivers/dax/fsdev.c. Two patches touch
shared DAX core in drivers/dax/super.c: patch 7 reads holder_ops once in
dax_holder_notify_failure() to close a double-fetch NULL dereference, and
patch 8 reorders fs_put_dax() and adds a WARN_ON(). fs_put_dax() is used by
ext2/ext4/erofs/xfs, but only holder-passing callers (like XFS in-tree) will
see a behavior change, and only a new warning if they misuse it.

Changes since V4:

- New patch 7 (dax: read holder_ops once in dax_holder_notify_failure()):
  split the reader-side READ_ONCE() fix out of the fs_put_dax() patch and
  placed it first, so the fs_put_dax() patch's "a concurrent
  dax_holder_notify_failure() that sees NULL ops returns -EOPNOTSUPP
  cleanly" reasoning actually holds when it lands. dax_holder_notify_failure()
  read holder_ops twice without READ_ONCE(); a concurrent clear could make
  the NULL check pass while the indirect call dereferenced NULL. Carries
  Fixes: 8012b86608552 ("dax: introduce holder for dax_device"), the commit
  that introduced the unmarked double fetch. Suggested by Richard Cheng (and
  the Sashiko bot).
- Patch 2 (multi-range memory_failure offset): the ->memory_failure callback
  now walks the pagemap's own immutable range array (pgmap->ranges[]) rather
  than dev_dax->ranges[], which a concurrent sysfs mapping_store() can
  krealloc() under dax_region_rwsem while this callback holds no such lock.
  For dynamic devices the two arrays are identical, so the reported offset is
  unchanged for the multi-range case this patch targets. Suggested by Richard
  Cheng (and the Sashiko bot).
- Dropped the dax_dev_get()/dax_dev_find() patch (V4 patch 8) from this
  revision. There is no in-tree caller yet; it will be sent together with the
  famfs filesystem series that introduces the caller. (Per Richard Cheng /
  Sashiko.)
- Patch 8 (holder_ops race in fs_put_dax()): unchanged from V4 (renumbered
  from 7 to 8).
- Collected Reviewed-by from Dave Jiang on patches 4 and 6.

Changes since V3:

- Patch 4: Adopted Dave's suggested refactor -- factor out
  fsdev_acquire_pgmap() and defer the dev_dax->pgmap assignment until
  probe can no longer fail, replacing the goto-based cleanup. Did not
  carry Alison's V3 Reviewed-by due to the rewrite.
- Patch 5: Also remove the now write-only dev_dax->virt_addr field,
  per Dave's review.
- Patch 7: Fixed the WARN_ON() to tolerate holder_data == NULL, which
  legitimately occurs when kill_dax() clears it during device removal
  under a live holder (per Dave's review). Wrong-holder calls still
  warn.
- Patch 8: Kept the Fixes tag -- the exported symbol itself is the
  hazard; stable kernels carrying the export should want this fix.

Changes since V2:

* Patch 1 (comment fix): No change. Responded to Dave's question about
  the dropped precondition -- the new comment correctly covers both
  callers; fsdev_clear_folio_state() does not guarantee share==0 before
  calling, so the old precondition was no longer universally true.
* V2 patch 2 (three fixes): Split into three separate patches (patches
  2-4) per Dave's review.
* V2 patch 3 (two fixes): Split into two separate patches (patches 5-6)
  per Dave's review.
* V2 patch 4 (clamp direct_access / remove cached_size): Dropped.
  Dave's analysis correctly showed the claimed bug does not exist --
  dax_pgoff_to_phys() already enforces that the full requested size fits
  within a single range before returning, making the clamp a no-op in
  every reachable path.
* V2 patch 5 (holder_ops race): Use WRITE_ONCE() for the holder_ops
  store; add WARN_ON() on the cmpxchg result to catch wrong-holder and
  double-put API contract violations; fix the inline comment, which
  incorrectly claimed dax_holder_notify_failure() consults holder_ops
  only when holder_data is non-NULL.
* V2 patch 6 (dax_dev_find): Add dax_alive() check under dax_read_lock()
  after ilookup5() to prevent returning a device that is concurrently
  being torn down by kill_dax().
* V2 patch 7 (formatting cleanup): Drop incorrect Fixes: tag; add
  Dave's Reviewed-by.
* The series grows from 7 to 9 patches.

Changes since v1:
* Dropped modes from patch 6 to fs/fuse/famfs.c and 
  fs/famfs/famfs_inode.c, which are not upstream so it broke
  attempts to apply the series. Oops...
* Added patch 7, which addresses a previously-missed review comment
  from Jonathan - minor cleanup



John Groves (9):
  dax: fix misleading comment about share/index union in
    dax_folio_reset_order()
  dax/fsdev: fix multi-range offset in memory_failure handler
  dax/fsdev: clear vmemmap_shift when binding static pgmap
  dax/fsdev: don't leave a dangling dev_dax->pgmap on probe failure
  dax/fsdev: use __va(phys) for kaddr in direct_access
  dax/fsdev: fail probe on invalid pgmap offset
  dax: read holder_ops once in dax_holder_notify_failure()
  dax: fix holder_ops race in fs_put_dax()
  dax: fsdev.c minor formatting cleanup

 drivers/dax/dax-private.h |   2 -
 drivers/dax/fsdev.c       | 126 +++++++++++++++++++++++++-------------
 drivers/dax/super.c       |  53 ++++++++++++++--
 fs/dax.c                  |  12 ++--
 4 files changed, 136 insertions(+), 57 deletions(-)


base-commit: 4549871118cf616eecdd2d939f78e3b9e1dddc48
-- 
2.53.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH V5 1/9] dax: fix misleading comment about share/index union in dax_folio_reset_order()
  2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
@ 2026-06-11 17:31   ` John Groves
  2026-06-11 17:31   ` [PATCH V5 2/9] dax/fsdev: fix multi-range offset in memory_failure handler John Groves
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: John Groves @ 2026-06-11 17:31 UTC (permalink / raw)
  To: John Groves, Dan Williams
  Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Miklos Szeredi,
	Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	John Groves

From: John Groves <John@Groves.net>

The comment in dax_folio_reset_order() claims that DAX maintains an
invariant where folio->share != 0 only when folio->mapping == NULL,
implying folio->share is zero whenever mapping is non-NULL. This is
misleading because folio->share and folio->index are a union -- for
non-shared folios with mapping != NULL, reading folio->share returns
the file page offset (folio->index), which is typically non-zero.

Reword the comment to accurately describe the union aliasing: the
assignment clears whichever interpretation of the union word is active
(index for non-shared folios, share for shared folios), which is correct
because the folio is being released in either case.

No functional change -- the code was already correct, only the
justification was wrong.

Fixes: 59eb73b98ae0b ("dax: Factor out dax_folio_reset_order() helper")

Reviewed-by: Jonathan Cameron <jic23@kernel.org>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: John Groves <john@groves.net>
---
 fs/dax.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 6d175cd47a99b..df19c9317d10e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -392,12 +392,12 @@ int dax_folio_reset_order(struct folio *folio)
 	int order = folio_order(folio);
 
 	/*
-	 * DAX maintains the invariant that folio->share != 0 only when
-	 * folio->mapping == NULL (enforced by dax_folio_make_shared()).
-	 * Equivalently: folio->mapping != NULL implies folio->share == 0.
-	 * Callers ensure share has been decremented to zero before
-	 * calling here, so unconditionally clearing both fields is
-	 * correct.
+	 * Clear the mapping and the index/share union word. folio->share
+	 * and folio->index occupy the same union in struct folio. For
+	 * non-shared folios (mapping != NULL), the union holds folio->index
+	 * (file page offset); for shared folios (mapping == NULL), it holds
+	 * folio->share (reference count). Either way, we are releasing the
+	 * folio and both fields should be zeroed.
 	 */
 	folio->mapping = NULL;
 	folio->share = 0;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH V5 2/9] dax/fsdev: fix multi-range offset in memory_failure handler
  2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
  2026-06-11 17:31   ` [PATCH V5 1/9] dax: fix misleading comment about share/index union in dax_folio_reset_order() John Groves
@ 2026-06-11 17:31   ` John Groves
  2026-06-11 17:51     ` sashiko-bot
  2026-06-12  3:08     ` Richard Cheng
  2026-06-11 17:32   ` [PATCH V5 3/9] dax/fsdev: clear vmemmap_shift when binding static pgmap John Groves
                     ` (6 subsequent siblings)
  8 siblings, 2 replies; 20+ messages in thread
From: John Groves @ 2026-06-11 17:31 UTC (permalink / raw)
  To: John Groves, Dan Williams
  Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Miklos Szeredi,
	Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	Richard Cheng, John Groves

From: John Groves <John@Groves.net>

Fix memory_failure offset calculation for multi-range devices. The old code
subtracted ranges[0].range.start from the faulting PFN's physical address,
which produces an incorrect (inflated) logical offset when the PFN falls in
ranges[1] or beyond due to physical gaps between ranges. Add
fsdev_pfn_to_offset() to walk the range list and compute the correct
device-linear byte offset.

Walk the pagemap's own range array (pgmap->ranges[]) rather than
dev_dax->ranges[]. The pgmap copy is the immutable snapshot populated at
probe and is never mutated afterwards, whereas dev_dax->ranges[] can be
krealloc()'d by a concurrent sysfs mapping_store() (under dax_region_rwsem,
which this ->memory_failure callback does not hold). For dynamic devices the
two arrays are identical, so the reported offset is unchanged for the
multi-range case this targets.

Fixes: d5406bd458b0a ("dax: add fsdev.c driver for fs-dax on character dax")

Suggested-by: Richard Cheng <icheng@nvidia.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/fsdev.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
index 188b2526bee45..2c5de3d80a618 100644
--- a/drivers/dax/fsdev.c
+++ b/drivers/dax/fsdev.c
@@ -135,11 +135,26 @@ static void fsdev_clear_ops(void *data)
  * The core mm code in free_zone_device_folio() handles the wake_up_var()
  * directly for this memory type.
  */
+static u64 fsdev_pfn_to_offset(struct dev_pagemap *pgmap, unsigned long pfn)
+{
+	phys_addr_t phys = PFN_PHYS(pfn);
+	u64 offset = 0;
+
+	for (int i = 0; i < pgmap->nr_range; i++) {
+		struct range *range = &pgmap->ranges[i];
+
+		if (phys >= range->start && phys <= range->end)
+			return offset + (phys - range->start);
+		offset += range_len(range);
+	}
+	return -1ULL;
+}
+
 static int fsdev_pagemap_memory_failure(struct dev_pagemap *pgmap,
 		unsigned long pfn, unsigned long nr_pages, int mf_flags)
 {
 	struct dev_dax *dev_dax = pgmap->owner;
-	u64 offset = PFN_PHYS(pfn) - dev_dax->ranges[0].range.start;
+	u64 offset = fsdev_pfn_to_offset(pgmap, pfn);
 	u64 len = nr_pages << PAGE_SHIFT;
 
 	return dax_holder_notify_failure(dev_dax->dax_dev, offset,
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH V5 3/9] dax/fsdev: clear vmemmap_shift when binding static pgmap
  2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
  2026-06-11 17:31   ` [PATCH V5 1/9] dax: fix misleading comment about share/index union in dax_folio_reset_order() John Groves
  2026-06-11 17:31   ` [PATCH V5 2/9] dax/fsdev: fix multi-range offset in memory_failure handler John Groves
@ 2026-06-11 17:32   ` John Groves
  2026-06-11 17:55     ` sashiko-bot
  2026-06-12  2:56     ` Richard Cheng
  2026-06-11 17:32   ` [PATCH V5 4/9] dax/fsdev: don't leave a dangling dev_dax->pgmap on probe failure John Groves
                     ` (5 subsequent siblings)
  8 siblings, 2 replies; 20+ messages in thread
From: John Groves @ 2026-06-11 17:32 UTC (permalink / raw)
  To: John Groves, Dan Williams
  Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Miklos Szeredi,
	Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	John Groves

From: John Groves <John@Groves.net>

Clear pgmap->vmemmap_shift for static DAX devices. When rebinding a static
device from device_dax (which may set vmemmap_shift based on alignment) to
fsdev_dax, the stale vmemmap_shift persists on the shared pgmap. Explicitly
zero it before devm_memremap_pages() so the vmemmap is built for order-0
folios as fsdev requires.

Fixes: d5406bd458b0a ("dax: add fsdev.c driver for fs-dax on character dax")

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/fsdev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
index 2c5de3d80a618..52f46b3e245ea 100644
--- a/drivers/dax/fsdev.c
+++ b/drivers/dax/fsdev.c
@@ -237,6 +237,7 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
 		}
 
 		pgmap = dev_dax->pgmap;
+		pgmap->vmemmap_shift = 0;
 	} else {
 		size_t pgmap_size;
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH V5 4/9] dax/fsdev: don't leave a dangling dev_dax->pgmap on probe failure
  2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
                     ` (2 preceding siblings ...)
  2026-06-11 17:32   ` [PATCH V5 3/9] dax/fsdev: clear vmemmap_shift when binding static pgmap John Groves
@ 2026-06-11 17:32   ` John Groves
  2026-06-11 18:04     ` sashiko-bot
  2026-06-11 17:32   ` [PATCH V5 5/9] dax/fsdev: use __va(phys) for kaddr in direct_access John Groves
                     ` (4 subsequent siblings)
  8 siblings, 1 reply; 20+ messages in thread
From: John Groves @ 2026-06-11 17:32 UTC (permalink / raw)
  To: John Groves, Dan Williams
  Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Miklos Szeredi,
	Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	John Groves

From: John Groves <John@Groves.net>

After the dynamic path set dev_dax->pgmap, any later probe failure left
dev_dax->pgmap dangling: devres frees the devm_kzalloc'd pgmap on probe
failure, and subsequent probe attempts would hit the "dynamic-dax with
pre-populated page map" check and fail permanently.

Factor pgmap acquisition out into fsdev_acquire_pgmap(), and defer the
dev_dax->pgmap assignment until probe can no longer fail. A failed probe
now never publishes the pointer at all, so there is nothing to unwind.
This also matches kill_dev_dax(), which already clears the dynamic pgmap
pointer on unbind: dev_dax->pgmap is now non-NULL only while the pgmap
is actually valid.

Refactor suggested by Dave Jiang.

Fixes: d5406bd458b0a ("dax: add fsdev.c driver for fs-dax on character dax")

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/fsdev.c | 77 ++++++++++++++++++++++++++++-----------------
 1 file changed, 49 insertions(+), 28 deletions(-)

diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
index 52f46b3e245ea..cc097167ad2c7 100644
--- a/drivers/dax/fsdev.c
+++ b/drivers/dax/fsdev.c
@@ -219,47 +219,62 @@ static const struct file_operations fsdev_fops = {
 	.release = fsdev_release,
 };
 
-static int fsdev_dax_probe(struct dev_dax *dev_dax)
+/*
+ * Acquire the dev_pagemap for probe: the static (pre-populated) one if
+ * present, or a devm-allocated one for the dynamic case. Note that
+ * dev_dax->pgmap is not set here; fsdev_dax_probe() sets it only once
+ * probe succeeds, so a failed probe never leaves a dangling pointer
+ * to a devres-freed pgmap.
+ */
+static struct dev_pagemap *fsdev_acquire_pgmap(struct dev_dax *dev_dax)
 {
-	struct dax_device *dax_dev = dev_dax->dax_dev;
 	struct device *dev = &dev_dax->dev;
 	struct dev_pagemap *pgmap;
-	struct inode *inode;
-	u64 data_offset = 0;
-	struct cdev *cdev;
-	void *addr;
-	int rc, i;
+	size_t pgmap_size;
 
 	if (static_dev_dax(dev_dax)) {
 		if (dev_dax->nr_range > 1) {
-			dev_warn(dev, "static pgmap / multi-range device conflict\n");
-			return -EINVAL;
+			dev_warn(dev,
+				 "static pgmap / multi-range device conflict\n");
+			return ERR_PTR(-EINVAL);
 		}
 
 		pgmap = dev_dax->pgmap;
 		pgmap->vmemmap_shift = 0;
-	} else {
-		size_t pgmap_size;
+		return pgmap;
+	}
 
-		if (dev_dax->pgmap) {
-			dev_warn(dev, "dynamic-dax with pre-populated page map\n");
-			return -EINVAL;
-		}
+	if (dev_dax->pgmap) {
+		dev_warn(dev, "dynamic-dax with pre-populated page map\n");
+		return ERR_PTR(-EINVAL);
+	}
 
-		pgmap_size = struct_size(pgmap, ranges, dev_dax->nr_range - 1);
-		pgmap = devm_kzalloc(dev, pgmap_size, GFP_KERNEL);
-		if (!pgmap)
-			return -ENOMEM;
+	pgmap_size = struct_size(pgmap, ranges, dev_dax->nr_range - 1);
+	pgmap = devm_kzalloc(dev, pgmap_size, GFP_KERNEL);
+	if (!pgmap)
+		return ERR_PTR(-ENOMEM);
 
-		pgmap->nr_range = dev_dax->nr_range;
-		dev_dax->pgmap = pgmap;
+	pgmap->nr_range = dev_dax->nr_range;
+	for (int i = 0; i < dev_dax->nr_range; i++)
+		pgmap->ranges[i] = dev_dax->ranges[i].range;
 
-		for (i = 0; i < dev_dax->nr_range; i++) {
-			struct range *range = &dev_dax->ranges[i].range;
+	return pgmap;
+}
 
-			pgmap->ranges[i] = *range;
-		}
-	}
+static int fsdev_dax_probe(struct dev_dax *dev_dax)
+{
+	struct dax_device *dax_dev = dev_dax->dax_dev;
+	struct device *dev = &dev_dax->dev;
+	struct dev_pagemap *pgmap;
+	struct inode *inode;
+	u64 data_offset = 0;
+	struct cdev *cdev;
+	void *addr;
+	int rc, i;
+
+	pgmap = fsdev_acquire_pgmap(dev_dax);
+	if (IS_ERR(pgmap))
+		return PTR_ERR(pgmap);
 
 	for (i = 0; i < dev_dax->nr_range; i++) {
 		struct range *range = &dev_dax->ranges[i].range;
@@ -306,7 +321,7 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
 	/* Detect whether the data is at a non-zero offset into the memory */
 	if (pgmap->range.start != dev_dax->ranges[0].range.start) {
 		u64 phys = dev_dax->ranges[0].range.start;
-		u64 pgmap_phys = dev_dax->pgmap[0].range.start;
+		u64 pgmap_phys = pgmap[0].range.start;
 
 		if (!WARN_ON(pgmap_phys > phys))
 			data_offset = phys - pgmap_phys;
@@ -339,7 +354,13 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
 		return rc;
 
 	run_dax(dax_dev);
-	return devm_add_action_or_reset(dev, fsdev_kill, dev_dax);
+	rc = devm_add_action_or_reset(dev, fsdev_kill, dev_dax);
+	if (rc)
+		return rc;
+
+	/* Probe can no longer fail; expose the pgmap via dev_dax */
+	dev_dax->pgmap = pgmap;
+	return 0;
 }
 
 static struct dax_device_driver fsdev_dax_driver = {
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH V5 5/9] dax/fsdev: use __va(phys) for kaddr in direct_access
  2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
                     ` (3 preceding siblings ...)
  2026-06-11 17:32   ` [PATCH V5 4/9] dax/fsdev: don't leave a dangling dev_dax->pgmap on probe failure John Groves
@ 2026-06-11 17:32   ` John Groves
  2026-06-11 17:32   ` [PATCH V5 6/9] dax/fsdev: fail probe on invalid pgmap offset John Groves
                     ` (3 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: John Groves @ 2026-06-11 17:32 UTC (permalink / raw)
  To: John Groves, Dan Williams
  Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Miklos Szeredi,
	Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	John Groves

From: John Groves <John@Groves.net>

Use __va(phys) instead of virt_addr + linear_offset for the kaddr
return in __fsdev_dax_direct_access(). The previous code added a
device-linear byte offset to virt_addr (which is __va of ranges[0]),
but for multi-range devices with physical gaps between ranges, this
linear arithmetic crosses the gap and produces a wrong kernel virtual
address. Using __va(phys) where phys comes from dax_pgoff_to_phys()
is correct for any range layout because the direct map translates
each physical address independently.

This leaves dev_dax->virt_addr write-only, so remove the field
(suggested by Dave Jiang).

Fixes: 759455848df0b ("dax: Save the kva from memremap")

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/dax-private.h | 2 --
 drivers/dax/fsdev.c       | 8 ++------
 2 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 81e4af49e39c1..607a53a91f58b 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -69,7 +69,6 @@ struct dev_dax_range {
  * data while the device is activated in the driver.
  * @region: parent region
  * @dax_dev: core dax functionality
- * @virt_addr: kva from memremap; used by fsdev_dax
  * @cached_size: size of daxdev cached by fsdev_dax
  * @align: alignment of this instance
  * @target_node: effective numa node if dev_dax memory range is onlined
@@ -85,7 +84,6 @@ struct dev_dax_range {
 struct dev_dax {
 	struct dax_region *region;
 	struct dax_device *dax_dev;
-	void *virt_addr;
 	u64 cached_size;
 	unsigned int align;
 	int target_node;
diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
index cc097167ad2c7..07de6bfbbf673 100644
--- a/drivers/dax/fsdev.c
+++ b/drivers/dax/fsdev.c
@@ -51,9 +51,7 @@ static long __fsdev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
 	struct dev_dax *dev_dax = dax_get_private(dax_dev);
 	size_t size = nr_pages << PAGE_SHIFT;
 	size_t offset = pgoff << PAGE_SHIFT;
-	void *virt_addr = dev_dax->virt_addr + offset;
 	phys_addr_t phys;
-	unsigned long local_pfn;
 
 	phys = dax_pgoff_to_phys(dev_dax, pgoff, size);
 	if (phys == -1) {
@@ -63,11 +61,10 @@ static long __fsdev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
 	}
 
 	if (kaddr)
-		*kaddr = virt_addr;
+		*kaddr = __va(phys);
 
-	local_pfn = PHYS_PFN(phys);
 	if (pfn)
-		*pfn = local_pfn;
+		*pfn = PHYS_PFN(phys);
 
 	/*
 	 * Use cached_size which was computed at probe time. The size cannot
@@ -329,7 +326,6 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
 		pr_debug("%s: offset detected phys=%llx pgmap_phys=%llx offset=%llx\n",
 		       __func__, phys, pgmap_phys, data_offset);
 	}
-	dev_dax->virt_addr = addr + data_offset;
 
 	inode = dax_inode(dax_dev);
 	cdev = inode->i_cdev;
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH V5 6/9] dax/fsdev: fail probe on invalid pgmap offset
  2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
                     ` (4 preceding siblings ...)
  2026-06-11 17:32   ` [PATCH V5 5/9] dax/fsdev: use __va(phys) for kaddr in direct_access John Groves
@ 2026-06-11 17:32   ` John Groves
  2026-06-11 18:09     ` Gupta, Pankaj
  2026-06-11 18:13     ` sashiko-bot
  2026-06-11 17:32   ` [PATCH V5 7/9] dax: read holder_ops once in dax_holder_notify_failure() John Groves
                     ` (2 subsequent siblings)
  8 siblings, 2 replies; 20+ messages in thread
From: John Groves @ 2026-06-11 17:32 UTC (permalink / raw)
  To: John Groves, Dan Williams
  Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Miklos Szeredi,
	Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	John Groves

From: John Groves <John@Groves.net>

Convert the WARN_ON to a fatal error when pgmap_phys > phys. This
condition means the remapped region starts after the device's data
region, which is an impossible state. Previously the probe continued
with data_offset=0, leaving virt_addr silently misaligned. Now probe
returns -EINVAL with a diagnostic message.

Fixes: 759455848df0b ("dax: Save the kva from memremap")

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/fsdev.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
index 07de6bfbbf673..71d2bee1e2805 100644
--- a/drivers/dax/fsdev.c
+++ b/drivers/dax/fsdev.c
@@ -320,8 +320,12 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
 		u64 phys = dev_dax->ranges[0].range.start;
 		u64 pgmap_phys = pgmap[0].range.start;
 
-		if (!WARN_ON(pgmap_phys > phys))
-			data_offset = phys - pgmap_phys;
+		if (pgmap_phys > phys) {
+			dev_err(dev, "pgmap start %#llx exceeds data start %#llx\n",
+				pgmap_phys, phys);
+			return -EINVAL;
+		}
+		data_offset = phys - pgmap_phys;
 
 		pr_debug("%s: offset detected phys=%llx pgmap_phys=%llx offset=%llx\n",
 		       __func__, phys, pgmap_phys, data_offset);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH V5 7/9] dax: read holder_ops once in dax_holder_notify_failure()
  2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
                     ` (5 preceding siblings ...)
  2026-06-11 17:32   ` [PATCH V5 6/9] dax/fsdev: fail probe on invalid pgmap offset John Groves
@ 2026-06-11 17:32   ` John Groves
  2026-06-11 18:13     ` sashiko-bot
  2026-06-12  3:02     ` Richard Cheng
  2026-06-11 17:32   ` [PATCH V5 8/9] dax: fix holder_ops race in fs_put_dax() John Groves
  2026-06-11 17:33   ` [PATCH V5 9/9] dax: fsdev.c minor formatting cleanup John Groves
  8 siblings, 2 replies; 20+ messages in thread
From: John Groves @ 2026-06-11 17:32 UTC (permalink / raw)
  To: John Groves, Dan Williams
  Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Miklos Szeredi,
	Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	Richard Cheng, John Groves

From: John Groves <John@Groves.net>

dax_holder_notify_failure() reads dax_dev->holder_ops twice without
READ_ONCE() -- once for the NULL check and once for the indirect
notify_failure() call. A concurrent fs_put_dax() or kill_dax() can clear
holder_ops between the two reads, so the check can observe a non-NULL
pointer while the call dereferences NULL.

Fetch holder_ops once into a local with READ_ONCE() so the NULL check and
the indirect call observe the same value.

Fixes: 8012b86608552 ("dax: introduce holder for dax_device")
Suggested-by: Richard Cheng <icheng@nvidia.com>
Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/super.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 25cf99dd9360b..6b5ee6589e39b 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -303,6 +303,7 @@ EXPORT_SYMBOL_GPL(dax_recovery_write);
 int dax_holder_notify_failure(struct dax_device *dax_dev, u64 off,
 			      u64 len, int mf_flags)
 {
+	const struct dax_holder_operations *ops;
 	int rc, id;
 
 	id = dax_read_lock();
@@ -311,12 +312,18 @@ int dax_holder_notify_failure(struct dax_device *dax_dev, u64 off,
 		goto out;
 	}
 
-	if (!dax_dev->holder_ops) {
+	/*
+	 * Read holder_ops once: a concurrent fs_put_dax() or kill_dax() can
+	 * clear it. Without the single fetch the compiler could reload
+	 * between the NULL check and the call and dereference a NULL ops.
+	 */
+	ops = READ_ONCE(dax_dev->holder_ops);
+	if (!ops) {
 		rc = -EOPNOTSUPP;
 		goto out;
 	}
 
-	rc = dax_dev->holder_ops->notify_failure(dax_dev, off, len, mf_flags);
+	rc = ops->notify_failure(dax_dev, off, len, mf_flags);
 out:
 	dax_read_unlock(id);
 	return rc;
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH V5 8/9] dax: fix holder_ops race in fs_put_dax()
  2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
                     ` (6 preceding siblings ...)
  2026-06-11 17:32   ` [PATCH V5 7/9] dax: read holder_ops once in dax_holder_notify_failure() John Groves
@ 2026-06-11 17:32   ` John Groves
  2026-06-11 18:28     ` sashiko-bot
  2026-06-11 17:33   ` [PATCH V5 9/9] dax: fsdev.c minor formatting cleanup John Groves
  8 siblings, 1 reply; 20+ messages in thread
From: John Groves @ 2026-06-11 17:32 UTC (permalink / raw)
  To: John Groves, Dan Williams
  Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Miklos Szeredi,
	Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	John Groves

From: John Groves <John@Groves.net>

Clear holder_ops before holder_data so that a concurrent fs_dax_get()
cannot have its newly installed holder_ops overwritten. cmpxchg()
provides release ordering on weakly-ordered architectures, ensuring the
WRITE_ONCE(holder_ops, NULL) store is visible to any CPU that observes
the holder_data release.

Add a WARN_ON() that fires only when the cmpxchg observes a non-NULL
value that is not @holder, i.e. fs_put_dax() called by something that
is not the current holder. That is an API contract violation; the
WARN_ON() does not prevent the damage but makes the bug visible.

A NULL cmpxchg result is deliberately tolerated: kill_dax() clears
holder_data while a holder is still attached when a device is removed
out from under a mounted filesystem (after delivering MF_MEM_PRE_REMOVE).
The holder's subsequent fs_put_dax() - e.g. xfs_free_buftarg() after a
forced shutdown - then legitimately finds holder_data already NULL, so
warning on that case would turn supported device removal into a splat
(or a panic with panic_on_warn).

Also add a kerneldoc comment documenting that fs_put_dax() must only
be called by the current holder.

Fixes: eec38f5d86d27 ("dax: Add fs_dax_get() func to prepare dax for fs-dax usage")
Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/super.c | 42 +++++++++++++++++++++++++++++++++++++++---
 1 file changed, 39 insertions(+), 3 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 6b5ee6589e39b..af5c1e14f7e39 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -116,11 +116,47 @@ EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
 
 #if IS_ENABLED(CONFIG_FS_DAX)
 
+/**
+ * fs_put_dax() - release holder ownership of a dax_device
+ * @dax_dev: dax device to release (may be NULL)
+ * @holder: the holder pointer previously passed to fs_dax_get() or
+ *          fs_dax_get_by_bdev(); must match exactly, as it is used
+ *          in a cmpxchg to atomically release ownership
+ *
+ * Must only be called by the current holder. Clears holder_ops before
+ * holder_data to avoid a race where a concurrent fs_dax_get() could have
+ * its newly installed holder_ops overwritten.
+ */
 void fs_put_dax(struct dax_device *dax_dev, void *holder)
 {
-	if (dax_dev && holder &&
-	    cmpxchg(&dax_dev->holder_data, holder, NULL) == holder)
-		dax_dev->holder_ops = NULL;
+	if (dax_dev && holder) {
+		void *prev;
+
+		/*
+		 * Clear holder_ops before releasing holder_data. A concurrent
+		 * dax_holder_notify_failure() that sees NULL ops returns
+		 * -EOPNOTSUPP cleanly. A concurrent fs_dax_get() that acquires
+		 * holder_data after the cmpxchg below is guaranteed to observe
+		 * holder_ops=NULL first (cmpxchg provides release ordering), so
+		 * its subsequent store of new ops will not be overwritten.
+		 */
+		WRITE_ONCE(dax_dev->holder_ops, NULL);
+		prev = cmpxchg(&dax_dev->holder_data, holder, NULL);
+
+		/*
+		 * prev == holder: normal release.
+		 * prev == NULL:   already released by kill_dax() when the
+		 *                 device was removed under a live holder;
+		 *                 not a bug.
+		 * prev != holder (non-NULL): fs_put_dax() called by something
+		 *                 that is not the current holder; an API
+		 *                 contract violation. A lock would be needed
+		 *                 to guard against this, but we WARN_ON()
+		 *                 instead since violating the contract is
+		 *                 a bug.
+		 */
+		WARN_ON(prev && prev != holder);
+	}
 	put_dax(dax_dev);
 }
 EXPORT_SYMBOL_GPL(fs_put_dax);
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH V5 9/9] dax: fsdev.c minor formatting cleanup
  2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
                     ` (7 preceding siblings ...)
  2026-06-11 17:32   ` [PATCH V5 8/9] dax: fix holder_ops race in fs_put_dax() John Groves
@ 2026-06-11 17:33   ` John Groves
  8 siblings, 0 replies; 20+ messages in thread
From: John Groves @ 2026-06-11 17:33 UTC (permalink / raw)
  To: John Groves, Dan Williams
  Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Miklos Szeredi,
	Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	John Groves

From: John Groves <John@Groves.net>

Address some comments from Jonathan that were missed in the merged
series. Fix line wrapping in fsdev_dax_recovery_write() and
fsdev_dax_zero_page_range() signatures.

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/fsdev.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
index 71d2bee1e2805..7df75728ada89 100644
--- a/drivers/dax/fsdev.c
+++ b/drivers/dax/fsdev.c
@@ -45,8 +45,8 @@ static void fsdev_write_dax(void *addr, struct page *page,
 }
 
 static long __fsdev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
-			long nr_pages, enum dax_access_mode mode, void **kaddr,
-			unsigned long *pfn)
+		long nr_pages, enum dax_access_mode mode, void **kaddr,
+		unsigned long *pfn)
 {
 	struct dev_dax *dev_dax = dax_get_private(dax_dev);
 	size_t size = nr_pages << PAGE_SHIFT;
@@ -80,7 +80,8 @@ static int fsdev_dax_zero_page_range(struct dax_device *dax_dev,
 	long rc;
 
 	WARN_ONCE(nr_pages > 1, "%s: nr_pages > 1\n", __func__);
-	rc = __fsdev_dax_direct_access(dax_dev, pgoff, 1, DAX_ACCESS, &kaddr, NULL);
+	rc = __fsdev_dax_direct_access(dax_dev, pgoff, 1, DAX_ACCESS,
+				       &kaddr, NULL);
 	if (rc < 0)
 		return rc;
 	fsdev_write_dax(kaddr, ZERO_PAGE(0), 0, PAGE_SIZE);
@@ -88,15 +89,15 @@ static int fsdev_dax_zero_page_range(struct dax_device *dax_dev,
 }
 
 static long fsdev_dax_direct_access(struct dax_device *dax_dev,
-		  pgoff_t pgoff, long nr_pages, enum dax_access_mode mode,
-		  void **kaddr, unsigned long *pfn)
+		pgoff_t pgoff, long nr_pages, enum dax_access_mode mode,
+		void **kaddr, unsigned long *pfn)
 {
 	return __fsdev_dax_direct_access(dax_dev, pgoff, nr_pages, mode,
 					 kaddr, pfn);
 }
 
-static size_t fsdev_dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
-		void *addr, size_t bytes, struct iov_iter *i)
+static size_t fsdev_dax_recovery_write(struct dax_device *dax_dev,
+		pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i)
 {
 	return _copy_from_iter_flushcache(addr, bytes, i);
 }
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH V5 2/9] dax/fsdev: fix multi-range offset in memory_failure handler
  2026-06-11 17:31   ` [PATCH V5 2/9] dax/fsdev: fix multi-range offset in memory_failure handler John Groves
@ 2026-06-11 17:51     ` sashiko-bot
  2026-06-12  3:08     ` Richard Cheng
  1 sibling, 0 replies; 20+ messages in thread
From: sashiko-bot @ 2026-06-11 17:51 UTC (permalink / raw)
  To: John Groves; +Cc: linux-cxl

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

New issues:
- [Critical] Incorrect memory failure offset calculation for static DAX devices, causing the wrong filesystem blocks to be marked corrupted.

Pre-existing issues:
- [High] Use-After-Free in dax_pgoff_to_phys() due to concurrent reallocation of dev_dax->ranges array via sysfs.
--

commit 410f6bffae27c2c1b51ab4214947b135248b8594
Author: John Groves <John@Groves.net>

dax/fsdev: fix multi-range offset in memory_failure handler

This commit updates the offset calculation in the memory_failure handler for
multi-range DAX devices. It changes the logic to iterate over the device
pagemap's immutable ranges array instead of the potentially dynamically
reallocated dev_dax ranges array.

>     Walk the pagemap's own range array (pgmap->ranges[]) rather than
>     dev_dax->ranges[]. The pgmap copy is the immutable snapshot populated at
>     probe and is never mutated afterwards, whereas dev_dax->ranges[] can be
>     krealloc()'d by a concurrent sysfs mapping_store() (under dax_region_rwsem,
>     which this ->memory_failure callback does not hold). For dynamic devices the
>     two arrays are identical, so the reported offset is unchanged for the
>     multi-range case this targets.

[Severity: High]
This isn't a bug introduced by this patch, but since the commit message
mentions that dev_dax->ranges can be krealloc()'d by a concurrent sysfs
mapping_store(), does this expose a use-after-free risk in
dax_pgoff_to_phys()?

Unlike size_store() which checks if the device is bound to a driver,
mapping_store() appears to lack this check and can invoke
alloc_dev_dax_range() to reallocate the array while the device is
actively in use:

drivers/dax/bus.c:mapping_store() {
    ...
    rc = alloc_dev_dax_range(dev_dax, r.start, to_alloc);
    ...
}

Meanwhile, filesystem I/O operations can locklessly iterate over the ranges
array:

drivers/dax/bus.c:dax_pgoff_to_phys() {
    ...
    for (i = 0; i < dev_dax->nr_range; i++) {
        struct dev_dax_range *dax_range = &dev_dax->ranges[i];
    ...
}

Could a concurrent sysfs write to the mapping attribute cause a
use-after-free if dax_pgoff_to_phys() is iterating the old ranges
array that was freed by krealloc()?

> diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> index 188b2526bee45..2c5de3d80a618 100644
> --- a/drivers/dax/fsdev.c
> +++ b/drivers/dax/fsdev.c

[ ... ]

> @@ -135,11 +135,26 @@ static void fsdev_clear_ops(void *data)
>   * The core mm code in free_zone_device_folio() handles the wake_up_var()
>   * directly for this memory type.
>   */
> +static u64 fsdev_pfn_to_offset(struct dev_pagemap *pgmap, unsigned long pfn)
> +{
> +	phys_addr_t phys = PFN_PHYS(pfn);
> +	u64 offset = 0;
> +
> +	for (int i = 0; i < pgmap->nr_range; i++) {
> +		struct range *range = &pgmap->ranges[i];
> +
> +		if (phys >= range->start && phys <= range->end)
> +			return offset + (phys - range->start);

[Severity: Critical]
Does this calculation incorrectly include the data_offset padding for static
DAX devices?

For static DAX devices, the device's physical range starts with a metadata
padding area (data_offset). The old code calculated the offset based on
dev_dax->ranges[0].range.start, which represents the start of the actual
user data.

By using pgmap->ranges[0].start instead, which includes this metadata
padding, the logical offset might be inflated by data_offset. Could this
cause the filesystem to receive an incorrect offset in
dax_holder_notify_failure(), potentially resulting in poisoning the wrong
filesystem blocks while leaving the actual corrupted blocks unmarked?

> +		offset += range_len(range);
> +	}
> +	return -1ULL;
> +}
> +

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/0100019eb7bcda4b-3f8edae9-d7a4-4bfa-aaea-fcef77fdbbc3-000000@email.amazonses.com?part=2

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V5 3/9] dax/fsdev: clear vmemmap_shift when binding static pgmap
  2026-06-11 17:32   ` [PATCH V5 3/9] dax/fsdev: clear vmemmap_shift when binding static pgmap John Groves
@ 2026-06-11 17:55     ` sashiko-bot
  2026-06-12  2:56     ` Richard Cheng
  1 sibling, 0 replies; 20+ messages in thread
From: sashiko-bot @ 2026-06-11 17:55 UTC (permalink / raw)
  To: John Groves; +Cc: linux-cxl

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

New issues:
- [High] Rebinding a static DAX device from fsdev_dax to device_dax (or unloading the fsdev_dax module) leaves a dangling `pgmap->ops` pointer on the persistent static `pgmap` structure, risking a use-after-free on hardware memory failure.

Pre-existing issues:
- [High] Probe functions in fsdev_dax re-initialize an embedded cdev via `cdev_init()`, clobbering its kobject state if there are open file descriptors from a previous bind.
--

commit cfa8354a7e331532bf1a69e2ea2e854ab1b90742
Author: John Groves <John@Groves.net>

dax/fsdev: clear vmemmap_shift when binding static pgmap

Clear pgmap->vmemmap_shift for static DAX devices. When rebinding a
static device from device_dax to fsdev_dax, the stale vmemmap_shift
persists on the shared pgmap. Explicitly zero it before
devm_memremap_pages() so the vmemmap is built for order-0 folios.

> diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> index 2c5de3d80a618..52f46b3e245ea 100644
> --- a/drivers/dax/fsdev.c
> +++ b/drivers/dax/fsdev.c
> @@ -237,6 +237,7 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
>  		}
>  
>  		pgmap = dev_dax->pgmap;
> +		pgmap->vmemmap_shift = 0;

[Severity: High]
Since we are fixing the stale vmemmap_shift on the shared pgmap here, do we
also need to handle the pgmap->ops pointer?

If fsdev_dax unbinds or the module is unloaded, it doesn't appear that we
clear pgmap->ops, which is set to &fsdev_pagemap_ops later in
fsdev_dax_probe(). Additionally, if device_dax is bound to the same device
instead, it appears to update pgmap->type but leave pgmap->ops unchanged.

If a hardware memory error occurs on the region after unbind, will
memory_failure_dev_pagemap() attempt to call the dangling
pgmap->ops->memory_failure pointer:

mm/memory-failure.c:memory_failure_dev_pagemap()
    if (pgmap->ops && pgmap->ops->memory_failure) {
        rc = pgmap->ops->memory_failure(pgmap, pfn, 1, flags);

and cause a use-after-free or panic?

[Severity: High]
This isn't a bug introduced by this patch, but does the embedded cdev
initialization in this probe function clobber the kobject state?

If there are open file descriptors from a previous bind, and
fsdev_dax_probe() re-initializes the embedded cdev later in the function:

drivers/dax/fsdev.c:fsdev_dax_probe()
    cdev_init(cdev, &fsdev_fops);

will cdev_init() destroy the active kobject's refcount and zero out the
cdev->list? Could this cause list corruption when the stale file descriptors
are later closed, or mismatched kobject refcounting that frees the cdev
while the new driver instance is using it?

>  	} else {
>  		size_t pgmap_size;
>

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/0100019eb7bcda4b-3f8edae9-d7a4-4bfa-aaea-fcef77fdbbc3-000000@email.amazonses.com?part=3

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V5 4/9] dax/fsdev: don't leave a dangling dev_dax->pgmap on probe failure
  2026-06-11 17:32   ` [PATCH V5 4/9] dax/fsdev: don't leave a dangling dev_dax->pgmap on probe failure John Groves
@ 2026-06-11 18:04     ` sashiko-bot
  0 siblings, 0 replies; 20+ messages in thread
From: sashiko-bot @ 2026-06-11 18:04 UTC (permalink / raw)
  To: John Groves; +Cc: linux-cxl

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

Pre-existing issues:
- [High] Unpatched dangling pointer in device.c leading to permanent probe failure and potential UAF
- [High] Concurrent sysfs mapping_store() execution causes TOCTOU heap buffer overflow and Use-After-Free in probe pathways
--

commit f0cb55dd27105cfc1116641d11d0187ba63297c0
Author: John Groves <John@Groves.net>

dax/fsdev: don't leave a dangling dev_dax->pgmap on probe failure

This patch factors out pgmap acquisition into fsdev_acquire_pgmap() and
defers the dev_dax->pgmap assignment until the probe can no longer fail.
This ensures a failed probe does not leave a dangling pointer.

> diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> index 52f46b3e245ea..cc097167ad2c7 100644
> --- a/drivers/dax/fsdev.c
> +++ b/drivers/dax/fsdev.c
> @@ -219,47 +219,62 @@ static const struct file_operations fsdev_fops = {
>  	.release = fsdev_release,
>  };
>  
> -static int fsdev_dax_probe(struct dev_dax *dev_dax)
> +/*
> + * Acquire the dev_pagemap for probe: the static (pre-populated) one if
> + * present, or a devm-allocated one for the dynamic case. Note that
> + * dev_dax->pgmap is not set here; fsdev_dax_probe() sets it only once
> + * probe succeeds, so a failed probe never leaves a dangling pointer
> + * to a devres-freed pgmap.
> + */
> +static struct dev_pagemap *fsdev_acquire_pgmap(struct dev_dax *dev_dax)
>  {

[ ... ]

> +	if (dev_dax->pgmap) {
> +		dev_warn(dev, "dynamic-dax with pre-populated page map\n");
> +		return ERR_PTR(-EINVAL);
> +	}
>  
> -		pgmap_size = struct_size(pgmap, ranges, dev_dax->nr_range - 1);
> -		pgmap = devm_kzalloc(dev, pgmap_size, GFP_KERNEL);
> -		if (!pgmap)
> -			return -ENOMEM;
> +	pgmap_size = struct_size(pgmap, ranges, dev_dax->nr_range - 1);
> +	pgmap = devm_kzalloc(dev, pgmap_size, GFP_KERNEL);
> +	if (!pgmap)
> +		return ERR_PTR(-ENOMEM);

[Severity: High]
This is a pre-existing issue, but does this allocation race with a concurrent
sysfs mapping_store() execution?

Since mapping_store() in drivers/dax/bus.c dynamically adds memory ranges
without checking if dev->driver is bound, it can run concurrently while the
driver is probing.

If dev_dax->nr_range increments after pgmap_size is calculated but before
the loop below executes, couldn't this cause a heap buffer overflow when
writing to pgmap->ranges[i]?

> -		pgmap->nr_range = dev_dax->nr_range;
> -		dev_dax->pgmap = pgmap;
> +	pgmap->nr_range = dev_dax->nr_range;
> +	for (int i = 0; i < dev_dax->nr_range; i++)
> +		pgmap->ranges[i] = dev_dax->ranges[i].range;
>  
> -		for (i = 0; i < dev_dax->nr_range; i++) {
> -			struct range *range = &dev_dax->ranges[i].range;
> +	return pgmap;
> +}
>  
> -			pgmap->ranges[i] = *range;
> -		}
> -	}
> +static int fsdev_dax_probe(struct dev_dax *dev_dax)
> +{
> +	struct dax_device *dax_dev = dev_dax->dax_dev;
> +	struct device *dev = &dev_dax->dev;
> +	struct dev_pagemap *pgmap;
> +	struct inode *inode;
> +	u64 data_offset = 0;
> +	struct cdev *cdev;
> +	void *addr;
> +	int rc, i;
> +
> +	pgmap = fsdev_acquire_pgmap(dev_dax);

[Severity: High]
This isn't a bug introduced by this patch, but since you are fixing the
dangling dev_dax->pgmap pointer vulnerability here in fsdev_dax_probe(),
should the exact same fix be applied to dev_dax_probe() in drivers/dax/device.c?

It appears device.c still assigns dev_dax->pgmap immediately after
devm_kzalloc(), which would leave it dangling if a subsequent operation like
devm_request_mem_region() fails, causing subsequent probe attempts to
permanently fail.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/0100019eb7bcda4b-3f8edae9-d7a4-4bfa-aaea-fcef77fdbbc3-000000@email.amazonses.com?part=4

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V5 6/9] dax/fsdev: fail probe on invalid pgmap offset
  2026-06-11 17:32   ` [PATCH V5 6/9] dax/fsdev: fail probe on invalid pgmap offset John Groves
@ 2026-06-11 18:09     ` Gupta, Pankaj
  2026-06-11 18:13     ` sashiko-bot
  1 sibling, 0 replies; 20+ messages in thread
From: Gupta, Pankaj @ 2026-06-11 18:09 UTC (permalink / raw)
  To: John Groves, John Groves, Dan Williams
  Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Miklos Szeredi,
	Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org


> From: John Groves <John@Groves.net>
>
> Convert the WARN_ON to a fatal error when pgmap_phys > phys. This
> condition means the remapped region starts after the device's data
> region, which is an impossible state. Previously the probe continued
> with data_offset=0, leaving virt_addr silently misaligned. Now probe
> returns -EINVAL with a diagnostic message.
>
> Fixes: 759455848df0b ("dax: Save the kva from memremap")
>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Alison Schofield <alison.schofield@intel.com>
> Signed-off-by: John Groves <john@groves.net>

Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>

> ---
>   drivers/dax/fsdev.c | 8 ++++++--
>   1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> index 07de6bfbbf673..71d2bee1e2805 100644
> --- a/drivers/dax/fsdev.c
> +++ b/drivers/dax/fsdev.c
> @@ -320,8 +320,12 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
>   		u64 phys = dev_dax->ranges[0].range.start;
>   		u64 pgmap_phys = pgmap[0].range.start;
>   
> -		if (!WARN_ON(pgmap_phys > phys))
> -			data_offset = phys - pgmap_phys;
> +		if (pgmap_phys > phys) {
> +			dev_err(dev, "pgmap start %#llx exceeds data start %#llx\n",
> +				pgmap_phys, phys);
> +			return -EINVAL;
> +		}
> +		data_offset = phys - pgmap_phys;
>   
>   		pr_debug("%s: offset detected phys=%llx pgmap_phys=%llx offset=%llx\n",
>   		       __func__, phys, pgmap_phys, data_offset);

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V5 6/9] dax/fsdev: fail probe on invalid pgmap offset
  2026-06-11 17:32   ` [PATCH V5 6/9] dax/fsdev: fail probe on invalid pgmap offset John Groves
  2026-06-11 18:09     ` Gupta, Pankaj
@ 2026-06-11 18:13     ` sashiko-bot
  1 sibling, 0 replies; 20+ messages in thread
From: sashiko-bot @ 2026-06-11 18:13 UTC (permalink / raw)
  To: John Groves; +Cc: linux-cxl

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] Validation check for 'pgmap_phys > phys' executes too late, allowing 'fsdev_clear_folio_state()' to process out-of-bounds PFNs and crash the kernel.
--

commit e78e60f7b38932158f6316aad6c7a8308838e5b2
Author: John Groves <John@Groves.net>

dax/fsdev: fail probe on invalid pgmap offset

This commit converts a WARN_ON to a fatal error when pgmap_phys exceeds phys,
preventing probe completion with an impossible state. It now returns -EINVAL 
with an error message instead of proceeding with an invalid offset.

> diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> --- a/drivers/dax/fsdev.c
> +++ b/drivers/dax/fsdev.c
> @@ -320,8 +320,12 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
>  		u64 phys = dev_dax->ranges[0].range.start;
>  		u64 pgmap_phys = pgmap[0].range.start;
>  
> -		if (!WARN_ON(pgmap_phys > phys))
> -			data_offset = phys - pgmap_phys;
> +		if (pgmap_phys > phys) {
> +			dev_err(dev, "pgmap start %#llx exceeds data start %#llx\n",
> +				pgmap_phys, phys);
> +			return -EINVAL;
> +		}

[Severity: High]
This is a pre-existing issue, but does this validation check execute too late
to prevent a crash?

Looking at fsdev_dax_probe(), the call to fsdev_clear_folio_state() happens
just before this check is reached. fsdev_clear_folio_state() iterates over
the physical PFNs defined in dev_dax->ranges, starting at phys. 

Because devm_memremap_pages() only populates vmemmap starting from
pgmap_phys, any PFNs between phys and pgmap_phys lack vmemmap backing. This
causes an out-of-bounds access when attempting to read the folio state:

fsdev_dax_probe()
  fsdev_clear_folio_state()
    struct folio *folio = pfn_folio(pfn);
    int order = dax_folio_reset_order(folio);

Could this trigger a fatal kernel page fault on the unmapped PFNs before the
probe has a chance to fail gracefully?

Should the validation block be moved above the call to
fsdev_clear_folio_state() to reject the invalid configuration before memory
is accessed?

> +		data_offset = phys - pgmap_phys;
>  
>  		pr_debug("%s: offset detected phys=%llx pgmap_phys=%llx offset=%llx\n",
>  		       __func__, phys, pgmap_phys, data_offset);

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/0100019eb7bcda4b-3f8edae9-d7a4-4bfa-aaea-fcef77fdbbc3-000000@email.amazonses.com?part=6

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V5 7/9] dax: read holder_ops once in dax_holder_notify_failure()
  2026-06-11 17:32   ` [PATCH V5 7/9] dax: read holder_ops once in dax_holder_notify_failure() John Groves
@ 2026-06-11 18:13     ` sashiko-bot
  2026-06-12  3:02     ` Richard Cheng
  1 sibling, 0 replies; 20+ messages in thread
From: sashiko-bot @ 2026-06-11 18:13 UTC (permalink / raw)
  To: John Groves; +Cc: linux-cxl

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] The patch fixes a TOCTOU race on `holder_ops` in `dax_holder_notify_failure()` but fails to address a missing SRCU synchronization in `fs_put_dax()`. Concurrently, `fs_put_dax()` clears `holder_ops` and `holder_data` but returns immediately without waiting for `dax_srcu` readers (unlike `kill_dax()`). This enables a filesystem callback to dereference a NULL `holder_data` pointer, or execute unloaded module memory if the module is unloaded before the preempted callback resumes.
--

commit fb9deacd32044f5bf6a6a9877ee95b7eb436b68b
Author: John Groves <John@Groves.net>
dax: read holder_ops once in dax_holder_notify_failure()

This commit updates dax_holder_notify_failure() to read dax_dev->holder_ops
once using READ_ONCE(). This prevents a concurrent clear of holder_ops from
causing a NULL pointer dereference during the indirect function call.

> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c

[ ... ]

> @@ -311,12 +312,18 @@ int dax_holder_notify_failure(struct dax_device *dax_dev, u64 off,
>  		goto out;
>  	}
>  
> -	if (!dax_dev->holder_ops) {
> +	/*
> +	 * Read holder_ops once: a concurrent fs_put_dax() or kill_dax() can
> +	 * clear it. Without the single fetch the compiler could reload
> +	 * between the NULL check and the call and dereference a NULL ops.
> +	 */
> +	ops = READ_ONCE(dax_dev->holder_ops);
> +	if (!ops) {
>  		rc = -EOPNOTSUPP;
>  		goto out;
>  	}
>  
> -	rc = dax_dev->holder_ops->notify_failure(dax_dev, off, len, mf_flags);
> +	rc = ops->notify_failure(dax_dev, off, len, mf_flags);

[Severity: High]
This is a pre-existing issue, but does this fully address the race with
holder_data being cleared?

If fs_put_dax() clears both holder_ops and holder_data concurrently, it
returns without calling synchronize_srcu(&dax_srcu) to wait for dax_srcu
readers.

If a thread is preempted right before calling ops->notify_failure() here,
and fs_put_dax() executes, the callback will proceed with a valid ops pointer
but a NULL holder_data.

For example, xfs_dax_notify_failure() calls dax_holder() which returns NULL,
and then unconditionally dereferences the mount pointer:

fs/xfs/xfs_notify_failure.c:xfs_dax_notify_failure() {
	struct xfs_mount *mp = dax_holder(dax_dev);

	if (!(mp->m_super->s_flags & SB_BORN)) {
	...
}

Could this result in a NULL pointer dereference, or an execution-after-free
if the filesystem module is subsequently unloaded?

>  out:
>  	dax_read_unlock(id);
>  	return rc;

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/0100019eb7bcda4b-3f8edae9-d7a4-4bfa-aaea-fcef77fdbbc3-000000@email.amazonses.com?part=7

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V5 8/9] dax: fix holder_ops race in fs_put_dax()
  2026-06-11 17:32   ` [PATCH V5 8/9] dax: fix holder_ops race in fs_put_dax() John Groves
@ 2026-06-11 18:28     ` sashiko-bot
  0 siblings, 0 replies; 20+ messages in thread
From: sashiko-bot @ 2026-06-11 18:28 UTC (permalink / raw)
  To: John Groves; +Cc: linux-cxl

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

New issues:
- [High] Unconditional `WRITE_ONCE()` in `fs_put_dax()` corrupts `holder_ops` of concurrent mounts due to a pre-existing race in `fs_dax_get()`.

Pre-existing issues:
- [High] Lack of SRCU synchronization in `fs_put_dax()` leads to NULL pointer dereference or use-after-free in memory failure handlers.
--

commit 4554a08ddf0ff95891c1710cb432fdf796290f3d
Author: John Groves <John@Groves.net>

dax: fix holder_ops race in fs_put_dax()

This patch clears holder_ops before holder_data in fs_put_dax() to
prevent a concurrent fs_dax_get() from having its newly installed
operations overwritten. It adds a WARN_ON for API contract violations.

> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 6b5ee6589e39b..af5c1e14f7e39 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -116,11 +116,47 @@ EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
>  
>  #if IS_ENABLED(CONFIG_FS_DAX)
>  
[ ... ]
>  void fs_put_dax(struct dax_device *dax_dev, void *holder)
>  {
> -	if (dax_dev && holder &&
> -	    cmpxchg(&dax_dev->holder_data, holder, NULL) == holder)
> -		dax_dev->holder_ops = NULL;
> +	if (dax_dev && holder) {
> +		void *prev;
> +
> +		/*
> +		 * Clear holder_ops before releasing holder_data. A concurrent
[ ... ]
> +		 * its subsequent store of new ops will not be overwritten.
> +		 */
> +		WRITE_ONCE(dax_dev->holder_ops, NULL);

[Severity: High]
Does this unconditional write corrupt concurrent mounts?

Because the SRCU lock is dropped in fs_dax_get() before claiming
holder_data, kill_dax() can run concurrently and clear the device state.
This allows fs_dax_get() to mount a new filesystem on a dead device.

If the original filesystem performs a forced unmount and calls
fs_put_dax(), this WRITE_ONCE() unconditionally destroys the new
filesystem's holder_ops before verifying ownership via cmpxchg().

> +		prev = cmpxchg(&dax_dev->holder_data, holder, NULL);

[Severity: High]
This is a pre-existing issue, but does the lack of SRCU synchronization
here lead to use-after-free or NULL pointer dereferences in memory failure
handlers?

dax_holder_notify_failure() executes under an SRCU read lock and reads
holder_ops before calling the filesystem's callback. If fs_put_dax()
clears holder_data and completes the unmount without waiting for an SRCU
grace period, it can cause problems if a hardware memory failure runs
concurrently.

For instance, looking at xfs_dax_notify_failure():

    struct xfs_mount *mp = dax_holder(dax_dev);
    if (!(mp->m_super->s_flags & SB_BORN)) {
        ...

If holder_data is cleared right after holder_ops is checked, can the
callback read a cleared holder_data, resulting in a NULL mp dereference
when accessing mp->m_super, or access a fully freed mp if the filesystem
has finished unmounting?

> +
> +		/*
> +		 * prev == holder: normal release.
[ ... ]
> +		 *                 instead since violating the contract is
> +		 *                 a bug.
> +		 */
> +		WARN_ON(prev && prev != holder);
> +	}
>  	put_dax(dax_dev);
>  }
>  EXPORT_SYMBOL_GPL(fs_put_dax);

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/0100019eb7bcda4b-3f8edae9-d7a4-4bfa-aaea-fcef77fdbbc3-000000@email.amazonses.com?part=8

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V5 3/9] dax/fsdev: clear vmemmap_shift when binding static pgmap
  2026-06-11 17:32   ` [PATCH V5 3/9] dax/fsdev: clear vmemmap_shift when binding static pgmap John Groves
  2026-06-11 17:55     ` sashiko-bot
@ 2026-06-12  2:56     ` Richard Cheng
  1 sibling, 0 replies; 20+ messages in thread
From: Richard Cheng @ 2026-06-12  2:56 UTC (permalink / raw)
  To: John Groves
  Cc: John Groves, Dan Williams, John Groves, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Miklos Szeredi, Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org

On Thu, Jun 11, 2026 at 05:32:07PM +0800, John Groves wrote:
> From: John Groves <John@Groves.net>
> 
> Clear pgmap->vmemmap_shift for static DAX devices. When rebinding a static
> device from device_dax (which may set vmemmap_shift based on alignment) to
> fsdev_dax, the stale vmemmap_shift persists on the shared pgmap. Explicitly
> zero it before devm_memremap_pages() so the vmemmap is built for order-0
> folios as fsdev requires.
> 
> Fixes: d5406bd458b0a ("dax: add fsdev.c driver for fs-dax on character dax")
> 
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Alison Schofield <alison.schofield@intel.com>
> Signed-off-by: John Groves <john@groves.net>
> ---
>  drivers/dax/fsdev.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> index 2c5de3d80a618..52f46b3e245ea 100644
> --- a/drivers/dax/fsdev.c
> +++ b/drivers/dax/fsdev.c
> @@ -237,6 +237,7 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
>  		}
>  
>  		pgmap = dev_dax->pgmap;
> +		pgmap->vmemmap_shift = 0;


Hello John,

I would suggest also clearing pgmap->ops and pgmap->owner on teardown.
fsdev also writes them but never clears them. memuunmap_pages() leaves the
descriptor intact and kill_dev_dax() only NULLs dev_dax->pgmap for !static case.
After fsdev unbind the stale ops survive.
If we do rmmod later a HW failure dispatch pgmap->ops->memory_failure.

--Richard

>  	} else {
>  		size_t pgmap_size;
>  
> -- 
> 2.53.0
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V5 7/9] dax: read holder_ops once in dax_holder_notify_failure()
  2026-06-11 17:32   ` [PATCH V5 7/9] dax: read holder_ops once in dax_holder_notify_failure() John Groves
  2026-06-11 18:13     ` sashiko-bot
@ 2026-06-12  3:02     ` Richard Cheng
  1 sibling, 0 replies; 20+ messages in thread
From: Richard Cheng @ 2026-06-12  3:02 UTC (permalink / raw)
  To: John Groves
  Cc: John Groves, Dan Williams, John Groves, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Miklos Szeredi, Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org

On Thu, Jun 11, 2026 at 05:32:45PM +0800, John Groves wrote:
> From: John Groves <John@Groves.net>
> 
> dax_holder_notify_failure() reads dax_dev->holder_ops twice without
> READ_ONCE() -- once for the NULL check and once for the indirect
> notify_failure() call. A concurrent fs_put_dax() or kill_dax() can clear
> holder_ops between the two reads, so the check can observe a non-NULL
> pointer while the call dereferences NULL.
> 

Hello John,

Thanks for this.

Reviewed-by: Richard Cheng <icheng@nvidia.com>

Small message nit, kill_dax() isn't a racing clearer.
Plus I think this only fix holder_ops double-fetch, the fs_put_dax()
race issue is separate and pre-existing.

Best regards,
Richard Cheng.

> Fetch holder_ops once into a local with READ_ONCE() so the NULL check and
> the indirect call observe the same value.
> 
> Fixes: 8012b86608552 ("dax: introduce holder for dax_device")
> Suggested-by: Richard Cheng <icheng@nvidia.com>
> Signed-off-by: John Groves <john@groves.net>
> ---
>  drivers/dax/super.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 25cf99dd9360b..6b5ee6589e39b 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -303,6 +303,7 @@ EXPORT_SYMBOL_GPL(dax_recovery_write);
>  int dax_holder_notify_failure(struct dax_device *dax_dev, u64 off,
>  			      u64 len, int mf_flags)
>  {
> +	const struct dax_holder_operations *ops;
>  	int rc, id;
>  
>  	id = dax_read_lock();
> @@ -311,12 +312,18 @@ int dax_holder_notify_failure(struct dax_device *dax_dev, u64 off,
>  		goto out;
>  	}
>  
> -	if (!dax_dev->holder_ops) {
> +	/*
> +	 * Read holder_ops once: a concurrent fs_put_dax() or kill_dax() can
> +	 * clear it. Without the single fetch the compiler could reload
> +	 * between the NULL check and the call and dereference a NULL ops.
> +	 */
> +	ops = READ_ONCE(dax_dev->holder_ops);
> +	if (!ops) {
>  		rc = -EOPNOTSUPP;
>  		goto out;
>  	}
>  
> -	rc = dax_dev->holder_ops->notify_failure(dax_dev, off, len, mf_flags);
> +	rc = ops->notify_failure(dax_dev, off, len, mf_flags);
>  out:
>  	dax_read_unlock(id);
>  	return rc;
> -- 
> 2.53.0
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V5 2/9] dax/fsdev: fix multi-range offset in memory_failure handler
  2026-06-11 17:31   ` [PATCH V5 2/9] dax/fsdev: fix multi-range offset in memory_failure handler John Groves
  2026-06-11 17:51     ` sashiko-bot
@ 2026-06-12  3:08     ` Richard Cheng
  1 sibling, 0 replies; 20+ messages in thread
From: Richard Cheng @ 2026-06-12  3:08 UTC (permalink / raw)
  To: John Groves
  Cc: John Groves, Dan Williams, John Groves, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Miklos Szeredi, Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org

On Thu, Jun 11, 2026 at 05:31:59PM +0800, John Groves wrote:
> From: John Groves <John@Groves.net>
> 
> Fix memory_failure offset calculation for multi-range devices. The old code
> subtracted ranges[0].range.start from the faulting PFN's physical address,
> which produces an incorrect (inflated) logical offset when the PFN falls in
> ranges[1] or beyond due to physical gaps between ranges. Add
> fsdev_pfn_to_offset() to walk the range list and compute the correct
> device-linear byte offset.
> 
> Walk the pagemap's own range array (pgmap->ranges[]) rather than
> dev_dax->ranges[]. The pgmap copy is the immutable snapshot populated at
> probe and is never mutated afterwards, whereas dev_dax->ranges[] can be
> krealloc()'d by a concurrent sysfs mapping_store() (under dax_region_rwsem,
> which this ->memory_failure callback does not hold). For dynamic devices the
> two arrays are identical, so the reported offset is unchanged for the
> multi-range case this targets.
> 
> Fixes: d5406bd458b0a ("dax: add fsdev.c driver for fs-dax on character dax")
> 
> Suggested-by: Richard Cheng <icheng@nvidia.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Alison Schofield <alison.schofield@intel.com>
> Signed-off-by: John Groves <john@groves.net>
> ---
>  drivers/dax/fsdev.c | 17 ++++++++++++++++-
>  1 file changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> index 188b2526bee45..2c5de3d80a618 100644
> --- a/drivers/dax/fsdev.c
> +++ b/drivers/dax/fsdev.c
> @@ -135,11 +135,26 @@ static void fsdev_clear_ops(void *data)
>   * The core mm code in free_zone_device_folio() handles the wake_up_var()
>   * directly for this memory type.
>   */
> +static u64 fsdev_pfn_to_offset(struct dev_pagemap *pgmap, unsigned long pfn)
> +{
> +	phys_addr_t phys = PFN_PHYS(pfn);
> +	u64 offset = 0;
> +
> +	for (int i = 0; i < pgmap->nr_range; i++) {
> +		struct range *range = &pgmap->ranges[i];
> +
> +		if (phys >= range->start && phys <= range->end)
> +			return offset + (phys - range->start);
> +		offset += range_len(range);
> +	}
> +	return -1ULL;
> +}
> +
>  static int fsdev_pagemap_memory_failure(struct dev_pagemap *pgmap,
>  		unsigned long pfn, unsigned long nr_pages, int mf_flags)
>  {
>  	struct dev_dax *dev_dax = pgmap->owner;
> -	u64 offset = PFN_PHYS(pfn) - dev_dax->ranges[0].range.start;
> +	u64 offset = fsdev_pfn_to_offset(pgmap, pfn);

Hi John,

I think this regresses static devices. pgmap->ranges[0].start can sit
data_offset below it on a static device, so the new offset = old + data_offset,
and XFS poisons the wrong blocks.

The gap walk only helps dynamic devices where data_offset ==0 . Maybe walking pgmap->ranges and
substract the probe's data_offset.

--Richard

>  	u64 len = nr_pages << PAGE_SHIFT;
>  
>  	return dax_holder_notify_failure(dev_dax->dax_dev, offset,
> -- 
> 2.53.0
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2026-06-12  3:09 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260611173057.65868-1-john@jagalactic.com>
2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
2026-06-11 17:31   ` [PATCH V5 1/9] dax: fix misleading comment about share/index union in dax_folio_reset_order() John Groves
2026-06-11 17:31   ` [PATCH V5 2/9] dax/fsdev: fix multi-range offset in memory_failure handler John Groves
2026-06-11 17:51     ` sashiko-bot
2026-06-12  3:08     ` Richard Cheng
2026-06-11 17:32   ` [PATCH V5 3/9] dax/fsdev: clear vmemmap_shift when binding static pgmap John Groves
2026-06-11 17:55     ` sashiko-bot
2026-06-12  2:56     ` Richard Cheng
2026-06-11 17:32   ` [PATCH V5 4/9] dax/fsdev: don't leave a dangling dev_dax->pgmap on probe failure John Groves
2026-06-11 18:04     ` sashiko-bot
2026-06-11 17:32   ` [PATCH V5 5/9] dax/fsdev: use __va(phys) for kaddr in direct_access John Groves
2026-06-11 17:32   ` [PATCH V5 6/9] dax/fsdev: fail probe on invalid pgmap offset John Groves
2026-06-11 18:09     ` Gupta, Pankaj
2026-06-11 18:13     ` sashiko-bot
2026-06-11 17:32   ` [PATCH V5 7/9] dax: read holder_ops once in dax_holder_notify_failure() John Groves
2026-06-11 18:13     ` sashiko-bot
2026-06-12  3:02     ` Richard Cheng
2026-06-11 17:32   ` [PATCH V5 8/9] dax: fix holder_ops race in fs_put_dax() John Groves
2026-06-11 18:28     ` sashiko-bot
2026-06-11 17:33   ` [PATCH V5 9/9] dax: fsdev.c minor formatting cleanup John Groves

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.