[PATCH V4 0/9] Fixes to the previously-merged drivers/dax/fsdev series

Linux CXL
 help / color / mirror / Atom feed

* [PATCH V4 0/9] Fixes to the previously-merged drivers/dax/fsdev series
       [not found] <20260607193224.94244-1-john@jagalactic.com>
@ 2026-06-07 19:32 ` John Groves
  2026-06-07 19:33   ` [PATCH V4 1/9] dax: fix misleading comment about share/index union in dax_folio_reset_order() John Groves
                     ` (8 more replies)
  0 siblings, 9 replies; 25+ messages in thread
From: John Groves @ 2026-06-07 19:32 UTC (permalink / raw)
  To: John Groves, Dan Williams
  Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Miklos Szeredi,
	Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	John Groves

From: John Groves <john@groves.net>

This series applies bug fixes (mostly found via sashiko) to the dax/fsdev 
series. This has been soaking in the famfs CI pipeline for 2+ weeks and
1) won't affect anything that doesn't use drivers/dax/fsdev.c, and 2)
doesn't affect any known workloads - although the bugs would have 
manifested when multi-range DCD dax devices are a thing (soon-ish).

Most of the series is confined to drivers/dax/fsdev.c. Two patches
touch shared DAX core: patch 7 changes fs_put_dax() in 
drivers/dax/super.c (used by ext2/ext4/erofs/xfs, though only
holder-passing callers, like XFS in-tree, will see a new warning if
they misuse fs_put_dax(). 

Although patch 8 adjusts the dax_dev lookup API in super.c / dax.h, this 
does not affect the above file systems; they use fs_dax_get_by_bdev() 
instead.

Changes since V3:

- Patch 4: Adopted Dave's suggested refactor -- factor out
  fsdev_acquire_pgmap() and defer the dev_dax->pgmap assignment until
  probe can no longer fail, replacing the goto-based cleanup. Did not
  carry Alison's V3 Reviewed-by due to the rewrite.
- Patch 5: Also remove the now write-only dev_dax->virt_addr field,
  per Dave's review.
- Patch 7: Fixed the WARN_ON() to tolerate holder_data == NULL, which
  legitimately occurs when kill_dax() clears it during device removal
  under a live holder (per Dave's review). Wrong-holder calls still
  warn.
- Patch 8: Kept the Fixes tag -- the exported symbol itself is the
  hazard; stable kernels carrying the export should want this fix.

Changes since V2:

* Patch 1 (comment fix): No change. Responded to Dave's question about
  the dropped precondition -- the new comment correctly covers both
  callers; fsdev_clear_folio_state() does not guarantee share==0 before
  calling, so the old precondition was no longer universally true.
* V2 patch 2 (three fixes): Split into three separate patches (patches
  2-4) per Dave's review.
* V2 patch 3 (two fixes): Split into two separate patches (patches 5-6)
  per Dave's review.
* V2 patch 4 (clamp direct_access / remove cached_size): Dropped.
  Dave's analysis correctly showed the claimed bug does not exist --
  dax_pgoff_to_phys() already enforces that the full requested size fits
  within a single range before returning, making the clamp a no-op in
  every reachable path.
* V2 patch 5 (holder_ops race): Use WRITE_ONCE() for the holder_ops
  store; add WARN_ON() on the cmpxchg result to catch wrong-holder and
  double-put API contract violations; fix the inline comment, which
  incorrectly claimed dax_holder_notify_failure() consults holder_ops
  only when holder_data is non-NULL.
* V2 patch 6 (dax_dev_find): Add dax_alive() check under dax_read_lock()
  after ilookup5() to prevent returning a device that is concurrently
  being torn down by kill_dax().
* V2 patch 7 (formatting cleanup): Drop incorrect Fixes: tag; add
  Dave's Reviewed-by.
* The series grows from 7 to 9 patches.

Changes since v1:
* Dropped modes from patch 6 to fs/fuse/famfs.c and 
  fs/famfs/famfs_inode.c, which are not upstream so it broke
  attempts to apply the series. Oops...
* Added patch 7, which addresses a previously-missed review comment
  from Jonathan - minor cleanup

John Groves (9):
  dax: fix misleading comment about share/index union in
    dax_folio_reset_order()
  dax/fsdev: fix multi-range offset in memory_failure handler
  dax/fsdev: clear vmemmap_shift when binding static pgmap
  dax/fsdev: don't leave a dangling dev_dax->pgmap on probe failure
  dax/fsdev: use __va(phys) for kaddr in direct_access
  dax/fsdev: fail probe on invalid pgmap offset
  dax: fix holder_ops race in fs_put_dax()
  dax: replace exported dax_dev_get() with non-allocating dax_dev_find()
  dax: fsdev.c minor formatting cleanup

 drivers/dax/dax-private.h |   2 -
 drivers/dax/fsdev.c       | 126 +++++++++++++++++++++++++-------------
 drivers/dax/super.c       |  80 ++++++++++++++++++++++--
 fs/dax.c                  |  12 ++--
 include/linux/dax.h       |   6 +-
 5 files changed, 168 insertions(+), 58 deletions(-)

base-commit: e43ffb69e0438cddd72aaa30898b4dc446f664f8
-- 
2.53.0

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH V4 1/9] dax: fix misleading comment about share/index union in dax_folio_reset_order()
  2026-06-07 19:32 ` [PATCH V4 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
@ 2026-06-07 19:33   ` John Groves
  2026-06-07 19:45     ` sashiko-bot
  2026-06-07 19:33   ` [PATCH V4 2/9] dax/fsdev: fix multi-range offset in memory_failure handler John Groves
                     ` (7 subsequent siblings)
  8 siblings, 1 reply; 25+ messages in thread
From: John Groves @ 2026-06-07 19:33 UTC (permalink / raw)
  To: John Groves, Dan Williams
  Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Miklos Szeredi,
	Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	John Groves

From: John Groves <John@Groves.net>

The comment in dax_folio_reset_order() claims that DAX maintains an
invariant where folio->share != 0 only when folio->mapping == NULL,
implying folio->share is zero whenever mapping is non-NULL. This is
misleading because folio->share and folio->index are a union -- for
non-shared folios with mapping != NULL, reading folio->share returns
the file page offset (folio->index), which is typically non-zero.

Reword the comment to accurately describe the union aliasing: the
assignment clears whichever interpretation of the union word is active
(index for non-shared folios, share for shared folios), which is correct
because the folio is being released in either case.

No functional change -- the code was already correct, only the
justification was wrong.

Fixes: 59eb73b98ae0b ("dax: Factor out dax_folio_reset_order() helper")

Reviewed-by: Jonathan Cameron <jic23@kernel.org>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: John Groves <john@groves.net>
---
 fs/dax.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 6d175cd47a99b..df19c9317d10e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -392,12 +392,12 @@ int dax_folio_reset_order(struct folio *folio)
 	int order = folio_order(folio);
 
 	/*
-	 * DAX maintains the invariant that folio->share != 0 only when
-	 * folio->mapping == NULL (enforced by dax_folio_make_shared()).
-	 * Equivalently: folio->mapping != NULL implies folio->share == 0.
-	 * Callers ensure share has been decremented to zero before
-	 * calling here, so unconditionally clearing both fields is
-	 * correct.
+	 * Clear the mapping and the index/share union word. folio->share
+	 * and folio->index occupy the same union in struct folio. For
+	 * non-shared folios (mapping != NULL), the union holds folio->index
+	 * (file page offset); for shared folios (mapping == NULL), it holds
+	 * folio->share (reference count). Either way, we are releasing the
+	 * folio and both fields should be zeroed.
 	 */
 	folio->mapping = NULL;
 	folio->share = 0;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH V4 1/9] dax: fix misleading comment about share/index union in dax_folio_reset_order()
  2026-06-07 19:33   ` [PATCH V4 1/9] dax: fix misleading comment about share/index union in dax_folio_reset_order() John Groves
@ 2026-06-07 19:45     ` sashiko-bot
  0 siblings, 0 replies; 25+ messages in thread
From: sashiko-bot @ 2026-06-07 19:45 UTC (permalink / raw)
  To: John Groves; +Cc: linux-cxl

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] Incorrect PFN advancement in `fsdev_clear_folio_state()` when starting from a tail page of a stale compound folio.
--

commit eb3e25245bdf433dc92d71266558b877ef6655c1
Author: John Groves <John@Groves.net>
dax: fix misleading comment about share/index union in dax_folio_reset_order()

This commit updates a misleading comment in dax_folio_reset_order() to
accurately describe the aliasing of folio->share and folio->index in the
union. It clarifies that the assignment safely clears whichever interpretation
is active, as both fields should be zeroed when the folio is released.

> diff --git a/fs/dax.c b/fs/dax.c
> index 6d175cd47a99b..df19c9317d10e 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -392,12 +392,12 @@ int dax_folio_reset_order(struct folio *folio)
>  	int order = folio_order(folio);
>  
>  	/*
> -	 * DAX maintains the invariant that folio->share != 0 only when
> -	 * folio->mapping == NULL (enforced by dax_folio_make_shared()).
> -	 * Equivalently: folio->mapping != NULL implies folio->share == 0.
> -	 * Callers ensure share has been decremented to zero before
> -	 * calling here, so unconditionally clearing both fields is
> -	 * correct.
> +	 * Clear the mapping and the index/share union word. folio->share
> +	 * and folio->index occupy the same union in struct folio. For
> +	 * non-shared folios (mapping != NULL), the union holds folio->index
> +	 * (file page offset); for shared folios (mapping == NULL), it holds
> +	 * folio->share (reference count). Either way, we are releasing the
> +	 * folio and both fields should be zeroed.
>  	 */
>  	folio->mapping = NULL;

[Severity: High]
This isn't a bug introduced by this patch, but looking at how callers use
dax_folio_reset_order():

drivers/dax/fsdev.c:fsdev_clear_folio_state() {
...
		while (pfn < end_pfn) {
			struct folio *folio = pfn_folio(pfn);
			int order = dax_folio_reset_order(folio);

			pfn += 1UL << order;
		}
...
}

Does this loop incorrectly advance the pfn if the range start falls within a
previously initialized compound page (a tail page)?

Since pfn_folio() returns the head folio, advancing by 1UL << order from the
current pfn might skip over the beginning of the next contiguous memory
region. This could leave stale compound metadata on those skipped pages.

Should the pfn advancement account for the page's index within the folio?

>  	folio->share = 0;

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/0100019ea3929225-a0f8e6f7-30ae-4f8e-ae6f-19129666c4c3-000000@email.amazonses.com?part=1

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH V4 2/9] dax/fsdev: fix multi-range offset in memory_failure handler
  2026-06-07 19:32 ` [PATCH V4 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
  2026-06-07 19:33   ` [PATCH V4 1/9] dax: fix misleading comment about share/index union in dax_folio_reset_order() John Groves
@ 2026-06-07 19:33   ` John Groves
  2026-06-07 19:49     ` sashiko-bot
  2026-06-08 10:56     ` Richard Cheng
  2026-06-07 19:33   ` [PATCH V4 3/9] dax/fsdev: clear vmemmap_shift when binding static pgmap John Groves
                     ` (6 subsequent siblings)
  8 siblings, 2 replies; 25+ messages in thread
From: John Groves @ 2026-06-07 19:33 UTC (permalink / raw)
  To: John Groves, Dan Williams
  Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Miklos Szeredi,
	Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	John Groves

From: John Groves <John@Groves.net>

Fix memory_failure offset calculation for multi-range devices. The old code
subtracted ranges[0].range.start from the faulting PFN's physical address,
which produces an incorrect (inflated) logical offset when the PFN falls in
ranges[1] or beyond due to physical gaps between ranges. Add
fsdev_pfn_to_offset() to walk the range list and compute the correct
device-linear byte offset.

Fixes: d5406bd458b0a ("dax: add fsdev.c driver for fs-dax on character dax")

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/fsdev.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
index 188b2526bee45..f315533b299e9 100644
--- a/drivers/dax/fsdev.c
+++ b/drivers/dax/fsdev.c
@@ -135,11 +135,26 @@ static void fsdev_clear_ops(void *data)
  * The core mm code in free_zone_device_folio() handles the wake_up_var()
  * directly for this memory type.
  */
+static u64 fsdev_pfn_to_offset(struct dev_dax *dev_dax, unsigned long pfn)
+{
+	phys_addr_t phys = PFN_PHYS(pfn);
+	u64 offset = 0;
+
+	for (int i = 0; i < dev_dax->nr_range; i++) {
+		struct range *range = &dev_dax->ranges[i].range;
+
+		if (phys >= range->start && phys <= range->end)
+			return offset + (phys - range->start);
+		offset += range_len(range);
+	}
+	return -1ULL;
+}
+
 static int fsdev_pagemap_memory_failure(struct dev_pagemap *pgmap,
 		unsigned long pfn, unsigned long nr_pages, int mf_flags)
 {
 	struct dev_dax *dev_dax = pgmap->owner;
-	u64 offset = PFN_PHYS(pfn) - dev_dax->ranges[0].range.start;
+	u64 offset = fsdev_pfn_to_offset(dev_dax, pfn);
 	u64 len = nr_pages << PAGE_SHIFT;
 
 	return dax_holder_notify_failure(dev_dax->dax_dev, offset,
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH V4 2/9] dax/fsdev: fix multi-range offset in memory_failure handler
  2026-06-07 19:33   ` [PATCH V4 2/9] dax/fsdev: fix multi-range offset in memory_failure handler John Groves
@ 2026-06-07 19:49     ` sashiko-bot
  2026-06-08 10:56     ` Richard Cheng
  1 sibling, 0 replies; 25+ messages in thread
From: sashiko-bot @ 2026-06-07 19:49 UTC (permalink / raw)
  To: John Groves; +Cc: linux-cxl

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] Lockless iteration over the dynamically reallocated `dev_dax->ranges` array can cause a Use-After-Free or out-of-bounds access.
--

commit f367744bc128642b975efbceb84c1bfc7d8fa44a
Author: John Groves <John@Groves.net>

dax/fsdev: fix multi-range offset in memory_failure handler

This commit fixes the memory failure offset calculation for multi-range
devices. It introduces fsdev_pfn_to_offset to walk the range list and
compute the correct device-linear byte offset instead of relying on the
start of the first range.

> diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> index 188b2526bee45..f315533b299e9 100644
> --- a/drivers/dax/fsdev.c
> +++ b/drivers/dax/fsdev.c
[ ... ]
> @@ -135,11 +135,26 @@ static void fsdev_clear_ops(void *data)
>   * The core mm code in free_zone_device_folio() handles the wake_up_var()
>   * directly for this memory type.
>   */
> +static u64 fsdev_pfn_to_offset(struct dev_dax *dev_dax, unsigned long pfn)
> +{
> +	phys_addr_t phys = PFN_PHYS(pfn);
> +	u64 offset = 0;
> +
> +	for (int i = 0; i < dev_dax->nr_range; i++) {
> +		struct range *range = &dev_dax->ranges[i].range;

[Severity: High]
This is a pre-existing issue, but does lockless iteration over
dev_dax->ranges expose us to a use-after-free here?

If a privileged user writes to the sysfs mapping attribute concurrently,
it can trigger an array reallocation:

mapping_store()
  alloc_dev_dax_range()
    krealloc()

Since this iteration in fsdev_pfn_to_offset() does not hold a synchronization
lock like dax_dev_rwsem, a concurrent krealloc() could free the array while
it is being walked during a memory failure event. The previous code
locklessly accessed the first index, but the new loop extends the window
where this could occur.

Would it be safer to iterate over the immutable snapshot stored in
pgmap->ranges instead, since that is populated during probe and passed into
the memory failure handler?

> +
> +		if (phys >= range->start && phys <= range->end)
> +			return offset + (phys - range->start);
> +		offset += range_len(range);
> +	}
> +	return -1ULL;
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/0100019ea3929225-a0f8e6f7-30ae-4f8e-ae6f-19129666c4c3-000000@email.amazonses.com?part=2

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V4 2/9] dax/fsdev: fix multi-range offset in memory_failure handler
  2026-06-07 19:33   ` [PATCH V4 2/9] dax/fsdev: fix multi-range offset in memory_failure handler John Groves
  2026-06-07 19:49     ` sashiko-bot
@ 2026-06-08 10:56     ` Richard Cheng
  2026-06-11 16:59       ` John Groves
  1 sibling, 1 reply; 25+ messages in thread
From: Richard Cheng @ 2026-06-08 10:56 UTC (permalink / raw)
  To: John Groves
  Cc: John Groves, Dan Williams, John Groves, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Miklos Szeredi, Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org

On Sun, Jun 07, 2026 at 07:33:19PM +0800, John Groves wrote:
> From: John Groves <John@Groves.net>
> 
> Fix memory_failure offset calculation for multi-range devices. The old code
> subtracted ranges[0].range.start from the faulting PFN's physical address,
> which produces an incorrect (inflated) logical offset when the PFN falls in
> ranges[1] or beyond due to physical gaps between ranges. Add
> fsdev_pfn_to_offset() to walk the range list and compute the correct
> device-linear byte offset.
> 
> Fixes: d5406bd458b0a ("dax: add fsdev.c driver for fs-dax on character dax")
> 
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Alison Schofield <alison.schofield@intel.com>
> Signed-off-by: John Groves <john@groves.net>
> ---
>  drivers/dax/fsdev.c | 17 ++++++++++++++++-
>  1 file changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> index 188b2526bee45..f315533b299e9 100644
> --- a/drivers/dax/fsdev.c
> +++ b/drivers/dax/fsdev.c
> @@ -135,11 +135,26 @@ static void fsdev_clear_ops(void *data)
>   * The core mm code in free_zone_device_folio() handles the wake_up_var()
>   * directly for this memory type.
>   */
> +static u64 fsdev_pfn_to_offset(struct dev_dax *dev_dax, unsigned long pfn)
> +{
> +	phys_addr_t phys = PFN_PHYS(pfn);
> +	u64 offset = 0;
> +
> +	for (int i = 0; i < dev_dax->nr_range; i++) {
> +		struct range *range = &dev_dax->ranges[i].range;
> +

IMHO, this walks dev_dax->ranges[] locklessly from the memory_failure callback.
mapping_store() can krealloc() that array via alloc_dev_dax_range() without
checking dev->driver, so a sysfs mapping write concurrent with a HW poison event
can move/free it under this walk.

We have pgmap->ranges[], the imuutable copy populated at probe and never mutated
afterwards, right here in the callback, and its accumulated range_len() is exactly
the device-linear offset.
Maybe walking that instread closes the race.

What do you think?

--Richard

> +		if (phys >= range->start && phys <= range->end)
> +			return offset + (phys - range->start);
> +		offset += range_len(range);
> +	}
> +	return -1ULL;
> +}
> +
>  static int fsdev_pagemap_memory_failure(struct dev_pagemap *pgmap,
>  		unsigned long pfn, unsigned long nr_pages, int mf_flags)
>  {
>  	struct dev_dax *dev_dax = pgmap->owner;
> -	u64 offset = PFN_PHYS(pfn) - dev_dax->ranges[0].range.start;
> +	u64 offset = fsdev_pfn_to_offset(dev_dax, pfn);
>  	u64 len = nr_pages << PAGE_SHIFT;
>  
>  	return dax_holder_notify_failure(dev_dax->dax_dev, offset,
> -- 
> 2.53.0
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V4 2/9] dax/fsdev: fix multi-range offset in memory_failure handler
  2026-06-08 10:56     ` Richard Cheng
@ 2026-06-11 16:59       ` John Groves
  0 siblings, 0 replies; 25+ messages in thread
From: John Groves @ 2026-06-11 16:59 UTC (permalink / raw)
  To: Richard Cheng
  Cc: John Groves, Dan Williams, John Groves, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Miklos Szeredi, Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org

On 26/06/08 06:56PM, Richard Cheng wrote:
> On Sun, Jun 07, 2026 at 07:33:19PM +0800, John Groves wrote:
> > From: John Groves <John@Groves.net>
> > 
> > Fix memory_failure offset calculation for multi-range devices. The old code
> > subtracted ranges[0].range.start from the faulting PFN's physical address,
> > which produces an incorrect (inflated) logical offset when the PFN falls in
> > ranges[1] or beyond due to physical gaps between ranges. Add
> > fsdev_pfn_to_offset() to walk the range list and compute the correct
> > device-linear byte offset.
> > 
> > Fixes: d5406bd458b0a ("dax: add fsdev.c driver for fs-dax on character dax")
> > 
> > Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> > Reviewed-by: Alison Schofield <alison.schofield@intel.com>
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> >  drivers/dax/fsdev.c | 17 ++++++++++++++++-
> >  1 file changed, 16 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> > index 188b2526bee45..f315533b299e9 100644
> > --- a/drivers/dax/fsdev.c
> > +++ b/drivers/dax/fsdev.c
> > @@ -135,11 +135,26 @@ static void fsdev_clear_ops(void *data)
> >   * The core mm code in free_zone_device_folio() handles the wake_up_var()
> >   * directly for this memory type.
> >   */
> > +static u64 fsdev_pfn_to_offset(struct dev_dax *dev_dax, unsigned long pfn)
> > +{
> > +	phys_addr_t phys = PFN_PHYS(pfn);
> > +	u64 offset = 0;
> > +
> > +	for (int i = 0; i < dev_dax->nr_range; i++) {
> > +		struct range *range = &dev_dax->ranges[i].range;
> > +
> 
> IMHO, this walks dev_dax->ranges[] locklessly from the memory_failure callback.
> mapping_store() can krealloc() that array via alloc_dev_dax_range() without
> checking dev->driver, so a sysfs mapping write concurrent with a HW poison event
> can move/free it under this walk.
> 
> We have pgmap->ranges[], the imuutable copy populated at probe and never mutated
> afterwards, right here in the callback, and its accumulated range_len() is exactly
> the device-linear offset.
> Maybe walking that instread closes the race.
> 
> What do you think?

Good idea. Revising to use the pgmap->ranges instead. Thanks Richard!

Regards,
John

<snip>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH V4 3/9] dax/fsdev: clear vmemmap_shift when binding static pgmap
  2026-06-07 19:32 ` [PATCH V4 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
  2026-06-07 19:33   ` [PATCH V4 1/9] dax: fix misleading comment about share/index union in dax_folio_reset_order() John Groves
  2026-06-07 19:33   ` [PATCH V4 2/9] dax/fsdev: fix multi-range offset in memory_failure handler John Groves
@ 2026-06-07 19:33   ` John Groves
  2026-06-07 19:49     ` sashiko-bot
  2026-06-07 19:33   ` [PATCH V4 4/9] dax/fsdev: don't leave a dangling dev_dax->pgmap on probe failure John Groves
                     ` (5 subsequent siblings)
  8 siblings, 1 reply; 25+ messages in thread
From: John Groves @ 2026-06-07 19:33 UTC (permalink / raw)
  To: John Groves, Dan Williams
  Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Miklos Szeredi,
	Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	John Groves

From: John Groves <John@Groves.net>

Clear pgmap->vmemmap_shift for static DAX devices. When rebinding a static
device from device_dax (which may set vmemmap_shift based on alignment) to
fsdev_dax, the stale vmemmap_shift persists on the shared pgmap. Explicitly
zero it before devm_memremap_pages() so the vmemmap is built for order-0
folios as fsdev requires.

Fixes: d5406bd458b0a ("dax: add fsdev.c driver for fs-dax on character dax")

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/fsdev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
index f315533b299e9..dbd722ed7ab05 100644
--- a/drivers/dax/fsdev.c
+++ b/drivers/dax/fsdev.c
@@ -237,6 +237,7 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
 		}
 
 		pgmap = dev_dax->pgmap;
+		pgmap->vmemmap_shift = 0;
 	} else {
 		size_t pgmap_size;
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH V4 3/9] dax/fsdev: clear vmemmap_shift when binding static pgmap
  2026-06-07 19:33   ` [PATCH V4 3/9] dax/fsdev: clear vmemmap_shift when binding static pgmap John Groves
@ 2026-06-07 19:49     ` sashiko-bot
  0 siblings, 0 replies; 25+ messages in thread
From: sashiko-bot @ 2026-06-07 19:49 UTC (permalink / raw)
  To: John Groves; +Cc: linux-cxl

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:

Pre-existing issues:
- [Critical] Dangling `pgmap->ops` and `pgmap->owner` function pointers cause Use-After-Free/Panic upon memory failure.
- [High] Stale `vmemmap_shift` in `device_dax` when rebinding static DAX devices with reduced alignment.
- [High] Re-initializing active `cdev` structure in probe leads to Use-After-Free and list corruption.
--

commit 98c8cbe94424d5d0c2736d9ed4d3c34493e26f6a
Author: John Groves <John@Groves.net>

dax/fsdev: clear vmemmap_shift when binding static pgmap

This commit explicitly zeros pgmap->vmemmap_shift for static dax devices
before calling devm_memremap_pages() in fsdev_dax. This ensures the
vmemmap is correctly built for order-0 folios, clearing any stale shift
value left behind when rebinding from device_dax.

> diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> index f315533b299e9..dbd722ed7ab05 100644
> --- a/drivers/dax/fsdev.c
> +++ b/drivers/dax/fsdev.c
> @@ -237,6 +237,7 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
>  		}
>  
>  		pgmap = dev_dax->pgmap;
> +		pgmap->vmemmap_shift = 0;

[Severity: High]
This isn't a bug introduced by this patch, but does drivers/dax/device.c
also need this fix?

In device.c:dev_dax_probe(), vmemmap_shift is updated conditionally:

	if (dev_dax->align > PAGE_SIZE)
		pgmap->vmemmap_shift =
			order_base_2(dev_dax->align >> PAGE_SHIFT);

If a static dax device is bound with an alignment greater than PAGE_SIZE,
unbound, has its alignment reduced to PAGE_SIZE via sysfs, and then rebound
to device_dax, it seems vmemmap_shift is not cleared and retains its stale
value.

Could this cause devm_memremap_pages() to erroneously build compound folios
for an order-0 dax device?

[Severity: Critical]
This is a pre-existing issue, but does rebinding between fsdev_dax and
device_dax leave dangling pointers in the shared pgmap?

When fsdev_dax binds to the pgmap, it populates its module-specific
callbacks:

drivers/dax/fsdev.c:fsdev_dax_probe() {
    ...
	pgmap->ops = &fsdev_pagemap_ops;
	pgmap->owner = dev_dax;
    ...
}

If the device is later rebound to device_dax, dev_dax_probe() only updates
pgmap->type and does not appear to clear pgmap->ops or pgmap->owner:

drivers/dax/device.c:dev_dax_probe() {
    ...
	pgmap->type = MEMORY_DEVICE_GENERIC;
    ...
}

If fsdev_dax is then unloaded and a memory failure occurs, could
memory_failure() invoke the dangling pgmap->ops->memory_failure() pointer,
resulting in a use-after-free and kernel panic?

[Severity: High]
This is a pre-existing issue, but does calling cdev_init() during probe
risk list corruption and use-after-free if the vfs still holds references?

Both fsdev_dax_probe() and dev_dax_probe() call cdev_init() on the cdev
embedded within the long-lived dax_device structure:

drivers/dax/fsdev.c:fsdev_dax_probe() {
    ...
	cdev = inode->i_cdev;
	cdev_init(cdev, &fsdev_fops);
    ...
}

If the driver is unbound and rebound while userspace still holds open file
descriptors for the device, the vfs maintains references to the cdev.

Doesn't cdev_init() zero-initialize the cdev and re-initialize its kobject,
overwriting the active reference count and clearing cdev->list?

>  	} else {
>  		size_t pgmap_size;
>

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/0100019ea3929225-a0f8e6f7-30ae-4f8e-ae6f-19129666c4c3-000000@email.amazonses.com?part=3

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH V4 4/9] dax/fsdev: don't leave a dangling dev_dax->pgmap on probe failure
  2026-06-07 19:32 ` [PATCH V4 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
                     ` (2 preceding siblings ...)
  2026-06-07 19:33   ` [PATCH V4 3/9] dax/fsdev: clear vmemmap_shift when binding static pgmap John Groves
@ 2026-06-07 19:33   ` John Groves
  2026-06-07 19:44     ` sashiko-bot
  2026-06-08 21:30     ` Dave Jiang
  2026-06-07 19:33   ` [PATCH V4 5/9] dax/fsdev: use __va(phys) for kaddr in direct_access John Groves
                     ` (4 subsequent siblings)
  8 siblings, 2 replies; 25+ messages in thread
From: John Groves @ 2026-06-07 19:33 UTC (permalink / raw)
  To: John Groves, Dan Williams
  Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Miklos Szeredi,
	Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	John Groves

From: John Groves <John@Groves.net>

After the dynamic path set dev_dax->pgmap, any later probe failure left
dev_dax->pgmap dangling: devres frees the devm_kzalloc'd pgmap on probe
failure, and subsequent probe attempts would hit the "dynamic-dax with
pre-populated page map" check and fail permanently.

Factor pgmap acquisition out into fsdev_acquire_pgmap(), and defer the
dev_dax->pgmap assignment until probe can no longer fail. A failed probe
now never publishes the pointer at all, so there is nothing to unwind.
This also matches kill_dev_dax(), which already clears the dynamic pgmap
pointer on unbind: dev_dax->pgmap is now non-NULL only while the pgmap
is actually valid.

Refactor suggested by Dave Jiang.

Fixes: d5406bd458b0a ("dax: add fsdev.c driver for fs-dax on character dax")
Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/fsdev.c | 77 ++++++++++++++++++++++++++++-----------------
 1 file changed, 49 insertions(+), 28 deletions(-)

diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
index dbd722ed7ab05..0fd5e1293d725 100644
--- a/drivers/dax/fsdev.c
+++ b/drivers/dax/fsdev.c
@@ -219,47 +219,62 @@ static const struct file_operations fsdev_fops = {
 	.release = fsdev_release,
 };
 
-static int fsdev_dax_probe(struct dev_dax *dev_dax)
+/*
+ * Acquire the dev_pagemap for probe: the static (pre-populated) one if
+ * present, or a devm-allocated one for the dynamic case. Note that
+ * dev_dax->pgmap is not set here; fsdev_dax_probe() sets it only once
+ * probe succeeds, so a failed probe never leaves a dangling pointer
+ * to a devres-freed pgmap.
+ */
+static struct dev_pagemap *fsdev_acquire_pgmap(struct dev_dax *dev_dax)
 {
-	struct dax_device *dax_dev = dev_dax->dax_dev;
 	struct device *dev = &dev_dax->dev;
 	struct dev_pagemap *pgmap;
-	struct inode *inode;
-	u64 data_offset = 0;
-	struct cdev *cdev;
-	void *addr;
-	int rc, i;
+	size_t pgmap_size;
 
 	if (static_dev_dax(dev_dax)) {
 		if (dev_dax->nr_range > 1) {
-			dev_warn(dev, "static pgmap / multi-range device conflict\n");
-			return -EINVAL;
+			dev_warn(dev,
+				 "static pgmap / multi-range device conflict\n");
+			return ERR_PTR(-EINVAL);
 		}
 
 		pgmap = dev_dax->pgmap;
 		pgmap->vmemmap_shift = 0;
-	} else {
-		size_t pgmap_size;
+		return pgmap;
+	}
 
-		if (dev_dax->pgmap) {
-			dev_warn(dev, "dynamic-dax with pre-populated page map\n");
-			return -EINVAL;
-		}
+	if (dev_dax->pgmap) {
+		dev_warn(dev, "dynamic-dax with pre-populated page map\n");
+		return ERR_PTR(-EINVAL);
+	}
 
-		pgmap_size = struct_size(pgmap, ranges, dev_dax->nr_range - 1);
-		pgmap = devm_kzalloc(dev, pgmap_size, GFP_KERNEL);
-		if (!pgmap)
-			return -ENOMEM;
+	pgmap_size = struct_size(pgmap, ranges, dev_dax->nr_range - 1);
+	pgmap = devm_kzalloc(dev, pgmap_size, GFP_KERNEL);
+	if (!pgmap)
+		return ERR_PTR(-ENOMEM);
 
-		pgmap->nr_range = dev_dax->nr_range;
-		dev_dax->pgmap = pgmap;
+	pgmap->nr_range = dev_dax->nr_range;
+	for (int i = 0; i < dev_dax->nr_range; i++)
+		pgmap->ranges[i] = dev_dax->ranges[i].range;
 
-		for (i = 0; i < dev_dax->nr_range; i++) {
-			struct range *range = &dev_dax->ranges[i].range;
+	return pgmap;
+}
 
-			pgmap->ranges[i] = *range;
-		}
-	}
+static int fsdev_dax_probe(struct dev_dax *dev_dax)
+{
+	struct dax_device *dax_dev = dev_dax->dax_dev;
+	struct device *dev = &dev_dax->dev;
+	struct dev_pagemap *pgmap;
+	struct inode *inode;
+	u64 data_offset = 0;
+	struct cdev *cdev;
+	void *addr;
+	int rc, i;
+
+	pgmap = fsdev_acquire_pgmap(dev_dax);
+	if (IS_ERR(pgmap))
+		return PTR_ERR(pgmap);
 
 	for (i = 0; i < dev_dax->nr_range; i++) {
 		struct range *range = &dev_dax->ranges[i].range;
@@ -306,7 +321,7 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
 	/* Detect whether the data is at a non-zero offset into the memory */
 	if (pgmap->range.start != dev_dax->ranges[0].range.start) {
 		u64 phys = dev_dax->ranges[0].range.start;
-		u64 pgmap_phys = dev_dax->pgmap[0].range.start;
+		u64 pgmap_phys = pgmap[0].range.start;
 
 		if (!WARN_ON(pgmap_phys > phys))
 			data_offset = phys - pgmap_phys;
@@ -339,7 +354,13 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
 		return rc;
 
 	run_dax(dax_dev);
-	return devm_add_action_or_reset(dev, fsdev_kill, dev_dax);
+	rc = devm_add_action_or_reset(dev, fsdev_kill, dev_dax);
+	if (rc)
+		return rc;
+
+	/* Probe can no longer fail; expose the pgmap via dev_dax */
+	dev_dax->pgmap = pgmap;
+	return 0;
 }
 
 static struct dax_device_driver fsdev_dax_driver = {
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH V4 4/9] dax/fsdev: don't leave a dangling dev_dax->pgmap on probe failure
  2026-06-07 19:33   ` [PATCH V4 4/9] dax/fsdev: don't leave a dangling dev_dax->pgmap on probe failure John Groves
@ 2026-06-07 19:44     ` sashiko-bot
  2026-06-08 21:30     ` Dave Jiang
  1 sibling, 0 replies; 25+ messages in thread
From: sashiko-bot @ 2026-06-07 19:44 UTC (permalink / raw)
  To: John Groves; +Cc: linux-cxl

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] The identical dangling pointer bug that causes permanent probe failure and was fixed in `fsdev.c` still exists in `dev_dax_probe` within `drivers/dax/device.c`.
--

commit 4bfae7ab5cd902255b5838eb3c590fffc93926ef
Author: John Groves <John@Groves.net>

dax/fsdev: don't leave a dangling dev_dax->pgmap on probe failure

This commit fixes a bug in fsdev.c where dev_dax->pgmap was left dangling
if a probe failed after allocation. It defers assigning dev_dax->pgmap
until the probe can no longer fail.

> diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> index dbd722ed7ab05..0fd5e1293d725 100644
> --- a/drivers/dax/fsdev.c
> +++ b/drivers/dax/fsdev.c
[ ... ]
> @@ -339,7 +354,13 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
>  		return rc;
>  
>  	run_dax(dax_dev);
> -	return devm_add_action_or_reset(dev, fsdev_kill, dev_dax);
> +	rc = devm_add_action_or_reset(dev, fsdev_kill, dev_dax);
> +	if (rc)
> +		return rc;
> +
> +	/* Probe can no longer fail; expose the pgmap via dev_dax */
> +	dev_dax->pgmap = pgmap;
> +	return 0;
>  }

[Severity: High]
This is a pre-existing issue, but does the same dangling pointer bug still
exist in dev_dax_probe() in drivers/dax/device.c?

The identical pattern of assigning dev_dax->pgmap immediately after the
devm_kzalloc() allocation appears to be present there:

drivers/dax/device.c:dev_dax_probe() {
        ...
        pgmap = devm_kzalloc(dev, ...);
        if (!pgmap)
                return -ENOMEM;

        dev_dax->pgmap = pgmap;
        ...
}

If a later step in the probe fails (such as devm_request_mem_region() or
devm_memremap_pages()), the devres framework automatically frees the
allocated pgmap. Since the cleanup action is only registered at the end of
a successful probe, the error paths leave the pointer dangling.

Would subsequent attempts to probe the driver then hit the check for
dev_dax->pgmap and erroneously fail with -EINVAL, permanently preventing
successful device binding? Does device.c need a similar fix?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/0100019ea3929225-a0f8e6f7-30ae-4f8e-ae6f-19129666c4c3-000000@email.amazonses.com?part=4

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V4 4/9] dax/fsdev: don't leave a dangling dev_dax->pgmap on probe failure
  2026-06-07 19:33   ` [PATCH V4 4/9] dax/fsdev: don't leave a dangling dev_dax->pgmap on probe failure John Groves
  2026-06-07 19:44     ` sashiko-bot
@ 2026-06-08 21:30     ` Dave Jiang
  1 sibling, 0 replies; 25+ messages in thread
From: Dave Jiang @ 2026-06-08 21:30 UTC (permalink / raw)
  To: John Groves, John Groves, Dan Williams
  Cc: John Groves, Vishal Verma, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Miklos Szeredi,
	Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org



On 6/7/26 12:33 PM, John Groves wrote:
> From: John Groves <John@Groves.net>
> 
> After the dynamic path set dev_dax->pgmap, any later probe failure left
> dev_dax->pgmap dangling: devres frees the devm_kzalloc'd pgmap on probe
> failure, and subsequent probe attempts would hit the "dynamic-dax with
> pre-populated page map" check and fail permanently.
> 
> Factor pgmap acquisition out into fsdev_acquire_pgmap(), and defer the
> dev_dax->pgmap assignment until probe can no longer fail. A failed probe
> now never publishes the pointer at all, so there is nothing to unwind.
> This also matches kill_dev_dax(), which already clears the dynamic pgmap
> pointer on unbind: dev_dax->pgmap is now non-NULL only while the pgmap
> is actually valid.
> 
> Refactor suggested by Dave Jiang.
> 
> Fixes: d5406bd458b0a ("dax: add fsdev.c driver for fs-dax on character dax")
> Signed-off-by: John Groves <john@groves.net>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>


> ---
>  drivers/dax/fsdev.c | 77 ++++++++++++++++++++++++++++-----------------
>  1 file changed, 49 insertions(+), 28 deletions(-)
> 
> diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> index dbd722ed7ab05..0fd5e1293d725 100644
> --- a/drivers/dax/fsdev.c
> +++ b/drivers/dax/fsdev.c
> @@ -219,47 +219,62 @@ static const struct file_operations fsdev_fops = {
>  	.release = fsdev_release,
>  };
>  
> -static int fsdev_dax_probe(struct dev_dax *dev_dax)
> +/*
> + * Acquire the dev_pagemap for probe: the static (pre-populated) one if
> + * present, or a devm-allocated one for the dynamic case. Note that
> + * dev_dax->pgmap is not set here; fsdev_dax_probe() sets it only once
> + * probe succeeds, so a failed probe never leaves a dangling pointer
> + * to a devres-freed pgmap.
> + */
> +static struct dev_pagemap *fsdev_acquire_pgmap(struct dev_dax *dev_dax)
>  {
> -	struct dax_device *dax_dev = dev_dax->dax_dev;
>  	struct device *dev = &dev_dax->dev;
>  	struct dev_pagemap *pgmap;
> -	struct inode *inode;
> -	u64 data_offset = 0;
> -	struct cdev *cdev;
> -	void *addr;
> -	int rc, i;
> +	size_t pgmap_size;
>  
>  	if (static_dev_dax(dev_dax)) {
>  		if (dev_dax->nr_range > 1) {
> -			dev_warn(dev, "static pgmap / multi-range device conflict\n");
> -			return -EINVAL;
> +			dev_warn(dev,
> +				 "static pgmap / multi-range device conflict\n");
> +			return ERR_PTR(-EINVAL);
>  		}
>  
>  		pgmap = dev_dax->pgmap;
>  		pgmap->vmemmap_shift = 0;
> -	} else {
> -		size_t pgmap_size;
> +		return pgmap;
> +	}
>  
> -		if (dev_dax->pgmap) {
> -			dev_warn(dev, "dynamic-dax with pre-populated page map\n");
> -			return -EINVAL;
> -		}
> +	if (dev_dax->pgmap) {
> +		dev_warn(dev, "dynamic-dax with pre-populated page map\n");
> +		return ERR_PTR(-EINVAL);
> +	}
>  
> -		pgmap_size = struct_size(pgmap, ranges, dev_dax->nr_range - 1);
> -		pgmap = devm_kzalloc(dev, pgmap_size, GFP_KERNEL);
> -		if (!pgmap)
> -			return -ENOMEM;
> +	pgmap_size = struct_size(pgmap, ranges, dev_dax->nr_range - 1);
> +	pgmap = devm_kzalloc(dev, pgmap_size, GFP_KERNEL);
> +	if (!pgmap)
> +		return ERR_PTR(-ENOMEM);
>  
> -		pgmap->nr_range = dev_dax->nr_range;
> -		dev_dax->pgmap = pgmap;
> +	pgmap->nr_range = dev_dax->nr_range;
> +	for (int i = 0; i < dev_dax->nr_range; i++)
> +		pgmap->ranges[i] = dev_dax->ranges[i].range;
>  
> -		for (i = 0; i < dev_dax->nr_range; i++) {
> -			struct range *range = &dev_dax->ranges[i].range;
> +	return pgmap;
> +}
>  
> -			pgmap->ranges[i] = *range;
> -		}
> -	}
> +static int fsdev_dax_probe(struct dev_dax *dev_dax)
> +{
> +	struct dax_device *dax_dev = dev_dax->dax_dev;
> +	struct device *dev = &dev_dax->dev;
> +	struct dev_pagemap *pgmap;
> +	struct inode *inode;
> +	u64 data_offset = 0;
> +	struct cdev *cdev;
> +	void *addr;
> +	int rc, i;
> +
> +	pgmap = fsdev_acquire_pgmap(dev_dax);
> +	if (IS_ERR(pgmap))
> +		return PTR_ERR(pgmap);
>  
>  	for (i = 0; i < dev_dax->nr_range; i++) {
>  		struct range *range = &dev_dax->ranges[i].range;
> @@ -306,7 +321,7 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
>  	/* Detect whether the data is at a non-zero offset into the memory */
>  	if (pgmap->range.start != dev_dax->ranges[0].range.start) {
>  		u64 phys = dev_dax->ranges[0].range.start;
> -		u64 pgmap_phys = dev_dax->pgmap[0].range.start;
> +		u64 pgmap_phys = pgmap[0].range.start;
>  
>  		if (!WARN_ON(pgmap_phys > phys))
>  			data_offset = phys - pgmap_phys;
> @@ -339,7 +354,13 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
>  		return rc;
>  
>  	run_dax(dax_dev);
> -	return devm_add_action_or_reset(dev, fsdev_kill, dev_dax);
> +	rc = devm_add_action_or_reset(dev, fsdev_kill, dev_dax);
> +	if (rc)
> +		return rc;
> +
> +	/* Probe can no longer fail; expose the pgmap via dev_dax */
> +	dev_dax->pgmap = pgmap;
> +	return 0;
>  }
>  
>  static struct dax_device_driver fsdev_dax_driver = {


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH V4 5/9] dax/fsdev: use __va(phys) for kaddr in direct_access
  2026-06-07 19:32 ` [PATCH V4 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
                     ` (3 preceding siblings ...)
  2026-06-07 19:33   ` [PATCH V4 4/9] dax/fsdev: don't leave a dangling dev_dax->pgmap on probe failure John Groves
@ 2026-06-07 19:33   ` John Groves
  2026-06-07 19:44     ` sashiko-bot
  2026-06-07 19:34   ` [PATCH V4 6/9] dax/fsdev: fail probe on invalid pgmap offset John Groves
                     ` (3 subsequent siblings)
  8 siblings, 1 reply; 25+ messages in thread
From: John Groves @ 2026-06-07 19:33 UTC (permalink / raw)
  To: John Groves, Dan Williams
  Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Miklos Szeredi,
	Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	John Groves

From: John Groves <John@Groves.net>

Use __va(phys) instead of virt_addr + linear_offset for the kaddr
return in __fsdev_dax_direct_access(). The previous code added a
device-linear byte offset to virt_addr (which is __va of ranges[0]),
but for multi-range devices with physical gaps between ranges, this
linear arithmetic crosses the gap and produces a wrong kernel virtual
address. Using __va(phys) where phys comes from dax_pgoff_to_phys()
is correct for any range layout because the direct map translates
each physical address independently.

This leaves dev_dax->virt_addr write-only, so remove the field
(suggested by Dave Jiang).

Fixes: 759455848df0b ("dax: Save the kva from memremap")

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/dax-private.h | 2 --
 drivers/dax/fsdev.c       | 8 ++------
 2 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 81e4af49e39c1..607a53a91f58b 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -69,7 +69,6 @@ struct dev_dax_range {
  * data while the device is activated in the driver.
  * @region: parent region
  * @dax_dev: core dax functionality
- * @virt_addr: kva from memremap; used by fsdev_dax
  * @cached_size: size of daxdev cached by fsdev_dax
  * @align: alignment of this instance
  * @target_node: effective numa node if dev_dax memory range is onlined
@@ -85,7 +84,6 @@ struct dev_dax_range {
 struct dev_dax {
 	struct dax_region *region;
 	struct dax_device *dax_dev;
-	void *virt_addr;
 	u64 cached_size;
 	unsigned int align;
 	int target_node;
diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
index 0fd5e1293d725..af9ef80c05c6d 100644
--- a/drivers/dax/fsdev.c
+++ b/drivers/dax/fsdev.c
@@ -51,9 +51,7 @@ static long __fsdev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
 	struct dev_dax *dev_dax = dax_get_private(dax_dev);
 	size_t size = nr_pages << PAGE_SHIFT;
 	size_t offset = pgoff << PAGE_SHIFT;
-	void *virt_addr = dev_dax->virt_addr + offset;
 	phys_addr_t phys;
-	unsigned long local_pfn;
 
 	phys = dax_pgoff_to_phys(dev_dax, pgoff, size);
 	if (phys == -1) {
@@ -63,11 +61,10 @@ static long __fsdev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
 	}
 
 	if (kaddr)
-		*kaddr = virt_addr;
+		*kaddr = __va(phys);
 
-	local_pfn = PHYS_PFN(phys);
 	if (pfn)
-		*pfn = local_pfn;
+		*pfn = PHYS_PFN(phys);
 
 	/*
 	 * Use cached_size which was computed at probe time. The size cannot
@@ -329,7 +326,6 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
 		pr_debug("%s: offset detected phys=%llx pgmap_phys=%llx offset=%llx\n",
 		       __func__, phys, pgmap_phys, data_offset);
 	}
-	dev_dax->virt_addr = addr + data_offset;
 
 	inode = dax_inode(dax_dev);
 	cdev = inode->i_cdev;
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH V4 5/9] dax/fsdev: use __va(phys) for kaddr in direct_access
  2026-06-07 19:33   ` [PATCH V4 5/9] dax/fsdev: use __va(phys) for kaddr in direct_access John Groves
@ 2026-06-07 19:44     ` sashiko-bot
  0 siblings, 0 replies; 25+ messages in thread
From: sashiko-bot @ 2026-06-07 19:44 UTC (permalink / raw)
  To: John Groves; +Cc: linux-cxl

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] `__fsdev_dax_direct_access()` returns `-EFAULT` instead of truncating the returned size when a request crosses a multi-range physical boundary.
--

commit 25e7e6c7d5f413efd9eb57096b8df2c8803d235a
Author: John Groves <John@Groves.net>

dax/fsdev: use __va(phys) for kaddr in direct_access

This commit updates __fsdev_dax_direct_access() to use __va(phys) for the
kaddr return, fixing an issue where multi-range devices with physical gaps
resulted in an incorrect kernel virtual address. It also removes the now
unused virt_addr field from struct dev_dax.

> diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> index 0fd5e1293d725..af9ef80c05c6d 100644
> --- a/drivers/dax/fsdev.c
> +++ b/drivers/dax/fsdev.c
> @@ -51,9 +51,7 @@ static long __fsdev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
>  	struct dev_dax *dev_dax = dax_get_private(dax_dev);
>  	size_t size = nr_pages << PAGE_SHIFT;
>  	size_t offset = pgoff << PAGE_SHIFT;
> -	void *virt_addr = dev_dax->virt_addr + offset;
>  	phys_addr_t phys;
> -	unsigned long local_pfn;
>  
>  	phys = dax_pgoff_to_phys(dev_dax, pgoff, size);
>  	if (phys == -1) {

[Severity: High]
This isn't a bug introduced by this patch, but it appears to be a pre-existing
issue. If a direct_access request spans across a physical gap into the next
DAX range, dax_pgoff_to_phys() returns -1 because it checks if the entire
requested size fits within a single contiguous physical range.

This causes __fsdev_dax_direct_access() to fail the entire operation with
-EFAULT. Should it instead truncate the returned size to the boundary of the
available contiguous length, allowing the caller to safely issue subsequent
requests for the remainder of the pages? Failing completely with -EFAULT
might spuriously break valid I/O accesses on multi-range DAX devices.

[ ... ]

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/0100019ea3929225-a0f8e6f7-30ae-4f8e-ae6f-19129666c4c3-000000@email.amazonses.com?part=5

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH V4 6/9] dax/fsdev: fail probe on invalid pgmap offset
  2026-06-07 19:32 ` [PATCH V4 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
                     ` (4 preceding siblings ...)
  2026-06-07 19:33   ` [PATCH V4 5/9] dax/fsdev: use __va(phys) for kaddr in direct_access John Groves
@ 2026-06-07 19:34   ` John Groves
  2026-06-07 19:43     ` sashiko-bot
  2026-06-08 21:39     ` Dave Jiang
  2026-06-07 19:34   ` [PATCH V4 7/9] dax: fix holder_ops race in fs_put_dax() John Groves
                     ` (2 subsequent siblings)
  8 siblings, 2 replies; 25+ messages in thread
From: John Groves @ 2026-06-07 19:34 UTC (permalink / raw)
  To: John Groves, Dan Williams
  Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Miklos Szeredi,
	Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	John Groves

From: John Groves <John@Groves.net>

Convert the WARN_ON to a fatal error when pgmap_phys > phys. This
condition means the remapped region starts after the device's data
region, which is an impossible state. Previously the probe continued
with data_offset=0, leaving virt_addr silently misaligned. Now probe
returns -EINVAL with a diagnostic message.

Fixes: 759455848df0b ("dax: Save the kva from memremap")

Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/fsdev.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
index af9ef80c05c6d..dcb512625ce65 100644
--- a/drivers/dax/fsdev.c
+++ b/drivers/dax/fsdev.c
@@ -320,8 +320,12 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
 		u64 phys = dev_dax->ranges[0].range.start;
 		u64 pgmap_phys = pgmap[0].range.start;
 
-		if (!WARN_ON(pgmap_phys > phys))
-			data_offset = phys - pgmap_phys;
+		if (pgmap_phys > phys) {
+			dev_err(dev, "pgmap start %#llx exceeds data start %#llx\n",
+				pgmap_phys, phys);
+			return -EINVAL;
+		}
+		data_offset = phys - pgmap_phys;
 
 		pr_debug("%s: offset detected phys=%llx pgmap_phys=%llx offset=%llx\n",
 		       __func__, phys, pgmap_phys, data_offset);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH V4 6/9] dax/fsdev: fail probe on invalid pgmap offset
  2026-06-07 19:34   ` [PATCH V4 6/9] dax/fsdev: fail probe on invalid pgmap offset John Groves
@ 2026-06-07 19:43     ` sashiko-bot
  2026-06-08 21:39     ` Dave Jiang
  1 sibling, 0 replies; 25+ messages in thread
From: sashiko-bot @ 2026-06-07 19:43 UTC (permalink / raw)
  To: John Groves; +Cc: linux-cxl

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] The validation check for `pgmap_phys > phys` is unreachable in the exact failure case it is meant to handle, resulting in a kernel panic due to out-of-bounds `vmemmap` access before the bounds check is evaluated.
--

commit 18b2ee922105fd88cbce7e6ed1c6000666fc3c6c
Author: John Groves <John@Groves.net>
Subject: dax/fsdev: fail probe on invalid pgmap offset

This commit converts a warning into a fatal error during probe when the
remapped region starts after the device's data region. It aims to prevent
virt_addr misalignments by returning an error instead of continuing with a
zero offset.

> diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> --- a/drivers/dax/fsdev.c
> +++ b/drivers/dax/fsdev.c
[ ... ]
> @@ -320,8 +320,12 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
>  		u64 phys = dev_dax->ranges[0].range.start;
>  		u64 pgmap_phys = pgmap[0].range.start;
>  
> -		if (!WARN_ON(pgmap_phys > phys))
> -			data_offset = phys - pgmap_phys;
> +		if (pgmap_phys > phys) {
> +			dev_err(dev, "pgmap start %#llx exceeds data start %#llx\n",
> +				pgmap_phys, phys);
> +			return -EINVAL;
> +		}

[Severity: High]
This is a pre-existing issue, but does this new validation check actually
execute before a crash occurs?

Looking at fsdev_dax_probe() in drivers/dax/fsdev.c, the validation block
handling the pgmap_phys > phys state is placed after
fsdev_clear_folio_state() is called.

fsdev_clear_folio_state() iterates unconditionally starting at phys:

drivers/dax/fsdev.c:fsdev_clear_folio_state() {
    ...
    while (pfn < end_pfn) {
        struct folio *folio = pfn_folio(pfn);
        ...
}

If phys < pgmap_phys, pfn_folio(pfn) dereferences pfns that fall outside
the remapped pgmap region, because vmemmap is only populated starting at
pgmap_phys via the earlier call to devm_memremap_pages().

Accessing these unpopulated vmemmap pages triggers an out-of-bounds page
fault.

Should the validation check and error return be relocated to execute before
fsdev_clear_folio_state() is invoked so that this graceful failure path is
reachable?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/0100019ea3929225-a0f8e6f7-30ae-4f8e-ae6f-19129666c4c3-000000@email.amazonses.com?part=6

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V4 6/9] dax/fsdev: fail probe on invalid pgmap offset
  2026-06-07 19:34   ` [PATCH V4 6/9] dax/fsdev: fail probe on invalid pgmap offset John Groves
  2026-06-07 19:43     ` sashiko-bot
@ 2026-06-08 21:39     ` Dave Jiang
  1 sibling, 0 replies; 25+ messages in thread
From: Dave Jiang @ 2026-06-08 21:39 UTC (permalink / raw)
  To: John Groves, John Groves, Dan Williams
  Cc: John Groves, Vishal Verma, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Miklos Szeredi,
	Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org



On 6/7/26 12:34 PM, John Groves wrote:
> From: John Groves <John@Groves.net>
> 
> Convert the WARN_ON to a fatal error when pgmap_phys > phys. This
> condition means the remapped region starts after the device's data
> region, which is an impossible state. Previously the probe continued
> with data_offset=0, leaving virt_addr silently misaligned. Now probe
> returns -EINVAL with a diagnostic message.
> 
> Fixes: 759455848df0b ("dax: Save the kva from memremap")
> 
> Reviewed-by: Alison Schofield <alison.schofield@intel.com>
> Signed-off-by: John Groves <john@groves.net>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>

> ---
>  drivers/dax/fsdev.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> index af9ef80c05c6d..dcb512625ce65 100644
> --- a/drivers/dax/fsdev.c
> +++ b/drivers/dax/fsdev.c
> @@ -320,8 +320,12 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
>  		u64 phys = dev_dax->ranges[0].range.start;
>  		u64 pgmap_phys = pgmap[0].range.start;
>  
> -		if (!WARN_ON(pgmap_phys > phys))
> -			data_offset = phys - pgmap_phys;
> +		if (pgmap_phys > phys) {
> +			dev_err(dev, "pgmap start %#llx exceeds data start %#llx\n",
> +				pgmap_phys, phys);
> +			return -EINVAL;
> +		}
> +		data_offset = phys - pgmap_phys;
>  
>  		pr_debug("%s: offset detected phys=%llx pgmap_phys=%llx offset=%llx\n",
>  		       __func__, phys, pgmap_phys, data_offset);


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH V4 7/9] dax: fix holder_ops race in fs_put_dax()
  2026-06-07 19:32 ` [PATCH V4 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
                     ` (5 preceding siblings ...)
  2026-06-07 19:34   ` [PATCH V4 6/9] dax/fsdev: fail probe on invalid pgmap offset John Groves
@ 2026-06-07 19:34   ` John Groves
  2026-06-07 19:49     ` sashiko-bot
  2026-06-08 10:52     ` Richard Cheng
  2026-06-07 19:34   ` [PATCH V4 8/9] dax: replace exported dax_dev_get() with non-allocating dax_dev_find() John Groves
  2026-06-07 19:34   ` [PATCH V4 9/9] dax: fsdev.c minor formatting cleanup John Groves
  8 siblings, 2 replies; 25+ messages in thread
From: John Groves @ 2026-06-07 19:34 UTC (permalink / raw)
  To: John Groves, Dan Williams
  Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Miklos Szeredi,
	Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	John Groves

From: John Groves <John@Groves.net>

Clear holder_ops before holder_data so that a concurrent fs_dax_get()
cannot have its newly installed holder_ops overwritten. cmpxchg()
provides release ordering on weakly-ordered architectures, ensuring the
WRITE_ONCE(holder_ops, NULL) store is visible to any CPU that observes
the holder_data release.

Add a WARN_ON() that fires only when the cmpxchg observes a non-NULL
value that is not @holder, i.e. fs_put_dax() called by something that
is not the current holder. That is an API contract violation; the
WARN_ON() does not prevent the damage but makes the bug visible.

A NULL cmpxchg result is deliberately tolerated: kill_dax() clears
holder_data while a holder is still attached when a device is removed
out from under a mounted filesystem (after delivering MF_MEM_PRE_REMOVE).
The holder's subsequent fs_put_dax() - e.g. xfs_free_buftarg() after a
forced shutdown - then legitimately finds holder_data already NULL, so
warning on that case would turn supported device removal into a splat
(or a panic with panic_on_warn).

Also add a kerneldoc comment documenting that fs_put_dax() must only
be called by the current holder.

Fixes: eec38f5d86d27 ("dax: Add fs_dax_get() func to prepare dax for fs-dax usage")
Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/super.c | 42 +++++++++++++++++++++++++++++++++++++++---
 1 file changed, 39 insertions(+), 3 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 25cf99dd9360b..96f778dcde50b 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -116,11 +116,47 @@ EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
 
 #if IS_ENABLED(CONFIG_FS_DAX)
 
+/**
+ * fs_put_dax() - release holder ownership of a dax_device
+ * @dax_dev: dax device to release (may be NULL)
+ * @holder: the holder pointer previously passed to fs_dax_get() or
+ *          fs_dax_get_by_bdev(); must match exactly, as it is used
+ *          in a cmpxchg to atomically release ownership
+ *
+ * Must only be called by the current holder. Clears holder_ops before
+ * holder_data to avoid a race where a concurrent fs_dax_get() could have
+ * its newly installed holder_ops overwritten.
+ */
 void fs_put_dax(struct dax_device *dax_dev, void *holder)
 {
-	if (dax_dev && holder &&
-	    cmpxchg(&dax_dev->holder_data, holder, NULL) == holder)
-		dax_dev->holder_ops = NULL;
+	if (dax_dev && holder) {
+		void *prev;
+
+		/*
+		 * Clear holder_ops before releasing holder_data. A concurrent
+		 * dax_holder_notify_failure() that sees NULL ops returns
+		 * -EOPNOTSUPP cleanly. A concurrent fs_dax_get() that acquires
+		 * holder_data after the cmpxchg below is guaranteed to observe
+		 * holder_ops=NULL first (cmpxchg provides release ordering), so
+		 * its subsequent store of new ops will not be overwritten.
+		 */
+		WRITE_ONCE(dax_dev->holder_ops, NULL);
+		prev = cmpxchg(&dax_dev->holder_data, holder, NULL);
+
+		/*
+		 * prev == holder: normal release.
+		 * prev == NULL:   already released by kill_dax() when the
+		 *                 device was removed under a live holder;
+		 *                 not a bug.
+		 * prev != holder (non-NULL): fs_put_dax() called by something
+		 *                 that is not the current holder; an API
+		 *                 contract violation. A lock would be needed
+		 *                 to guard against this, but we WARN_ON()
+		 *                 instead since violating the contract is
+		 *                 a bug.
+		 */
+		WARN_ON(prev && prev != holder);
+	}
 	put_dax(dax_dev);
 }
 EXPORT_SYMBOL_GPL(fs_put_dax);
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH V4 7/9] dax: fix holder_ops race in fs_put_dax()
  2026-06-07 19:34   ` [PATCH V4 7/9] dax: fix holder_ops race in fs_put_dax() John Groves
@ 2026-06-07 19:49     ` sashiko-bot
  2026-06-08 10:52     ` Richard Cheng
  1 sibling, 0 replies; 25+ messages in thread
From: sashiko-bot @ 2026-06-07 19:49 UTC (permalink / raw)
  To: John Groves; +Cc: linux-cxl

Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:

New issues:
- [High] fs_put_dax() unconditionally clears dax_dev->holder_ops before verifying ownership, corrupting the operations of a legitimate holder if called incorrectly.

Pre-existing issues:
- [High] dax_holder_notify_failure() reads dax_dev->holder_ops without READ_ONCE(), allowing compiler double-fetches and a TOCTOU NULL pointer dereference.
- [High] fs_dax_get() drops the SRCU read lock before modifying device state, allowing it to race with kill_dax() and acquire a dead device.
- [High] fs_put_dax() clears holder information but fails to wait for pre-existing SRCU readers, leading to a UAF or NULL pointer dereference in notify callbacks.
- [High] kill_dax() checks holder_data before clearing DAXDEV_ALIVE, creating a TOCTOU race where a newly attached holder is wiped without notification.
--

commit c385d20d67e08b6d8549b98dcf43d6d8a692ac06
Author: John Groves <John@Groves.net>

dax: fix holder_ops race in fs_put_dax()

Clear holder_ops before holder_data so that a concurrent fs_dax_get()
cannot have its newly installed holder_ops overwritten. cmpxchg()
provides release ordering on weakly-ordered architectures, ensuring the
WRITE_ONCE(holder_ops, NULL) store is visible to any CPU that observes
the holder_data release.

> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 25cf99dd9360b..96f778dcde50b 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -116,11 +116,47 @@ EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);

[ ... ]

> +		/*
> +		 * Clear holder_ops before releasing holder_data. A concurrent
> +		 * dax_holder_notify_failure() that sees NULL ops returns
> +		 * -EOPNOTSUPP cleanly. A concurrent fs_dax_get() that acquires

[Severity: High]
This isn't a bug introduced by this patch, but this comment relies on a flawed
behavior in the failure notification path.

Looking at dax_holder_notify_failure():

drivers/dax/super.c:dax_holder_notify_failure() {
    if (!dax_dev->holder_ops) {
        rc = -EOPNOTSUPP;
        goto out;
    }

    rc = dax_dev->holder_ops->notify_failure(dax_dev, off, len, mf_flags);
}

Since holder_ops is read without the READ_ONCE macro, can the compiler emit
two separate memory loads? If fs_put_dax() sets holder_ops to NULL between
the check and the indirect function call, it seems the second load would read
NULL, resulting in a NULL pointer dereference.

> +		 * holder_data after the cmpxchg below is guaranteed to observe
> +		 * holder_ops=NULL first (cmpxchg provides release ordering), so
> +		 * its subsequent store of new ops will not be overwritten.
> +		 */
> +		WRITE_ONCE(dax_dev->holder_ops, NULL);
> +		prev = cmpxchg(&dax_dev->holder_data, holder, NULL);

[Severity: High]
Does this unconditionally clear the holder operations before verifying 
ownership?

If an errant caller invokes fs_put_dax() with a non-matching holder, the 
cmpxchg will correctly fail and preserve holder_data. However, the legitimate 
holder's operations have already been overwritten with NULL here. Could this 
corrupt the rightful holder's state and silently break future memory failure 
notifications?

[ ... ]

> +		WARN_ON(prev && prev != holder);
> +	}
>  	put_dax(dax_dev);
>  }

[Severity: High]
This is a pre-existing issue, but does fs_put_dax() need to synchronize with
SRCU readers before returning?

When a filesystem unmounts, it calls fs_put_dax() to clear holder_data and
holder_ops, but returns without calling synchronize_srcu. 

If dax_holder_notify_failure() is executing asynchronously, it retrieves the
mount pointer:

fs/xfs/xfs_notify_failure.c:xfs_dax_notify_failure() {
    struct xfs_mount *mp = dax_holder(dax_dev);

    if (!(mp->m_super->s_flags & SB_BORN)) {
        ...
    }
}

If fs_put_dax() has just cleared holder_data, dax_holder() returns NULL, and
this immediately dereferences the NULL pointer. Could this cause a crash on
unmount?

[Severity: High]
This is a pre-existing issue, but looking at fs_dax_get() which is related to
these ownership changes:

drivers/dax/super.c:fs_dax_get() {
    id = dax_read_lock();
    if (!dax_dev || !dax_alive(dax_dev) || !igrab(&dax_dev->inode)) {
        dax_read_unlock(id);
        return -ENODEV;
    }
    dax_read_unlock(id);
    ... 
    if (cmpxchg(&dax_dev->holder_data, NULL, holder))
}

Does dropping the SRCU read lock before modifying the device state allow a
race with kill_dax()?

If kill_dax() executes concurrently, clears the alive flag, and wipes 
holder_data, could the cmpxchg in fs_dax_get() blindly succeed and set 
operations on a dead device?

[Severity: High]
This is a pre-existing issue, but could there be a race during device removal
in kill_dax()?

drivers/dax/super.c:kill_dax() {
    if (dax_dev->holder_data != NULL)
        dax_holder_notify_failure(dax_dev, 0, U64_MAX,
                MF_MEM_PRE_REMOVE);

    clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
}

Because holder_data is checked before clearing the alive flag, could a 
new filesystem execute fs_dax_get_by_bdev() concurrently, see dax_alive() is
true, and successfully register its holder_data? 

If kill_dax() then unconditionally wipes holder_data and holder_ops after
this, the new filesystem would miss the pre-remove notification entirely.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/0100019ea3929225-a0f8e6f7-30ae-4f8e-ae6f-19129666c4c3-000000@email.amazonses.com?part=7

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V4 7/9] dax: fix holder_ops race in fs_put_dax()
  2026-06-07 19:34   ` [PATCH V4 7/9] dax: fix holder_ops race in fs_put_dax() John Groves
  2026-06-07 19:49     ` sashiko-bot
@ 2026-06-08 10:52     ` Richard Cheng
  2026-06-11 17:01       ` John Groves
  1 sibling, 1 reply; 25+ messages in thread
From: Richard Cheng @ 2026-06-08 10:52 UTC (permalink / raw)
  To: John Groves
  Cc: John Groves, Dan Williams, John Groves, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Miklos Szeredi, Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org

On Sun, Jun 07, 2026 at 07:34:10PM +0800, John Groves wrote:
> From: John Groves <John@Groves.net>
> 
> Clear holder_ops before holder_data so that a concurrent fs_dax_get()
> cannot have its newly installed holder_ops overwritten. cmpxchg()
> provides release ordering on weakly-ordered architectures, ensuring the
> WRITE_ONCE(holder_ops, NULL) store is visible to any CPU that observes
> the holder_data release.
> 
> Add a WARN_ON() that fires only when the cmpxchg observes a non-NULL
> value that is not @holder, i.e. fs_put_dax() called by something that
> is not the current holder. That is an API contract violation; the
> WARN_ON() does not prevent the damage but makes the bug visible.
> 
> A NULL cmpxchg result is deliberately tolerated: kill_dax() clears
> holder_data while a holder is still attached when a device is removed
> out from under a mounted filesystem (after delivering MF_MEM_PRE_REMOVE).
> The holder's subsequent fs_put_dax() - e.g. xfs_free_buftarg() after a
> forced shutdown - then legitimately finds holder_data already NULL, so
> warning on that case would turn supported device removal into a splat
> (or a panic with panic_on_warn).
> 
> Also add a kerneldoc comment documenting that fs_put_dax() must only
> be called by the current holder.
> 
> Fixes: eec38f5d86d27 ("dax: Add fs_dax_get() func to prepare dax for fs-dax usage")
> Signed-off-by: John Groves <john@groves.net>
> ---
>  drivers/dax/super.c | 42 +++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 39 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 25cf99dd9360b..96f778dcde50b 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -116,11 +116,47 @@ EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
>  
>  #if IS_ENABLED(CONFIG_FS_DAX)
>  
> +/**
> + * fs_put_dax() - release holder ownership of a dax_device
> + * @dax_dev: dax device to release (may be NULL)
> + * @holder: the holder pointer previously passed to fs_dax_get() or
> + *          fs_dax_get_by_bdev(); must match exactly, as it is used
> + *          in a cmpxchg to atomically release ownership
> + *
> + * Must only be called by the current holder. Clears holder_ops before
> + * holder_data to avoid a race where a concurrent fs_dax_get() could have
> + * its newly installed holder_ops overwritten.
> + */
>  void fs_put_dax(struct dax_device *dax_dev, void *holder)
>  {
> -	if (dax_dev && holder &&
> -	    cmpxchg(&dax_dev->holder_data, holder, NULL) == holder)
> -		dax_dev->holder_ops = NULL;
> +	if (dax_dev && holder) {
> +		void *prev;
> +
> +		/*
> +		 * Clear holder_ops before releasing holder_data. A concurrent
> +		 * dax_holder_notify_failure() that sees NULL ops returns
> +		 * -EOPNOTSUPP cleanly. A concurrent fs_dax_get() that acquires
> +		 * holder_data after the cmpxchg below is guaranteed to observe
> +		 * holder_ops=NULL first (cmpxchg provides release ordering), so
> +		 * its subsequent store of new ops will not be overwritten.
> +		 */

This isn't guaranteed today. dax-holder_notify_failure() reads
dax_dev->holder_ops twice without READ_ONCE(). With your WRITE_ONCE()
racing in between, the second read "dax_dev->holder_ops->notify_failure()" can
return NULL and result in NULL deref, so the "see NULL cleanly" property the comment relies
on doesn't hold.

Or reading it once into a local would make it tru
"""
const struct dax_holder_operations *ops = READ_ONCE(dax_dev->holder_ops);

if (!ops)
	return -EOPNOTSUPP;
rc = ops->notify_failure(dax_dev, off, len, mf_flags);
"""

What do you think ?

--Richard


> +		WRITE_ONCE(dax_dev->holder_ops, NULL);
> +		prev = cmpxchg(&dax_dev->holder_data, holder, NULL);
> +
> +		/*
> +		 * prev == holder: normal release.
> +		 * prev == NULL:   already released by kill_dax() when the
> +		 *                 device was removed under a live holder;
> +		 *                 not a bug.
> +		 * prev != holder (non-NULL): fs_put_dax() called by something
> +		 *                 that is not the current holder; an API
> +		 *                 contract violation. A lock would be needed
> +		 *                 to guard against this, but we WARN_ON()
> +		 *                 instead since violating the contract is
> +		 *                 a bug.
> +		 */
> +		WARN_ON(prev && prev != holder);
> +	}
>  	put_dax(dax_dev);
>  }
>  EXPORT_SYMBOL_GPL(fs_put_dax);
> -- 
> 2.53.0
> 
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V4 7/9] dax: fix holder_ops race in fs_put_dax()
  2026-06-08 10:52     ` Richard Cheng
@ 2026-06-11 17:01       ` John Groves
  0 siblings, 0 replies; 25+ messages in thread
From: John Groves @ 2026-06-11 17:01 UTC (permalink / raw)
  To: Richard Cheng
  Cc: John Groves, Dan Williams, John Groves, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Miklos Szeredi, Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org

On 26/06/08 06:52PM, Richard Cheng wrote:
> On Sun, Jun 07, 2026 at 07:34:10PM +0800, John Groves wrote:
> > From: John Groves <John@Groves.net>
> > 
> > Clear holder_ops before holder_data so that a concurrent fs_dax_get()
> > cannot have its newly installed holder_ops overwritten. cmpxchg()
> > provides release ordering on weakly-ordered architectures, ensuring the
> > WRITE_ONCE(holder_ops, NULL) store is visible to any CPU that observes
> > the holder_data release.
> > 
> > Add a WARN_ON() that fires only when the cmpxchg observes a non-NULL
> > value that is not @holder, i.e. fs_put_dax() called by something that
> > is not the current holder. That is an API contract violation; the
> > WARN_ON() does not prevent the damage but makes the bug visible.
> > 
> > A NULL cmpxchg result is deliberately tolerated: kill_dax() clears
> > holder_data while a holder is still attached when a device is removed
> > out from under a mounted filesystem (after delivering MF_MEM_PRE_REMOVE).
> > The holder's subsequent fs_put_dax() - e.g. xfs_free_buftarg() after a
> > forced shutdown - then legitimately finds holder_data already NULL, so
> > warning on that case would turn supported device removal into a splat
> > (or a panic with panic_on_warn).
> > 
> > Also add a kerneldoc comment documenting that fs_put_dax() must only
> > be called by the current holder.
> > 
> > Fixes: eec38f5d86d27 ("dax: Add fs_dax_get() func to prepare dax for fs-dax usage")
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> >  drivers/dax/super.c | 42 +++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 39 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > index 25cf99dd9360b..96f778dcde50b 100644
> > --- a/drivers/dax/super.c
> > +++ b/drivers/dax/super.c
> > @@ -116,11 +116,47 @@ EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
> >  
> >  #if IS_ENABLED(CONFIG_FS_DAX)
> >  
> > +/**
> > + * fs_put_dax() - release holder ownership of a dax_device
> > + * @dax_dev: dax device to release (may be NULL)
> > + * @holder: the holder pointer previously passed to fs_dax_get() or
> > + *          fs_dax_get_by_bdev(); must match exactly, as it is used
> > + *          in a cmpxchg to atomically release ownership
> > + *
> > + * Must only be called by the current holder. Clears holder_ops before
> > + * holder_data to avoid a race where a concurrent fs_dax_get() could have
> > + * its newly installed holder_ops overwritten.
> > + */
> >  void fs_put_dax(struct dax_device *dax_dev, void *holder)
> >  {
> > -	if (dax_dev && holder &&
> > -	    cmpxchg(&dax_dev->holder_data, holder, NULL) == holder)
> > -		dax_dev->holder_ops = NULL;
> > +	if (dax_dev && holder) {
> > +		void *prev;
> > +
> > +		/*
> > +		 * Clear holder_ops before releasing holder_data. A concurrent
> > +		 * dax_holder_notify_failure() that sees NULL ops returns
> > +		 * -EOPNOTSUPP cleanly. A concurrent fs_dax_get() that acquires
> > +		 * holder_data after the cmpxchg below is guaranteed to observe
> > +		 * holder_ops=NULL first (cmpxchg provides release ordering), so
> > +		 * its subsequent store of new ops will not be overwritten.
> > +		 */
> 
> This isn't guaranteed today. dax-holder_notify_failure() reads
> dax_dev->holder_ops twice without READ_ONCE(). With your WRITE_ONCE()
> racing in between, the second read "dax_dev->holder_ops->notify_failure()" can
> return NULL and result in NULL deref, so the "see NULL cleanly" property the comment relies
> on doesn't hold.
> 
> Or reading it once into a local would make it tru
> """
> const struct dax_holder_operations *ops = READ_ONCE(dax_dev->holder_ops);
> 
> if (!ops)
> 	return -EOPNOTSUPP;
> rc = ops->notify_failure(dax_dev, off, len, mf_flags);
> """
> 
> What do you think ?

Another good catch. Adding a fix to dax_holder_notify_failure(), to get
the ops via READ_ONCE().

Thanks,
John

<snip>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH V4 8/9] dax: replace exported dax_dev_get() with non-allocating dax_dev_find()
  2026-06-07 19:32 ` [PATCH V4 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
                     ` (6 preceding siblings ...)
  2026-06-07 19:34   ` [PATCH V4 7/9] dax: fix holder_ops race in fs_put_dax() John Groves
@ 2026-06-07 19:34   ` John Groves
  2026-06-07 19:49     ` sashiko-bot
  2026-06-08 10:48     ` Richard Cheng
  2026-06-07 19:34   ` [PATCH V4 9/9] dax: fsdev.c minor formatting cleanup John Groves
  8 siblings, 2 replies; 25+ messages in thread
From: John Groves @ 2026-06-07 19:34 UTC (permalink / raw)
  To: John Groves, Dan Williams
  Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Miklos Szeredi,
	Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	John Groves

From: John Groves <John@Groves.net>

This fix is in response to a Sashiko review, and some subsequent
analysis.

dax_dev_get() uses iget5_locked() which creates a new inode if no
matching one exists. This is correct for the internal caller
(alloc_dax), but dangerous for external callers that look up devices
from user-supplied or metadata-supplied dev_t values:

1. A new inode is created with DAXDEV_ALIVE set but no backing driver,
   no ops, and no IDA-allocated minor number.

2. On teardown, dax_destroy_inode() warns because kill_dax() was never
   called, and dax_free_inode() calls ida_free() for a minor that was
   never ida_alloc'd -- potentially freeing the minor of a real device.

Add dax_dev_find() which uses ilookup5() for lookup-only semantics:
it returns an existing dax_device with an elevated inode reference, or
NULL if no device with the given dev_t exists. It never creates inodes.
A dax_alive() check under dax_read_lock() guards against returning a
device that is concurrently being torn down by kill_dax().

Make dax_dev_get() static again (internal to super.c for alloc_dax),
export dax_dev_find() instead, and update the two external callers
(famfs_inode.c, famfs.c). Also add the missing CONFIG_DAX=n stub.

About the 'fixes' tag: this removes the export of dax_dev_get(),
which was flawed, and replaces is with dax_dev_find(). It feels like
the fixes tag makes sense for correcting an ABI error.

Fixes: 2ae624d5a555d ("dax: export dax_dev_get()")

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/super.c | 38 ++++++++++++++++++++++++++++++++++++--
 include/linux/dax.h |  6 +++++-
 2 files changed, 41 insertions(+), 3 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 96f778dcde50b..b37ae79c084bb 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -557,7 +557,7 @@ static int dax_set(struct inode *inode, void *data)
 	return 0;
 }
 
-struct dax_device *dax_dev_get(dev_t devt)
+static struct dax_device *dax_dev_get(dev_t devt)
 {
 	struct dax_device *dax_dev;
 	struct inode *inode;
@@ -580,7 +580,41 @@ struct dax_device *dax_dev_get(dev_t devt)
 
 	return dax_dev;
 }
-EXPORT_SYMBOL_GPL(dax_dev_get);
+
+/**
+ * dax_dev_find - look up an existing dax_device by dev_t
+ * @devt: the device number to find
+ *
+ * Returns a dax_device with an elevated inode reference, or NULL if no
+ * device with the given dev_t exists. Unlike dax_dev_get(), this never
+ * allocates a new inode — it is safe for external callers that are looking
+ * up devices from user-supplied or metadata-supplied dev_t values.
+ *
+ * Caller must put_dax() the returned device when done.
+ */
+struct dax_device *dax_dev_find(dev_t devt)
+{
+	struct dax_device *dax_dev;
+	struct inode *inode;
+	int id;
+
+	inode = ilookup5(dax_superblock, hash_32(devt + DAXFS_MAGIC, 31),
+			 dax_test, &devt);
+	if (!inode)
+		return NULL;
+
+	dax_dev = to_dax_dev(inode);
+	id = dax_read_lock();
+	if (!dax_alive(dax_dev)) {
+		dax_read_unlock(id);
+		iput(inode);
+		return NULL;
+	}
+	dax_read_unlock(id);
+
+	return dax_dev;
+}
+EXPORT_SYMBOL_GPL(dax_dev_find);
 
 struct dax_device *alloc_dax(void *private, const struct dax_operations *ops)
 {
diff --git a/include/linux/dax.h b/include/linux/dax.h
index fe6c3ded1b50f..29113eb95e72d 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -54,7 +54,7 @@ struct dax_device *alloc_dax(void *private, const struct dax_operations *ops);
 void *dax_holder(struct dax_device *dax_dev);
 void put_dax(struct dax_device *dax_dev);
 void kill_dax(struct dax_device *dax_dev);
-struct dax_device *dax_dev_get(dev_t devt);
+struct dax_device *dax_dev_find(dev_t devt);
 void dax_write_cache(struct dax_device *dax_dev, bool wc);
 bool dax_write_cache_enabled(struct dax_device *dax_dev);
 bool dax_synchronous(struct dax_device *dax_dev);
@@ -92,6 +92,10 @@ static inline void put_dax(struct dax_device *dax_dev)
 static inline void kill_dax(struct dax_device *dax_dev)
 {
 }
+static inline struct dax_device *dax_dev_find(dev_t devt)
+{
+	return NULL;
+}
 static inline void dax_write_cache(struct dax_device *dax_dev, bool wc)
 {
 }
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH V4 8/9] dax: replace exported dax_dev_get() with non-allocating dax_dev_find()
  2026-06-07 19:34   ` [PATCH V4 8/9] dax: replace exported dax_dev_get() with non-allocating dax_dev_find() John Groves
@ 2026-06-07 19:49     ` sashiko-bot
  2026-06-08 10:48     ` Richard Cheng
  1 sibling, 0 replies; 25+ messages in thread
From: sashiko-bot @ 2026-06-07 19:49 UTC (permalink / raw)
  To: John Groves; +Cc: linux-cxl

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

New issues:
- [Medium] The patch introduces an exported symbol `dax_dev_find` but does not include the claimed caller updates, resulting in zero in-tree users.

Pre-existing issues:
- [High] A race condition allows `dax_dev_find()` to return a partially initialized `dax_device` with stale or uninitialized `ops` and `private` pointers, leading to a Use-After-Free or NULL pointer dereference.
--

commit 4db396b1477c7ac48536ccf90d47d3c4adb9e2d3
Author: John Groves <John@Groves.net>

dax: replace exported dax_dev_get() with non-allocating dax_dev_find()

This commit replaces the exported dax_dev_get with a new dax_dev_find
function that performs lookup-only semantics. It also makes dax_dev_get
static again to prevent external callers from accidentally creating new
unbacked DAX inodes.

> Make dax_dev_get() static again (internal to super.c for alloc_dax),
> export dax_dev_find() instead, and update the two external callers
> (famfs_inode.c, famfs.c). Also add the missing CONFIG_DAX=n stub.

[Severity: Medium]
The commit message mentions updating external callers in famfs_inode.c
and famfs.c, but these files are not included in the patch and do not
exist in the upstream tree.

Is it possible the updates for these files were left out of the patch,
or are they out-of-tree? Exporting a symbol without any in-tree users
typically goes against upstream guidelines.

> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 96f778dcde50b..b37ae79c084bb 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
[ ... ]
> +struct dax_device *dax_dev_find(dev_t devt)
> +{
> +	struct dax_device *dax_dev;
> +	struct inode *inode;
> +	int id;
> +
> +	inode = ilookup5(dax_superblock, hash_32(devt + DAXFS_MAGIC, 31),
> +			 dax_test, &devt);
> +	if (!inode)
> +		return NULL;
> +
> +	dax_dev = to_dax_dev(inode);
> +	id = dax_read_lock();
> +	if (!dax_alive(dax_dev)) {

[Severity: High]
This is a pre-existing issue, but does adding dax_dev_find expose a race
condition where a partially initialized device can be returned to external
callers?

When a new DAX device is created, alloc_dax calls dax_dev_get. Inside
dax_dev_get, the new inode is unlocked and DAXDEV_ALIVE is set:

    if (inode_state_read_once(inode) & I_NEW) {
        set_bit(DAXDEV_ALIVE, &dax_dev->flags);
        ...
        unlock_new_inode(inode);
    }

However, alloc_dax initializes the ops and private pointers after
dax_dev_get returns:

    dax_dev = dax_dev_get(devt);
    if (!dax_dev)
        goto err_dev;

    dax_dev->ops = ops;
    dax_dev->private = private;

Since dax_device is allocated via alloc_inode_sb with GFP_KERNEL (no
__GFP_ZERO), could a concurrent call to dax_dev_find successfully look up
the inode via ilookup5 and see DAXDEV_ALIVE set before alloc_dax sets
the ops and private pointers?

If an external caller invokes DAX operations on the returned device, could
it dereference dangling ops or private pointers from a previous slab object
lifecycle? Is it possible to defer unlock_new_inode or setting
DAXDEV_ALIVE until after ops and private are fully initialized?

> +		dax_read_unlock(id);
> +		iput(inode);
> +		return NULL;
> +	}
> +	dax_read_unlock(id);
> +
> +	return dax_dev;
> +}
> +EXPORT_SYMBOL_GPL(dax_dev_find);

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/0100019ea3929225-a0f8e6f7-30ae-4f8e-ae6f-19129666c4c3-000000@email.amazonses.com?part=8

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V4 8/9] dax: replace exported dax_dev_get() with non-allocating dax_dev_find()
  2026-06-07 19:34   ` [PATCH V4 8/9] dax: replace exported dax_dev_get() with non-allocating dax_dev_find() John Groves
  2026-06-07 19:49     ` sashiko-bot
@ 2026-06-08 10:48     ` Richard Cheng
  1 sibling, 0 replies; 25+ messages in thread
From: Richard Cheng @ 2026-06-08 10:48 UTC (permalink / raw)
  To: John Groves
  Cc: John Groves, Dan Williams, John Groves, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Miklos Szeredi, Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org

On Sun, Jun 07, 2026 at 07:34:21PM +0800, John Groves wrote:
> From: John Groves <John@Groves.net>
> 
> This fix is in response to a Sashiko review, and some subsequent
> analysis.
> 
> dax_dev_get() uses iget5_locked() which creates a new inode if no
> matching one exists. This is correct for the internal caller
> (alloc_dax), but dangerous for external callers that look up devices
> from user-supplied or metadata-supplied dev_t values:
> 
> 1. A new inode is created with DAXDEV_ALIVE set but no backing driver,
>    no ops, and no IDA-allocated minor number.
> 
> 2. On teardown, dax_destroy_inode() warns because kill_dax() was never
>    called, and dax_free_inode() calls ida_free() for a minor that was
>    never ida_alloc'd -- potentially freeing the minor of a real device.
> 
> Add dax_dev_find() which uses ilookup5() for lookup-only semantics:
> it returns an existing dax_device with an elevated inode reference, or
> NULL if no device with the given dev_t exists. It never creates inodes.
> A dax_alive() check under dax_read_lock() guards against returning a
> device that is concurrently being torn down by kill_dax().
> 
> Make dax_dev_get() static again (internal to super.c for alloc_dax),
> export dax_dev_find() instead, and update the two external callers
> (famfs_inode.c, famfs.c). Also add the missing CONFIG_DAX=n stub.
> 
> About the 'fixes' tag: this removes the export of dax_dev_get(),
> which was flawed, and replaces is with dax_dev_find(). It feels like
> the fixes tag makes sense for correcting an ABI error.
> 
> Fixes: 2ae624d5a555d ("dax: export dax_dev_get()")
> 
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Alison Schofield <alison.schofield@intel.com>
> Signed-off-by: John Groves <john@groves.net>
> ---
>  drivers/dax/super.c | 38 ++++++++++++++++++++++++++++++++++++--
>  include/linux/dax.h |  6 +++++-
>  2 files changed, 41 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 96f778dcde50b..b37ae79c084bb 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -557,7 +557,7 @@ static int dax_set(struct inode *inode, void *data)
>  	return 0;
>  }
>  
> -struct dax_device *dax_dev_get(dev_t devt)
> +static struct dax_device *dax_dev_get(dev_t devt)
>  {
>  	struct dax_device *dax_dev;
>  	struct inode *inode;
> @@ -580,7 +580,41 @@ struct dax_device *dax_dev_get(dev_t devt)
>  
>  	return dax_dev;
>  }
> -EXPORT_SYMBOL_GPL(dax_dev_get);
> +
> +/**
> + * dax_dev_find - look up an existing dax_device by dev_t
> + * @devt: the device number to find
> + *
> + * Returns a dax_device with an elevated inode reference, or NULL if no
> + * device with the given dev_t exists. Unlike dax_dev_get(), this never
> + * allocates a new inode — it is safe for external callers that are looking
> + * up devices from user-supplied or metadata-supplied dev_t values.
> + *
> + * Caller must put_dax() the returned device when done.
> + */
> +struct dax_device *dax_dev_find(dev_t devt)
> +{
> +	struct dax_device *dax_dev;
> +	struct inode *inode;
> +	int id;
> +
> +	inode = ilookup5(dax_superblock, hash_32(devt + DAXFS_MAGIC, 31),
> +			 dax_test, &devt);
> +	if (!inode)
> +		return NULL;
> +
> +	dax_dev = to_dax_dev(inode);
> +	id = dax_read_lock();
> +	if (!dax_alive(dax_dev)) {
> +		dax_read_unlock(id);
> +		iput(inode);
> +		return NULL;
> +	}
> +	dax_read_unlock(id);
> +
> +	return dax_dev;
> +}
> +EXPORT_SYMBOL_GPL(dax_dev_find);
>

For now I see no in-tree caller of this function, the famfs users you mention doesn't
exists yet. Cloud the dax_dev_find() addition+export go with the famfs series that
actually uses it, and this patch just do the dax_dev_get() de-export ?

--Richard
  
>  struct dax_device *alloc_dax(void *private, const struct dax_operations *ops)
>  {
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index fe6c3ded1b50f..29113eb95e72d 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -54,7 +54,7 @@ struct dax_device *alloc_dax(void *private, const struct dax_operations *ops);
>  void *dax_holder(struct dax_device *dax_dev);
>  void put_dax(struct dax_device *dax_dev);
>  void kill_dax(struct dax_device *dax_dev);
> -struct dax_device *dax_dev_get(dev_t devt);
> +struct dax_device *dax_dev_find(dev_t devt);
>  void dax_write_cache(struct dax_device *dax_dev, bool wc);
>  bool dax_write_cache_enabled(struct dax_device *dax_dev);
>  bool dax_synchronous(struct dax_device *dax_dev);
> @@ -92,6 +92,10 @@ static inline void put_dax(struct dax_device *dax_dev)
>  static inline void kill_dax(struct dax_device *dax_dev)
>  {
>  }
> +static inline struct dax_device *dax_dev_find(dev_t devt)
> +{
> +	return NULL;
> +}
>  static inline void dax_write_cache(struct dax_device *dax_dev, bool wc)
>  {
>  }
> -- 
> 2.53.0
> 
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH V4 9/9] dax: fsdev.c minor formatting cleanup
  2026-06-07 19:32 ` [PATCH V4 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
                     ` (7 preceding siblings ...)
  2026-06-07 19:34   ` [PATCH V4 8/9] dax: replace exported dax_dev_get() with non-allocating dax_dev_find() John Groves
@ 2026-06-07 19:34   ` John Groves
  8 siblings, 0 replies; 25+ messages in thread
From: John Groves @ 2026-06-07 19:34 UTC (permalink / raw)
  To: John Groves, Dan Williams
  Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Miklos Szeredi,
	Alison Schofield, Ira Weiny, Jonathan Cameron,
	nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	John Groves

From: John Groves <John@Groves.net>

Address some comments from Jonathan that were missed in the merged
series. Fix line wrapping in fsdev_dax_recovery_write() and
fsdev_dax_zero_page_range() signatures.

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/fsdev.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
index dcb512625ce65..565d93926ec40 100644
--- a/drivers/dax/fsdev.c
+++ b/drivers/dax/fsdev.c
@@ -45,8 +45,8 @@ static void fsdev_write_dax(void *addr, struct page *page,
 }
 
 static long __fsdev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
-			long nr_pages, enum dax_access_mode mode, void **kaddr,
-			unsigned long *pfn)
+		long nr_pages, enum dax_access_mode mode, void **kaddr,
+		unsigned long *pfn)
 {
 	struct dev_dax *dev_dax = dax_get_private(dax_dev);
 	size_t size = nr_pages << PAGE_SHIFT;
@@ -80,7 +80,8 @@ static int fsdev_dax_zero_page_range(struct dax_device *dax_dev,
 	long rc;
 
 	WARN_ONCE(nr_pages > 1, "%s: nr_pages > 1\n", __func__);
-	rc = __fsdev_dax_direct_access(dax_dev, pgoff, 1, DAX_ACCESS, &kaddr, NULL);
+	rc = __fsdev_dax_direct_access(dax_dev, pgoff, 1, DAX_ACCESS,
+				       &kaddr, NULL);
 	if (rc < 0)
 		return rc;
 	fsdev_write_dax(kaddr, ZERO_PAGE(0), 0, PAGE_SIZE);
@@ -88,15 +89,15 @@ static int fsdev_dax_zero_page_range(struct dax_device *dax_dev,
 }
 
 static long fsdev_dax_direct_access(struct dax_device *dax_dev,
-		  pgoff_t pgoff, long nr_pages, enum dax_access_mode mode,
-		  void **kaddr, unsigned long *pfn)
+		pgoff_t pgoff, long nr_pages, enum dax_access_mode mode,
+		void **kaddr, unsigned long *pfn)
 {
 	return __fsdev_dax_direct_access(dax_dev, pgoff, nr_pages, mode,
 					 kaddr, pfn);
 }
 
-static size_t fsdev_dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
-		void *addr, size_t bytes, struct iov_iter *i)
+static size_t fsdev_dax_recovery_write(struct dax_device *dax_dev,
+		pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i)
 {
 	return _copy_from_iter_flushcache(addr, bytes, i);
 }
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2026-06-11 17:04 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260607193224.94244-1-john@jagalactic.com>
2026-06-07 19:32 ` [PATCH V4 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
2026-06-07 19:33   ` [PATCH V4 1/9] dax: fix misleading comment about share/index union in dax_folio_reset_order() John Groves
2026-06-07 19:45     ` sashiko-bot
2026-06-07 19:33   ` [PATCH V4 2/9] dax/fsdev: fix multi-range offset in memory_failure handler John Groves
2026-06-07 19:49     ` sashiko-bot
2026-06-08 10:56     ` Richard Cheng
2026-06-11 16:59       ` John Groves
2026-06-07 19:33   ` [PATCH V4 3/9] dax/fsdev: clear vmemmap_shift when binding static pgmap John Groves
2026-06-07 19:49     ` sashiko-bot
2026-06-07 19:33   ` [PATCH V4 4/9] dax/fsdev: don't leave a dangling dev_dax->pgmap on probe failure John Groves
2026-06-07 19:44     ` sashiko-bot
2026-06-08 21:30     ` Dave Jiang
2026-06-07 19:33   ` [PATCH V4 5/9] dax/fsdev: use __va(phys) for kaddr in direct_access John Groves
2026-06-07 19:44     ` sashiko-bot
2026-06-07 19:34   ` [PATCH V4 6/9] dax/fsdev: fail probe on invalid pgmap offset John Groves
2026-06-07 19:43     ` sashiko-bot
2026-06-08 21:39     ` Dave Jiang
2026-06-07 19:34   ` [PATCH V4 7/9] dax: fix holder_ops race in fs_put_dax() John Groves
2026-06-07 19:49     ` sashiko-bot
2026-06-08 10:52     ` Richard Cheng
2026-06-11 17:01       ` John Groves
2026-06-07 19:34   ` [PATCH V4 8/9] dax: replace exported dax_dev_get() with non-allocating dax_dev_find() John Groves
2026-06-07 19:49     ` sashiko-bot
2026-06-08 10:48     ` Richard Cheng
2026-06-07 19:34   ` [PATCH V4 9/9] dax: fsdev.c minor formatting cleanup John Groves

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox