* [PATCH V5 1/9] dax: fix misleading comment about share/index union in dax_folio_reset_order()
2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
@ 2026-06-11 17:31 ` John Groves
2026-06-11 17:31 ` [PATCH V5 2/9] dax/fsdev: fix multi-range offset in memory_failure handler John Groves
` (7 subsequent siblings)
8 siblings, 0 replies; 20+ messages in thread
From: John Groves @ 2026-06-11 17:31 UTC (permalink / raw)
To: John Groves, Dan Williams
Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Miklos Szeredi,
Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
John Groves
From: John Groves <John@Groves.net>
The comment in dax_folio_reset_order() claims that DAX maintains an
invariant where folio->share != 0 only when folio->mapping == NULL,
implying folio->share is zero whenever mapping is non-NULL. This is
misleading because folio->share and folio->index are a union -- for
non-shared folios with mapping != NULL, reading folio->share returns
the file page offset (folio->index), which is typically non-zero.
Reword the comment to accurately describe the union aliasing: the
assignment clears whichever interpretation of the union word is active
(index for non-shared folios, share for shared folios), which is correct
because the folio is being released in either case.
No functional change -- the code was already correct, only the
justification was wrong.
Fixes: 59eb73b98ae0b ("dax: Factor out dax_folio_reset_order() helper")
Reviewed-by: Jonathan Cameron <jic23@kernel.org>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: John Groves <john@groves.net>
---
fs/dax.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 6d175cd47a99b..df19c9317d10e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -392,12 +392,12 @@ int dax_folio_reset_order(struct folio *folio)
int order = folio_order(folio);
/*
- * DAX maintains the invariant that folio->share != 0 only when
- * folio->mapping == NULL (enforced by dax_folio_make_shared()).
- * Equivalently: folio->mapping != NULL implies folio->share == 0.
- * Callers ensure share has been decremented to zero before
- * calling here, so unconditionally clearing both fields is
- * correct.
+ * Clear the mapping and the index/share union word. folio->share
+ * and folio->index occupy the same union in struct folio. For
+ * non-shared folios (mapping != NULL), the union holds folio->index
+ * (file page offset); for shared folios (mapping == NULL), it holds
+ * folio->share (reference count). Either way, we are releasing the
+ * folio and both fields should be zeroed.
*/
folio->mapping = NULL;
folio->share = 0;
--
2.53.0
^ permalink raw reply related [flat|nested] 20+ messages in thread* [PATCH V5 2/9] dax/fsdev: fix multi-range offset in memory_failure handler
2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
2026-06-11 17:31 ` [PATCH V5 1/9] dax: fix misleading comment about share/index union in dax_folio_reset_order() John Groves
@ 2026-06-11 17:31 ` John Groves
2026-06-11 17:51 ` sashiko-bot
2026-06-12 3:08 ` Richard Cheng
2026-06-11 17:32 ` [PATCH V5 3/9] dax/fsdev: clear vmemmap_shift when binding static pgmap John Groves
` (6 subsequent siblings)
8 siblings, 2 replies; 20+ messages in thread
From: John Groves @ 2026-06-11 17:31 UTC (permalink / raw)
To: John Groves, Dan Williams
Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Miklos Szeredi,
Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
Richard Cheng, John Groves
From: John Groves <John@Groves.net>
Fix memory_failure offset calculation for multi-range devices. The old code
subtracted ranges[0].range.start from the faulting PFN's physical address,
which produces an incorrect (inflated) logical offset when the PFN falls in
ranges[1] or beyond due to physical gaps between ranges. Add
fsdev_pfn_to_offset() to walk the range list and compute the correct
device-linear byte offset.
Walk the pagemap's own range array (pgmap->ranges[]) rather than
dev_dax->ranges[]. The pgmap copy is the immutable snapshot populated at
probe and is never mutated afterwards, whereas dev_dax->ranges[] can be
krealloc()'d by a concurrent sysfs mapping_store() (under dax_region_rwsem,
which this ->memory_failure callback does not hold). For dynamic devices the
two arrays are identical, so the reported offset is unchanged for the
multi-range case this targets.
Fixes: d5406bd458b0a ("dax: add fsdev.c driver for fs-dax on character dax")
Suggested-by: Richard Cheng <icheng@nvidia.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: John Groves <john@groves.net>
---
drivers/dax/fsdev.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)
diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
index 188b2526bee45..2c5de3d80a618 100644
--- a/drivers/dax/fsdev.c
+++ b/drivers/dax/fsdev.c
@@ -135,11 +135,26 @@ static void fsdev_clear_ops(void *data)
* The core mm code in free_zone_device_folio() handles the wake_up_var()
* directly for this memory type.
*/
+static u64 fsdev_pfn_to_offset(struct dev_pagemap *pgmap, unsigned long pfn)
+{
+ phys_addr_t phys = PFN_PHYS(pfn);
+ u64 offset = 0;
+
+ for (int i = 0; i < pgmap->nr_range; i++) {
+ struct range *range = &pgmap->ranges[i];
+
+ if (phys >= range->start && phys <= range->end)
+ return offset + (phys - range->start);
+ offset += range_len(range);
+ }
+ return -1ULL;
+}
+
static int fsdev_pagemap_memory_failure(struct dev_pagemap *pgmap,
unsigned long pfn, unsigned long nr_pages, int mf_flags)
{
struct dev_dax *dev_dax = pgmap->owner;
- u64 offset = PFN_PHYS(pfn) - dev_dax->ranges[0].range.start;
+ u64 offset = fsdev_pfn_to_offset(pgmap, pfn);
u64 len = nr_pages << PAGE_SHIFT;
return dax_holder_notify_failure(dev_dax->dax_dev, offset,
--
2.53.0
^ permalink raw reply related [flat|nested] 20+ messages in thread* Re: [PATCH V5 2/9] dax/fsdev: fix multi-range offset in memory_failure handler
2026-06-11 17:31 ` [PATCH V5 2/9] dax/fsdev: fix multi-range offset in memory_failure handler John Groves
@ 2026-06-11 17:51 ` sashiko-bot
2026-06-12 3:08 ` Richard Cheng
1 sibling, 0 replies; 20+ messages in thread
From: sashiko-bot @ 2026-06-11 17:51 UTC (permalink / raw)
To: John Groves; +Cc: linux-cxl
Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
New issues:
- [Critical] Incorrect memory failure offset calculation for static DAX devices, causing the wrong filesystem blocks to be marked corrupted.
Pre-existing issues:
- [High] Use-After-Free in dax_pgoff_to_phys() due to concurrent reallocation of dev_dax->ranges array via sysfs.
--
commit 410f6bffae27c2c1b51ab4214947b135248b8594
Author: John Groves <John@Groves.net>
dax/fsdev: fix multi-range offset in memory_failure handler
This commit updates the offset calculation in the memory_failure handler for
multi-range DAX devices. It changes the logic to iterate over the device
pagemap's immutable ranges array instead of the potentially dynamically
reallocated dev_dax ranges array.
> Walk the pagemap's own range array (pgmap->ranges[]) rather than
> dev_dax->ranges[]. The pgmap copy is the immutable snapshot populated at
> probe and is never mutated afterwards, whereas dev_dax->ranges[] can be
> krealloc()'d by a concurrent sysfs mapping_store() (under dax_region_rwsem,
> which this ->memory_failure callback does not hold). For dynamic devices the
> two arrays are identical, so the reported offset is unchanged for the
> multi-range case this targets.
[Severity: High]
This isn't a bug introduced by this patch, but since the commit message
mentions that dev_dax->ranges can be krealloc()'d by a concurrent sysfs
mapping_store(), does this expose a use-after-free risk in
dax_pgoff_to_phys()?
Unlike size_store() which checks if the device is bound to a driver,
mapping_store() appears to lack this check and can invoke
alloc_dev_dax_range() to reallocate the array while the device is
actively in use:
drivers/dax/bus.c:mapping_store() {
...
rc = alloc_dev_dax_range(dev_dax, r.start, to_alloc);
...
}
Meanwhile, filesystem I/O operations can locklessly iterate over the ranges
array:
drivers/dax/bus.c:dax_pgoff_to_phys() {
...
for (i = 0; i < dev_dax->nr_range; i++) {
struct dev_dax_range *dax_range = &dev_dax->ranges[i];
...
}
Could a concurrent sysfs write to the mapping attribute cause a
use-after-free if dax_pgoff_to_phys() is iterating the old ranges
array that was freed by krealloc()?
> diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> index 188b2526bee45..2c5de3d80a618 100644
> --- a/drivers/dax/fsdev.c
> +++ b/drivers/dax/fsdev.c
[ ... ]
> @@ -135,11 +135,26 @@ static void fsdev_clear_ops(void *data)
> * The core mm code in free_zone_device_folio() handles the wake_up_var()
> * directly for this memory type.
> */
> +static u64 fsdev_pfn_to_offset(struct dev_pagemap *pgmap, unsigned long pfn)
> +{
> + phys_addr_t phys = PFN_PHYS(pfn);
> + u64 offset = 0;
> +
> + for (int i = 0; i < pgmap->nr_range; i++) {
> + struct range *range = &pgmap->ranges[i];
> +
> + if (phys >= range->start && phys <= range->end)
> + return offset + (phys - range->start);
[Severity: Critical]
Does this calculation incorrectly include the data_offset padding for static
DAX devices?
For static DAX devices, the device's physical range starts with a metadata
padding area (data_offset). The old code calculated the offset based on
dev_dax->ranges[0].range.start, which represents the start of the actual
user data.
By using pgmap->ranges[0].start instead, which includes this metadata
padding, the logical offset might be inflated by data_offset. Could this
cause the filesystem to receive an incorrect offset in
dax_holder_notify_failure(), potentially resulting in poisoning the wrong
filesystem blocks while leaving the actual corrupted blocks unmarked?
> + offset += range_len(range);
> + }
> + return -1ULL;
> +}
> +
--
Sashiko AI review · https://sashiko.dev/#/patchset/0100019eb7bcda4b-3f8edae9-d7a4-4bfa-aaea-fcef77fdbbc3-000000@email.amazonses.com?part=2
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH V5 2/9] dax/fsdev: fix multi-range offset in memory_failure handler
2026-06-11 17:31 ` [PATCH V5 2/9] dax/fsdev: fix multi-range offset in memory_failure handler John Groves
2026-06-11 17:51 ` sashiko-bot
@ 2026-06-12 3:08 ` Richard Cheng
1 sibling, 0 replies; 20+ messages in thread
From: Richard Cheng @ 2026-06-12 3:08 UTC (permalink / raw)
To: John Groves
Cc: John Groves, Dan Williams, John Groves, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
Miklos Szeredi, Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
On Thu, Jun 11, 2026 at 05:31:59PM +0800, John Groves wrote:
> From: John Groves <John@Groves.net>
>
> Fix memory_failure offset calculation for multi-range devices. The old code
> subtracted ranges[0].range.start from the faulting PFN's physical address,
> which produces an incorrect (inflated) logical offset when the PFN falls in
> ranges[1] or beyond due to physical gaps between ranges. Add
> fsdev_pfn_to_offset() to walk the range list and compute the correct
> device-linear byte offset.
>
> Walk the pagemap's own range array (pgmap->ranges[]) rather than
> dev_dax->ranges[]. The pgmap copy is the immutable snapshot populated at
> probe and is never mutated afterwards, whereas dev_dax->ranges[] can be
> krealloc()'d by a concurrent sysfs mapping_store() (under dax_region_rwsem,
> which this ->memory_failure callback does not hold). For dynamic devices the
> two arrays are identical, so the reported offset is unchanged for the
> multi-range case this targets.
>
> Fixes: d5406bd458b0a ("dax: add fsdev.c driver for fs-dax on character dax")
>
> Suggested-by: Richard Cheng <icheng@nvidia.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Alison Schofield <alison.schofield@intel.com>
> Signed-off-by: John Groves <john@groves.net>
> ---
> drivers/dax/fsdev.c | 17 ++++++++++++++++-
> 1 file changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> index 188b2526bee45..2c5de3d80a618 100644
> --- a/drivers/dax/fsdev.c
> +++ b/drivers/dax/fsdev.c
> @@ -135,11 +135,26 @@ static void fsdev_clear_ops(void *data)
> * The core mm code in free_zone_device_folio() handles the wake_up_var()
> * directly for this memory type.
> */
> +static u64 fsdev_pfn_to_offset(struct dev_pagemap *pgmap, unsigned long pfn)
> +{
> + phys_addr_t phys = PFN_PHYS(pfn);
> + u64 offset = 0;
> +
> + for (int i = 0; i < pgmap->nr_range; i++) {
> + struct range *range = &pgmap->ranges[i];
> +
> + if (phys >= range->start && phys <= range->end)
> + return offset + (phys - range->start);
> + offset += range_len(range);
> + }
> + return -1ULL;
> +}
> +
> static int fsdev_pagemap_memory_failure(struct dev_pagemap *pgmap,
> unsigned long pfn, unsigned long nr_pages, int mf_flags)
> {
> struct dev_dax *dev_dax = pgmap->owner;
> - u64 offset = PFN_PHYS(pfn) - dev_dax->ranges[0].range.start;
> + u64 offset = fsdev_pfn_to_offset(pgmap, pfn);
Hi John,
I think this regresses static devices. pgmap->ranges[0].start can sit
data_offset below it on a static device, so the new offset = old + data_offset,
and XFS poisons the wrong blocks.
The gap walk only helps dynamic devices where data_offset ==0 . Maybe walking pgmap->ranges and
substract the probe's data_offset.
--Richard
> u64 len = nr_pages << PAGE_SHIFT;
>
> return dax_holder_notify_failure(dev_dax->dax_dev, offset,
> --
> 2.53.0
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH V5 3/9] dax/fsdev: clear vmemmap_shift when binding static pgmap
2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
2026-06-11 17:31 ` [PATCH V5 1/9] dax: fix misleading comment about share/index union in dax_folio_reset_order() John Groves
2026-06-11 17:31 ` [PATCH V5 2/9] dax/fsdev: fix multi-range offset in memory_failure handler John Groves
@ 2026-06-11 17:32 ` John Groves
2026-06-11 17:55 ` sashiko-bot
2026-06-12 2:56 ` Richard Cheng
2026-06-11 17:32 ` [PATCH V5 4/9] dax/fsdev: don't leave a dangling dev_dax->pgmap on probe failure John Groves
` (5 subsequent siblings)
8 siblings, 2 replies; 20+ messages in thread
From: John Groves @ 2026-06-11 17:32 UTC (permalink / raw)
To: John Groves, Dan Williams
Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Miklos Szeredi,
Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
John Groves
From: John Groves <John@Groves.net>
Clear pgmap->vmemmap_shift for static DAX devices. When rebinding a static
device from device_dax (which may set vmemmap_shift based on alignment) to
fsdev_dax, the stale vmemmap_shift persists on the shared pgmap. Explicitly
zero it before devm_memremap_pages() so the vmemmap is built for order-0
folios as fsdev requires.
Fixes: d5406bd458b0a ("dax: add fsdev.c driver for fs-dax on character dax")
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: John Groves <john@groves.net>
---
drivers/dax/fsdev.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
index 2c5de3d80a618..52f46b3e245ea 100644
--- a/drivers/dax/fsdev.c
+++ b/drivers/dax/fsdev.c
@@ -237,6 +237,7 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
}
pgmap = dev_dax->pgmap;
+ pgmap->vmemmap_shift = 0;
} else {
size_t pgmap_size;
--
2.53.0
^ permalink raw reply related [flat|nested] 20+ messages in thread* Re: [PATCH V5 3/9] dax/fsdev: clear vmemmap_shift when binding static pgmap
2026-06-11 17:32 ` [PATCH V5 3/9] dax/fsdev: clear vmemmap_shift when binding static pgmap John Groves
@ 2026-06-11 17:55 ` sashiko-bot
2026-06-12 2:56 ` Richard Cheng
1 sibling, 0 replies; 20+ messages in thread
From: sashiko-bot @ 2026-06-11 17:55 UTC (permalink / raw)
To: John Groves; +Cc: linux-cxl
Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
New issues:
- [High] Rebinding a static DAX device from fsdev_dax to device_dax (or unloading the fsdev_dax module) leaves a dangling `pgmap->ops` pointer on the persistent static `pgmap` structure, risking a use-after-free on hardware memory failure.
Pre-existing issues:
- [High] Probe functions in fsdev_dax re-initialize an embedded cdev via `cdev_init()`, clobbering its kobject state if there are open file descriptors from a previous bind.
--
commit cfa8354a7e331532bf1a69e2ea2e854ab1b90742
Author: John Groves <John@Groves.net>
dax/fsdev: clear vmemmap_shift when binding static pgmap
Clear pgmap->vmemmap_shift for static DAX devices. When rebinding a
static device from device_dax to fsdev_dax, the stale vmemmap_shift
persists on the shared pgmap. Explicitly zero it before
devm_memremap_pages() so the vmemmap is built for order-0 folios.
> diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> index 2c5de3d80a618..52f46b3e245ea 100644
> --- a/drivers/dax/fsdev.c
> +++ b/drivers/dax/fsdev.c
> @@ -237,6 +237,7 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
> }
>
> pgmap = dev_dax->pgmap;
> + pgmap->vmemmap_shift = 0;
[Severity: High]
Since we are fixing the stale vmemmap_shift on the shared pgmap here, do we
also need to handle the pgmap->ops pointer?
If fsdev_dax unbinds or the module is unloaded, it doesn't appear that we
clear pgmap->ops, which is set to &fsdev_pagemap_ops later in
fsdev_dax_probe(). Additionally, if device_dax is bound to the same device
instead, it appears to update pgmap->type but leave pgmap->ops unchanged.
If a hardware memory error occurs on the region after unbind, will
memory_failure_dev_pagemap() attempt to call the dangling
pgmap->ops->memory_failure pointer:
mm/memory-failure.c:memory_failure_dev_pagemap()
if (pgmap->ops && pgmap->ops->memory_failure) {
rc = pgmap->ops->memory_failure(pgmap, pfn, 1, flags);
and cause a use-after-free or panic?
[Severity: High]
This isn't a bug introduced by this patch, but does the embedded cdev
initialization in this probe function clobber the kobject state?
If there are open file descriptors from a previous bind, and
fsdev_dax_probe() re-initializes the embedded cdev later in the function:
drivers/dax/fsdev.c:fsdev_dax_probe()
cdev_init(cdev, &fsdev_fops);
will cdev_init() destroy the active kobject's refcount and zero out the
cdev->list? Could this cause list corruption when the stale file descriptors
are later closed, or mismatched kobject refcounting that frees the cdev
while the new driver instance is using it?
> } else {
> size_t pgmap_size;
>
--
Sashiko AI review · https://sashiko.dev/#/patchset/0100019eb7bcda4b-3f8edae9-d7a4-4bfa-aaea-fcef77fdbbc3-000000@email.amazonses.com?part=3
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH V5 3/9] dax/fsdev: clear vmemmap_shift when binding static pgmap
2026-06-11 17:32 ` [PATCH V5 3/9] dax/fsdev: clear vmemmap_shift when binding static pgmap John Groves
2026-06-11 17:55 ` sashiko-bot
@ 2026-06-12 2:56 ` Richard Cheng
1 sibling, 0 replies; 20+ messages in thread
From: Richard Cheng @ 2026-06-12 2:56 UTC (permalink / raw)
To: John Groves
Cc: John Groves, Dan Williams, John Groves, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
Miklos Szeredi, Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
On Thu, Jun 11, 2026 at 05:32:07PM +0800, John Groves wrote:
> From: John Groves <John@Groves.net>
>
> Clear pgmap->vmemmap_shift for static DAX devices. When rebinding a static
> device from device_dax (which may set vmemmap_shift based on alignment) to
> fsdev_dax, the stale vmemmap_shift persists on the shared pgmap. Explicitly
> zero it before devm_memremap_pages() so the vmemmap is built for order-0
> folios as fsdev requires.
>
> Fixes: d5406bd458b0a ("dax: add fsdev.c driver for fs-dax on character dax")
>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Alison Schofield <alison.schofield@intel.com>
> Signed-off-by: John Groves <john@groves.net>
> ---
> drivers/dax/fsdev.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> index 2c5de3d80a618..52f46b3e245ea 100644
> --- a/drivers/dax/fsdev.c
> +++ b/drivers/dax/fsdev.c
> @@ -237,6 +237,7 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
> }
>
> pgmap = dev_dax->pgmap;
> + pgmap->vmemmap_shift = 0;
Hello John,
I would suggest also clearing pgmap->ops and pgmap->owner on teardown.
fsdev also writes them but never clears them. memuunmap_pages() leaves the
descriptor intact and kill_dev_dax() only NULLs dev_dax->pgmap for !static case.
After fsdev unbind the stale ops survive.
If we do rmmod later a HW failure dispatch pgmap->ops->memory_failure.
--Richard
> } else {
> size_t pgmap_size;
>
> --
> 2.53.0
>
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH V5 4/9] dax/fsdev: don't leave a dangling dev_dax->pgmap on probe failure
2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
` (2 preceding siblings ...)
2026-06-11 17:32 ` [PATCH V5 3/9] dax/fsdev: clear vmemmap_shift when binding static pgmap John Groves
@ 2026-06-11 17:32 ` John Groves
2026-06-11 18:04 ` sashiko-bot
2026-06-11 17:32 ` [PATCH V5 5/9] dax/fsdev: use __va(phys) for kaddr in direct_access John Groves
` (4 subsequent siblings)
8 siblings, 1 reply; 20+ messages in thread
From: John Groves @ 2026-06-11 17:32 UTC (permalink / raw)
To: John Groves, Dan Williams
Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Miklos Szeredi,
Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
John Groves
From: John Groves <John@Groves.net>
After the dynamic path set dev_dax->pgmap, any later probe failure left
dev_dax->pgmap dangling: devres frees the devm_kzalloc'd pgmap on probe
failure, and subsequent probe attempts would hit the "dynamic-dax with
pre-populated page map" check and fail permanently.
Factor pgmap acquisition out into fsdev_acquire_pgmap(), and defer the
dev_dax->pgmap assignment until probe can no longer fail. A failed probe
now never publishes the pointer at all, so there is nothing to unwind.
This also matches kill_dev_dax(), which already clears the dynamic pgmap
pointer on unbind: dev_dax->pgmap is now non-NULL only while the pgmap
is actually valid.
Refactor suggested by Dave Jiang.
Fixes: d5406bd458b0a ("dax: add fsdev.c driver for fs-dax on character dax")
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: John Groves <john@groves.net>
---
drivers/dax/fsdev.c | 77 ++++++++++++++++++++++++++++-----------------
1 file changed, 49 insertions(+), 28 deletions(-)
diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
index 52f46b3e245ea..cc097167ad2c7 100644
--- a/drivers/dax/fsdev.c
+++ b/drivers/dax/fsdev.c
@@ -219,47 +219,62 @@ static const struct file_operations fsdev_fops = {
.release = fsdev_release,
};
-static int fsdev_dax_probe(struct dev_dax *dev_dax)
+/*
+ * Acquire the dev_pagemap for probe: the static (pre-populated) one if
+ * present, or a devm-allocated one for the dynamic case. Note that
+ * dev_dax->pgmap is not set here; fsdev_dax_probe() sets it only once
+ * probe succeeds, so a failed probe never leaves a dangling pointer
+ * to a devres-freed pgmap.
+ */
+static struct dev_pagemap *fsdev_acquire_pgmap(struct dev_dax *dev_dax)
{
- struct dax_device *dax_dev = dev_dax->dax_dev;
struct device *dev = &dev_dax->dev;
struct dev_pagemap *pgmap;
- struct inode *inode;
- u64 data_offset = 0;
- struct cdev *cdev;
- void *addr;
- int rc, i;
+ size_t pgmap_size;
if (static_dev_dax(dev_dax)) {
if (dev_dax->nr_range > 1) {
- dev_warn(dev, "static pgmap / multi-range device conflict\n");
- return -EINVAL;
+ dev_warn(dev,
+ "static pgmap / multi-range device conflict\n");
+ return ERR_PTR(-EINVAL);
}
pgmap = dev_dax->pgmap;
pgmap->vmemmap_shift = 0;
- } else {
- size_t pgmap_size;
+ return pgmap;
+ }
- if (dev_dax->pgmap) {
- dev_warn(dev, "dynamic-dax with pre-populated page map\n");
- return -EINVAL;
- }
+ if (dev_dax->pgmap) {
+ dev_warn(dev, "dynamic-dax with pre-populated page map\n");
+ return ERR_PTR(-EINVAL);
+ }
- pgmap_size = struct_size(pgmap, ranges, dev_dax->nr_range - 1);
- pgmap = devm_kzalloc(dev, pgmap_size, GFP_KERNEL);
- if (!pgmap)
- return -ENOMEM;
+ pgmap_size = struct_size(pgmap, ranges, dev_dax->nr_range - 1);
+ pgmap = devm_kzalloc(dev, pgmap_size, GFP_KERNEL);
+ if (!pgmap)
+ return ERR_PTR(-ENOMEM);
- pgmap->nr_range = dev_dax->nr_range;
- dev_dax->pgmap = pgmap;
+ pgmap->nr_range = dev_dax->nr_range;
+ for (int i = 0; i < dev_dax->nr_range; i++)
+ pgmap->ranges[i] = dev_dax->ranges[i].range;
- for (i = 0; i < dev_dax->nr_range; i++) {
- struct range *range = &dev_dax->ranges[i].range;
+ return pgmap;
+}
- pgmap->ranges[i] = *range;
- }
- }
+static int fsdev_dax_probe(struct dev_dax *dev_dax)
+{
+ struct dax_device *dax_dev = dev_dax->dax_dev;
+ struct device *dev = &dev_dax->dev;
+ struct dev_pagemap *pgmap;
+ struct inode *inode;
+ u64 data_offset = 0;
+ struct cdev *cdev;
+ void *addr;
+ int rc, i;
+
+ pgmap = fsdev_acquire_pgmap(dev_dax);
+ if (IS_ERR(pgmap))
+ return PTR_ERR(pgmap);
for (i = 0; i < dev_dax->nr_range; i++) {
struct range *range = &dev_dax->ranges[i].range;
@@ -306,7 +321,7 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
/* Detect whether the data is at a non-zero offset into the memory */
if (pgmap->range.start != dev_dax->ranges[0].range.start) {
u64 phys = dev_dax->ranges[0].range.start;
- u64 pgmap_phys = dev_dax->pgmap[0].range.start;
+ u64 pgmap_phys = pgmap[0].range.start;
if (!WARN_ON(pgmap_phys > phys))
data_offset = phys - pgmap_phys;
@@ -339,7 +354,13 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
return rc;
run_dax(dax_dev);
- return devm_add_action_or_reset(dev, fsdev_kill, dev_dax);
+ rc = devm_add_action_or_reset(dev, fsdev_kill, dev_dax);
+ if (rc)
+ return rc;
+
+ /* Probe can no longer fail; expose the pgmap via dev_dax */
+ dev_dax->pgmap = pgmap;
+ return 0;
}
static struct dax_device_driver fsdev_dax_driver = {
--
2.53.0
^ permalink raw reply related [flat|nested] 20+ messages in thread* Re: [PATCH V5 4/9] dax/fsdev: don't leave a dangling dev_dax->pgmap on probe failure
2026-06-11 17:32 ` [PATCH V5 4/9] dax/fsdev: don't leave a dangling dev_dax->pgmap on probe failure John Groves
@ 2026-06-11 18:04 ` sashiko-bot
0 siblings, 0 replies; 20+ messages in thread
From: sashiko-bot @ 2026-06-11 18:04 UTC (permalink / raw)
To: John Groves; +Cc: linux-cxl
Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
Pre-existing issues:
- [High] Unpatched dangling pointer in device.c leading to permanent probe failure and potential UAF
- [High] Concurrent sysfs mapping_store() execution causes TOCTOU heap buffer overflow and Use-After-Free in probe pathways
--
commit f0cb55dd27105cfc1116641d11d0187ba63297c0
Author: John Groves <John@Groves.net>
dax/fsdev: don't leave a dangling dev_dax->pgmap on probe failure
This patch factors out pgmap acquisition into fsdev_acquire_pgmap() and
defers the dev_dax->pgmap assignment until the probe can no longer fail.
This ensures a failed probe does not leave a dangling pointer.
> diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> index 52f46b3e245ea..cc097167ad2c7 100644
> --- a/drivers/dax/fsdev.c
> +++ b/drivers/dax/fsdev.c
> @@ -219,47 +219,62 @@ static const struct file_operations fsdev_fops = {
> .release = fsdev_release,
> };
>
> -static int fsdev_dax_probe(struct dev_dax *dev_dax)
> +/*
> + * Acquire the dev_pagemap for probe: the static (pre-populated) one if
> + * present, or a devm-allocated one for the dynamic case. Note that
> + * dev_dax->pgmap is not set here; fsdev_dax_probe() sets it only once
> + * probe succeeds, so a failed probe never leaves a dangling pointer
> + * to a devres-freed pgmap.
> + */
> +static struct dev_pagemap *fsdev_acquire_pgmap(struct dev_dax *dev_dax)
> {
[ ... ]
> + if (dev_dax->pgmap) {
> + dev_warn(dev, "dynamic-dax with pre-populated page map\n");
> + return ERR_PTR(-EINVAL);
> + }
>
> - pgmap_size = struct_size(pgmap, ranges, dev_dax->nr_range - 1);
> - pgmap = devm_kzalloc(dev, pgmap_size, GFP_KERNEL);
> - if (!pgmap)
> - return -ENOMEM;
> + pgmap_size = struct_size(pgmap, ranges, dev_dax->nr_range - 1);
> + pgmap = devm_kzalloc(dev, pgmap_size, GFP_KERNEL);
> + if (!pgmap)
> + return ERR_PTR(-ENOMEM);
[Severity: High]
This is a pre-existing issue, but does this allocation race with a concurrent
sysfs mapping_store() execution?
Since mapping_store() in drivers/dax/bus.c dynamically adds memory ranges
without checking if dev->driver is bound, it can run concurrently while the
driver is probing.
If dev_dax->nr_range increments after pgmap_size is calculated but before
the loop below executes, couldn't this cause a heap buffer overflow when
writing to pgmap->ranges[i]?
> - pgmap->nr_range = dev_dax->nr_range;
> - dev_dax->pgmap = pgmap;
> + pgmap->nr_range = dev_dax->nr_range;
> + for (int i = 0; i < dev_dax->nr_range; i++)
> + pgmap->ranges[i] = dev_dax->ranges[i].range;
>
> - for (i = 0; i < dev_dax->nr_range; i++) {
> - struct range *range = &dev_dax->ranges[i].range;
> + return pgmap;
> +}
>
> - pgmap->ranges[i] = *range;
> - }
> - }
> +static int fsdev_dax_probe(struct dev_dax *dev_dax)
> +{
> + struct dax_device *dax_dev = dev_dax->dax_dev;
> + struct device *dev = &dev_dax->dev;
> + struct dev_pagemap *pgmap;
> + struct inode *inode;
> + u64 data_offset = 0;
> + struct cdev *cdev;
> + void *addr;
> + int rc, i;
> +
> + pgmap = fsdev_acquire_pgmap(dev_dax);
[Severity: High]
This isn't a bug introduced by this patch, but since you are fixing the
dangling dev_dax->pgmap pointer vulnerability here in fsdev_dax_probe(),
should the exact same fix be applied to dev_dax_probe() in drivers/dax/device.c?
It appears device.c still assigns dev_dax->pgmap immediately after
devm_kzalloc(), which would leave it dangling if a subsequent operation like
devm_request_mem_region() fails, causing subsequent probe attempts to
permanently fail.
--
Sashiko AI review · https://sashiko.dev/#/patchset/0100019eb7bcda4b-3f8edae9-d7a4-4bfa-aaea-fcef77fdbbc3-000000@email.amazonses.com?part=4
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH V5 5/9] dax/fsdev: use __va(phys) for kaddr in direct_access
2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
` (3 preceding siblings ...)
2026-06-11 17:32 ` [PATCH V5 4/9] dax/fsdev: don't leave a dangling dev_dax->pgmap on probe failure John Groves
@ 2026-06-11 17:32 ` John Groves
2026-06-11 17:32 ` [PATCH V5 6/9] dax/fsdev: fail probe on invalid pgmap offset John Groves
` (3 subsequent siblings)
8 siblings, 0 replies; 20+ messages in thread
From: John Groves @ 2026-06-11 17:32 UTC (permalink / raw)
To: John Groves, Dan Williams
Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Miklos Szeredi,
Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
John Groves
From: John Groves <John@Groves.net>
Use __va(phys) instead of virt_addr + linear_offset for the kaddr
return in __fsdev_dax_direct_access(). The previous code added a
device-linear byte offset to virt_addr (which is __va of ranges[0]),
but for multi-range devices with physical gaps between ranges, this
linear arithmetic crosses the gap and produces a wrong kernel virtual
address. Using __va(phys) where phys comes from dax_pgoff_to_phys()
is correct for any range layout because the direct map translates
each physical address independently.
This leaves dev_dax->virt_addr write-only, so remove the field
(suggested by Dave Jiang).
Fixes: 759455848df0b ("dax: Save the kva from memremap")
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: John Groves <john@groves.net>
---
drivers/dax/dax-private.h | 2 --
drivers/dax/fsdev.c | 8 ++------
2 files changed, 2 insertions(+), 8 deletions(-)
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 81e4af49e39c1..607a53a91f58b 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -69,7 +69,6 @@ struct dev_dax_range {
* data while the device is activated in the driver.
* @region: parent region
* @dax_dev: core dax functionality
- * @virt_addr: kva from memremap; used by fsdev_dax
* @cached_size: size of daxdev cached by fsdev_dax
* @align: alignment of this instance
* @target_node: effective numa node if dev_dax memory range is onlined
@@ -85,7 +84,6 @@ struct dev_dax_range {
struct dev_dax {
struct dax_region *region;
struct dax_device *dax_dev;
- void *virt_addr;
u64 cached_size;
unsigned int align;
int target_node;
diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
index cc097167ad2c7..07de6bfbbf673 100644
--- a/drivers/dax/fsdev.c
+++ b/drivers/dax/fsdev.c
@@ -51,9 +51,7 @@ static long __fsdev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
struct dev_dax *dev_dax = dax_get_private(dax_dev);
size_t size = nr_pages << PAGE_SHIFT;
size_t offset = pgoff << PAGE_SHIFT;
- void *virt_addr = dev_dax->virt_addr + offset;
phys_addr_t phys;
- unsigned long local_pfn;
phys = dax_pgoff_to_phys(dev_dax, pgoff, size);
if (phys == -1) {
@@ -63,11 +61,10 @@ static long __fsdev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
}
if (kaddr)
- *kaddr = virt_addr;
+ *kaddr = __va(phys);
- local_pfn = PHYS_PFN(phys);
if (pfn)
- *pfn = local_pfn;
+ *pfn = PHYS_PFN(phys);
/*
* Use cached_size which was computed at probe time. The size cannot
@@ -329,7 +326,6 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
pr_debug("%s: offset detected phys=%llx pgmap_phys=%llx offset=%llx\n",
__func__, phys, pgmap_phys, data_offset);
}
- dev_dax->virt_addr = addr + data_offset;
inode = dax_inode(dax_dev);
cdev = inode->i_cdev;
--
2.53.0
^ permalink raw reply related [flat|nested] 20+ messages in thread* [PATCH V5 6/9] dax/fsdev: fail probe on invalid pgmap offset
2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
` (4 preceding siblings ...)
2026-06-11 17:32 ` [PATCH V5 5/9] dax/fsdev: use __va(phys) for kaddr in direct_access John Groves
@ 2026-06-11 17:32 ` John Groves
2026-06-11 18:09 ` Gupta, Pankaj
2026-06-11 18:13 ` sashiko-bot
2026-06-11 17:32 ` [PATCH V5 7/9] dax: read holder_ops once in dax_holder_notify_failure() John Groves
` (2 subsequent siblings)
8 siblings, 2 replies; 20+ messages in thread
From: John Groves @ 2026-06-11 17:32 UTC (permalink / raw)
To: John Groves, Dan Williams
Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Miklos Szeredi,
Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
John Groves
From: John Groves <John@Groves.net>
Convert the WARN_ON to a fatal error when pgmap_phys > phys. This
condition means the remapped region starts after the device's data
region, which is an impossible state. Previously the probe continued
with data_offset=0, leaving virt_addr silently misaligned. Now probe
returns -EINVAL with a diagnostic message.
Fixes: 759455848df0b ("dax: Save the kva from memremap")
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: John Groves <john@groves.net>
---
drivers/dax/fsdev.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
index 07de6bfbbf673..71d2bee1e2805 100644
--- a/drivers/dax/fsdev.c
+++ b/drivers/dax/fsdev.c
@@ -320,8 +320,12 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
u64 phys = dev_dax->ranges[0].range.start;
u64 pgmap_phys = pgmap[0].range.start;
- if (!WARN_ON(pgmap_phys > phys))
- data_offset = phys - pgmap_phys;
+ if (pgmap_phys > phys) {
+ dev_err(dev, "pgmap start %#llx exceeds data start %#llx\n",
+ pgmap_phys, phys);
+ return -EINVAL;
+ }
+ data_offset = phys - pgmap_phys;
pr_debug("%s: offset detected phys=%llx pgmap_phys=%llx offset=%llx\n",
__func__, phys, pgmap_phys, data_offset);
--
2.53.0
^ permalink raw reply related [flat|nested] 20+ messages in thread* Re: [PATCH V5 6/9] dax/fsdev: fail probe on invalid pgmap offset
2026-06-11 17:32 ` [PATCH V5 6/9] dax/fsdev: fail probe on invalid pgmap offset John Groves
@ 2026-06-11 18:09 ` Gupta, Pankaj
2026-06-11 18:13 ` sashiko-bot
1 sibling, 0 replies; 20+ messages in thread
From: Gupta, Pankaj @ 2026-06-11 18:09 UTC (permalink / raw)
To: John Groves, John Groves, Dan Williams
Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Miklos Szeredi,
Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
> From: John Groves <John@Groves.net>
>
> Convert the WARN_ON to a fatal error when pgmap_phys > phys. This
> condition means the remapped region starts after the device's data
> region, which is an impossible state. Previously the probe continued
> with data_offset=0, leaving virt_addr silently misaligned. Now probe
> returns -EINVAL with a diagnostic message.
>
> Fixes: 759455848df0b ("dax: Save the kva from memremap")
>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Alison Schofield <alison.schofield@intel.com>
> Signed-off-by: John Groves <john@groves.net>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
> ---
> drivers/dax/fsdev.c | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> index 07de6bfbbf673..71d2bee1e2805 100644
> --- a/drivers/dax/fsdev.c
> +++ b/drivers/dax/fsdev.c
> @@ -320,8 +320,12 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
> u64 phys = dev_dax->ranges[0].range.start;
> u64 pgmap_phys = pgmap[0].range.start;
>
> - if (!WARN_ON(pgmap_phys > phys))
> - data_offset = phys - pgmap_phys;
> + if (pgmap_phys > phys) {
> + dev_err(dev, "pgmap start %#llx exceeds data start %#llx\n",
> + pgmap_phys, phys);
> + return -EINVAL;
> + }
> + data_offset = phys - pgmap_phys;
>
> pr_debug("%s: offset detected phys=%llx pgmap_phys=%llx offset=%llx\n",
> __func__, phys, pgmap_phys, data_offset);
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH V5 6/9] dax/fsdev: fail probe on invalid pgmap offset
2026-06-11 17:32 ` [PATCH V5 6/9] dax/fsdev: fail probe on invalid pgmap offset John Groves
2026-06-11 18:09 ` Gupta, Pankaj
@ 2026-06-11 18:13 ` sashiko-bot
1 sibling, 0 replies; 20+ messages in thread
From: sashiko-bot @ 2026-06-11 18:13 UTC (permalink / raw)
To: John Groves; +Cc: linux-cxl
Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
Pre-existing issues:
- [High] Validation check for 'pgmap_phys > phys' executes too late, allowing 'fsdev_clear_folio_state()' to process out-of-bounds PFNs and crash the kernel.
--
commit e78e60f7b38932158f6316aad6c7a8308838e5b2
Author: John Groves <John@Groves.net>
dax/fsdev: fail probe on invalid pgmap offset
This commit converts a WARN_ON to a fatal error when pgmap_phys exceeds phys,
preventing probe completion with an impossible state. It now returns -EINVAL
with an error message instead of proceeding with an invalid offset.
> diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> --- a/drivers/dax/fsdev.c
> +++ b/drivers/dax/fsdev.c
> @@ -320,8 +320,12 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
> u64 phys = dev_dax->ranges[0].range.start;
> u64 pgmap_phys = pgmap[0].range.start;
>
> - if (!WARN_ON(pgmap_phys > phys))
> - data_offset = phys - pgmap_phys;
> + if (pgmap_phys > phys) {
> + dev_err(dev, "pgmap start %#llx exceeds data start %#llx\n",
> + pgmap_phys, phys);
> + return -EINVAL;
> + }
[Severity: High]
This is a pre-existing issue, but does this validation check execute too late
to prevent a crash?
Looking at fsdev_dax_probe(), the call to fsdev_clear_folio_state() happens
just before this check is reached. fsdev_clear_folio_state() iterates over
the physical PFNs defined in dev_dax->ranges, starting at phys.
Because devm_memremap_pages() only populates vmemmap starting from
pgmap_phys, any PFNs between phys and pgmap_phys lack vmemmap backing. This
causes an out-of-bounds access when attempting to read the folio state:
fsdev_dax_probe()
fsdev_clear_folio_state()
struct folio *folio = pfn_folio(pfn);
int order = dax_folio_reset_order(folio);
Could this trigger a fatal kernel page fault on the unmapped PFNs before the
probe has a chance to fail gracefully?
Should the validation block be moved above the call to
fsdev_clear_folio_state() to reject the invalid configuration before memory
is accessed?
> + data_offset = phys - pgmap_phys;
>
> pr_debug("%s: offset detected phys=%llx pgmap_phys=%llx offset=%llx\n",
> __func__, phys, pgmap_phys, data_offset);
--
Sashiko AI review · https://sashiko.dev/#/patchset/0100019eb7bcda4b-3f8edae9-d7a4-4bfa-aaea-fcef77fdbbc3-000000@email.amazonses.com?part=6
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH V5 7/9] dax: read holder_ops once in dax_holder_notify_failure()
2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
` (5 preceding siblings ...)
2026-06-11 17:32 ` [PATCH V5 6/9] dax/fsdev: fail probe on invalid pgmap offset John Groves
@ 2026-06-11 17:32 ` John Groves
2026-06-11 18:13 ` sashiko-bot
2026-06-12 3:02 ` Richard Cheng
2026-06-11 17:32 ` [PATCH V5 8/9] dax: fix holder_ops race in fs_put_dax() John Groves
2026-06-11 17:33 ` [PATCH V5 9/9] dax: fsdev.c minor formatting cleanup John Groves
8 siblings, 2 replies; 20+ messages in thread
From: John Groves @ 2026-06-11 17:32 UTC (permalink / raw)
To: John Groves, Dan Williams
Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Miklos Szeredi,
Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
Richard Cheng, John Groves
From: John Groves <John@Groves.net>
dax_holder_notify_failure() reads dax_dev->holder_ops twice without
READ_ONCE() -- once for the NULL check and once for the indirect
notify_failure() call. A concurrent fs_put_dax() or kill_dax() can clear
holder_ops between the two reads, so the check can observe a non-NULL
pointer while the call dereferences NULL.
Fetch holder_ops once into a local with READ_ONCE() so the NULL check and
the indirect call observe the same value.
Fixes: 8012b86608552 ("dax: introduce holder for dax_device")
Suggested-by: Richard Cheng <icheng@nvidia.com>
Signed-off-by: John Groves <john@groves.net>
---
drivers/dax/super.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 25cf99dd9360b..6b5ee6589e39b 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -303,6 +303,7 @@ EXPORT_SYMBOL_GPL(dax_recovery_write);
int dax_holder_notify_failure(struct dax_device *dax_dev, u64 off,
u64 len, int mf_flags)
{
+ const struct dax_holder_operations *ops;
int rc, id;
id = dax_read_lock();
@@ -311,12 +312,18 @@ int dax_holder_notify_failure(struct dax_device *dax_dev, u64 off,
goto out;
}
- if (!dax_dev->holder_ops) {
+ /*
+ * Read holder_ops once: a concurrent fs_put_dax() or kill_dax() can
+ * clear it. Without the single fetch the compiler could reload
+ * between the NULL check and the call and dereference a NULL ops.
+ */
+ ops = READ_ONCE(dax_dev->holder_ops);
+ if (!ops) {
rc = -EOPNOTSUPP;
goto out;
}
- rc = dax_dev->holder_ops->notify_failure(dax_dev, off, len, mf_flags);
+ rc = ops->notify_failure(dax_dev, off, len, mf_flags);
out:
dax_read_unlock(id);
return rc;
--
2.53.0
^ permalink raw reply related [flat|nested] 20+ messages in thread* Re: [PATCH V5 7/9] dax: read holder_ops once in dax_holder_notify_failure()
2026-06-11 17:32 ` [PATCH V5 7/9] dax: read holder_ops once in dax_holder_notify_failure() John Groves
@ 2026-06-11 18:13 ` sashiko-bot
2026-06-12 3:02 ` Richard Cheng
1 sibling, 0 replies; 20+ messages in thread
From: sashiko-bot @ 2026-06-11 18:13 UTC (permalink / raw)
To: John Groves; +Cc: linux-cxl
Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
Pre-existing issues:
- [High] The patch fixes a TOCTOU race on `holder_ops` in `dax_holder_notify_failure()` but fails to address a missing SRCU synchronization in `fs_put_dax()`. Concurrently, `fs_put_dax()` clears `holder_ops` and `holder_data` but returns immediately without waiting for `dax_srcu` readers (unlike `kill_dax()`). This enables a filesystem callback to dereference a NULL `holder_data` pointer, or execute unloaded module memory if the module is unloaded before the preempted callback resumes.
--
commit fb9deacd32044f5bf6a6a9877ee95b7eb436b68b
Author: John Groves <John@Groves.net>
dax: read holder_ops once in dax_holder_notify_failure()
This commit updates dax_holder_notify_failure() to read dax_dev->holder_ops
once using READ_ONCE(). This prevents a concurrent clear of holder_ops from
causing a NULL pointer dereference during the indirect function call.
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
[ ... ]
> @@ -311,12 +312,18 @@ int dax_holder_notify_failure(struct dax_device *dax_dev, u64 off,
> goto out;
> }
>
> - if (!dax_dev->holder_ops) {
> + /*
> + * Read holder_ops once: a concurrent fs_put_dax() or kill_dax() can
> + * clear it. Without the single fetch the compiler could reload
> + * between the NULL check and the call and dereference a NULL ops.
> + */
> + ops = READ_ONCE(dax_dev->holder_ops);
> + if (!ops) {
> rc = -EOPNOTSUPP;
> goto out;
> }
>
> - rc = dax_dev->holder_ops->notify_failure(dax_dev, off, len, mf_flags);
> + rc = ops->notify_failure(dax_dev, off, len, mf_flags);
[Severity: High]
This is a pre-existing issue, but does this fully address the race with
holder_data being cleared?
If fs_put_dax() clears both holder_ops and holder_data concurrently, it
returns without calling synchronize_srcu(&dax_srcu) to wait for dax_srcu
readers.
If a thread is preempted right before calling ops->notify_failure() here,
and fs_put_dax() executes, the callback will proceed with a valid ops pointer
but a NULL holder_data.
For example, xfs_dax_notify_failure() calls dax_holder() which returns NULL,
and then unconditionally dereferences the mount pointer:
fs/xfs/xfs_notify_failure.c:xfs_dax_notify_failure() {
struct xfs_mount *mp = dax_holder(dax_dev);
if (!(mp->m_super->s_flags & SB_BORN)) {
...
}
Could this result in a NULL pointer dereference, or an execution-after-free
if the filesystem module is subsequently unloaded?
> out:
> dax_read_unlock(id);
> return rc;
--
Sashiko AI review · https://sashiko.dev/#/patchset/0100019eb7bcda4b-3f8edae9-d7a4-4bfa-aaea-fcef77fdbbc3-000000@email.amazonses.com?part=7
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH V5 7/9] dax: read holder_ops once in dax_holder_notify_failure()
2026-06-11 17:32 ` [PATCH V5 7/9] dax: read holder_ops once in dax_holder_notify_failure() John Groves
2026-06-11 18:13 ` sashiko-bot
@ 2026-06-12 3:02 ` Richard Cheng
1 sibling, 0 replies; 20+ messages in thread
From: Richard Cheng @ 2026-06-12 3:02 UTC (permalink / raw)
To: John Groves
Cc: John Groves, Dan Williams, John Groves, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
Miklos Szeredi, Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
On Thu, Jun 11, 2026 at 05:32:45PM +0800, John Groves wrote:
> From: John Groves <John@Groves.net>
>
> dax_holder_notify_failure() reads dax_dev->holder_ops twice without
> READ_ONCE() -- once for the NULL check and once for the indirect
> notify_failure() call. A concurrent fs_put_dax() or kill_dax() can clear
> holder_ops between the two reads, so the check can observe a non-NULL
> pointer while the call dereferences NULL.
>
Hello John,
Thanks for this.
Reviewed-by: Richard Cheng <icheng@nvidia.com>
Small message nit, kill_dax() isn't a racing clearer.
Plus I think this only fix holder_ops double-fetch, the fs_put_dax()
race issue is separate and pre-existing.
Best regards,
Richard Cheng.
> Fetch holder_ops once into a local with READ_ONCE() so the NULL check and
> the indirect call observe the same value.
>
> Fixes: 8012b86608552 ("dax: introduce holder for dax_device")
> Suggested-by: Richard Cheng <icheng@nvidia.com>
> Signed-off-by: John Groves <john@groves.net>
> ---
> drivers/dax/super.c | 11 +++++++++--
> 1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 25cf99dd9360b..6b5ee6589e39b 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -303,6 +303,7 @@ EXPORT_SYMBOL_GPL(dax_recovery_write);
> int dax_holder_notify_failure(struct dax_device *dax_dev, u64 off,
> u64 len, int mf_flags)
> {
> + const struct dax_holder_operations *ops;
> int rc, id;
>
> id = dax_read_lock();
> @@ -311,12 +312,18 @@ int dax_holder_notify_failure(struct dax_device *dax_dev, u64 off,
> goto out;
> }
>
> - if (!dax_dev->holder_ops) {
> + /*
> + * Read holder_ops once: a concurrent fs_put_dax() or kill_dax() can
> + * clear it. Without the single fetch the compiler could reload
> + * between the NULL check and the call and dereference a NULL ops.
> + */
> + ops = READ_ONCE(dax_dev->holder_ops);
> + if (!ops) {
> rc = -EOPNOTSUPP;
> goto out;
> }
>
> - rc = dax_dev->holder_ops->notify_failure(dax_dev, off, len, mf_flags);
> + rc = ops->notify_failure(dax_dev, off, len, mf_flags);
> out:
> dax_read_unlock(id);
> return rc;
> --
> 2.53.0
>
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH V5 8/9] dax: fix holder_ops race in fs_put_dax()
2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
` (6 preceding siblings ...)
2026-06-11 17:32 ` [PATCH V5 7/9] dax: read holder_ops once in dax_holder_notify_failure() John Groves
@ 2026-06-11 17:32 ` John Groves
2026-06-11 18:28 ` sashiko-bot
2026-06-11 17:33 ` [PATCH V5 9/9] dax: fsdev.c minor formatting cleanup John Groves
8 siblings, 1 reply; 20+ messages in thread
From: John Groves @ 2026-06-11 17:32 UTC (permalink / raw)
To: John Groves, Dan Williams
Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Miklos Szeredi,
Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
John Groves
From: John Groves <John@Groves.net>
Clear holder_ops before holder_data so that a concurrent fs_dax_get()
cannot have its newly installed holder_ops overwritten. cmpxchg()
provides release ordering on weakly-ordered architectures, ensuring the
WRITE_ONCE(holder_ops, NULL) store is visible to any CPU that observes
the holder_data release.
Add a WARN_ON() that fires only when the cmpxchg observes a non-NULL
value that is not @holder, i.e. fs_put_dax() called by something that
is not the current holder. That is an API contract violation; the
WARN_ON() does not prevent the damage but makes the bug visible.
A NULL cmpxchg result is deliberately tolerated: kill_dax() clears
holder_data while a holder is still attached when a device is removed
out from under a mounted filesystem (after delivering MF_MEM_PRE_REMOVE).
The holder's subsequent fs_put_dax() - e.g. xfs_free_buftarg() after a
forced shutdown - then legitimately finds holder_data already NULL, so
warning on that case would turn supported device removal into a splat
(or a panic with panic_on_warn).
Also add a kerneldoc comment documenting that fs_put_dax() must only
be called by the current holder.
Fixes: eec38f5d86d27 ("dax: Add fs_dax_get() func to prepare dax for fs-dax usage")
Signed-off-by: John Groves <john@groves.net>
---
drivers/dax/super.c | 42 +++++++++++++++++++++++++++++++++++++++---
1 file changed, 39 insertions(+), 3 deletions(-)
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 6b5ee6589e39b..af5c1e14f7e39 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -116,11 +116,47 @@ EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
#if IS_ENABLED(CONFIG_FS_DAX)
+/**
+ * fs_put_dax() - release holder ownership of a dax_device
+ * @dax_dev: dax device to release (may be NULL)
+ * @holder: the holder pointer previously passed to fs_dax_get() or
+ * fs_dax_get_by_bdev(); must match exactly, as it is used
+ * in a cmpxchg to atomically release ownership
+ *
+ * Must only be called by the current holder. Clears holder_ops before
+ * holder_data to avoid a race where a concurrent fs_dax_get() could have
+ * its newly installed holder_ops overwritten.
+ */
void fs_put_dax(struct dax_device *dax_dev, void *holder)
{
- if (dax_dev && holder &&
- cmpxchg(&dax_dev->holder_data, holder, NULL) == holder)
- dax_dev->holder_ops = NULL;
+ if (dax_dev && holder) {
+ void *prev;
+
+ /*
+ * Clear holder_ops before releasing holder_data. A concurrent
+ * dax_holder_notify_failure() that sees NULL ops returns
+ * -EOPNOTSUPP cleanly. A concurrent fs_dax_get() that acquires
+ * holder_data after the cmpxchg below is guaranteed to observe
+ * holder_ops=NULL first (cmpxchg provides release ordering), so
+ * its subsequent store of new ops will not be overwritten.
+ */
+ WRITE_ONCE(dax_dev->holder_ops, NULL);
+ prev = cmpxchg(&dax_dev->holder_data, holder, NULL);
+
+ /*
+ * prev == holder: normal release.
+ * prev == NULL: already released by kill_dax() when the
+ * device was removed under a live holder;
+ * not a bug.
+ * prev != holder (non-NULL): fs_put_dax() called by something
+ * that is not the current holder; an API
+ * contract violation. A lock would be needed
+ * to guard against this, but we WARN_ON()
+ * instead since violating the contract is
+ * a bug.
+ */
+ WARN_ON(prev && prev != holder);
+ }
put_dax(dax_dev);
}
EXPORT_SYMBOL_GPL(fs_put_dax);
--
2.53.0
^ permalink raw reply related [flat|nested] 20+ messages in thread* Re: [PATCH V5 8/9] dax: fix holder_ops race in fs_put_dax()
2026-06-11 17:32 ` [PATCH V5 8/9] dax: fix holder_ops race in fs_put_dax() John Groves
@ 2026-06-11 18:28 ` sashiko-bot
0 siblings, 0 replies; 20+ messages in thread
From: sashiko-bot @ 2026-06-11 18:28 UTC (permalink / raw)
To: John Groves; +Cc: linux-cxl
Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
New issues:
- [High] Unconditional `WRITE_ONCE()` in `fs_put_dax()` corrupts `holder_ops` of concurrent mounts due to a pre-existing race in `fs_dax_get()`.
Pre-existing issues:
- [High] Lack of SRCU synchronization in `fs_put_dax()` leads to NULL pointer dereference or use-after-free in memory failure handlers.
--
commit 4554a08ddf0ff95891c1710cb432fdf796290f3d
Author: John Groves <John@Groves.net>
dax: fix holder_ops race in fs_put_dax()
This patch clears holder_ops before holder_data in fs_put_dax() to
prevent a concurrent fs_dax_get() from having its newly installed
operations overwritten. It adds a WARN_ON for API contract violations.
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 6b5ee6589e39b..af5c1e14f7e39 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -116,11 +116,47 @@ EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
>
> #if IS_ENABLED(CONFIG_FS_DAX)
>
[ ... ]
> void fs_put_dax(struct dax_device *dax_dev, void *holder)
> {
> - if (dax_dev && holder &&
> - cmpxchg(&dax_dev->holder_data, holder, NULL) == holder)
> - dax_dev->holder_ops = NULL;
> + if (dax_dev && holder) {
> + void *prev;
> +
> + /*
> + * Clear holder_ops before releasing holder_data. A concurrent
[ ... ]
> + * its subsequent store of new ops will not be overwritten.
> + */
> + WRITE_ONCE(dax_dev->holder_ops, NULL);
[Severity: High]
Does this unconditional write corrupt concurrent mounts?
Because the SRCU lock is dropped in fs_dax_get() before claiming
holder_data, kill_dax() can run concurrently and clear the device state.
This allows fs_dax_get() to mount a new filesystem on a dead device.
If the original filesystem performs a forced unmount and calls
fs_put_dax(), this WRITE_ONCE() unconditionally destroys the new
filesystem's holder_ops before verifying ownership via cmpxchg().
> + prev = cmpxchg(&dax_dev->holder_data, holder, NULL);
[Severity: High]
This is a pre-existing issue, but does the lack of SRCU synchronization
here lead to use-after-free or NULL pointer dereferences in memory failure
handlers?
dax_holder_notify_failure() executes under an SRCU read lock and reads
holder_ops before calling the filesystem's callback. If fs_put_dax()
clears holder_data and completes the unmount without waiting for an SRCU
grace period, it can cause problems if a hardware memory failure runs
concurrently.
For instance, looking at xfs_dax_notify_failure():
struct xfs_mount *mp = dax_holder(dax_dev);
if (!(mp->m_super->s_flags & SB_BORN)) {
...
If holder_data is cleared right after holder_ops is checked, can the
callback read a cleared holder_data, resulting in a NULL mp dereference
when accessing mp->m_super, or access a fully freed mp if the filesystem
has finished unmounting?
> +
> + /*
> + * prev == holder: normal release.
[ ... ]
> + * instead since violating the contract is
> + * a bug.
> + */
> + WARN_ON(prev && prev != holder);
> + }
> put_dax(dax_dev);
> }
> EXPORT_SYMBOL_GPL(fs_put_dax);
--
Sashiko AI review · https://sashiko.dev/#/patchset/0100019eb7bcda4b-3f8edae9-d7a4-4bfa-aaea-fcef77fdbbc3-000000@email.amazonses.com?part=8
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH V5 9/9] dax: fsdev.c minor formatting cleanup
2026-06-11 17:31 ` [PATCH V5 0/9] Fixes to the previously-merged drivers/dax/fsdev series John Groves
` (7 preceding siblings ...)
2026-06-11 17:32 ` [PATCH V5 8/9] dax: fix holder_ops race in fs_put_dax() John Groves
@ 2026-06-11 17:33 ` John Groves
8 siblings, 0 replies; 20+ messages in thread
From: John Groves @ 2026-06-11 17:33 UTC (permalink / raw)
To: John Groves, Dan Williams
Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Miklos Szeredi,
Alison Schofield, Ira Weiny, Jonathan Cameron,
nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
John Groves
From: John Groves <John@Groves.net>
Address some comments from Jonathan that were missed in the merged
series. Fix line wrapping in fsdev_dax_recovery_write() and
fsdev_dax_zero_page_range() signatures.
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: John Groves <john@groves.net>
---
drivers/dax/fsdev.c | 15 ++++++++-------
1 file changed, 8 insertions(+), 7 deletions(-)
diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
index 71d2bee1e2805..7df75728ada89 100644
--- a/drivers/dax/fsdev.c
+++ b/drivers/dax/fsdev.c
@@ -45,8 +45,8 @@ static void fsdev_write_dax(void *addr, struct page *page,
}
static long __fsdev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
- long nr_pages, enum dax_access_mode mode, void **kaddr,
- unsigned long *pfn)
+ long nr_pages, enum dax_access_mode mode, void **kaddr,
+ unsigned long *pfn)
{
struct dev_dax *dev_dax = dax_get_private(dax_dev);
size_t size = nr_pages << PAGE_SHIFT;
@@ -80,7 +80,8 @@ static int fsdev_dax_zero_page_range(struct dax_device *dax_dev,
long rc;
WARN_ONCE(nr_pages > 1, "%s: nr_pages > 1\n", __func__);
- rc = __fsdev_dax_direct_access(dax_dev, pgoff, 1, DAX_ACCESS, &kaddr, NULL);
+ rc = __fsdev_dax_direct_access(dax_dev, pgoff, 1, DAX_ACCESS,
+ &kaddr, NULL);
if (rc < 0)
return rc;
fsdev_write_dax(kaddr, ZERO_PAGE(0), 0, PAGE_SIZE);
@@ -88,15 +89,15 @@ static int fsdev_dax_zero_page_range(struct dax_device *dax_dev,
}
static long fsdev_dax_direct_access(struct dax_device *dax_dev,
- pgoff_t pgoff, long nr_pages, enum dax_access_mode mode,
- void **kaddr, unsigned long *pfn)
+ pgoff_t pgoff, long nr_pages, enum dax_access_mode mode,
+ void **kaddr, unsigned long *pfn)
{
return __fsdev_dax_direct_access(dax_dev, pgoff, nr_pages, mode,
kaddr, pfn);
}
-static size_t fsdev_dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
- void *addr, size_t bytes, struct iov_iter *i)
+static size_t fsdev_dax_recovery_write(struct dax_device *dax_dev,
+ pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i)
{
return _copy_from_iter_flushcache(addr, bytes, i);
}
--
2.53.0
^ permalink raw reply related [flat|nested] 20+ messages in thread