* [RFC PATCH 01/12] dax: rate limit dev_dax_huge_fault() output
2026-04-23 17:02 [RFC PATCH 00/12] dax: Add DAX to guest memfd support for KVM Dave Jiang
@ 2026-04-23 17:02 ` Dave Jiang
2026-04-23 17:02 ` [RFC PATCH 02/12] dax: Save the kva from memremap Dave Jiang
` (13 subsequent siblings)
14 siblings, 0 replies; 28+ messages in thread
From: Dave Jiang @ 2026-04-23 17:02 UTC (permalink / raw)
To: linux-cxl, nvdimm
Cc: djbw, iweiny, pasha.tatashin, mclapinski, rppt, joao.m.martins,
jic23, gourry, john, rick.p.edgecombe
Use dev_dbg_ratelimited() to rate limit the output of
dev_dax_huge_fault() in order to not flood the dmesg output.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
drivers/dax/device.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 22999a402e02..62206bcb63a6 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -244,9 +244,11 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf, unsigned int order)
int id;
struct dev_dax *dev_dax = filp->private_data;
- dev_dbg(&dev_dax->dev, "%s: op=%s addr=%#lx order=%d\n", current->comm,
- (vmf->flags & FAULT_FLAG_WRITE) ? "write" : "read",
- vmf->address & ~((1UL << (order + PAGE_SHIFT)) - 1), order);
+ dev_dbg_ratelimited(&dev_dax->dev, "%s: op=%s addr=%#lx order=%d\n",
+ current->comm,
+ (vmf->flags & FAULT_FLAG_WRITE) ? "write" : "read",
+ vmf->address & ~((1UL << (order + PAGE_SHIFT)) - 1),
+ order);
id = dax_read_lock();
if (order == 0)
--
2.53.0
^ permalink raw reply related [flat|nested] 28+ messages in thread* [RFC PATCH 02/12] dax: Save the kva from memremap
2026-04-23 17:02 [RFC PATCH 00/12] dax: Add DAX to guest memfd support for KVM Dave Jiang
2026-04-23 17:02 ` [RFC PATCH 01/12] dax: rate limit dev_dax_huge_fault() output Dave Jiang
@ 2026-04-23 17:02 ` Dave Jiang
2026-04-23 17:02 ` [RFC PATCH 03/12] dax: Add fallocate support to device dax Dave Jiang
` (12 subsequent siblings)
14 siblings, 0 replies; 28+ messages in thread
From: Dave Jiang @ 2026-04-23 17:02 UTC (permalink / raw)
To: linux-cxl, nvdimm
Cc: djbw, iweiny, pasha.tatashin, mclapinski, rppt, joao.m.martins,
jic23, gourry, john, rick.p.edgecombe
This patch is partially taken from John Grove's famfs dax series.
Save the kva for memremap because we need direct_access() support.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
drivers/dax/dax-private.h | 4 ++++
drivers/dax/device.c | 19 +++++++++++++++++++
2 files changed, 23 insertions(+)
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index c6ae27c982f4..425a515905e5 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -69,6 +69,8 @@ struct dev_dax_range {
* data while the device is activated in the driver.
* @region: parent region
* @dax_dev: core dax functionality
+ * @virt_addr: kva from memremap; used by fsdev_dax
+ * @cached_size: cached size of the memory range for quick access; used by fsdev_dax
* @align: alignment of this instance
* @target_node: effective numa node if dev_dax memory range is onlined
* @dyn_id: is this a dynamic or statically created instance
@@ -83,6 +85,8 @@ struct dev_dax_range {
struct dev_dax {
struct dax_region *region;
struct dax_device *dax_dev;
+ void *virt_addr;
+ u64 cached_size;
unsigned int align;
int target_node;
bool dyn_id;
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 62206bcb63a6..4ffa3ef60a57 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -370,6 +370,7 @@ static int dax_open(struct inode *inode, struct file *filp)
filp->f_sb_err = file_sample_sb_err(filp);
filp->private_data = dev_dax;
inode->i_flags = S_DAX;
+ inode->i_size = dev_dax->cached_size;
return 0;
}
@@ -408,7 +409,9 @@ static int dev_dax_probe(struct dev_dax *dev_dax)
struct device *dev = &dev_dax->dev;
struct dev_pagemap *pgmap;
struct inode *inode;
+ u64 data_offset = 0;
struct cdev *cdev;
+ u64 len = 0;
void *addr;
int rc, i;
@@ -451,6 +454,7 @@ static int dev_dax_probe(struct dev_dax *dev_dax)
i, range->start, range->end);
return -EBUSY;
}
+ len += range_len(range);
}
pgmap->type = MEMORY_DEVICE_GENERIC;
@@ -461,6 +465,21 @@ static int dev_dax_probe(struct dev_dax *dev_dax)
if (IS_ERR(addr))
return PTR_ERR(addr);
+ /* Detect whether the data is at a non-zero offset into the memory */
+ if (pgmap->range.start != dev_dax->ranges[0].range.start) {
+ u64 phys = dev_dax->ranges[0].range.start;
+ u64 pgmap_phys = dev_dax->pgmap[0].range.start;
+
+ if (!WARN_ON(pgmap_phys > phys))
+ data_offset = phys - pgmap_phys;
+
+ pr_debug("%s: offset detected phys=%llx pgmap_phys=%llx offset=%llx\n",
+ __func__, phys, pgmap_phys, data_offset);
+ }
+
+ dev_dax->virt_addr = addr + data_offset;
+ dev_dax->cached_size = len;
+
inode = dax_inode(dax_dev);
cdev = inode->i_cdev;
cdev_init(cdev, &dax_fops);
--
2.53.0
^ permalink raw reply related [flat|nested] 28+ messages in thread* [RFC PATCH 03/12] dax: Add fallocate support to device dax
2026-04-23 17:02 [RFC PATCH 00/12] dax: Add DAX to guest memfd support for KVM Dave Jiang
2026-04-23 17:02 ` [RFC PATCH 01/12] dax: rate limit dev_dax_huge_fault() output Dave Jiang
2026-04-23 17:02 ` [RFC PATCH 02/12] dax: Save the kva from memremap Dave Jiang
@ 2026-04-23 17:02 ` Dave Jiang
2026-04-23 17:02 ` [RFC PATCH 04/12] dax: Move dax_pgoff_to_phys() to dax bus to be used by dev dax Dave Jiang
` (11 subsequent siblings)
14 siblings, 0 replies; 28+ messages in thread
From: Dave Jiang @ 2026-04-23 17:02 UTC (permalink / raw)
To: linux-cxl, nvdimm
Cc: djbw, iweiny, pasha.tatashin, mclapinski, rppt, joao.m.martins,
jic23, gourry, john, rick.p.edgecombe
fallocate() support is needed for the KVM guest_memfd selftest. Add a
version of fallocate() for device dax. This is a simplistic
implementation that just zeroes the specified range. It may need to
be revisited and implement map/unmap to support larger files.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
drivers/dax/device.c | 30 +++++++++++++++++++++++++++++-
1 file changed, 29 insertions(+), 1 deletion(-)
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 4ffa3ef60a57..705c59f469c2 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -10,6 +10,8 @@
#include <linux/fs.h>
#include <linux/mm.h>
#include <linux/mman.h>
+#include <linux/range.h>
+#include <linux/falloc.h>
#include "dax-private.h"
#include "bus.h"
@@ -383,7 +385,31 @@ static int dax_release(struct inode *inode, struct file *filp)
return 0;
}
-static const struct file_operations dax_fops = {
+static long dax_fallocate(struct file *file, int mode, loff_t offset,
+ loff_t len)
+{
+ struct dev_dax *dev_dax = file->private_data;
+
+ if (!IS_ALIGNED(offset, dev_dax->align) ||
+ !IS_ALIGNED(len, dev_dax->align))
+ return -EINVAL;
+
+ if (offset + len > dev_dax->cached_size)
+ return -ERANGE;
+
+ /* DAX device does not change size */
+ if (!(mode & FALLOC_FL_KEEP_SIZE))
+ return -EOPNOTSUPP;
+
+ if ((mode & ~FALLOC_FL_KEEP_SIZE) &&
+ ((mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE)) == 0))
+ return -EOPNOTSUPP;
+
+ memset(dev_dax->virt_addr + offset, 0, len);
+ return 0;
+}
+
+const struct file_operations dax_fops = {
.llseek = noop_llseek,
.owner = THIS_MODULE,
.open = dax_open,
@@ -391,7 +417,9 @@ static const struct file_operations dax_fops = {
.get_unmapped_area = dax_get_unmapped_area,
.mmap_prepare = dax_mmap_prepare,
.fop_flags = FOP_MMAP_SYNC,
+ .fallocate = dax_fallocate,
};
+EXPORT_SYMBOL_GPL(dax_fops);
static void dev_dax_cdev_del(void *cdev)
{
--
2.53.0
^ permalink raw reply related [flat|nested] 28+ messages in thread* [RFC PATCH 04/12] dax: Move dax_pgoff_to_phys() to dax bus to be used by dev dax
2026-04-23 17:02 [RFC PATCH 00/12] dax: Add DAX to guest memfd support for KVM Dave Jiang
` (2 preceding siblings ...)
2026-04-23 17:02 ` [RFC PATCH 03/12] dax: Add fallocate support to device dax Dave Jiang
@ 2026-04-23 17:02 ` Dave Jiang
2026-04-23 17:02 ` [RFC PATCH 05/12] dax: Add dax_operations and supporting functions to device dax Dave Jiang
` (10 subsequent siblings)
14 siblings, 0 replies; 28+ messages in thread
From: Dave Jiang @ 2026-04-23 17:02 UTC (permalink / raw)
To: linux-cxl, nvdimm
Cc: djbw, iweiny, pasha.tatashin, mclapinski, rppt, joao.m.martins,
jic23, gourry, john, rick.p.edgecombe
Move function and export symbol dax_pgoff_to_phys() to dax bus.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
drivers/dax/bus.c | 24 ++++++++++++++++++++++++
drivers/dax/device.c | 23 -----------------------
2 files changed, 24 insertions(+), 23 deletions(-)
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index fde29e0ad68b..92e79720befd 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1417,6 +1417,30 @@ static const struct device_type dev_dax_type = {
.groups = dax_attribute_groups,
};
+/* see "strong" declaration in tools/testing/nvdimm/dax-dev.c */
+__weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
+ unsigned long size)
+{
+ int i;
+
+ for (i = 0; i < dev_dax->nr_range; i++) {
+ struct dev_dax_range *dax_range = &dev_dax->ranges[i];
+ struct range *range = &dax_range->range;
+ unsigned long long pgoff_end;
+ phys_addr_t phys;
+
+ pgoff_end = dax_range->pgoff + PHYS_PFN(range_len(range)) - 1;
+ if (pgoff < dax_range->pgoff || pgoff > pgoff_end)
+ continue;
+ phys = PFN_PHYS(pgoff - dax_range->pgoff) + range->start;
+ if (phys + size - 1 <= range->end)
+ return phys;
+ break;
+ }
+ return -1;
+}
+EXPORT_SYMBOL_GPL(dax_pgoff_to_phys);
+
static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
{
struct dax_region *dax_region = data->dax_region;
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 705c59f469c2..e892fb4ec8e0 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -59,29 +59,6 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
vma->vm_file, func);
}
-/* see "strong" declaration in tools/testing/nvdimm/dax-dev.c */
-__weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
- unsigned long size)
-{
- int i;
-
- for (i = 0; i < dev_dax->nr_range; i++) {
- struct dev_dax_range *dax_range = &dev_dax->ranges[i];
- struct range *range = &dax_range->range;
- unsigned long long pgoff_end;
- phys_addr_t phys;
-
- pgoff_end = dax_range->pgoff + PHYS_PFN(range_len(range)) - 1;
- if (pgoff < dax_range->pgoff || pgoff > pgoff_end)
- continue;
- phys = PFN_PHYS(pgoff - dax_range->pgoff) + range->start;
- if (phys + size - 1 <= range->end)
- return phys;
- break;
- }
- return -1;
-}
-
static void dax_set_mapping(struct vm_fault *vmf, unsigned long pfn,
unsigned long fault_size)
{
--
2.53.0
^ permalink raw reply related [flat|nested] 28+ messages in thread* [RFC PATCH 05/12] dax: Add dax_operations and supporting functions to device dax
2026-04-23 17:02 [RFC PATCH 00/12] dax: Add DAX to guest memfd support for KVM Dave Jiang
` (3 preceding siblings ...)
2026-04-23 17:02 ` [RFC PATCH 04/12] dax: Move dax_pgoff_to_phys() to dax bus to be used by dev dax Dave Jiang
@ 2026-04-23 17:02 ` Dave Jiang
2026-04-23 17:02 ` [RFC PATCH 06/12] dax: Add helper to determine if a 'struct file' supports dax Dave Jiang
` (9 subsequent siblings)
14 siblings, 0 replies; 28+ messages in thread
From: Dave Jiang @ 2026-04-23 17:02 UTC (permalink / raw)
To: linux-cxl, nvdimm
Cc: djbw, iweiny, pasha.tatashin, mclapinski, rppt, joao.m.martins,
jic23, gourry, john, rick.p.edgecombe
dax_direct_access() support is needed to provide PFN when KVM performs
an EPT exception on the guest memory faulting. Add dax_operations and
supporting functions to support dax_direct_access() for device dax.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
drivers/dax/bus.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 97 insertions(+), 1 deletion(-)
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 92e79720befd..1ef447747876 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1,6 +1,7 @@
// SPDX-License-Identifier: GPL-2.0
/* Copyright(c) 2017-2018 Intel Corporation. All rights reserved. */
#include <linux/memremap.h>
+#include <linux/highmem.h>
#include <linux/device.h>
#include <linux/mutex.h>
#include <linux/list.h>
@@ -1441,6 +1442,101 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
}
EXPORT_SYMBOL_GPL(dax_pgoff_to_phys);
+static void dev_dax_write_dax(void *addr, struct page *page,
+ unsigned int off, unsigned int len)
+{
+ while (len) {
+ void *mem = kmap_local_page(page);
+ unsigned int chunk = min_t(unsigned int, len, PAGE_SIZE - off);
+
+ memcpy_flushcache(addr, mem + off, chunk);
+ kunmap_local(mem);
+ len -= chunk;
+ off = 0;
+ page++;
+ addr += chunk;
+ }
+}
+
+static long __dev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
+ long nr_pages, enum dax_access_mode mode,
+ void **kaddr, unsigned long *pfn)
+{
+ struct dev_dax *dev_dax = dax_get_private(dax_dev);
+ size_t size = nr_pages << PAGE_SHIFT;
+ size_t offset = pgoff << PAGE_SHIFT;
+ void *virt_addr = dev_dax->virt_addr + offset;
+ unsigned long local_pfn;
+ phys_addr_t phys;
+
+ /* Only support DAX_ACCESS atm */
+ if (mode != DAX_ACCESS)
+ return -EINVAL;
+
+ if (!dev_dax || !dev_dax->virt_addr)
+ return -ENXIO;
+
+ if (nr_pages <= 0)
+ return -EINVAL;
+
+ if (offset >= dev_dax->cached_size)
+ return -ERANGE;
+
+ phys = dax_pgoff_to_phys(dev_dax, pgoff, size);
+ if (phys == -1) {
+ dev_dbg(&dev_dax->dev,
+ "invalid access: pgoff=%#lx, nr_pages=%ld\n",
+ pgoff, nr_pages);
+ return -ERANGE;
+ }
+
+ if (kaddr)
+ *kaddr = virt_addr;
+
+ local_pfn = PHYS_PFN(phys);
+ if (pfn)
+ *pfn = local_pfn;
+
+ /*
+ * Use cached_size which was computed at probe time. The size cannot
+ * change while the driver is bound (resize returns -EBUSY).
+ */
+ return PHYS_PFN(min(size, dev_dax->cached_size - offset));
+}
+
+static int dev_dax_zero_page_range(struct dax_device *dax_dev,
+ pgoff_t pgoff, size_t nr_pages)
+{
+ void *kaddr;
+
+ WARN_ONCE(nr_pages > 1, "%s: nr_pages > 1\n", __func__);
+ __dev_dax_direct_access(dax_dev, pgoff, nr_pages, DAX_ACCESS, &kaddr, NULL);
+ dev_dax_write_dax(kaddr, ZERO_PAGE(0), 0, PAGE_SIZE);
+ return 0;
+}
+
+static long dev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
+ long nr_pages, enum dax_access_mode mode,
+ void **kaddr, unsigned long *pfn)
+{
+ return __dev_dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr,
+ pfn);
+}
+
+static size_t dev_dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
+ void *addr, size_t bytes,
+ struct iov_iter *i)
+{
+ return _copy_from_iter_flushcache(addr, bytes, i);
+}
+
+
+static const struct dax_operations dev_dax_ops = {
+ .direct_access = dev_dax_direct_access,
+ .zero_page_range = dev_dax_zero_page_range,
+ .recovery_write = dev_dax_recovery_write,
+};
+
static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
{
struct dax_region *dax_region = data->dax_region;
@@ -1500,7 +1596,7 @@ static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
* No dax_operations since there is no access to this device outside of
* mmap of the resulting character device.
*/
- dax_dev = alloc_dax(dev_dax, NULL);
+ dax_dev = alloc_dax(dev_dax, &dev_dax_ops);
if (IS_ERR(dax_dev)) {
rc = PTR_ERR(dax_dev);
goto err_alloc_dax;
--
2.53.0
^ permalink raw reply related [flat|nested] 28+ messages in thread* [RFC PATCH 06/12] dax: Add helper to determine if a 'struct file' supports dax
2026-04-23 17:02 [RFC PATCH 00/12] dax: Add DAX to guest memfd support for KVM Dave Jiang
` (4 preceding siblings ...)
2026-04-23 17:02 ` [RFC PATCH 05/12] dax: Add dax_operations and supporting functions to device dax Dave Jiang
@ 2026-04-23 17:02 ` Dave Jiang
2026-04-23 17:02 ` [RFC PATCH 07/12] KVM: guest_memfd: Add setup of daxfd when binding gmem Dave Jiang
` (8 subsequent siblings)
14 siblings, 0 replies; 28+ messages in thread
From: Dave Jiang @ 2026-04-23 17:02 UTC (permalink / raw)
To: linux-cxl, nvdimm
Cc: djbw, iweiny, pasha.tatashin, mclapinski, rppt, joao.m.martins,
jic23, gourry, john, rick.p.edgecombe
Add a helper function that checks of the file_operations is dax.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
include/linux/dax.h | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 9d624f4d9df6..a5e1a3ca1a0d 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -311,4 +311,11 @@ static inline void hmem_register_resource(int target_nid, struct resource *r)
typedef int (*walk_hmem_fn)(struct device *dev, int target_nid,
const struct resource *res);
int walk_hmem_resources(struct device *dev, walk_hmem_fn fn);
+
+extern const struct file_operations dax_fops;
+static inline bool is_file_dax(struct file *file)
+{
+ return file->f_op == &dax_fops;
+}
+
#endif
--
2.53.0
^ permalink raw reply related [flat|nested] 28+ messages in thread* [RFC PATCH 07/12] KVM: guest_memfd: Add setup of daxfd when binding gmem
2026-04-23 17:02 [RFC PATCH 00/12] dax: Add DAX to guest memfd support for KVM Dave Jiang
` (5 preceding siblings ...)
2026-04-23 17:02 ` [RFC PATCH 06/12] dax: Add helper to determine if a 'struct file' supports dax Dave Jiang
@ 2026-04-23 17:02 ` Dave Jiang
2026-04-23 17:02 ` [RFC PATCH 08/12] fs: allow char dev to go through fallocate Dave Jiang
` (7 subsequent siblings)
14 siblings, 0 replies; 28+ messages in thread
From: Dave Jiang @ 2026-04-23 17:02 UTC (permalink / raw)
To: linux-cxl, nvdimm
Cc: djbw, iweiny, pasha.tatashin, mclapinski, rppt, joao.m.martins,
jic23, gourry, john, rick.p.edgecombe
A DAX fd comes from device dax char dev passed in from userspace. It's
not a fd that is created by the kernel unlike gmem fd.
kvm_guest_memfd_bind() seems to be the place to setup additional gmem
context for daxfd at this moment when it is passed in through the ioctl
to bind to gmem. Add a helper function to setup the necessary bits
when the fd is verified to be DAX.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
arch/x86/kvm/Kconfig | 1 +
drivers/dax/bus.c | 3 +++
drivers/dax/dax-private.h | 4 ++++
include/linux/kvm_host.h | 24 +++++++++++++++++++
virt/kvm/Kconfig | 4 ++++
virt/kvm/guest_memfd.c | 50 ++++++++++++++++++++++++++-------------
6 files changed, 70 insertions(+), 16 deletions(-)
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 278f08194ec8..bdcaff9c49e7 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -48,6 +48,7 @@ config KVM_X86
select KVM_GENERIC_PRE_FAULT_MEMORY
select KVM_WERROR if WERROR
select KVM_GUEST_MEMFD if X86_64
+ select KVM_GUEST_DAXFD if X86_64
config KVM
tristate "Kernel-based Virtual Machine (KVM) support"
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 1ef447747876..759163722e4c 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1621,6 +1621,9 @@ static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
dev->parent = parent;
dev->type = &dev_dax_type;
+ xa_init(&dev_dax->gmem_file.bindings);
+ list_add(&dev_dax->gmem_file.entry, &inode->i_mapping->i_private_list);
+
rc = device_add(dev);
if (rc) {
kill_dev_dax(dev_dax);
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 425a515905e5..2b3c44cb0dbe 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -8,6 +8,7 @@
#include <linux/device.h>
#include <linux/cdev.h>
#include <linux/idr.h>
+#include <linux/kvm_host.h>
/* private routines between core files */
struct dax_device;
@@ -67,6 +68,8 @@ struct dev_dax_range {
/**
* struct dev_dax - instance data for a subdivision of a dax region, and
* data while the device is activated in the driver.
+ *
+ * @gmem_file: guest mem file for this dev_dax. Must be first member
* @region: parent region
* @dax_dev: core dax functionality
* @virt_addr: kva from memremap; used by fsdev_dax
@@ -83,6 +86,7 @@ struct dev_dax_range {
* @ranges: range tuples of memory used
*/
struct dev_dax {
+ struct gmem_file gmem_file;
struct dax_region *region;
struct dax_device *dax_dev;
void *virt_addr;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d93f75b05ae2..9afce6d02d9e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -56,6 +56,7 @@
*/
#define KVM_MEMSLOT_INVALID (1UL << 16)
#define KVM_MEMSLOT_GMEM_ONLY (1UL << 17)
+#define KVM_MEMSLOT_DAX_ONLY (1UL << 18)
/*
* Bit 63 of the memslot generation number is an "update in-progress flag",
@@ -2515,6 +2516,14 @@ static inline bool kvm_memslot_is_gmem_only(const struct kvm_memory_slot *slot)
return slot->flags & KVM_MEMSLOT_GMEM_ONLY;
}
+static inline bool kvm_memslot_is_dax_only(const struct kvm_memory_slot *slot)
+{
+ if (!IS_ENABLED(CONFIG_KVM_GUEST_DAXFD))
+ return false;
+
+ return slot->flags & KVM_MEMSLOT_DAX_ONLY;
+}
+
#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
{
@@ -2604,4 +2613,19 @@ static inline int kvm_enable_virtualization(void) { return 0; }
static inline void kvm_disable_virtualization(void) { }
#endif
+/*
+ * A guest_memfd instance can be associated multiple VMs, each with its own
+ * "view" of the underlying physical memory.
+ *
+ * The gmem's inode is effectively the raw underlying physical storage, and is
+ * used to track properties of the physical memory, while each gmem file is
+ * effectively a single VM's view of that storage, and is used to track assets
+ * specific to its associated VM, e.g. memslots=>gmem bindings.
+ */
+struct gmem_file {
+ struct kvm *kvm;
+ struct xarray bindings;
+ struct list_head entry;
+};
+
#endif
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 267c7369c765..7f0598af868b 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -125,3 +125,7 @@ config HAVE_KVM_ARCH_GMEM_INVALIDATE
config HAVE_KVM_ARCH_GMEM_POPULATE
bool
depends on KVM_GUEST_MEMFD
+
+config KVM_GUEST_DAXFD
+ bool
+ depends on KVM_GUEST_MEMFD
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index fdaea3422c30..959f690c1d1d 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -7,26 +7,12 @@
#include <linux/mempolicy.h>
#include <linux/pseudo_fs.h>
#include <linux/pagemap.h>
+#include <linux/dax.h>
#include "kvm_mm.h"
static struct vfsmount *kvm_gmem_mnt;
-/*
- * A guest_memfd instance can be associated multiple VMs, each with its own
- * "view" of the underlying physical memory.
- *
- * The gmem's inode is effectively the raw underlying physical storage, and is
- * used to track properties of the physical memory, while each gmem file is
- * effectively a single VM's view of that storage, and is used to track assets
- * specific to its associated VM, e.g. memslots=>gmem bindings.
- */
-struct gmem_file {
- struct kvm *kvm;
- struct xarray bindings;
- struct list_head entry;
-};
-
struct gmem_inode {
struct shared_policy policy;
struct inode vfs_inode;
@@ -644,6 +630,32 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
return __kvm_gmem_create(kvm, size, flags);
}
+/*
+ * DAX fd files are not initialized with gmem bits since it's passed in from
+ * user space and not created by the kernel (at least right now). So when
+ * the daxfd is being bound during kvm_gmem_bind(), the gmem bits needs to be
+ * initialized.
+ */
+static int kvm_daxfd_init(struct file *file, struct kvm_memory_slot *slot,
+ struct kvm *kvm)
+{
+ struct gmem_file *f;
+ struct inode *inode;
+
+ if (!is_file_dax(file))
+ return -EINVAL;
+
+ inode = file_inode(file);
+ GMEM_I(inode)->flags |= GUEST_MEMFD_FLAG_MMAP;
+ slot->flags |= KVM_MEMSLOT_DAX_ONLY;
+
+ kvm_get_kvm(kvm);
+ f = file->private_data;
+ f->kvm = kvm;
+
+ return 0;
+}
+
int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
unsigned int fd, loff_t offset)
{
@@ -660,7 +672,13 @@ int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
if (!file)
return -EBADF;
- if (file->f_op != &kvm_gmem_fops)
+ if (is_file_dax(file)) {
+ r = kvm_daxfd_init(file, slot, kvm);
+ if (r)
+ goto err;
+ }
+
+ if (file->f_op != &kvm_gmem_fops && !is_file_dax(file))
goto err;
f = file->private_data;
--
2.53.0
^ permalink raw reply related [flat|nested] 28+ messages in thread* [RFC PATCH 08/12] fs: allow char dev to go through fallocate
2026-04-23 17:02 [RFC PATCH 00/12] dax: Add DAX to guest memfd support for KVM Dave Jiang
` (6 preceding siblings ...)
2026-04-23 17:02 ` [RFC PATCH 07/12] KVM: guest_memfd: Add setup of daxfd when binding gmem Dave Jiang
@ 2026-04-23 17:02 ` Dave Jiang
2026-04-23 17:02 ` [RFC PATCH 09/12] dax: Add dax_get_dev_dax() helper function Dave Jiang
` (6 subsequent siblings)
14 siblings, 0 replies; 28+ messages in thread
From: Dave Jiang @ 2026-04-23 17:02 UTC (permalink / raw)
To: linux-cxl, nvdimm
Cc: djbw, iweiny, pasha.tatashin, mclapinski, rppt, joao.m.martins,
jic23, gourry, john, rick.p.edgecombe
Allow a device DAX device to execute fallocate.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
fs/open.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/open.c b/fs/open.c
index f328622061c5..7f74604566ac 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -322,7 +322,8 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
if (S_ISDIR(inode->i_mode))
return -EISDIR;
- if (!S_ISREG(inode->i_mode) && !S_ISBLK(inode->i_mode))
+ if (!S_ISREG(inode->i_mode) && !S_ISBLK(inode->i_mode) &&
+ !S_ISCHR(inode->i_mode))
return -ENODEV;
/* Check for wraparound */
--
2.53.0
^ permalink raw reply related [flat|nested] 28+ messages in thread* [RFC PATCH 09/12] dax: Add dax_get_dev_dax() helper function
2026-04-23 17:02 [RFC PATCH 00/12] dax: Add DAX to guest memfd support for KVM Dave Jiang
` (7 preceding siblings ...)
2026-04-23 17:02 ` [RFC PATCH 08/12] fs: allow char dev to go through fallocate Dave Jiang
@ 2026-04-23 17:02 ` Dave Jiang
2026-04-23 17:02 ` [RFC PATCH 10/12] kvm: Implement dax support for KVM faulting Dave Jiang
` (5 subsequent siblings)
14 siblings, 0 replies; 28+ messages in thread
From: Dave Jiang @ 2026-04-23 17:02 UTC (permalink / raw)
To: linux-cxl, nvdimm
Cc: djbw, iweiny, pasha.tatashin, mclapinski, rppt, joao.m.martins,
jic23, gourry, john, rick.p.edgecombe
For callers that need to get the dev_dax struct from a dax_device in
order to call DAX APIs, add a dax_get_dev_dax() helper.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
drivers/dax/bus.c | 6 ++++++
include/linux/dax.h | 7 +++++++
2 files changed, 13 insertions(+)
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 759163722e4c..a99db3739e45 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1515,6 +1515,12 @@ static int dev_dax_zero_page_range(struct dax_device *dax_dev,
return 0;
}
+struct dax_device *dax_get_dev_dax(struct dev_dax *dev_dax)
+{
+ return dev_dax->dax_dev;
+}
+EXPORT_SYMBOL_GPL(dax_get_dev_dax);
+
static long dev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
long nr_pages, enum dax_access_mode mode,
void **kaddr, unsigned long *pfn)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index a5e1a3ca1a0d..da1413c8a21f 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -62,6 +62,9 @@ void set_dax_nomc(struct dax_device *dax_dev);
void set_dax_synchronous(struct dax_device *dax_dev);
size_t dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
void *addr, size_t bytes, struct iov_iter *i);
+struct dev_dax;
+struct dax_device *dax_get_dev_dax(struct dev_dax *dev_dax);
+
/*
* Check if given mapping is supported by the file / underlying device.
*/
@@ -122,6 +125,10 @@ static inline size_t dax_recovery_write(struct dax_device *dax_dev,
{
return 0;
}
+static inline struct dax_device *dax_get_dev_dax(struct dev_dax *dev_dax)
+{
+ return NULL;
+}
#endif
struct writeback_control;
--
2.53.0
^ permalink raw reply related [flat|nested] 28+ messages in thread* [RFC PATCH 10/12] kvm: Implement dax support for KVM faulting
2026-04-23 17:02 [RFC PATCH 00/12] dax: Add DAX to guest memfd support for KVM Dave Jiang
` (8 preceding siblings ...)
2026-04-23 17:02 ` [RFC PATCH 09/12] dax: Add dax_get_dev_dax() helper function Dave Jiang
@ 2026-04-23 17:02 ` Dave Jiang
2026-04-23 17:02 ` [RFC PATCH 11/12] kvm: Add daxfd support for supported flags Dave Jiang
` (4 subsequent siblings)
14 siblings, 0 replies; 28+ messages in thread
From: Dave Jiang @ 2026-04-23 17:02 UTC (permalink / raw)
To: linux-cxl, nvdimm
Cc: djbw, iweiny, pasha.tatashin, mclapinski, rppt, joao.m.martins,
jic23, gourry, john, rick.p.edgecombe
Add support for KVM faulting of daxfd through using dax_direct_access().
The function kvm_dax_get_pfn() is implemented to complete the daxfd
support for KVM faulting. A reference is taken on the page. There is no
need to call put_dev_pagemap() when put_page() happens as recent kernel
changes takes care of that within put_page() path.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
arch/x86/kvm/mmu/mmu.c | 48 +++++++++++++++++++++++++++++++++++-----
drivers/dax/bus.c | 1 +
include/linux/dax.h | 1 +
include/linux/kvm_host.h | 8 +++++++
virt/kvm/guest_memfd.c | 42 +++++++++++++++++++++++++++++++++++
5 files changed, 94 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 02c450686b4a..fe787f73b9a8 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4588,16 +4588,52 @@ static int kvm_mmu_faultin_pfn_gmem(struct kvm_vcpu *vcpu,
return RET_PF_CONTINUE;
}
+static pgoff_t kvm_gmem_get_index(struct kvm_memory_slot *slot, gfn_t gfn)
+{
+ return gfn - slot->base_gfn + slot->gmem.pgoff;
+}
+
+static kvm_pfn_t kvm_faultin_dax_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
+{
+ kvm_pfn_t pfn;
+ pgoff_t index;
+ int rc;
+
+ if (!kvm_memslot_is_dax_only(fault->slot))
+ return KVM_PFN_ERR_FAULT;
+
+ index = kvm_gmem_get_index(fault->slot, fault->gfn);
+ rc = kvm_dax_get_pfn(fault->slot, index, &pfn, &fault->refcounted_page);
+ if (rc)
+ return KVM_PFN_ERR_FAULT;
+
+ return pfn;
+}
+
static int __kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
struct kvm_page_fault *fault)
{
unsigned int foll = fault->write ? FOLL_WRITE : 0;
+ gfn_t gfn = fault->gfn;
- if (fault->is_private || kvm_memslot_is_gmem_only(fault->slot))
+ if (fault->is_private || (kvm_memslot_is_gmem_only(fault->slot) &&
+ !kvm_memslot_is_dax_only(fault->slot)))
return kvm_mmu_faultin_pfn_gmem(vcpu, fault);
+ if (kvm_memslot_is_dax_only(fault->slot)) {
+ gfn = kvm_gmem_get_index(fault->slot, fault->gfn);
+ fault->pfn = kvm_faultin_dax_pfn(vcpu, fault);
+ if (fault->pfn == KVM_PFN_ERR_FAULT) {
+ kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
+ return RET_PF_INVALID;
+ }
+ fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
+
+ return RET_PF_CONTINUE;
+ }
+
foll |= FOLL_NOWAIT;
- fault->pfn = __kvm_faultin_pfn(fault->slot, fault->gfn, foll,
+ fault->pfn = __kvm_faultin_pfn(fault->slot, gfn, foll,
&fault->map_writable, &fault->refcounted_page);
/*
@@ -4610,9 +4646,9 @@ static int __kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
return RET_PF_CONTINUE;
if (!fault->prefetch && kvm_can_do_async_pf(vcpu)) {
- trace_kvm_try_async_get_page(fault->addr, fault->gfn);
- if (kvm_find_async_pf_gfn(vcpu, fault->gfn)) {
- trace_kvm_async_pf_repeated_fault(fault->addr, fault->gfn);
+ trace_kvm_try_async_get_page(fault->addr, gfn);
+ if (kvm_find_async_pf_gfn(vcpu, gfn)) {
+ trace_kvm_async_pf_repeated_fault(fault->addr, gfn);
kvm_make_request(KVM_REQ_APF_HALT, vcpu);
return RET_PF_RETRY;
} else if (kvm_arch_setup_async_pf(vcpu, fault)) {
@@ -4627,7 +4663,7 @@ static int __kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
*/
foll |= FOLL_INTERRUPTIBLE;
foll &= ~FOLL_NOWAIT;
- fault->pfn = __kvm_faultin_pfn(fault->slot, fault->gfn, foll,
+ fault->pfn = __kvm_faultin_pfn(fault->slot, gfn, foll,
&fault->map_writable, &fault->refcounted_page);
return RET_PF_CONTINUE;
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index a99db3739e45..2009f34614d8 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -2,6 +2,7 @@
/* Copyright(c) 2017-2018 Intel Corporation. All rights reserved. */
#include <linux/memremap.h>
#include <linux/highmem.h>
+#include <linux/kvm_host.h>
#include <linux/device.h>
#include <linux/mutex.h>
#include <linux/list.h>
diff --git a/include/linux/dax.h b/include/linux/dax.h
index da1413c8a21f..41214b6d7897 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -5,6 +5,7 @@
#include <linux/fs.h>
#include <linux/mm.h>
#include <linux/radix-tree.h>
+#include <linux/kvm_host.h>
typedef unsigned long dax_entry_t;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9afce6d02d9e..ffd0381ba079 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2552,6 +2552,8 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
gfn_t gfn, kvm_pfn_t *pfn, struct page **page,
int *max_order);
+int kvm_dax_get_pfn(struct kvm_memory_slot *slot, pgoff_t index, kvm_pfn_t *pfn,
+ struct page **refcounted_page);
#else
static inline int kvm_gmem_get_pfn(struct kvm *kvm,
struct kvm_memory_slot *slot, gfn_t gfn,
@@ -2561,6 +2563,12 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
KVM_BUG_ON(1, kvm);
return -EIO;
}
+static inline int kvm_dax_get_pfn(struct kvm_memory_slot *slot, gfn_t gfn,
+ kvm_pfn_t *pfn)
+{
+ KVM_BUG_ON(1, kvm);
+ return -EIO;
+}
#endif /* CONFIG_KVM_GUEST_MEMFD */
#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_PREPARE
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 959f690c1d1d..4e7141fdb2b8 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -840,6 +840,48 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
}
EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
+int kvm_dax_get_pfn(struct kvm_memory_slot *slot, pgoff_t index, kvm_pfn_t *pfn,
+ struct page **refcounted_page)
+{
+ struct dev_pagemap *pgmap;
+ struct dev_dax *dev_dax;
+ struct page *page;
+ void *kaddr;
+ long rc;
+ int id;
+
+ CLASS(gmem_get_file, file)(slot);
+ if (!file)
+ return -EFAULT;
+
+ dev_dax = file->private_data;
+ if (!dev_dax)
+ return -ENODEV;
+
+ id = dax_read_lock();
+ rc = dax_direct_access(dax_get_dev_dax(dev_dax), index, 1, DAX_ACCESS,
+ &kaddr, (unsigned long *)pfn);
+ dax_read_unlock(id);
+ if (rc < 0)
+ return rc;
+
+ /* Verify that 'struct page' exists for this PFN */
+ pgmap = get_dev_pagemap(*pfn);
+ if (!pgmap)
+ return -ENODEV;
+
+ page = pfn_to_page(*pfn);
+ if (!try_get_page(page)) {
+ put_dev_pagemap(pgmap);
+ return -EFAULT;
+ }
+
+ *refcounted_page = page;
+
+ return 0;
+}
+EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_dax_get_pfn);
+
#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_POPULATE
long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long npages,
kvm_gmem_populate_cb post_populate, void *opaque)
--
2.53.0
^ permalink raw reply related [flat|nested] 28+ messages in thread* [RFC PATCH 11/12] kvm: Add daxfd support for supported flags
2026-04-23 17:02 [RFC PATCH 00/12] dax: Add DAX to guest memfd support for KVM Dave Jiang
` (9 preceding siblings ...)
2026-04-23 17:02 ` [RFC PATCH 10/12] kvm: Implement dax support for KVM faulting Dave Jiang
@ 2026-04-23 17:02 ` Dave Jiang
2026-04-23 17:02 ` [RFC PATCH 12/12] selftest/kvm: Add daxfd support for gmem selftest Dave Jiang
` (3 subsequent siblings)
14 siblings, 0 replies; 28+ messages in thread
From: Dave Jiang @ 2026-04-23 17:02 UTC (permalink / raw)
To: linux-cxl, nvdimm
Cc: djbw, iweiny, pasha.tatashin, mclapinski, rppt, joao.m.martins,
jic23, gourry, john, rick.p.edgecombe
Add supported flags for daxfd similar to what memfd does.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
include/linux/kvm_host.h | 7 +++++++
include/uapi/linux/kvm.h | 4 ++++
virt/kvm/kvm_main.c | 6 ++++++
3 files changed, 17 insertions(+)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ffd0381ba079..1427ff41cfc9 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -743,6 +743,13 @@ static inline u64 kvm_gmem_get_supported_flags(struct kvm *kvm)
}
#endif
+#ifdef CONFIG_KVM_GUEST_DAXFD
+static inline u64 kvm_dax_get_supported_flags(struct kvm *kvm)
+{
+ return GUEST_MEMFD_FLAG_MMAP;
+}
+#endif
+
#ifndef kvm_arch_has_readonly_mem
static inline bool kvm_arch_has_readonly_mem(struct kvm *kvm)
{
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index dddb781b0507..2ae3e1cdcee5 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -51,6 +51,7 @@ struct kvm_userspace_memory_region2 {
#define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
#define KVM_MEM_READONLY (1UL << 1)
#define KVM_MEM_GUEST_MEMFD (1UL << 2)
+#define KVM_MEM_GUEST_DAXFD (1UL << 3)
/* for KVM_IRQ_LINE */
struct kvm_irq_level {
@@ -974,6 +975,8 @@ struct kvm_enable_cap {
#define KVM_CAP_GUEST_MEMFD_FLAGS 244
#define KVM_CAP_ARM_SEA_TO_USER 245
#define KVM_CAP_S390_USER_OPEREXEC 246
+#define KVM_CAP_GUEST_DAXFD 247
+#define KVM_CAP_GUEST_DAXFD_FLAGS 248
struct kvm_irq_routing_irqchip {
__u32 irqchip;
@@ -1612,6 +1615,7 @@ struct kvm_memory_attributes {
#define KVM_CREATE_GUEST_MEMFD _IOWR(KVMIO, 0xd4, struct kvm_create_guest_memfd)
#define GUEST_MEMFD_FLAG_MMAP (1ULL << 0)
#define GUEST_MEMFD_FLAG_INIT_SHARED (1ULL << 1)
+#define GUEST_DAXFD_FLAG_MMAP (1ULL << 2)
struct kvm_create_guest_memfd {
__u64 size;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5b5b69c97665..82d9fb65e149 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4952,6 +4952,12 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
return 1;
case KVM_CAP_GUEST_MEMFD_FLAGS:
return kvm_gmem_get_supported_flags(kvm);
+#endif
+#ifdef CONFIG_KVM_GUEST_DAXFD
+ case KVM_CAP_GUEST_DAXFD:
+ return 1;
+ case KVM_CAP_GUEST_DAXFD_FLAGS:
+ return kvm_dax_get_supported_flags(kvm);
#endif
default:
break;
--
2.53.0
^ permalink raw reply related [flat|nested] 28+ messages in thread* [RFC PATCH 12/12] selftest/kvm: Add daxfd support for gmem selftest
2026-04-23 17:02 [RFC PATCH 00/12] dax: Add DAX to guest memfd support for KVM Dave Jiang
` (10 preceding siblings ...)
2026-04-23 17:02 ` [RFC PATCH 11/12] kvm: Add daxfd support for supported flags Dave Jiang
@ 2026-04-23 17:02 ` Dave Jiang
2026-04-23 17:27 ` [RFC PATCH 00/12] dax: Add DAX to guest memfd support for KVM Pasha Tatashin
` (2 subsequent siblings)
14 siblings, 0 replies; 28+ messages in thread
From: Dave Jiang @ 2026-04-23 17:02 UTC (permalink / raw)
To: linux-cxl, nvdimm
Cc: djbw, iweiny, pasha.tatashin, mclapinski, rppt, joao.m.martins,
jic23, gourry, john, rick.p.edgecombe
The changes are very hacked up in order to approriately support
taking a DAX fd and using it as the backing storage for the KVM gmem
selftest. There are some liberties taken due to the difference in
mechanism between memfd vs daxfd. One big difference is that the daxfd
is all or nothing where the entire dax region is given when the char
dev is used.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
tools/testing/selftests/kvm/Makefile.kvm | 1 +
.../testing/selftests/kvm/guest_daxfd_test.c | 329 ++++++++++++++++++
2 files changed, 330 insertions(+)
create mode 100644 tools/testing/selftests/kvm/guest_daxfd_test.c
diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm
index d45bf4ccb3bf..851484a407ce 100644
--- a/tools/testing/selftests/kvm/Makefile.kvm
+++ b/tools/testing/selftests/kvm/Makefile.kvm
@@ -143,6 +143,7 @@ TEST_GEN_PROGS_x86 += access_tracking_perf_test
TEST_GEN_PROGS_x86 += coalesced_io_test
TEST_GEN_PROGS_x86 += dirty_log_perf_test
TEST_GEN_PROGS_x86 += guest_memfd_test
+TEST_GEN_PROGS_x86 += guest_daxfd_test
TEST_GEN_PROGS_x86 += hardware_disable_test
TEST_GEN_PROGS_x86 += memslot_modification_stress_test
TEST_GEN_PROGS_x86 += memslot_perf_test
diff --git a/tools/testing/selftests/kvm/guest_daxfd_test.c b/tools/testing/selftests/kvm/guest_daxfd_test.c
new file mode 100644
index 000000000000..d86842f2b841
--- /dev/null
+++ b/tools/testing/selftests/kvm/guest_daxfd_test.c
@@ -0,0 +1,329 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright Intel Corporation, 2023
+ *
+ * Author: Chao Peng <chao.p.peng@linux.intel.com>
+ */
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+#include <stdio.h>
+#include <fcntl.h>
+
+#include <linux/bitmap.h>
+#include <linux/falloc.h>
+#include <linux/sizes.h>
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+
+#include "kvm_util.h"
+#include "numaif.h"
+#include "test_util.h"
+#include "ucall_common.h"
+
+static const char dax_path[] = "/dev/dax0.1";
+static const size_t dax_size = SZ_4G;
+
+static size_t page_size;
+
+static int vm_create_guest_daxfd(void)
+{
+ int fd;
+
+ fd = open(dax_path, O_RDWR | O_LARGEFILE);
+ TEST_ASSERT(fd != -1, "Cannot open %s: %s", dax_path, strerror(errno));
+
+ return fd;
+}
+
+static void test_file_read_write(int fd, size_t total_size)
+{
+ char buf[64];
+
+ TEST_ASSERT(read(fd, buf, sizeof(buf)) < 0,
+ "read on a guest_mem fd should fail");
+ TEST_ASSERT(write(fd, buf, sizeof(buf)) < 0,
+ "write on a guest_mem fd should fail");
+ TEST_ASSERT(pread(fd, buf, sizeof(buf), 0) < 0,
+ "pread on a guest_mem fd should fail");
+ TEST_ASSERT(pwrite(fd, buf, sizeof(buf), 0) < 0,
+ "pwrite on a guest_mem fd should fail");
+}
+
+static void test_mmap_supported(int fd, size_t total_size)
+{
+ const char val = 0xaa;
+ char *mem;
+ size_t i;
+ int ret;
+
+ mem = kvm_mmap(total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd);
+
+ memset(mem, val, total_size);
+ for (i = 0; i < total_size; i++)
+ TEST_ASSERT_EQ(READ_ONCE(mem[i]), val);
+
+ ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, 0,
+ page_size);
+ TEST_ASSERT(!ret, "fallocate the first page should succeed.");
+
+ for (i = 0; i < page_size; i++)
+ TEST_ASSERT_EQ(READ_ONCE(mem[i]), 0x00);
+ for (; i < total_size; i++)
+ TEST_ASSERT_EQ(READ_ONCE(mem[i]), val);
+
+ memset(mem, val, page_size);
+ for (i = 0; i < total_size; i++)
+ TEST_ASSERT_EQ(READ_ONCE(mem[i]), val);
+
+ kvm_munmap(mem, total_size);
+}
+
+static void test_fault_sigbus(int fd, size_t accessible_size, size_t map_size)
+{
+ const char val = 0xaa;
+ char *mem;
+ size_t i;
+
+ mem = kvm_mmap(map_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd);
+
+ TEST_EXPECT_SIGBUS(memset(mem, val, map_size));
+ TEST_EXPECT_SIGBUS((void)READ_ONCE(mem[accessible_size]));
+
+ for (i = 0; i < accessible_size; i++)
+ TEST_ASSERT_EQ(READ_ONCE(mem[i]), val);
+
+ kvm_munmap(mem, map_size);
+}
+
+static void test_fault_overflow(int fd, size_t total_size)
+{
+ total_size = dax_size;
+
+ test_fault_sigbus(fd, total_size, total_size * 4);
+}
+
+static void test_file_size(int fd, size_t total_size)
+{
+ struct stat sb;
+ int ret;
+
+ ret = fstat(fd, &sb);
+ TEST_ASSERT(!ret, "fstat should succeed");
+ /* Can't test total size because dax you get the whole device size */
+ //TEST_ASSERT_EQ(sb.st_size, total_size);
+ TEST_ASSERT_EQ(sb.st_blksize, page_size);
+}
+
+static void test_fallocate(int fd, size_t total_size)
+{
+ int ret;
+
+ total_size = dax_size;
+
+ ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, 0, total_size);
+ TEST_ASSERT(!ret, "fallocate with aligned offset and size should succeed");
+
+ ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+ page_size - 1, page_size);
+ TEST_ASSERT(ret, "fallocate with unaligned offset should fail");
+
+ ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, total_size, page_size);
+ TEST_ASSERT(ret, "fallocate beginning at total_size should fail");
+
+ ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, total_size + page_size, page_size);
+ TEST_ASSERT(ret, "fallocate beginning after total_size should fail");
+
+ /*
+ * The next 2 have opposite behavior of a file. DAX is finite and
+ * therefore those should fail.
+ */
+ ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+ total_size, page_size);
+ TEST_ASSERT(ret, "fallocate(PUNCH_HOLE) at total_size should fail");
+
+ ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+ total_size + page_size, page_size);
+ TEST_ASSERT(ret, "fallocate(PUNCH_HOLE) after total_size should fail");
+
+
+ ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+ page_size, page_size - 1);
+ TEST_ASSERT(ret, "fallocate with unaligned size should fail");
+
+ ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+ page_size, page_size);
+ TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) with aligned offset and size should succeed");
+
+ ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, page_size, page_size);
+ TEST_ASSERT(!ret, "fallocate to restore punched hole should succeed");
+}
+
+static void test_invalid_punch_hole(int fd, size_t total_size)
+{
+ struct {
+ off_t offset;
+ off_t len;
+ } testcases[] = {
+ {0, 1},
+ {0, page_size - 1},
+ {0, page_size + 1},
+
+ {1, 1},
+ {1, page_size - 1},
+ {1, page_size},
+ {1, page_size + 1},
+
+ {page_size, 1},
+ {page_size, page_size - 1},
+ {page_size, page_size + 1},
+ };
+ int ret, i;
+
+ for (i = 0; i < ARRAY_SIZE(testcases); i++) {
+ ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+ testcases[i].offset, testcases[i].len);
+ TEST_ASSERT(ret == -1 && errno == EINVAL,
+ "PUNCH_HOLE with !PAGE_SIZE offset (%lx) and/or length (%lx) should fail",
+ testcases[i].offset, testcases[i].len);
+ }
+}
+
+static void test_create_guest_daxfd(void)
+{
+ int fd1, ret;
+ struct stat st1;
+
+ fd1 = vm_create_guest_daxfd();
+ TEST_ASSERT(fd1 != -1, "memfd creation should succeed");
+
+ ret = fstat(fd1, &st1);
+ TEST_ASSERT(ret != -1, "memfd fstat should succeed");
+
+ close(fd1);
+}
+
+#define gmem_test(__test, __vm, __flags) \
+do { \
+ int fd = vm_create_guest_daxfd(); \
+ \
+ test_##__test(fd, page_size * 4); \
+ close(fd); \
+} while (0)
+
+static void __test_guest_daxfd(struct kvm_vm *vm, uint64_t flags)
+{
+ pr_info("Testing guest_daxfd with flags 0x%lx\n", flags);
+ test_create_guest_daxfd();
+ pr_info("test create_guest_daxfd() passed\n");
+
+ gmem_test(file_read_write, vm, flags);
+ pr_info("test file_read_write passed\n");
+
+ gmem_test(mmap_supported, vm, flags);
+ pr_info("test mmap_supported passed\n");
+ gmem_test(fault_overflow, vm, flags);
+ pr_info("test fault overflow passed\n");
+
+ gmem_test(file_size, vm, flags);
+ pr_info("test file_size passed\n");
+ gmem_test(fallocate, vm, flags);
+ pr_info("test fallocate passed\n");
+ gmem_test(invalid_punch_hole, vm, flags);
+ pr_info("test invalid_punch_hole passed\n");
+}
+
+static void test_guest_daxfd(unsigned long vm_type)
+{
+ struct kvm_vm *vm = vm_create_barebones_type(vm_type);
+
+ __test_guest_daxfd(vm, 0);
+
+ kvm_vm_free(vm);
+}
+
+static void guest_code(uint8_t *mem, uint64_t size)
+{
+ size_t i;
+
+ for (i = 0; i < size; i++)
+ __GUEST_ASSERT(mem[i] == 0xaa,
+ "Guest expected 0xaa at offset %lu, got 0x%x", i, mem[i]);
+
+ memset(mem, 0xff, size);
+ GUEST_DONE();
+}
+
+static void test_guest_daxfd_guest(void)
+{
+ /*
+ * Skip the first 4gb and slot0. slot0 maps <1gb and is used to back
+ * the guest's code, stack, and page tables, and low memory contains
+ * the PCI hole and other MMIO regions that need to be avoided.
+ */
+ const uint64_t gpa = SZ_4G;
+ const uint64_t test_size = SZ_4G;
+ const int slot = 1;
+
+ struct kvm_vcpu *vcpu;
+ struct kvm_vm *vm;
+ uint8_t *mem;
+ size_t size;
+ int fd, i;
+
+ if (!kvm_check_cap(KVM_CAP_GUEST_DAXFD_FLAGS))
+ return;
+
+ vm = __vm_create_shape_with_one_vcpu(VM_SHAPE_DEFAULT, &vcpu, 1, guest_code);
+
+ TEST_ASSERT(vm_check_cap(vm, KVM_CAP_GUEST_DAXFD_FLAGS) & GUEST_DAXFD_FLAG_MMAP,
+ "Default VM type should support MMAP, supported flags = 0x%x",
+ vm_check_cap(vm, KVM_CAP_GUEST_DAXFD_FLAGS));
+
+ size = vm->page_size;
+ fd = vm_create_guest_daxfd();
+
+ /* Do we need to do this? It's necessary for gmem fd */
+ vm_set_user_memory_region2(vm, slot, KVM_MEM_GUEST_DAXFD, gpa, size, NULL, fd, 0);
+
+ mem = kvm_mmap(test_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd);
+ memset(mem, 0xaa, test_size);
+ kvm_munmap(mem, test_size);
+
+ virt_pg_map(vm, gpa, gpa);
+ vcpu_args_set(vcpu, 2, gpa, size);
+ vcpu_run(vcpu);
+
+ TEST_ASSERT_EQ(get_ucall(vcpu, NULL), UCALL_DONE);
+
+ mem = kvm_mmap(size, PROT_READ | PROT_WRITE, MAP_SHARED, fd);
+ for (i = 0; i < size; i++)
+ TEST_ASSERT_EQ(mem[i], 0xff);
+
+ close(fd);
+ kvm_vm_free(vm);
+}
+
+int main(int argc, char *argv[])
+{
+ unsigned long vm_types, vm_type;
+
+ TEST_REQUIRE(kvm_has_cap(KVM_CAP_GUEST_DAXFD));
+
+ page_size = getpagesize();
+
+ /*
+ * Not all architectures support KVM_CAP_VM_TYPES. However, those that
+ * support guest_memfd have that support for the default VM type.
+ */
+ vm_types = kvm_check_cap(KVM_CAP_VM_TYPES);
+ if (!vm_types)
+ vm_types = BIT(VM_TYPE_DEFAULT);
+
+ for_each_set_bit(vm_type, &vm_types, BITS_PER_TYPE(vm_types))
+ test_guest_daxfd(vm_type);
+
+ test_guest_daxfd_guest();
+}
--
2.53.0
^ permalink raw reply related [flat|nested] 28+ messages in thread* Re: [RFC PATCH 00/12] dax: Add DAX to guest memfd support for KVM
2026-04-23 17:02 [RFC PATCH 00/12] dax: Add DAX to guest memfd support for KVM Dave Jiang
` (11 preceding siblings ...)
2026-04-23 17:02 ` [RFC PATCH 12/12] selftest/kvm: Add daxfd support for gmem selftest Dave Jiang
@ 2026-04-23 17:27 ` Pasha Tatashin
2026-04-23 18:08 ` Dave Jiang
2026-04-24 3:43 ` Gregory Price
2026-04-24 17:13 ` Frank van der Linden
14 siblings, 1 reply; 28+ messages in thread
From: Pasha Tatashin @ 2026-04-23 17:27 UTC (permalink / raw)
To: Dave Jiang
Cc: linux-cxl, nvdimm, djbw, iweiny, pasha.tatashin, mclapinski, rppt,
joao.m.martins, jic23, gourry, john, rick.p.edgecombe
Hi Dave,
On 04-23 10:02, Dave Jiang wrote:
> This RFC series is created as a proof of concept to connect device DAX to guest
> memory by riding on top of guest memfd in order to prove out that device DAX
> can be used as guest memory. The series seeks to jump start a discussion on
> if there are interests in creating a DAX bridge to utilize CXL memory for guest
> memory until the N_PRIVATE implementation by Gregory [1] is available upstream
> and DAX users are ready to move to the new scheme. Once there's an established
> consensus of interest, we can move the discussion to the best way to implement
> the DAX bridge and the future of device DAX as guest.
I cannot speak to the CXL/DAX use case, but I can provide perspective
from a persistence point of view. Currently, as a temporary workaround,
we are using emulated pmem in DevDax mode for live update purposes.
However, going forward, our plan is to switch to regular memory and use
LUO + memfd/guestmemfd backed by regular RAM to preserve resources.
We are working on a patch series that we plan to send out in the coming
weeks to preserve guestmemfd via LUO.
By design, all resources that participate and need to be preserved
across reboots for live update purposes must have FD handlers.
Does your series allow DAX memory with 1G alignment (i.e. 1G pages) to
back guest_memfd? That is also an interesting use case, while HugeTLB
support for guest_memfd is in progress, it still has not yet landed.
> I did the bare minimal to get the PoC to pass a modified version of KVM gmem
> selftest (guest_memfd_test) in order to prove out that DAX can go in the gmem
> path. A DAX char dev is created and the fd is passed in user space with
> vm_set_user_memory_region2(). The DAX region is passed in as a whole when used
> unlike memfd where any size can be passed in to be allocated.
>
> The folks on the cc line are people that Dan Williams has mentioned that may be
> of interest to this.
>
> [1]: https://lore.kernel.org/linux-cxl/aeWV1CvP9ImZ3eEG@gourry-fedora-PF4VCD3F/T/#t
>
>
> Dave Jiang (12):
> dax: rate limit dev_dax_huge_fault() output
> dax: Save the kva from memremap
> dax: Add fallocate support to device dax
> dax: Move dax_pgoff_to_phys() to dax bus to be used by dev dax
> dax: Add dax_operations and supporting functions to device dax
> dax: Add helper to determine if a 'struct file' supports dax
> KVM: guest_memfd: Add setup of daxfd when binding gmem
> fs: allow char dev to go through fallocate
> dax: Add dax_get_dev_dax() helper function
> kvm: Implement dax support for KVM faulting
> kvm: Add daxfd support for supported flags
> selftest/kvm: Add daxfd support for gmem selftest
>
> arch/x86/kvm/Kconfig | 1 +
> arch/x86/kvm/mmu/mmu.c | 48 ++-
> drivers/dax/bus.c | 132 ++++++-
> drivers/dax/dax-private.h | 8 +
> drivers/dax/device.c | 80 +++--
> fs/open.c | 3 +-
> include/linux/dax.h | 15 +
> include/linux/kvm_host.h | 39 +++
> include/uapi/linux/kvm.h | 4 +
> tools/testing/selftests/kvm/Makefile.kvm | 1 +
> .../testing/selftests/kvm/guest_daxfd_test.c | 329 ++++++++++++++++++
> virt/kvm/Kconfig | 4 +
> virt/kvm/guest_memfd.c | 92 ++++-
> virt/kvm/kvm_main.c | 6 +
> 14 files changed, 711 insertions(+), 51 deletions(-)
> create mode 100644 tools/testing/selftests/kvm/guest_daxfd_test.c
>
>
> base-commit: 05f7e89ab9731565d8a62e3b5d1ec206485eeb0b
> --
> 2.53.0
>
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [RFC PATCH 00/12] dax: Add DAX to guest memfd support for KVM
2026-04-23 17:27 ` [RFC PATCH 00/12] dax: Add DAX to guest memfd support for KVM Pasha Tatashin
@ 2026-04-23 18:08 ` Dave Jiang
2026-04-23 18:21 ` Dave Jiang
0 siblings, 1 reply; 28+ messages in thread
From: Dave Jiang @ 2026-04-23 18:08 UTC (permalink / raw)
To: Pasha Tatashin
Cc: linux-cxl, nvdimm, djbw, iweiny, mclapinski, rppt, joao.m.martins,
jic23, gourry, john, rick.p.edgecombe
On 4/23/26 10:27 AM, Pasha Tatashin wrote:
> Hi Dave,
>
Hi Pasha,
> On 04-23 10:02, Dave Jiang wrote:
>> This RFC series is created as a proof of concept to connect device DAX to guest
>> memory by riding on top of guest memfd in order to prove out that device DAX
>> can be used as guest memory. The series seeks to jump start a discussion on
>> if there are interests in creating a DAX bridge to utilize CXL memory for guest
>> memory until the N_PRIVATE implementation by Gregory [1] is available upstream
>> and DAX users are ready to move to the new scheme. Once there's an established
>> consensus of interest, we can move the discussion to the best way to implement
>> the DAX bridge and the future of device DAX as guest.
>
> I cannot speak to the CXL/DAX use case, but I can provide perspective
> from a persistence point of view. Currently, as a temporary workaround,
> we are using emulated pmem in DevDax mode for live update purposes.
> However, going forward, our plan is to switch to regular memory and use
> LUO + memfd/guestmemfd backed by regular RAM to preserve resources.
>
> We are working on a patch series that we plan to send out in the coming
> weeks to preserve guestmemfd via LUO.
>
> By design, all resources that participate and need to be preserved
> across reboots for live update purposes must have FD handlers.
>
> Does your series allow DAX memory with 1G alignment (i.e. 1G pages) to
> back guest_memfd? That is also an interesting use case, while HugeTLB
> support for guest_memfd is in progress, it still has not yet landed.
I just did a quick dirty hack to get 4k working. It may need some
additional plumbing to enable 1G alignment. But I don't see any reason why
that can't be done. Do you see any technical barriers? While it rides some
of the paths to setup for gmem_fd, the faulting path is through DAX and does
not go through the gmem. Maybe if we are actually implementing this, it may
split out via clean dedicated DAX paths and not ride on top of gmem_fd? If
we determine there is interest in doing this enabling, then at that time we
can go to the next step and discuss how we want the interface to look like.
DJ
>
>> I did the bare minimal to get the PoC to pass a modified version of KVM gmem
>> selftest (guest_memfd_test) in order to prove out that DAX can go in the gmem
>> path. A DAX char dev is created and the fd is passed in user space with
>> vm_set_user_memory_region2(). The DAX region is passed in as a whole when used
>> unlike memfd where any size can be passed in to be allocated.
>>
>> The folks on the cc line are people that Dan Williams has mentioned that may be
>> of interest to this.
>>
>> [1]: https://lore.kernel.org/linux-cxl/aeWV1CvP9ImZ3eEG@gourry-fedora-PF4VCD3F/T/#t
>>
>>
>> Dave Jiang (12):
>> dax: rate limit dev_dax_huge_fault() output
>> dax: Save the kva from memremap
>> dax: Add fallocate support to device dax
>> dax: Move dax_pgoff_to_phys() to dax bus to be used by dev dax
>> dax: Add dax_operations and supporting functions to device dax
>> dax: Add helper to determine if a 'struct file' supports dax
>> KVM: guest_memfd: Add setup of daxfd when binding gmem
>> fs: allow char dev to go through fallocate
>> dax: Add dax_get_dev_dax() helper function
>> kvm: Implement dax support for KVM faulting
>> kvm: Add daxfd support for supported flags
>> selftest/kvm: Add daxfd support for gmem selftest
>>
>> arch/x86/kvm/Kconfig | 1 +
>> arch/x86/kvm/mmu/mmu.c | 48 ++-
>> drivers/dax/bus.c | 132 ++++++-
>> drivers/dax/dax-private.h | 8 +
>> drivers/dax/device.c | 80 +++--
>> fs/open.c | 3 +-
>> include/linux/dax.h | 15 +
>> include/linux/kvm_host.h | 39 +++
>> include/uapi/linux/kvm.h | 4 +
>> tools/testing/selftests/kvm/Makefile.kvm | 1 +
>> .../testing/selftests/kvm/guest_daxfd_test.c | 329 ++++++++++++++++++
>> virt/kvm/Kconfig | 4 +
>> virt/kvm/guest_memfd.c | 92 ++++-
>> virt/kvm/kvm_main.c | 6 +
>> 14 files changed, 711 insertions(+), 51 deletions(-)
>> create mode 100644 tools/testing/selftests/kvm/guest_daxfd_test.c
>>
>>
>> base-commit: 05f7e89ab9731565d8a62e3b5d1ec206485eeb0b
>> --
>> 2.53.0
>>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 00/12] dax: Add DAX to guest memfd support for KVM
2026-04-23 18:08 ` Dave Jiang
@ 2026-04-23 18:21 ` Dave Jiang
0 siblings, 0 replies; 28+ messages in thread
From: Dave Jiang @ 2026-04-23 18:21 UTC (permalink / raw)
To: Pasha Tatashin
Cc: linux-cxl, nvdimm, djbw, iweiny, mclapinski, rppt, joao.m.martins,
jic23, gourry, john, rick.p.edgecombe
On 4/23/26 11:08 AM, Dave Jiang wrote:
>
>
> On 4/23/26 10:27 AM, Pasha Tatashin wrote:
>> Hi Dave,
>>
> Hi Pasha,
>
>> On 04-23 10:02, Dave Jiang wrote:
>>> This RFC series is created as a proof of concept to connect device DAX to guest
>>> memory by riding on top of guest memfd in order to prove out that device DAX
>>> can be used as guest memory. The series seeks to jump start a discussion on
>>> if there are interests in creating a DAX bridge to utilize CXL memory for guest
>>> memory until the N_PRIVATE implementation by Gregory [1] is available upstream
>>> and DAX users are ready to move to the new scheme. Once there's an established
>>> consensus of interest, we can move the discussion to the best way to implement
>>> the DAX bridge and the future of device DAX as guest.
>>
>> I cannot speak to the CXL/DAX use case, but I can provide perspective
>> from a persistence point of view. Currently, as a temporary workaround,
>> we are using emulated pmem in DevDax mode for live update purposes.
>> However, going forward, our plan is to switch to regular memory and use
>> LUO + memfd/guestmemfd backed by regular RAM to preserve resources.
I also want to mention that while I'm motivated by CXL backing, the enabling is DAX backed by any backend.
DJ
>>
>> We are working on a patch series that we plan to send out in the coming
>> weeks to preserve guestmemfd via LUO.
>>
>> By design, all resources that participate and need to be preserved
>> across reboots for live update purposes must have FD handlers.
>>
>> Does your series allow DAX memory with 1G alignment (i.e. 1G pages) to
>> back guest_memfd? That is also an interesting use case, while HugeTLB
>> support for guest_memfd is in progress, it still has not yet landed.
>
> I just did a quick dirty hack to get 4k working. It may need some
> additional plumbing to enable 1G alignment. But I don't see any reason why
> that can't be done. Do you see any technical barriers? While it rides some
> of the paths to setup for gmem_fd, the faulting path is through DAX and does
> not go through the gmem. Maybe if we are actually implementing this, it may
> split out via clean dedicated DAX paths and not ride on top of gmem_fd? If
> we determine there is interest in doing this enabling, then at that time we
> can go to the next step and discuss how we want the interface to look like.
>
> DJ
>
>>
>>> I did the bare minimal to get the PoC to pass a modified version of KVM gmem
>>> selftest (guest_memfd_test) in order to prove out that DAX can go in the gmem
>>> path. A DAX char dev is created and the fd is passed in user space with
>>> vm_set_user_memory_region2(). The DAX region is passed in as a whole when used
>>> unlike memfd where any size can be passed in to be allocated.
>>>
>>> The folks on the cc line are people that Dan Williams has mentioned that may be
>>> of interest to this.
>>>
>>> [1]: https://lore.kernel.org/linux-cxl/aeWV1CvP9ImZ3eEG@gourry-fedora-PF4VCD3F/T/#t
>>>
>>>
>>> Dave Jiang (12):
>>> dax: rate limit dev_dax_huge_fault() output
>>> dax: Save the kva from memremap
>>> dax: Add fallocate support to device dax
>>> dax: Move dax_pgoff_to_phys() to dax bus to be used by dev dax
>>> dax: Add dax_operations and supporting functions to device dax
>>> dax: Add helper to determine if a 'struct file' supports dax
>>> KVM: guest_memfd: Add setup of daxfd when binding gmem
>>> fs: allow char dev to go through fallocate
>>> dax: Add dax_get_dev_dax() helper function
>>> kvm: Implement dax support for KVM faulting
>>> kvm: Add daxfd support for supported flags
>>> selftest/kvm: Add daxfd support for gmem selftest
>>>
>>> arch/x86/kvm/Kconfig | 1 +
>>> arch/x86/kvm/mmu/mmu.c | 48 ++-
>>> drivers/dax/bus.c | 132 ++++++-
>>> drivers/dax/dax-private.h | 8 +
>>> drivers/dax/device.c | 80 +++--
>>> fs/open.c | 3 +-
>>> include/linux/dax.h | 15 +
>>> include/linux/kvm_host.h | 39 +++
>>> include/uapi/linux/kvm.h | 4 +
>>> tools/testing/selftests/kvm/Makefile.kvm | 1 +
>>> .../testing/selftests/kvm/guest_daxfd_test.c | 329 ++++++++++++++++++
>>> virt/kvm/Kconfig | 4 +
>>> virt/kvm/guest_memfd.c | 92 ++++-
>>> virt/kvm/kvm_main.c | 6 +
>>> 14 files changed, 711 insertions(+), 51 deletions(-)
>>> create mode 100644 tools/testing/selftests/kvm/guest_daxfd_test.c
>>>
>>>
>>> base-commit: 05f7e89ab9731565d8a62e3b5d1ec206485eeb0b
>>> --
>>> 2.53.0
>>>
>
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 00/12] dax: Add DAX to guest memfd support for KVM
2026-04-23 17:02 [RFC PATCH 00/12] dax: Add DAX to guest memfd support for KVM Dave Jiang
` (12 preceding siblings ...)
2026-04-23 17:27 ` [RFC PATCH 00/12] dax: Add DAX to guest memfd support for KVM Pasha Tatashin
@ 2026-04-24 3:43 ` Gregory Price
2026-04-24 17:38 ` Frank van der Linden
2026-04-29 13:21 ` Ira Weiny
2026-04-24 17:13 ` Frank van der Linden
14 siblings, 2 replies; 28+ messages in thread
From: Gregory Price @ 2026-04-24 3:43 UTC (permalink / raw)
To: Dave Jiang
Cc: linux-cxl, nvdimm, djbw, iweiny, pasha.tatashin, mclapinski, rppt,
joao.m.martins, jic23, john, rick.p.edgecombe
On Thu, Apr 23, 2026 at 10:02:07AM -0700, Dave Jiang wrote:
> This RFC series is created as a proof of concept to connect device DAX to guest
> memory by riding on top of guest memfd in order to prove out that device DAX
> can be used as guest memory. The series seeks to jump start a discussion on
> if there are interests in creating a DAX bridge to utilize CXL memory for guest
> memory until the N_PRIVATE implementation by Gregory [1] is available upstream
> and DAX users are ready to move to the new scheme. Once there's an established
> consensus of interest, we can move the discussion to the best way to implement
> the DAX bridge and the future of device DAX as guest.
>
> I did the bare minimal to get the PoC to pass a modified version of KVM gmem
> selftest (guest_memfd_test) in order to prove out that DAX can go in the gmem
> path. A DAX char dev is created and the fd is passed in user space with
> vm_set_user_memory_region2(). The DAX region is passed in as a whole when used
> unlike memfd where any size can be passed in to be allocated.
>
> The folks on the cc line are people that Dan Williams has mentioned that may be
> of interest to this.
I see these as *mildly* orthogonal, but I think maybe you should propose
a discussion at LSF to talk about this.
guest_memfd in particular wants the host to never map the memory - and
guests *generally* want 1GB huge page support (TLB go brrrrr).
There's a real argument for just handing a physical memory region over
to guest_memfd and making it manage the region manually, rather than
doing a bunch of nonsense just so you can call alloc_pages_node()
So I see an extension like this as genuinely useful regardless of
whether private nodes actually end up merged. It's a matter of
flexibility and use cases.
With this plumbing, you get less flexible use of the memory (you're tied
to dax abstractions), whereas with private nodes you can build slightly
more flexible general-system support.
IN THEORY you could add something like an NP_OP_NOMAP to private nodes
to make the buddy manage pages that don't have a direct map - BUT - in
practice that's likely to be more of a bodge rather than a good design.
So I will say - to the detriment of private nodes ;] - I like this idea.
The question is ultimately how much flexibility you need to shuffle this
capacity from one guest to another.
~Gregory
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [RFC PATCH 00/12] dax: Add DAX to guest memfd support for KVM
2026-04-24 3:43 ` Gregory Price
@ 2026-04-24 17:38 ` Frank van der Linden
2026-04-29 13:21 ` Ira Weiny
1 sibling, 0 replies; 28+ messages in thread
From: Frank van der Linden @ 2026-04-24 17:38 UTC (permalink / raw)
To: Gregory Price
Cc: Dave Jiang, linux-cxl, nvdimm, djbw, iweiny, pasha.tatashin,
mclapinski, rppt, joao.m.martins, jic23, john, rick.p.edgecombe
On Thu, Apr 23, 2026 at 8:43 PM Gregory Price <gourry@gourry.net> wrote:
>
> On Thu, Apr 23, 2026 at 10:02:07AM -0700, Dave Jiang wrote:
> > This RFC series is created as a proof of concept to connect device DAX to guest
> > memory by riding on top of guest memfd in order to prove out that device DAX
> > can be used as guest memory. The series seeks to jump start a discussion on
> > if there are interests in creating a DAX bridge to utilize CXL memory for guest
> > memory until the N_PRIVATE implementation by Gregory [1] is available upstream
> > and DAX users are ready to move to the new scheme. Once there's an established
> > consensus of interest, we can move the discussion to the best way to implement
> > the DAX bridge and the future of device DAX as guest.
> >
> > I did the bare minimal to get the PoC to pass a modified version of KVM gmem
> > selftest (guest_memfd_test) in order to prove out that DAX can go in the gmem
> > path. A DAX char dev is created and the fd is passed in user space with
> > vm_set_user_memory_region2(). The DAX region is passed in as a whole when used
> > unlike memfd where any size can be passed in to be allocated.
> >
> > The folks on the cc line are people that Dan Williams has mentioned that may be
> > of interest to this.
>
> I see these as *mildly* orthogonal, but I think maybe you should propose
> a discussion at LSF to talk about this.
>
> guest_memfd in particular wants the host to never map the memory - and
> guests *generally* want 1GB huge page support (TLB go brrrrr).
>
> There's a real argument for just handing a physical memory region over
> to guest_memfd and making it manage the region manually, rather than
> doing a bunch of nonsense just so you can call alloc_pages_node()
>
Yes, I agree. I originally planned to do that as a project - I had a
prototype of a physical memory pool allocator that just sets memory
aside at boot. But I quickly concluded that you still need folios,
they are the currency of memory management, and hard to avoid (and
maybe they shouldn't be avoided).
So, I ended up hotplugging the memory in as ZONE_DEVICE memory, which
can then be used as a new backend allocator to guest_memfd.
I like the private node idea a lot, but for guest memory, you often
don't want the buddy allocator involved, like you mentioned.
I think there's room for both. Like I mentioned in my previous
message, if guest_memfd moves to a model where there can be registered
backend allocators, both could be used. E.g. a ZONE_DEVICE allocator
which hands out 1G, HVO-ed pages, and a private node allocator with
buddy and LRU functionality (which you may need if you're running
memory overcommitted VMs).
- Frank
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 00/12] dax: Add DAX to guest memfd support for KVM
2026-04-24 3:43 ` Gregory Price
2026-04-24 17:38 ` Frank van der Linden
@ 2026-04-29 13:21 ` Ira Weiny
2026-04-29 23:58 ` Gregory Price
1 sibling, 1 reply; 28+ messages in thread
From: Ira Weiny @ 2026-04-29 13:21 UTC (permalink / raw)
To: Gregory Price, Dave Jiang
Cc: linux-cxl, nvdimm, djbw, iweiny, pasha.tatashin, mclapinski, rppt,
joao.m.martins, jic23, john, rick.p.edgecombe
Gregory Price wrote:
> On Thu, Apr 23, 2026 at 10:02:07AM -0700, Dave Jiang wrote:
> > This RFC series is created as a proof of concept to connect device DAX to guest
> > memory by riding on top of guest memfd in order to prove out that device DAX
> > can be used as guest memory. The series seeks to jump start a discussion on
> > if there are interests in creating a DAX bridge to utilize CXL memory for guest
> > memory until the N_PRIVATE implementation by Gregory [1] is available upstream
> > and DAX users are ready to move to the new scheme. Once there's an established
> > consensus of interest, we can move the discussion to the best way to implement
> > the DAX bridge and the future of device DAX as guest.
> >
> > I did the bare minimal to get the PoC to pass a modified version of KVM gmem
> > selftest (guest_memfd_test) in order to prove out that DAX can go in the gmem
> > path. A DAX char dev is created and the fd is passed in user space with
> > vm_set_user_memory_region2(). The DAX region is passed in as a whole when used
> > unlike memfd where any size can be passed in to be allocated.
> >
> > The folks on the cc line are people that Dan Williams has mentioned that may be
> > of interest to this.
>
> I see these as *mildly* orthogonal, but I think maybe you should propose
> a discussion at LSF to talk about this.
Sorry I was a bit delayed on this thread due to some email issues.
Yes this should be talked about at LSF if possible. But I also think this
is something which is a ways off based on the responses we have seen here.
>
> guest_memfd in particular wants the host to never map the memory - and
> guests *generally* want 1GB huge page support (TLB go brrrrr).
That is _not_ going to be true forever. There is work ongoing to create
shared gmem for various reasons. For secure guests this is at least
useful for initial population of memory before handing to a guest.
>
> There's a real argument for just handing a physical memory region over
> to guest_memfd and making it manage the region manually, rather than
> doing a bunch of nonsense just so you can call alloc_pages_node()
Agreed.
>
> So I see an extension like this as genuinely useful regardless of
> whether private nodes actually end up merged. It's a matter of
> flexibility and use cases.
Yep, the initial talks we had with Dan were to try and get DAX FDs to be
more mainstream. Given some of the other work it may be better to
deprecate DAX FDs. But deprecations can take a long time so what Dave
came up with here is trying to help modernize those fds to be more useful
for guest computing.
Also depending on existing use cases this may be easier for folks to
adopt? But it may have more rough edges than it is worth?
>
> With this plumbing, you get less flexible use of the memory (you're tied
> to dax abstractions), whereas with private nodes you can build slightly
> more flexible general-system support.
>
> IN THEORY you could add something like an NP_OP_NOMAP to private nodes
> to make the buddy manage pages that don't have a direct map - BUT - in
> practice that's likely to be more of a bodge rather than a good design.
>
> So I will say - to the detriment of private nodes ;] - I like this idea.
I've investigated using private nodes as a mechanism for guest_memfd to
draw from. I think this is along the lines of what Frank mentioned
elsewhere in this thread.
>
> The question is ultimately how much flexibility you need to shuffle this
> capacity from one guest to another.
Yep. And how much control one needs over which exact CXL/DAX devices the
memory comes from. As you know from our community calls that is one thing
I'm not sure the private node idea is great at. But it could be that is
not really required. Or is best handled as a carve out.
Ira
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 00/12] dax: Add DAX to guest memfd support for KVM
2026-04-29 13:21 ` Ira Weiny
@ 2026-04-29 23:58 ` Gregory Price
0 siblings, 0 replies; 28+ messages in thread
From: Gregory Price @ 2026-04-29 23:58 UTC (permalink / raw)
To: Ira Weiny
Cc: Dave Jiang, linux-cxl, nvdimm, djbw, iweiny, pasha.tatashin,
mclapinski, rppt, joao.m.martins, jic23, john, rick.p.edgecombe
On Wed, Apr 29, 2026 at 08:21:00AM -0500, Ira Weiny wrote:
> Gregory Price wrote:
I think we're largely in agreement here, so trimming a bunch of this.
> >
> > The question is ultimately how much flexibility you need to shuffle this
> > capacity from one guest to another.
>
> Yep. And how much control one needs over which exact CXL/DAX devices the
> memory comes from. As you know from our community calls that is one thing
> I'm not sure the private node idea is great at. But it could be that is
> not really required. Or is best handled as a carve out.
>
If you can do Device<->Node mappings, then this is trivial.
If you need more specific handling, then private nodes are not the way.
The intent of private nodes is to make mmap()/malloc() etc functional in
a heterogenous memory world, which is explicitly different than "give me
a specific chunk of physical capacity".
Which, tl;dr: "There's a real argument for just handing guest_memfd a
chunk of unmapped memory and making it deal with the problem".
(shared gmem is... odd... given the original intent not do this :P)
~Gregory
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 00/12] dax: Add DAX to guest memfd support for KVM
2026-04-23 17:02 [RFC PATCH 00/12] dax: Add DAX to guest memfd support for KVM Dave Jiang
` (13 preceding siblings ...)
2026-04-24 3:43 ` Gregory Price
@ 2026-04-24 17:13 ` Frank van der Linden
2026-04-24 18:23 ` Dave Jiang
14 siblings, 1 reply; 28+ messages in thread
From: Frank van der Linden @ 2026-04-24 17:13 UTC (permalink / raw)
To: Dave Jiang
Cc: linux-cxl, nvdimm, djbw, iweiny, pasha.tatashin, mclapinski, rppt,
joao.m.martins, jic23, gourry, john, rick.p.edgecombe
Dave Jiang <dave.jiang@intel.com> wrote:
> This RFC series is created as a proof of concept to connect device DAX to guest
> memory by riding on top of guest memfd in order to prove out that device DAX
> can be used as guest memory. The series seeks to jump start a discussion on
> if there are interests in creating a DAX bridge to utilize CXL memory for guest
> memory until the N_PRIVATE implementation by Gregory [1] is available upstream
> and DAX users are ready to move to the new scheme. Once there's an established
> consensus of interest, we can move the discussion to the best way to implement
> the DAX bridge and the future of device DAX as guest.
>
> I did the bare minimal to get the PoC to pass a modified version of KVM gmem
> selftest (guest_memfd_test) in order to prove out that DAX can go in the gmem
> path. A DAX char dev is created and the fd is passed in user space with
> vm_set_user_memory_region2(). The DAX region is passed in as a whole when used
> unlike memfd where any size can be passed in to be allocated.
>
> The folks on the cc line are people that Dan Williams has mentioned that may be
> of interest to this.
>
> [1]: https://lore.kernel.org/linux-cxl/aeWV1CvP9ImZ3eEG@gourry-fedora-PF4VCD3F/T/#t
One of the main ideas behind guest_memfd is that the memory is managed
by the kernel only, so it knows what it has and that it can trust
the memory. This RFC passes an fd in via the ioctl(), which I think
breaks that model.
Since there is interest for several different allocation backends
(default, hugetlb, zone_device), it might be better to use a model
where guest_memfd has the option for backend allocators to register
themselves in the kernel. The ioctl can then select one by their
id/name (could be just a string). They can be configured using
e.g. sysfs (like hugetlb already is).
This would also allow easy experimentation with new allocators,
having an allocator with BPF control, etc.
- Frank
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [RFC PATCH 00/12] dax: Add DAX to guest memfd support for KVM
2026-04-24 17:13 ` Frank van der Linden
@ 2026-04-24 18:23 ` Dave Jiang
2026-04-24 20:01 ` Frank van der Linden
2026-05-06 20:23 ` Ackerley Tng
0 siblings, 2 replies; 28+ messages in thread
From: Dave Jiang @ 2026-04-24 18:23 UTC (permalink / raw)
To: Frank van der Linden
Cc: linux-cxl, nvdimm, djbw, iweiny, pasha.tatashin, mclapinski, rppt,
joao.m.martins, jic23, gourry, john, rick.p.edgecombe
On 4/24/26 10:13 AM, Frank van der Linden wrote:
> Dave Jiang <dave.jiang@intel.com> wrote:
>> This RFC series is created as a proof of concept to connect device DAX to guest
>> memory by riding on top of guest memfd in order to prove out that device DAX
>> can be used as guest memory. The series seeks to jump start a discussion on
>> if there are interests in creating a DAX bridge to utilize CXL memory for guest
>> memory until the N_PRIVATE implementation by Gregory [1] is available upstream
>> and DAX users are ready to move to the new scheme. Once there's an established
>> consensus of interest, we can move the discussion to the best way to implement
>> the DAX bridge and the future of device DAX as guest.
>>
>> I did the bare minimal to get the PoC to pass a modified version of KVM gmem
>> selftest (guest_memfd_test) in order to prove out that DAX can go in the gmem
>> path. A DAX char dev is created and the fd is passed in user space with
>> vm_set_user_memory_region2(). The DAX region is passed in as a whole when used
>> unlike memfd where any size can be passed in to be allocated.
>>
>> The folks on the cc line are people that Dan Williams has mentioned that may be
>> of interest to this.
>>
>> [1]: https://lore.kernel.org/linux-cxl/aeWV1CvP9ImZ3eEG@gourry-fedora-PF4VCD3F/T/#t
>
> One of the main ideas behind guest_memfd is that the memory is managed
> by the kernel only, so it knows what it has and that it can trust
> the memory. This RFC passes an fd in via the ioctl(), which I think
> breaks that model.
Don't we issue KVM_CREATE_GUEST_MEMFD ioctl to get a fd in userspace to be passed to KVM_SET_USER_MEMORY_REGION2 ioctl later? We are just passing in a DAX fd instead of a guest mem fd.
>
> Since there is interest for several different allocation backends
> (default, hugetlb, zone_device), it might be better to use a model
> where guest_memfd has the option for backend allocators to register
> themselves in the kernel. The ioctl can then select one by their
> id/name (could be just a string). They can be configured using
> e.g. sysfs (like hugetlb already is).
>
> This would also allow easy experimentation with new allocators,
> having an allocator with BPF control, etc.
Agreed. Although my main intent is to see if there's interest with providing something to the usages already on the DAX path an ease of transition until something like what's proposed above shows up. But if what I proposed will be a security issue then maybe not.
>
> - Frank
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 00/12] dax: Add DAX to guest memfd support for KVM
2026-04-24 18:23 ` Dave Jiang
@ 2026-04-24 20:01 ` Frank van der Linden
2026-04-24 20:59 ` Dave Jiang
2026-05-06 20:23 ` Ackerley Tng
1 sibling, 1 reply; 28+ messages in thread
From: Frank van der Linden @ 2026-04-24 20:01 UTC (permalink / raw)
To: Dave Jiang
Cc: linux-cxl, nvdimm, djbw, iweiny, pasha.tatashin, mclapinski, rppt,
joao.m.martins, jic23, gourry, john, rick.p.edgecombe
On Fri, Apr 24, 2026 at 11:23 AM Dave Jiang <dave.jiang@intel.com> wrote:
>
>
>
> On 4/24/26 10:13 AM, Frank van der Linden wrote:
> > Dave Jiang <dave.jiang@intel.com> wrote:
> >> This RFC series is created as a proof of concept to connect device DAX to guest
> >> memory by riding on top of guest memfd in order to prove out that device DAX
> >> can be used as guest memory. The series seeks to jump start a discussion on
> >> if there are interests in creating a DAX bridge to utilize CXL memory for guest
> >> memory until the N_PRIVATE implementation by Gregory [1] is available upstream
> >> and DAX users are ready to move to the new scheme. Once there's an established
> >> consensus of interest, we can move the discussion to the best way to implement
> >> the DAX bridge and the future of device DAX as guest.
> >>
> >> I did the bare minimal to get the PoC to pass a modified version of KVM gmem
> >> selftest (guest_memfd_test) in order to prove out that DAX can go in the gmem
> >> path. A DAX char dev is created and the fd is passed in user space with
> >> vm_set_user_memory_region2(). The DAX region is passed in as a whole when used
> >> unlike memfd where any size can be passed in to be allocated.
> >>
> >> The folks on the cc line are people that Dan Williams has mentioned that may be
> >> of interest to this.
> >>
> >> [1]: https://lore.kernel.org/linux-cxl/aeWV1CvP9ImZ3eEG@gourry-fedora-PF4VCD3F/T/#t
> >
> > One of the main ideas behind guest_memfd is that the memory is managed
> > by the kernel only, so it knows what it has and that it can trust
> > the memory. This RFC passes an fd in via the ioctl(), which I think
> > breaks that model.
>
> Don't we issue KVM_CREATE_GUEST_MEMFD ioctl to get a fd in userspace to be passed to KVM_SET_USER_MEMORY_REGION2 ioctl later? We are just passing in a DAX fd instead of a guest mem fd.
Sorry, yes, I should have said "it passes in a *non-guest_memfd file
descriptor*" via the ioctl. I think the intent of the guest_memfd code
is that it can only bind to a guest_memfd file descriptor (hence the
check in kvm_gmem_bind()), otherwise its trust model would break. Of
course, I'm not a guest_memfd expert, the maintainers can give you the
definitive answer on this.
>
> >
> > Since there is interest for several different allocation backends
> > (default, hugetlb, zone_device), it might be better to use a model
> > where guest_memfd has the option for backend allocators to register
> > themselves in the kernel. The ioctl can then select one by their
> > id/name (could be just a string). They can be configured using
> > e.g. sysfs (like hugetlb already is).
> >
> > This would also allow easy experimentation with new allocators,
> > having an allocator with BPF control, etc.
>
> Agreed. Although my main intent is to see if there's interest with providing something to the usages already on the DAX path an ease of transition until something like what's proposed above shows up. But if what I proposed will be a security issue then maybe not.
Sure, yes, understood that you're looking for a transitional solution.
- Frank
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 00/12] dax: Add DAX to guest memfd support for KVM
2026-04-24 20:01 ` Frank van der Linden
@ 2026-04-24 20:59 ` Dave Jiang
0 siblings, 0 replies; 28+ messages in thread
From: Dave Jiang @ 2026-04-24 20:59 UTC (permalink / raw)
To: Frank van der Linden
Cc: linux-cxl, nvdimm, djbw, iweiny, pasha.tatashin, mclapinski, rppt,
joao.m.martins, jic23, gourry, john, rick.p.edgecombe
On 4/24/26 1:01 PM, Frank van der Linden wrote:
> On Fri, Apr 24, 2026 at 11:23 AM Dave Jiang <dave.jiang@intel.com> wrote:
>>
>>
>>
>> On 4/24/26 10:13 AM, Frank van der Linden wrote:
>>> Dave Jiang <dave.jiang@intel.com> wrote:
>>>> This RFC series is created as a proof of concept to connect device DAX to guest
>>>> memory by riding on top of guest memfd in order to prove out that device DAX
>>>> can be used as guest memory. The series seeks to jump start a discussion on
>>>> if there are interests in creating a DAX bridge to utilize CXL memory for guest
>>>> memory until the N_PRIVATE implementation by Gregory [1] is available upstream
>>>> and DAX users are ready to move to the new scheme. Once there's an established
>>>> consensus of interest, we can move the discussion to the best way to implement
>>>> the DAX bridge and the future of device DAX as guest.
>>>>
>>>> I did the bare minimal to get the PoC to pass a modified version of KVM gmem
>>>> selftest (guest_memfd_test) in order to prove out that DAX can go in the gmem
>>>> path. A DAX char dev is created and the fd is passed in user space with
>>>> vm_set_user_memory_region2(). The DAX region is passed in as a whole when used
>>>> unlike memfd where any size can be passed in to be allocated.
>>>>
>>>> The folks on the cc line are people that Dan Williams has mentioned that may be
>>>> of interest to this.
>>>>
>>>> [1]: https://lore.kernel.org/linux-cxl/aeWV1CvP9ImZ3eEG@gourry-fedora-PF4VCD3F/T/#t
>>>
>>> One of the main ideas behind guest_memfd is that the memory is managed
>>> by the kernel only, so it knows what it has and that it can trust
>>> the memory. This RFC passes an fd in via the ioctl(), which I think
>>> breaks that model.
>>
>> Don't we issue KVM_CREATE_GUEST_MEMFD ioctl to get a fd in userspace to be passed to KVM_SET_USER_MEMORY_REGION2 ioctl later? We are just passing in a DAX fd instead of a guest mem fd.
>
> Sorry, yes, I should have said "it passes in a *non-guest_memfd file
> descriptor*" via the ioctl. I think the intent of the guest_memfd code
> is that it can only bind to a guest_memfd file descriptor (hence the
> check in kvm_gmem_bind()), otherwise its trust model would break. Of
> course, I'm not a guest_memfd expert, the maintainers can give you the
> definitive answer on this.
I basically hacked it on top of memfd IOCTLS because it was the quickest way to see if it would work. I think IF we are going to implement something, it would have its own KVM ioctls and not riding on top of guest_memfd and muddy things up. The better question is if we create such a daxfd interface, is it feasible or are there major security concerns?
DJ
>
>>
>>>
>>> Since there is interest for several different allocation backends
>>> (default, hugetlb, zone_device), it might be better to use a model
>>> where guest_memfd has the option for backend allocators to register
>>> themselves in the kernel. The ioctl can then select one by their
>>> id/name (could be just a string). They can be configured using
>>> e.g. sysfs (like hugetlb already is).
>>>
>>> This would also allow easy experimentation with new allocators,
>>> having an allocator with BPF control, etc.
>>
>> Agreed. Although my main intent is to see if there's interest with providing something to the usages already on the DAX path an ease of transition until something like what's proposed above shows up. But if what I proposed will be a security issue then maybe not.
>
> Sure, yes, understood that you're looking for a transitional solution.
>
> - Frank
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 00/12] dax: Add DAX to guest memfd support for KVM
2026-04-24 18:23 ` Dave Jiang
2026-04-24 20:01 ` Frank van der Linden
@ 2026-05-06 20:23 ` Ackerley Tng
2026-05-06 20:37 ` Dave Jiang
2026-05-08 1:09 ` Ira Weiny
1 sibling, 2 replies; 28+ messages in thread
From: Ackerley Tng @ 2026-05-06 20:23 UTC (permalink / raw)
To: Dave Jiang
Cc: fvdl, linux-cxl, nvdimm, djbw, iweiny, pasha.tatashin, mclapinski,
rppt, joao.m.martins, jic23, gourry, john, rick.p.edgecombe
Dave Jiang <dave.jiang@intel.com> writes:
> On 4/24/26 10:13 AM, Frank van der Linden wrote:
>> Dave Jiang <dave.jiang@intel.com> wrote:
>>> This RFC series is created as a proof of concept to connect device DAX to guest
>>> memory by riding on top of guest memfd in order to prove out that device DAX
>>> can be used as guest memory. The series seeks to jump start a discussion on
>>> if there are interests in creating a DAX bridge to utilize CXL memory for guest
>>> memory until the N_PRIVATE implementation by Gregory [1] is available upstream
>>> and DAX users are ready to move to the new scheme. Once there's an established
>>> consensus of interest, we can move the discussion to the best way to implement
>>> the DAX bridge and the future of device DAX as guest.
>>>
>>> I did the bare minimal to get the PoC to pass a modified version of KVM gmem
>>> selftest (guest_memfd_test) in order to prove out that DAX can go in the gmem
>>> path. A DAX char dev is created and the fd is passed in user space with
>>> vm_set_user_memory_region2(). The DAX region is passed in as a whole when used
>>> unlike memfd where any size can be passed in to be allocated.
>>>
>>> The folks on the cc line are people that Dan Williams has mentioned that may be
>>> of interest to this.
>>>
Thanks for the PoC! I've been working on guest_memfd HugeTLB and I'm
glad there is interest in other "backends" for guest_memfd :)
>>> [1]: https://lore.kernel.org/linux-cxl/aeWV1CvP9ImZ3eEG@gourry-fedora-PF4VCD3F/T/#t
>>
>> One of the main ideas behind guest_memfd is that the memory is managed
>> by the kernel only, so it knows what it has and that it can trust
>> the memory. This RFC passes an fd in via the ioctl(), which I think
>> breaks that model.
Yup! One of guest_memfd's core purposes is to be able to block host
accesses to guest private (in the CoCo sense) memory.
>
> Don't we issue KVM_CREATE_GUEST_MEMFD ioctl to get a fd in userspace to be passed to KVM_SET_USER_MEMORY_REGION2 ioctl later? We are just passing in a DAX fd instead of a guest mem fd.
>
This RFC is passing a DAX fd instead of a guest_memfd when creating a
memslot, so it's not really using guest_memfd, it's just reusing the
functions that were first created for guest_memfd to support another
kind of fd.
What's the use case you're shooting for? Why not mmap() from the DAX
fd and then pass the userspace address to KVM when setting up a memslot?
Is there a requirement to have the DAX memory usable by CoCo guests as
well, and hence requiring guest_memfd-style protection from host
accesses for private DAX memory?
>>
>> Since there is interest for several different allocation backends
>> (default, hugetlb, zone_device), it might be better to use a model
>> where guest_memfd has the option for backend allocators to register
>> themselves in the kernel. The ioctl can then select one by their
>> id/name (could be just a string). They can be configured using
>> e.g. sysfs (like hugetlb already is).
>>
>> This would also allow easy experimentation with new allocators,
>> having an allocator with BPF control, etc.
>
> Agreed. Although my main intent is to see if there's interest with providing something to the usages already on the DAX path an ease of transition until something like what's proposed above shows up. But if what I proposed will be a security issue then maybe not.
>
>>
>> - Frank
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 00/12] dax: Add DAX to guest memfd support for KVM
2026-05-06 20:23 ` Ackerley Tng
@ 2026-05-06 20:37 ` Dave Jiang
2026-05-08 1:09 ` Ira Weiny
1 sibling, 0 replies; 28+ messages in thread
From: Dave Jiang @ 2026-05-06 20:37 UTC (permalink / raw)
To: Ackerley Tng
Cc: fvdl, linux-cxl, nvdimm, djbw, iweiny, pasha.tatashin, mclapinski,
rppt, joao.m.martins, jic23, gourry, john, rick.p.edgecombe
On 5/6/26 1:23 PM, Ackerley Tng wrote:
> Dave Jiang <dave.jiang@intel.com> writes:
>
>> On 4/24/26 10:13 AM, Frank van der Linden wrote:
>>> Dave Jiang <dave.jiang@intel.com> wrote:
>>>> This RFC series is created as a proof of concept to connect device DAX to guest
>>>> memory by riding on top of guest memfd in order to prove out that device DAX
>>>> can be used as guest memory. The series seeks to jump start a discussion on
>>>> if there are interests in creating a DAX bridge to utilize CXL memory for guest
>>>> memory until the N_PRIVATE implementation by Gregory [1] is available upstream
>>>> and DAX users are ready to move to the new scheme. Once there's an established
>>>> consensus of interest, we can move the discussion to the best way to implement
>>>> the DAX bridge and the future of device DAX as guest.
>>>>
>>>> I did the bare minimal to get the PoC to pass a modified version of KVM gmem
>>>> selftest (guest_memfd_test) in order to prove out that DAX can go in the gmem
>>>> path. A DAX char dev is created and the fd is passed in user space with
>>>> vm_set_user_memory_region2(). The DAX region is passed in as a whole when used
>>>> unlike memfd where any size can be passed in to be allocated.
>>>>
>>>> The folks on the cc line are people that Dan Williams has mentioned that may be
>>>> of interest to this.
>>>>
>
> Thanks for the PoC! I've been working on guest_memfd HugeTLB and I'm
> glad there is interest in other "backends" for guest_memfd :)
>
>>>> [1]: https://lore.kernel.org/linux-cxl/aeWV1CvP9ImZ3eEG@gourry-fedora-PF4VCD3F/T/#t
>>>
>>> One of the main ideas behind guest_memfd is that the memory is managed
>>> by the kernel only, so it knows what it has and that it can trust
>>> the memory. This RFC passes an fd in via the ioctl(), which I think
>>> breaks that model.
>
> Yup! One of guest_memfd's core purposes is to be able to block host
> accesses to guest private (in the CoCo sense) memory.
>
>>
>> Don't we issue KVM_CREATE_GUEST_MEMFD ioctl to get a fd in userspace to be passed to KVM_SET_USER_MEMORY_REGION2 ioctl later? We are just passing in a DAX fd instead of a guest mem fd.
>>
>
> This RFC is passing a DAX fd instead of a guest_memfd when creating a
> memslot, so it's not really using guest_memfd, it's just reusing the
> functions that were first created for guest_memfd to support another
> kind of fd.
>
Right. It was the fastest way to see if something would work. It isn't meant to be the design goal in the future.
> What's the use case you're shooting for? Why not mmap() from the DAX
> fd and then pass the userspace address to KVM when setting up a memslot?
The use case mainly is to see if the people currently using DAX via mmap() would utilize this for other usages as a bridge vs something like the private node implementation Gregory is working on that has a totally different way of doing things. So yes what you suggested could be another way to do it. Mainly I want to see if there's even any interest at all. And if so then we can talk about how we want it to be done and I'm wide open on that.
>
> Is there a requirement to have the DAX memory usable by CoCo guests as
> well, and hence requiring guest_memfd-style protection from host
> accesses for private DAX memory?
I think if we are to implement this then I think so at some point.
DJ
>
>>>
>>> Since there is interest for several different allocation backends
>>> (default, hugetlb, zone_device), it might be better to use a model
>>> where guest_memfd has the option for backend allocators to register
>>> themselves in the kernel. The ioctl can then select one by their
>>> id/name (could be just a string). They can be configured using
>>> e.g. sysfs (like hugetlb already is).
>>>
>>> This would also allow easy experimentation with new allocators,
>>> having an allocator with BPF control, etc.
>>
>> Agreed. Although my main intent is to see if there's interest with providing something to the usages already on the DAX path an ease of transition until something like what's proposed above shows up. But if what I proposed will be a security issue then maybe not.
>>
>>>
>>> - Frank
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 00/12] dax: Add DAX to guest memfd support for KVM
2026-05-06 20:23 ` Ackerley Tng
2026-05-06 20:37 ` Dave Jiang
@ 2026-05-08 1:09 ` Ira Weiny
2026-05-10 14:40 ` Gregory Price
1 sibling, 1 reply; 28+ messages in thread
From: Ira Weiny @ 2026-05-08 1:09 UTC (permalink / raw)
To: Ackerley Tng, Dave Jiang
Cc: fvdl, linux-cxl, nvdimm, djbw, iweiny, pasha.tatashin, mclapinski,
rppt, joao.m.martins, jic23, gourry, john, rick.p.edgecombe
Ackerley Tng wrote:
> Dave Jiang <dave.jiang@intel.com> writes:
>
> > On 4/24/26 10:13 AM, Frank van der Linden wrote:
> >> Dave Jiang <dave.jiang@intel.com> wrote:
[snip]
>
> >>> [1]: https://lore.kernel.org/linux-cxl/aeWV1CvP9ImZ3eEG@gourry-fedora-PF4VCD3F/T/#t
> >>
> >> One of the main ideas behind guest_memfd is that the memory is managed
> >> by the kernel only, so it knows what it has and that it can trust
> >> the memory. This RFC passes an fd in via the ioctl(), which I think
> >> breaks that model.
>
> Yup! One of guest_memfd's core purposes is to be able to block host
> accesses to guest private (in the CoCo sense) memory.
>
> >
> > Don't we issue KVM_CREATE_GUEST_MEMFD ioctl to get a fd in userspace to be passed to KVM_SET_USER_MEMORY_REGION2 ioctl later? We are just passing in a DAX fd instead of a guest mem fd.
> >
>
> This RFC is passing a DAX fd instead of a guest_memfd when creating a
> memslot, so it's not really using guest_memfd, it's just reusing the
> functions that were first created for guest_memfd to support another
> kind of fd.
>
> What's the use case you're shooting for? Why not mmap() from the DAX
> fd and then pass the userspace address to KVM when setting up a memslot?
>
> Is there a requirement to have the DAX memory usable by CoCo guests as
> well, and hence requiring guest_memfd-style protection from host
> accesses for private DAX memory?
>
I was thinking this would be an eventual use case for DAX/CXL memory yes.
There are a couple of issues with mmaping DAX.
1) DAX is getting a bit long in the tooth. It may be that users are fine
with it and it should stick around but there are some who worry that it
is too deviated from the memfd/gmemfd style of management.
2) What you propose above does not give the gmem 'protection' for CoCo's.
So yea that is the bigger issue.
Allowing gmem to use DAX/CXL as a backend within the kernel is where I
think this is headed. But having the gmem fd be allocated with that
backend would need to have more knobs in gmem. Also I believe there may
be use cases where a _specific_ CXL device is desired. That case makes
the knobs required more complicated.
What Dave has done here gives the device information via the dax fd. It
is kind of clunky but it works...
Ira
[snip]
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCH 00/12] dax: Add DAX to guest memfd support for KVM
2026-05-08 1:09 ` Ira Weiny
@ 2026-05-10 14:40 ` Gregory Price
0 siblings, 0 replies; 28+ messages in thread
From: Gregory Price @ 2026-05-10 14:40 UTC (permalink / raw)
To: Ira Weiny
Cc: Ackerley Tng, Dave Jiang, fvdl, linux-cxl, nvdimm, djbw, iweiny,
pasha.tatashin, mclapinski, rppt, joao.m.martins, jic23, john,
rick.p.edgecombe
On Thu, May 07, 2026 at 08:09:25PM -0500, Ira Weiny wrote:
>
> 2) What you propose above does not give the gmem 'protection' for CoCo's.
> So yea that is the bigger issue.
>
Realistically, what you actually want is to add:
private_dax.c
+
MEMORY_DEVICE_CONFIDENTIAL
And just make sure they work together to produce:
a) open() works -> produces an FD
b) no direct-mappings, struct page exists, can be accessed by KVM
c) all userland operations fault (memory is never in direct map)
d) unbind explicitly zeroes or calls a registered sanitize() func
But this adds a new dax mode and a new ZONE_DEVICE mode.
A private node with NP_OPT_NOMAP might be cleaner, but you still have to
do the hotplug/memremap dance either way.
~Gregory
^ permalink raw reply [flat|nested] 28+ messages in thread