* [PATCH BUNDLE] famfs: Fabric-Attached Memory File System
@ 2026-01-07 15:32 John Groves
2026-01-07 15:33 ` [PATCH V3 00/21] famfs: port into fuse John Groves
` (2 more replies)
0 siblings, 3 replies; 74+ messages in thread
From: John Groves @ 2026-01-07 15:32 UTC (permalink / raw)
To: John Groves, Miklos Szeredi, Dan Williams, Bernd Schubert,
Alison Schofield
Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
Josef Bacik, Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
This is a coordinated patch submission for famfs (Fabric-Attached Memory
File System) across three repositories:
1. Linux kernel (21 patches) - dax fsdev driver + fuse/famfs integration
2. libfuse (4 patches) - famfs protocol support for fuse servers
3. ndctl/daxctl (2 patches) - support for the new "famfs" devdax mode
Each series is posted as a reply to this cover message, with individual
patches replying to their respective series cover.
Overview
--------
Famfs exposes shared memory as a file system. It consumes shared memory
from dax devices and provides memory-mappable files that map directly to
the memory with no page cache involvement. Famfs differs from conventional
file systems in fs-dax mode in that it handles in-memory metadata in a
sharable way (which begins with never caching dirty shared metadata).
Famfs started as a standalone file system [1,2], but the consensus at
LSFMM 2024 and 2025 [3,4] was that it should be ported into fuse.
The key performance requirement is that famfs must resolve mapping faults
without upcalls. This is achieved by fully caching the file-to-devdax
metadata for all active files via two fuse client/server message/response
pairs: GET_FMAP and GET_DAXDEV.
Patch Series Summary
--------------------
Linux Kernel (V3, 21 patches):
- dax: New fsdev driver (drivers/dax/fsdev.c) providing a devdax mode
compatible with fs-dax. Devices can be switched among 'devdax', 'fsdev'
and 'system-ram' modes via daxctl or sysfs.
- fuse: Famfs integration adding GET_FMAP and GET_DAXDEV messages for
caching file-to-dax mappings in the kernel.
libfuse (V2, 4 patches):
- Updates fuse_kernel.h to kernel 6.19 baseline
- Adds famfs DAX fmap protocol definitions
- Adds API for kernel mount options
- Implements famfs DAX fmap support for fuse servers
ndctl/daxctl (2 patches):
- Adds daxctl support for the new "famfs" mode of devdax
- Adds test/daxctl-famfs.sh for testing mode transitions
Changes Since V2 (kernel)
-------------------------
- Dax: Completely new fsdev driver replaces the dev_dax_iomap modifications.
Uses MEMORY_DEVICE_FS_DAX type with order-0 folios for fs-dax compatibility.
- Dax: The "poisoned page" problem is properly fixed via fsdev_clear_folio_state()
which clears stale mapping/compound state when fsdev binds.
- Dax: Added dax_set_ops() and driver unbind protection while filesystem mounted.
- Fuse: Famfs mounts require CAP_SYS_RAWIO (exposing raw memory devices).
- Fuse: Added DAX address_space_operations with noop_dirty_folio.
- Rebased to latest kernels, compatible with recent dax refactoring.
Testing
-------
The famfs user space [5] includes comprehensive smoke and unit tests that
exercise all three components together. The ndctl series includes a
dedicated test for famfs mode transitions.
References
----------
[1] https://lore.kernel.org/linux-cxl/cover.1708709155.git.john@groves.net/
[2] https://lore.kernel.org/linux-cxl/cover.1714409084.git.john@groves.net/
[3] https://lwn.net/Articles/983105/ (LSFMM 2024)
[4] https://lwn.net/Articles/1020170/ (LSFMM 2025)
[5] https://famfs.org (famfs user space)
[6] https://lore.kernel.org/linux-cxl/20250703185032.46568-1-john@groves.net/ (V2)
--
John Groves
^ permalink raw reply [flat|nested] 74+ messages in thread
* [PATCH V3 00/21] famfs: port into fuse
2026-01-07 15:32 [PATCH BUNDLE] famfs: Fabric-Attached Memory File System John Groves
@ 2026-01-07 15:33 ` John Groves
2026-01-07 15:33 ` [PATCH V3 01/21] dax: move dax_pgoff_to_phys from [drivers/dax/] device.c to bus.c John Groves
` (20 more replies)
2026-01-07 15:34 ` [PATCH V3 0/4] libfuse: add basic famfs support to libfuse John Groves
2026-01-07 15:34 ` [PATCH 0/2] ndctl: Add daxctl support for the new "famfs" mode of devdax John Groves
2 siblings, 21 replies; 74+ messages in thread
From: John Groves @ 2026-01-07 15:33 UTC (permalink / raw)
To: John Groves, Miklos Szeredi, Dan Williams, Bernd Schubert,
Alison Schofield
Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
Josef Bacik, Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel, John Groves
This patch series is available as a git tag at [0].
Description:
This patch series introduces famfs into the fuse file system framework.
This is really two patch series concatenated.
- The patches with the 'dax:' prefix introduce necessary dax
functionality.
- The patches with 'famfs_fuse:' introduce the famfs functionality into
fuse. The famfs_fuse patches depend on the dax patches.
In addition, there are related patch sets for libfuse and ndctl(daxctl).
Related patches and code
- Related patch to libfuse - posted under the same cover
- Related patch to ndctl/daxctl - posted under the same cover
- The famfs user space code can be found at [1]
Dax Overview:
This series introduces a new "famfs mode" of devdax, whose driver is
drivers/dax/fsdev.c. This driver supports dax_iomap_rw() and
dax_iomap_fault() calls against a character dax instance. A dax device
now can be converted among three modes: 'system-ram', 'devdax' and
'famfs' via daxctl or sysfs (e.g. unbind devdax and bind famfs instead).
In famfs mode, a dax device initializes its pages consistent with the
fsdaxmode of pmem. Raw read/write/mmap are not supported in this mode,
but famfs is happy in this mode - using dax_iomap_rw() for read/write and
dax_iomap_fault() for mmap faults.
Fuse Overview:
Famfs started as a standalone file system, but this series is intended to
permanently supersede that implementation. At a high level, famfs adds
two new fuse server messages:
GET_FMAP - Retrieves a famfs fmap (the file-to-dax map for a famfs
file)
GET_DAXDEV - Retrieves the details of a particular daxdev that was
referenced by an fmap
Famfs Overview
Famfs exposes shared memory as a file system. Famfs consumes shared
memory from dax devices, and provides memory-mappable files that map
directly to the memory - no page cache involvement. Famfs differs from
conventional file systems in fs-dax mode, in that it handles in-memory
metadata in a sharable way (which begins with never caching dirty shared
metadata).
Famfs started as a standalone file system [2,3], but the consensus at
LSFMM was that it should be ported into fuse [4,5].
The key performance requirement is that famfs must resolve mapping faults
without upcalls. This is achieved by fully caching the file-to-devdax
metadata for all active files. This is done via two fuse client/server
message/response pairs: GET_FMAP and GET_DAXDEV.
Famfs remains the first fs-dax file system that is backed by devdax
rather than pmem in fs-dax mode (hence the need for the new dax mode).
Notes
- When a file is opened in a famfs mount, the OPEN is followed by a
GET_FMAP message and response. The "fmap" is the full file-to-dax
mapping, allowing the fuse/famfs kernel code to handle
read/write/fault without any upcalls.
- After each GET_FMAP, the fmap is checked for extents that reference
previously-unknown daxdevs. Each such occurrence is handled with a
GET_DAXDEV message and response.
- Daxdevs are stored in a table (which might become an xarray at some
point). When entries are added to the table, we acquire exclusive
access to the daxdev via the fs_dax_get() call (modeled after how
fs-dax handles this with pmem devices). Famfs provides
holder_operations to devdax, providing a notification path in the
event of memory errors or forced reconfiguration.
- If devdax notifies famfs of memory errors on a dax device, famfs
currently blocks all subsequent accesses to data on that device. The
recovery is to re-initialize the memory and file system. Famfs is
memory, not storage...
- Because famfs uses backing (devdax) devices, only privileged mounts are
supported (i.e. the fuse server requires CAP_SYS_RAWIO).
- The famfs kernel code never accesses the memory directly - it only
facilitates read, write and mmap on behalf of user processes, using
fmap metadata provided by its privileged fuse server. As such, the
RAS of the shared memory affects applications, but not the kernel.
- Famfs has backing device(s), but they are devdax (char) rather than
block. Right now there is no way to tell the vfs layer that famfs has a
char backing device (unless we say it's block, but it's not). Currently
we use the standard anonymous fuse fs_type - but I'm not sure that's
ultimately optimal (thoughts?)
Changes v2 [7] -> v3
- Dax: Completely new fsdev driver (drivers/dax/fsdev.c) replaces the
dev_dax_iomap modifications to bus.c/device.c. Devdax devices can now
be switched among 'devdax', 'famfs' and 'system-ram' modes via daxctl
or sysfs.
- Dax: fsdev uses MEMORY_DEVICE_FS_DAX type and leaves folios at order-0
(no vmemmap_shift), allowing fs-dax to manage folio lifecycles
dynamically like pmem does.
- Dax: The "poisoned page" problem is properly fixed via
fsdev_clear_folio_state(), which clears stale mapping/compound state
when fsdev binds. The temporary WARN_ON_ONCE workaround in fs/dax.c
has been removed.
- Dax: Added dax_set_ops() so fsdev can set dax_operations at bind time
(and clear them on unbind), since the dax_device is created before we
know which driver will bind.
- Dax: Added custom bind/unbind sysfs handlers; unbind return -EBUSY if a
filesystem holds the device, preventing unbind while famfs is mounted.
- Fuse: Famfs mounts now require that the fuse server/daemon has
CAP_SYS_RAWIO because they expose raw memory devices.
- Fuse: Added DAX address_space_operations with noop_dirty_folio since
famfs is memory-backed with no writeback required.
- Rebased to latest kernels, fully compatible with Alistair Popple
et. al's recent dax refactoring.
- Ran this series through Chris Mason's code review AI prompts to check
for issues - several subtle problems found and fixed.
- Dropped RFC status - this version is intended to be mergeable.
Changes v1 [8] -> v2:
- The GET_FMAP message/response has been moved from LOOKUP to OPEN, as
was the pretty much unanimous consensus.
- Made the response payload to GET_FMAP variable sized (patch 12)
- Dodgy kerneldoc comments cleaned up or removed.
- Fixed memory leak of fc->shadow in patch 11 (thanks Joanne)
- Dropped many pr_debug and pr_notice calls
References
[0] - https://github.com/jagalactic/linux/tree/famfs-v3 (this patch set)
[1] - https://famfs.org (famfs user space)
[2] - https://lore.kernel.org/linux-cxl/cover.1708709155.git.john@groves.net/
[3] - https://lore.kernel.org/linux-cxl/cover.1714409084.git.john@groves.net/
[4] - https://lwn.net/Articles/983105/ (lsfmm 2024)
[5] - https://lwn.net/Articles/1020170/ (lsfmm 2025)
[6] - https://lore.kernel.org/linux-cxl/cover.8068ad144a7eea4a813670301f4d2a86a8e68ec4.1740713401.git-series.apopple@nvidia.com/
[7] - https://lore.kernel.org/linux-fsdevel/20250703185032.46568-1-john@groves.net/ (famfs fuse v2)
[8] - https://lore.kernel.org/linux-fsdevel/20250421013346.32530-1-john@groves.net/ (famfs fuse v1)
John Groves (21):
dax: move dax_pgoff_to_phys from [drivers/dax/] device.c to bus.c
dax: add fsdev.c driver for fs-dax on character dax
dax: Save the kva from memremap
dax: Add dax_operations for use by fs-dax on fsdev dax
dax: Add dax_set_ops() for setting dax_operations at bind time
dax: Add fs_dax_get() func to prepare dax for fs-dax usage
dax: prevent driver unbind while filesystem holds device
dax: export dax_dev_get()
famfs_fuse: magic.h: Add famfs magic numbers
famfs_fuse: Kconfig
famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/
famfs_fuse: Basic fuse kernel ABI enablement for famfs
famfs_fuse: Famfs mount opt: -o shadow=<shadowpath>
famfs_fuse: Plumb the GET_FMAP message/response
famfs_fuse: Create files with famfs fmaps
famfs_fuse: GET_DAXDEV message and daxdev_table
famfs_fuse: Plumb dax iomap and fuse read/write/mmap
famfs_fuse: Add holder_operations for dax notify_failure()
famfs_fuse: Add DAX address_space_operations with noop_dirty_folio
famfs_fuse: Add famfs fmap metadata documentation
famfs_fuse: Add documentation
Documentation/filesystems/famfs.rst | 142 ++++
Documentation/filesystems/index.rst | 1 +
MAINTAINERS | 18 +
drivers/dax/Kconfig | 17 +
drivers/dax/Makefile | 2 +
drivers/dax/bus.c | 86 +-
drivers/dax/bus.h | 3 +
drivers/dax/dax-private.h | 5 +
drivers/dax/device.c | 23 -
drivers/dax/fsdev.c | 369 ++++++++
drivers/dax/super.c | 95 ++-
fs/fuse/Kconfig | 14 +
fs/fuse/Makefile | 1 +
fs/fuse/dir.c | 2 +-
fs/fuse/famfs.c | 1221 +++++++++++++++++++++++++++
fs/fuse/famfs_kfmap.h | 167 ++++
fs/fuse/file.c | 45 +-
fs/fuse/fuse_i.h | 126 ++-
fs/fuse/inode.c | 59 +-
fs/fuse/iomode.c | 2 +-
fs/namei.c | 1 +
include/linux/dax.h | 7 +
include/uapi/linux/fuse.h | 88 ++
include/uapi/linux/magic.h | 2 +
24 files changed, 2454 insertions(+), 42 deletions(-)
create mode 100644 Documentation/filesystems/famfs.rst
create mode 100644 drivers/dax/fsdev.c
create mode 100644 fs/fuse/famfs.c
create mode 100644 fs/fuse/famfs_kfmap.h
base-commit: 9ace4753a5202b02191d54e9fdf7f9e3d02b85eb
--
2.49.0
^ permalink raw reply [flat|nested] 74+ messages in thread
* [PATCH V3 01/21] dax: move dax_pgoff_to_phys from [drivers/dax/] device.c to bus.c
2026-01-07 15:33 ` [PATCH V3 00/21] famfs: port into fuse John Groves
@ 2026-01-07 15:33 ` John Groves
2026-01-08 10:43 ` Jonathan Cameron
2026-01-07 15:33 ` [PATCH V3 02/21] dax: add fsdev.c driver for fs-dax on character dax John Groves
` (19 subsequent siblings)
20 siblings, 1 reply; 74+ messages in thread
From: John Groves @ 2026-01-07 15:33 UTC (permalink / raw)
To: John Groves, Miklos Szeredi, Dan Williams, Bernd Schubert,
Alison Schofield
Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
Josef Bacik, Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel, John Groves
This function will be used by both device.c and fsdev.c, but both are
loadable modules. Moving to bus.c puts it in core and makes it available
to both.
No code changes - just relocated.
Signed-off-by: John Groves <john@groves.net>
---
drivers/dax/bus.c | 27 +++++++++++++++++++++++++++
drivers/dax/device.c | 23 -----------------------
2 files changed, 27 insertions(+), 23 deletions(-)
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index fde29e0ad68b..a2f9a3cc30a5 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -7,6 +7,9 @@
#include <linux/slab.h>
#include <linux/dax.h>
#include <linux/io.h>
+#include <linux/backing-dev.h>
+#include <linux/range.h>
+#include <linux/uio.h>
#include "dax-private.h"
#include "bus.h"
@@ -1417,6 +1420,30 @@ static const struct device_type dev_dax_type = {
.groups = dax_attribute_groups,
};
+/* see "strong" declaration in tools/testing/nvdimm/dax-dev.c */
+__weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
+ unsigned long size)
+{
+ int i;
+
+ for (i = 0; i < dev_dax->nr_range; i++) {
+ struct dev_dax_range *dax_range = &dev_dax->ranges[i];
+ struct range *range = &dax_range->range;
+ unsigned long long pgoff_end;
+ phys_addr_t phys;
+
+ pgoff_end = dax_range->pgoff + PHYS_PFN(range_len(range)) - 1;
+ if (pgoff < dax_range->pgoff || pgoff > pgoff_end)
+ continue;
+ phys = PFN_PHYS(pgoff - dax_range->pgoff) + range->start;
+ if (phys + size - 1 <= range->end)
+ return phys;
+ break;
+ }
+ return -1;
+}
+EXPORT_SYMBOL_GPL(dax_pgoff_to_phys);
+
static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
{
struct dax_region *dax_region = data->dax_region;
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 22999a402e02..132c1d03fd07 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -57,29 +57,6 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
vma->vm_file, func);
}
-/* see "strong" declaration in tools/testing/nvdimm/dax-dev.c */
-__weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
- unsigned long size)
-{
- int i;
-
- for (i = 0; i < dev_dax->nr_range; i++) {
- struct dev_dax_range *dax_range = &dev_dax->ranges[i];
- struct range *range = &dax_range->range;
- unsigned long long pgoff_end;
- phys_addr_t phys;
-
- pgoff_end = dax_range->pgoff + PHYS_PFN(range_len(range)) - 1;
- if (pgoff < dax_range->pgoff || pgoff > pgoff_end)
- continue;
- phys = PFN_PHYS(pgoff - dax_range->pgoff) + range->start;
- if (phys + size - 1 <= range->end)
- return phys;
- break;
- }
- return -1;
-}
-
static void dax_set_mapping(struct vm_fault *vmf, unsigned long pfn,
unsigned long fault_size)
{
--
2.49.0
^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH V3 02/21] dax: add fsdev.c driver for fs-dax on character dax
2026-01-07 15:33 ` [PATCH V3 00/21] famfs: port into fuse John Groves
2026-01-07 15:33 ` [PATCH V3 01/21] dax: move dax_pgoff_to_phys from [drivers/dax/] device.c to bus.c John Groves
@ 2026-01-07 15:33 ` John Groves
2026-01-08 11:31 ` Jonathan Cameron
2026-01-07 15:33 ` [PATCH V3 03/21] dax: Save the kva from memremap John Groves
` (18 subsequent siblings)
20 siblings, 1 reply; 74+ messages in thread
From: John Groves @ 2026-01-07 15:33 UTC (permalink / raw)
To: John Groves, Miklos Szeredi, Dan Williams, Bernd Schubert,
Alison Schofield
Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
Josef Bacik, Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel, John Groves
The new fsdev driver provides pages/folios initialized compatibly with
fsdax - normal rather than devdax-style refcounting, and starting out
with order-0 folios.
When fsdev binds to a daxdev, it is usually (always?) switching from the
devdax mode (device.c), which pre-initializes compound folios according
to its alignment. Fsdev uses fsdev_clear_folio_state() to switch the
folios into a fsdax-compatible state.
A side effect of this is that raw mmap doesn't (can't?) work on an fsdev
dax instance. Accordingly, The fsdev driver does not provide raw mmap -
devices must be put in 'devdax' mode (drivers/dax/device.c) to get raw
mmap capability.
In this commit is just the framework, which remaps pages/folios compatibly
with fsdax.
Enabling dax changes:
* bus.h: add DAXDRV_FSDEV_TYPE driver type
* bus.c: allow DAXDRV_FSDEV_TYPE drivers to bind to daxdevs
* dax.h: prototype inode_dax(), which fsdev needs
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Suggested-by: Gregory Price <gourry@gourry.net>
Signed-off-by: John Groves <john@groves.net>
---
MAINTAINERS | 8 ++
drivers/dax/Kconfig | 17 +++
drivers/dax/Makefile | 2 +
drivers/dax/bus.c | 4 +
drivers/dax/bus.h | 1 +
drivers/dax/fsdev.c | 276 +++++++++++++++++++++++++++++++++++++++++++
include/linux/dax.h | 4 +
7 files changed, 312 insertions(+)
create mode 100644 drivers/dax/fsdev.c
diff --git a/MAINTAINERS b/MAINTAINERS
index 765ad2daa218..90429cb06090 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7184,6 +7184,14 @@ L: linux-cxl@vger.kernel.org
S: Supported
F: drivers/dax/
+DEVICE DIRECT ACCESS (DAX) [fsdev_dax]
+M: John Groves <jgroves@micron.com>
+M: John Groves <John@Groves.net>
+L: nvdimm@lists.linux.dev
+L: linux-cxl@vger.kernel.org
+S: Supported
+F: drivers/dax/fsdev.c
+
DEVICE FREQUENCY (DEVFREQ)
M: MyungJoo Ham <myungjoo.ham@samsung.com>
M: Kyungmin Park <kyungmin.park@samsung.com>
diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index d656e4c0eb84..491325d914a8 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -78,4 +78,21 @@ config DEV_DAX_KMEM
Say N if unsure.
+config DEV_DAX_FS
+ tristate "FSDEV DAX: fs-dax compatible device driver"
+ depends on DEV_DAX
+ default DEV_DAX
+ help
+ Support a device-dax driver mode that is compatible with fs-dax
+ filesystems. Unlike the standard device-dax driver which
+ pre-initializes compound folios based on device alignment, this
+ driver leaves folios uninitialized (similar to pmem) allowing
+ fs-dax to manage folio lifecycles dynamically.
+
+ This driver uses MEMORY_DEVICE_FS_DAX type and does not set
+ vmemmap_shift, making it compatible with filesystems like famfs
+ that use the iomap-based fs-dax infrastructure.
+
+ Say M if you plan to use fs-dax filesystems on /dev/dax devices.
+ Say N if you only need raw character device access to DAX memory.
endif
diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile
index 5ed5c39857c8..77aa3df3285c 100644
--- a/drivers/dax/Makefile
+++ b/drivers/dax/Makefile
@@ -4,11 +4,13 @@ obj-$(CONFIG_DEV_DAX) += device_dax.o
obj-$(CONFIG_DEV_DAX_KMEM) += kmem.o
obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o
obj-$(CONFIG_DEV_DAX_CXL) += dax_cxl.o
+obj-$(CONFIG_DEV_DAX_FS) += fsdev_dax.o
dax-y := super.o
dax-y += bus.o
device_dax-y := device.o
dax_pmem-y := pmem.o
dax_cxl-y := cxl.o
+fsdev_dax-y := fsdev.o
obj-y += hmem/
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index a2f9a3cc30a5..0d7228acb913 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -84,6 +84,10 @@ static int dax_match_type(const struct dax_device_driver *dax_drv, struct device
!IS_ENABLED(CONFIG_DEV_DAX_KMEM))
return 1;
+ /* fsdev driver can also bind to device-type dax devices */
+ if (dax_drv->type == DAXDRV_FSDEV_TYPE && type == DAXDRV_DEVICE_TYPE)
+ return 1;
+
return 0;
}
diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
index cbbf64443098..880bdf7e72d7 100644
--- a/drivers/dax/bus.h
+++ b/drivers/dax/bus.h
@@ -31,6 +31,7 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data);
enum dax_driver_type {
DAXDRV_KMEM_TYPE,
DAXDRV_DEVICE_TYPE,
+ DAXDRV_FSDEV_TYPE,
};
struct dax_device_driver {
diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
new file mode 100644
index 000000000000..2a3249d1529c
--- /dev/null
+++ b/drivers/dax/fsdev.c
@@ -0,0 +1,276 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2026 Micron Technology, Inc. */
+#include <linux/memremap.h>
+#include <linux/pagemap.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/cdev.h>
+#include <linux/slab.h>
+#include <linux/dax.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include "dax-private.h"
+#include "bus.h"
+
+/*
+ * FS-DAX compatible devdax driver
+ *
+ * Unlike drivers/dax/device.c which pre-initializes compound folios based
+ * on device alignment (via vmemmap_shift), this driver leaves folios
+ * uninitialized similar to pmem. This allows fs-dax filesystems like famfs
+ * to work without needing special handling for pre-initialized folios.
+ *
+ * Key differences from device.c:
+ * - pgmap type is MEMORY_DEVICE_FS_DAX (not MEMORY_DEVICE_GENERIC)
+ * - vmemmap_shift is NOT set (folios remain order-0)
+ * - fs-dax can dynamically create compound folios as needed
+ * - No mmap support - all access is through fs-dax/iomap
+ */
+
+
+static void fsdev_cdev_del(void *cdev)
+{
+ cdev_del(cdev);
+}
+
+static void fsdev_kill(void *dev_dax)
+{
+ kill_dev_dax(dev_dax);
+}
+
+/*
+ * Page map operations for FS-DAX mode
+ * Similar to fsdax_pagemap_ops in drivers/nvdimm/pmem.c
+ *
+ * Note: folio_free callback is not needed for MEMORY_DEVICE_FS_DAX.
+ * The core mm code in free_zone_device_folio() handles the wake_up_var()
+ * directly for this memory type.
+ */
+static int fsdev_pagemap_memory_failure(struct dev_pagemap *pgmap,
+ unsigned long pfn, unsigned long nr_pages, int mf_flags)
+{
+ struct dev_dax *dev_dax = pgmap->owner;
+ u64 offset = PFN_PHYS(pfn) - dev_dax->ranges[0].range.start;
+ u64 len = nr_pages << PAGE_SHIFT;
+
+ return dax_holder_notify_failure(dev_dax->dax_dev, offset,
+ len, mf_flags);
+}
+
+static const struct dev_pagemap_ops fsdev_pagemap_ops = {
+ .memory_failure = fsdev_pagemap_memory_failure,
+};
+
+/*
+ * Clear any stale folio state from pages in the given range.
+ * This is necessary because device_dax pre-initializes compound folios
+ * based on vmemmap_shift, and that state may persist after driver unbind.
+ * Since fsdev_dax uses MEMORY_DEVICE_FS_DAX without vmemmap_shift, fs-dax
+ * expects to find clean order-0 folios that it can build into compound
+ * folios on demand.
+ *
+ * At probe time, no filesystem should be mounted yet, so all mappings
+ * are stale and must be cleared along with compound state.
+ */
+static void fsdev_clear_folio_state(struct dev_dax *dev_dax)
+{
+ int i;
+
+ for (i = 0; i < dev_dax->nr_range; i++) {
+ struct range *range = &dev_dax->ranges[i].range;
+ unsigned long pfn, end_pfn;
+
+ pfn = PHYS_PFN(range->start);
+ end_pfn = PHYS_PFN(range->end) + 1;
+
+ while (pfn < end_pfn) {
+ struct page *page = pfn_to_page(pfn);
+ struct folio *folio = (struct folio *)page;
+ struct dev_pagemap *pgmap = page_pgmap(page);
+ int order = folio_order(folio);
+
+ /*
+ * Clear any stale mapping pointer. At probe time,
+ * no filesystem is mounted, so any mapping is stale.
+ */
+ folio->mapping = NULL;
+ folio->share = 0;
+
+ if (order > 0) {
+ int j;
+
+ folio_reset_order(folio);
+ for (j = 0; j < (1UL << order); j++) {
+ struct page *p = page + j;
+
+ ClearPageHead(p);
+ clear_compound_head(p);
+ ((struct folio *)p)->mapping = NULL;
+ ((struct folio *)p)->share = 0;
+ ((struct folio *)p)->pgmap = pgmap;
+ }
+ pfn += (1UL << order);
+ } else {
+ folio->pgmap = pgmap;
+ pfn++;
+ }
+ }
+ }
+}
+
+static int fsdev_open(struct inode *inode, struct file *filp)
+{
+ struct dax_device *dax_dev = inode_dax(inode);
+ struct dev_dax *dev_dax = dax_get_private(dax_dev);
+
+ dev_dbg(&dev_dax->dev, "trace\n");
+ filp->private_data = dev_dax;
+
+ return 0;
+}
+
+static int fsdev_release(struct inode *inode, struct file *filp)
+{
+ struct dev_dax *dev_dax = filp->private_data;
+
+ dev_dbg(&dev_dax->dev, "trace\n");
+ return 0;
+}
+
+static const struct file_operations fsdev_fops = {
+ .llseek = noop_llseek,
+ .owner = THIS_MODULE,
+ .open = fsdev_open,
+ .release = fsdev_release,
+};
+
+static int fsdev_dax_probe(struct dev_dax *dev_dax)
+{
+ struct dax_device *dax_dev = dev_dax->dax_dev;
+ struct device *dev = &dev_dax->dev;
+ struct dev_pagemap *pgmap;
+ u64 data_offset = 0;
+ struct inode *inode;
+ struct cdev *cdev;
+ void *addr;
+ int rc, i;
+
+ if (static_dev_dax(dev_dax)) {
+ if (dev_dax->nr_range > 1) {
+ dev_warn(dev,
+ "static pgmap / multi-range device conflict\n");
+ return -EINVAL;
+ }
+
+ pgmap = dev_dax->pgmap;
+ } else {
+ if (dev_dax->pgmap) {
+ dev_warn(dev,
+ "dynamic-dax with pre-populated page map\n");
+ return -EINVAL;
+ }
+
+ pgmap = devm_kzalloc(dev,
+ struct_size(pgmap, ranges, dev_dax->nr_range - 1),
+ GFP_KERNEL);
+ if (!pgmap)
+ return -ENOMEM;
+
+ pgmap->nr_range = dev_dax->nr_range;
+ dev_dax->pgmap = pgmap;
+
+ for (i = 0; i < dev_dax->nr_range; i++) {
+ struct range *range = &dev_dax->ranges[i].range;
+
+ pgmap->ranges[i] = *range;
+ }
+ }
+
+ for (i = 0; i < dev_dax->nr_range; i++) {
+ struct range *range = &dev_dax->ranges[i].range;
+
+ if (!devm_request_mem_region(dev, range->start,
+ range_len(range), dev_name(dev))) {
+ dev_warn(dev, "mapping%d: %#llx-%#llx could not reserve range\n",
+ i, range->start, range->end);
+ return -EBUSY;
+ }
+ }
+
+ /*
+ * FS-DAX compatible mode: Use MEMORY_DEVICE_FS_DAX type and
+ * do NOT set vmemmap_shift. This leaves folios at order-0,
+ * allowing fs-dax to dynamically create compound folios as needed
+ * (similar to pmem behavior).
+ */
+ pgmap->type = MEMORY_DEVICE_FS_DAX;
+ pgmap->ops = &fsdev_pagemap_ops;
+ pgmap->owner = dev_dax;
+
+ /*
+ * CRITICAL DIFFERENCE from device.c:
+ * We do NOT set vmemmap_shift here, even if align > PAGE_SIZE.
+ * This ensures folios remain order-0 and are compatible with
+ * fs-dax's folio management.
+ */
+
+ addr = devm_memremap_pages(dev, pgmap);
+ if (IS_ERR(addr))
+ return PTR_ERR(addr);
+
+ /*
+ * Clear any stale compound folio state left over from a previous
+ * driver (e.g., device_dax with vmemmap_shift).
+ */
+ fsdev_clear_folio_state(dev_dax);
+
+ /* Detect whether the data is at a non-zero offset into the memory */
+ if (pgmap->range.start != dev_dax->ranges[0].range.start) {
+ u64 phys = dev_dax->ranges[0].range.start;
+ u64 pgmap_phys = dev_dax->pgmap[0].range.start;
+
+ if (!WARN_ON(pgmap_phys > phys))
+ data_offset = phys - pgmap_phys;
+
+ pr_debug("%s: offset detected phys=%llx pgmap_phys=%llx offset=%llx\n",
+ __func__, phys, pgmap_phys, data_offset);
+ }
+
+ inode = dax_inode(dax_dev);
+ cdev = inode->i_cdev;
+ cdev_init(cdev, &fsdev_fops);
+ cdev->owner = dev->driver->owner;
+ cdev_set_parent(cdev, &dev->kobj);
+ rc = cdev_add(cdev, dev->devt, 1);
+ if (rc)
+ return rc;
+
+ rc = devm_add_action_or_reset(dev, fsdev_cdev_del, cdev);
+ if (rc)
+ return rc;
+
+ run_dax(dax_dev);
+ return devm_add_action_or_reset(dev, fsdev_kill, dev_dax);
+}
+
+static struct dax_device_driver fsdev_dax_driver = {
+ .probe = fsdev_dax_probe,
+ .type = DAXDRV_FSDEV_TYPE,
+};
+
+static int __init dax_init(void)
+{
+ return dax_driver_register(&fsdev_dax_driver);
+}
+
+static void __exit dax_exit(void)
+{
+ dax_driver_unregister(&fsdev_dax_driver);
+}
+
+MODULE_AUTHOR("John Groves");
+MODULE_DESCRIPTION("FS-DAX Device: fs-dax compatible devdax driver");
+MODULE_LICENSE("GPL");
+module_init(dax_init);
+module_exit(dax_exit);
+MODULE_ALIAS_DAX_DEVICE(0);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 9d624f4d9df6..74e098010016 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -51,6 +51,10 @@ struct dax_holder_operations {
#if IS_ENABLED(CONFIG_DAX)
struct dax_device *alloc_dax(void *private, const struct dax_operations *ops);
+
+#if IS_ENABLED(CONFIG_DEV_DAX_FS)
+struct dax_device *inode_dax(struct inode *inode);
+#endif
void *dax_holder(struct dax_device *dax_dev);
void put_dax(struct dax_device *dax_dev);
void kill_dax(struct dax_device *dax_dev);
--
2.49.0
^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH V3 03/21] dax: Save the kva from memremap
2026-01-07 15:33 ` [PATCH V3 00/21] famfs: port into fuse John Groves
2026-01-07 15:33 ` [PATCH V3 01/21] dax: move dax_pgoff_to_phys from [drivers/dax/] device.c to bus.c John Groves
2026-01-07 15:33 ` [PATCH V3 02/21] dax: add fsdev.c driver for fs-dax on character dax John Groves
@ 2026-01-07 15:33 ` John Groves
2026-01-08 11:32 ` Jonathan Cameron
2026-01-07 15:33 ` [PATCH V3 04/21] dax: Add dax_operations for use by fs-dax on fsdev dax John Groves
` (17 subsequent siblings)
20 siblings, 1 reply; 74+ messages in thread
From: John Groves @ 2026-01-07 15:33 UTC (permalink / raw)
To: John Groves, Miklos Szeredi, Dan Williams, Bernd Schubert,
Alison Schofield
Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
Josef Bacik, Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel, John Groves
Save the kva from memremap because we need it for iomap rw support.
Prior to famfs, there were no iomap users of /dev/dax - so the virtual
address from memremap was not needed.
(also fill in missing kerneldoc comment fields for struct dev_dax)
Signed-off-by: John Groves <john@groves.net>
---
drivers/dax/dax-private.h | 4 ++++
drivers/dax/fsdev.c | 1 +
2 files changed, 5 insertions(+)
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 0867115aeef2..1bb1631af485 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -69,18 +69,22 @@ struct dev_dax_range {
* data while the device is activated in the driver.
* @region - parent region
* @dax_dev - core dax functionality
+ * @virt_addr - kva from memremap; used by fsdev_dax
+ * @align - alignment of this instance
* @target_node: effective numa node if dev_dax memory range is onlined
* @dyn_id: is this a dynamic or statically created instance
* @id: ida allocated id when the dax_region is not static
* @ida: mapping id allocator
* @dev - device core
* @pgmap - pgmap for memmap setup / lifetime (driver owned)
+ * @memmap_on_memory - allow kmem to put the memmap in the memory
* @nr_range: size of @ranges
* @ranges: range tuples of memory used
*/
struct dev_dax {
struct dax_region *region;
struct dax_device *dax_dev;
+ void *virt_addr;
unsigned int align;
int target_node;
bool dyn_id;
diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
index 2a3249d1529c..c5c660b193e5 100644
--- a/drivers/dax/fsdev.c
+++ b/drivers/dax/fsdev.c
@@ -235,6 +235,7 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
pr_debug("%s: offset detected phys=%llx pgmap_phys=%llx offset=%llx\n",
__func__, phys, pgmap_phys, data_offset);
}
+ dev_dax->virt_addr = addr + data_offset;
inode = dax_inode(dax_dev);
cdev = inode->i_cdev;
--
2.49.0
^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH V3 04/21] dax: Add dax_operations for use by fs-dax on fsdev dax
2026-01-07 15:33 ` [PATCH V3 00/21] famfs: port into fuse John Groves
` (2 preceding siblings ...)
2026-01-07 15:33 ` [PATCH V3 03/21] dax: Save the kva from memremap John Groves
@ 2026-01-07 15:33 ` John Groves
2026-01-08 11:50 ` Jonathan Cameron
2026-01-07 15:33 ` [PATCH V3 05/21] dax: Add dax_set_ops() for setting dax_operations at bind time John Groves
` (16 subsequent siblings)
20 siblings, 1 reply; 74+ messages in thread
From: John Groves @ 2026-01-07 15:33 UTC (permalink / raw)
To: John Groves, Miklos Szeredi, Dan Williams, Bernd Schubert,
Alison Schofield
Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
Josef Bacik, Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel, John Groves
From: John Groves <John@Groves.net>
* These methods are based on pmem_dax_ops from drivers/nvdimm/pmem.c
* fsdev_dax_direct_access() returns the hpa, pfn and kva. The kva was
newly stored as dev_dax->virt_addr by dev_dax_probe().
* The hpa/pfn are used for mmap (dax_iomap_fault()), and the kva is used
for read/write (dax_iomap_rw())
* fsdev_dax_recovery_write() and dev_dax_zero_page_range() have not been
tested yet. I'm looking for suggestions as to how to test those.
* dax-private.h: add dev_dax->cached_size, which fsdev needs to
remember. The dev_dax size cannot change while a driver is bound
(dev_dax_resize returns -EBUSY if dev->driver is set). Caching the size
at probe time allows fsdev's direct_access path can use it without
acquiring dax_dev_rwsem (which isn't exported anyway).
Signed-off-by: John Groves <john@groves.net>
---
drivers/dax/dax-private.h | 1 +
drivers/dax/fsdev.c | 80 +++++++++++++++++++++++++++++++++++++++
2 files changed, 81 insertions(+)
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 1bb1631af485..fbd8348cc71c 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -85,6 +85,7 @@ struct dev_dax {
struct dax_region *region;
struct dax_device *dax_dev;
void *virt_addr;
+ u64 cached_size;
unsigned int align;
int target_node;
bool dyn_id;
diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
index c5c660b193e5..9e2f83aa2584 100644
--- a/drivers/dax/fsdev.c
+++ b/drivers/dax/fsdev.c
@@ -27,6 +27,81 @@
* - No mmap support - all access is through fs-dax/iomap
*/
+static void fsdev_write_dax(void *pmem_addr, struct page *page,
+ unsigned int off, unsigned int len)
+{
+ while (len) {
+ void *mem = kmap_local_page(page);
+ unsigned int chunk = min_t(unsigned int, len, PAGE_SIZE - off);
+
+ memcpy_flushcache(pmem_addr, mem + off, chunk);
+ kunmap_local(mem);
+ len -= chunk;
+ off = 0;
+ page++;
+ pmem_addr += chunk;
+ }
+}
+
+static long __fsdev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
+ long nr_pages, enum dax_access_mode mode, void **kaddr,
+ unsigned long *pfn)
+{
+ struct dev_dax *dev_dax = dax_get_private(dax_dev);
+ size_t size = nr_pages << PAGE_SHIFT;
+ size_t offset = pgoff << PAGE_SHIFT;
+ void *virt_addr = dev_dax->virt_addr + offset;
+ phys_addr_t phys;
+ unsigned long local_pfn;
+
+ WARN_ON(!dev_dax->virt_addr);
+
+ phys = dax_pgoff_to_phys(dev_dax, pgoff, nr_pages << PAGE_SHIFT);
+
+ if (kaddr)
+ *kaddr = virt_addr;
+
+ local_pfn = PHYS_PFN(phys);
+ if (pfn)
+ *pfn = local_pfn;
+
+ /*
+ * Use cached_size which was computed at probe time. The size cannot
+ * change while the driver is bound (resize returns -EBUSY).
+ */
+ return PHYS_PFN(min_t(size_t, size, dev_dax->cached_size - offset));
+}
+
+static int fsdev_dax_zero_page_range(struct dax_device *dax_dev,
+ pgoff_t pgoff, size_t nr_pages)
+{
+ void *kaddr;
+
+ WARN_ONCE(nr_pages > 1, "%s: nr_pages > 1\n", __func__);
+ __fsdev_dax_direct_access(dax_dev, pgoff, 1, DAX_ACCESS, &kaddr, NULL);
+ fsdev_write_dax(kaddr, ZERO_PAGE(0), 0, PAGE_SIZE);
+ return 0;
+}
+
+static long fsdev_dax_direct_access(struct dax_device *dax_dev,
+ pgoff_t pgoff, long nr_pages, enum dax_access_mode mode,
+ void **kaddr, unsigned long *pfn)
+{
+ return __fsdev_dax_direct_access(dax_dev, pgoff, nr_pages, mode,
+ kaddr, pfn);
+}
+
+static size_t fsdev_dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
+ void *addr, size_t bytes, struct iov_iter *i)
+{
+ return _copy_from_iter_flushcache(addr, bytes, i);
+}
+
+static const struct dax_operations dev_dax_ops = {
+ .direct_access = fsdev_dax_direct_access,
+ .zero_page_range = fsdev_dax_zero_page_range,
+ .recovery_write = fsdev_dax_recovery_write,
+};
static void fsdev_cdev_del(void *cdev)
{
@@ -197,6 +272,11 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
}
}
+ /* Cache size now; it cannot change while driver is bound */
+ dev_dax->cached_size = 0;
+ for (i = 0; i < dev_dax->nr_range; i++)
+ dev_dax->cached_size += range_len(&dev_dax->ranges[i].range);
+
/*
* FS-DAX compatible mode: Use MEMORY_DEVICE_FS_DAX type and
* do NOT set vmemmap_shift. This leaves folios at order-0,
--
2.49.0
^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH V3 05/21] dax: Add dax_set_ops() for setting dax_operations at bind time
2026-01-07 15:33 ` [PATCH V3 00/21] famfs: port into fuse John Groves
` (3 preceding siblings ...)
2026-01-07 15:33 ` [PATCH V3 04/21] dax: Add dax_operations for use by fs-dax on fsdev dax John Groves
@ 2026-01-07 15:33 ` John Groves
2026-01-08 12:06 ` Jonathan Cameron
2026-01-07 15:33 ` [PATCH V3 06/21] dax: Add fs_dax_get() func to prepare dax for fs-dax usage John Groves
` (15 subsequent siblings)
20 siblings, 1 reply; 74+ messages in thread
From: John Groves @ 2026-01-07 15:33 UTC (permalink / raw)
To: John Groves, Miklos Szeredi, Dan Williams, Bernd Schubert,
Alison Schofield
Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
Josef Bacik, Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel, John Groves
From: John Groves <John@Groves.net>
The dax_device is created (in the non-pmem case) at hmem probe time via
devm_create_dev_dax(), before we know which driver (device_dax,
fsdev_dax, or kmem) will bind - by calling alloc_dax() with NULL ops,
drivers (i.e. fsdev_dax) that need specific dax_operations must set
them later.
Add dax_set_ops() exported function so fsdev_dax can set its ops at
probe time and clear them on remove. device_dax doesn't need ops since
it uses the mmap fault path directly.
Use cmpxchg() to atomically set ops only if currently NULL, returning
-EBUSY if ops are already set. This prevents accidental double-binding.
Clearing ops (NULL) always succeeds.
Signed-off-by: John Groves <john@groves.net>
---
drivers/dax/fsdev.c | 12 ++++++++++++
drivers/dax/super.c | 38 +++++++++++++++++++++++++++++++++++++-
include/linux/dax.h | 1 +
3 files changed, 50 insertions(+), 1 deletion(-)
diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
index 9e2f83aa2584..3f4f593896e3 100644
--- a/drivers/dax/fsdev.c
+++ b/drivers/dax/fsdev.c
@@ -330,12 +330,24 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
if (rc)
return rc;
+ /* Set the dax operations for fs-dax access path */
+ rc = dax_set_ops(dax_dev, &dev_dax_ops);
+ if (rc)
+ return rc;
+
run_dax(dax_dev);
return devm_add_action_or_reset(dev, fsdev_kill, dev_dax);
}
+static void fsdev_dax_remove(struct dev_dax *dev_dax)
+{
+ /* Clear ops on unbind so they aren't used with a different driver */
+ dax_set_ops(dev_dax->dax_dev, NULL);
+}
+
static struct dax_device_driver fsdev_dax_driver = {
.probe = fsdev_dax_probe,
+ .remove = fsdev_dax_remove,
.type = DAXDRV_FSDEV_TYPE,
};
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index c00b9dff4a06..ba0b4cd18a77 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -157,6 +157,9 @@ long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
if (!dax_alive(dax_dev))
return -ENXIO;
+ if (!dax_dev->ops)
+ return -EOPNOTSUPP;
+
if (nr_pages < 0)
return -EINVAL;
@@ -207,6 +210,10 @@ int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
if (!dax_alive(dax_dev))
return -ENXIO;
+
+ if (!dax_dev->ops)
+ return -EOPNOTSUPP;
+
/*
* There are no callers that want to zero more than one page as of now.
* Once users are there, this check can be removed after the
@@ -223,7 +230,7 @@ EXPORT_SYMBOL_GPL(dax_zero_page_range);
size_t dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
void *addr, size_t bytes, struct iov_iter *iter)
{
- if (!dax_dev->ops->recovery_write)
+ if (!dax_dev->ops || !dax_dev->ops->recovery_write)
return 0;
return dax_dev->ops->recovery_write(dax_dev, pgoff, addr, bytes, iter);
}
@@ -307,6 +314,35 @@ void set_dax_nomc(struct dax_device *dax_dev)
}
EXPORT_SYMBOL_GPL(set_dax_nomc);
+/**
+ * dax_set_ops - set the dax_operations for a dax_device
+ * @dax_dev: the dax_device to configure
+ * @ops: the operations to set (may be NULL to clear)
+ *
+ * This allows drivers to set the dax_operations after the dax_device
+ * has been allocated. This is needed when the device is created before
+ * the driver that needs specific ops is bound (e.g., fsdev_dax binding
+ * to a dev_dax created by hmem).
+ *
+ * When setting non-NULL ops, fails if ops are already set (returns -EBUSY).
+ * When clearing ops (NULL), always succeeds.
+ *
+ * Return: 0 on success, -EBUSY if ops already set
+ */
+int dax_set_ops(struct dax_device *dax_dev, const struct dax_operations *ops)
+{
+ if (ops) {
+ /* Setting ops: fail if already set */
+ if (cmpxchg(&dax_dev->ops, NULL, ops) != NULL)
+ return -EBUSY;
+ } else {
+ /* Clearing ops: always allowed */
+ dax_dev->ops = NULL;
+ }
+ return 0;
+}
+EXPORT_SYMBOL_GPL(dax_set_ops);
+
bool dax_alive(struct dax_device *dax_dev)
{
lockdep_assert_held(&dax_srcu);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 74e098010016..3fcd8562b72b 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -246,6 +246,7 @@ static inline void dax_break_layout_final(struct inode *inode)
bool dax_alive(struct dax_device *dax_dev);
void *dax_get_private(struct dax_device *dax_dev);
+int dax_set_ops(struct dax_device *dax_dev, const struct dax_operations *ops);
long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
enum dax_access_mode mode, void **kaddr, unsigned long *pfn);
size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
--
2.49.0
^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH V3 06/21] dax: Add fs_dax_get() func to prepare dax for fs-dax usage
2026-01-07 15:33 ` [PATCH V3 00/21] famfs: port into fuse John Groves
` (4 preceding siblings ...)
2026-01-07 15:33 ` [PATCH V3 05/21] dax: Add dax_set_ops() for setting dax_operations at bind time John Groves
@ 2026-01-07 15:33 ` John Groves
2026-01-08 12:27 ` Jonathan Cameron
2026-01-07 15:33 ` [PATCH V3 07/21] dax: prevent driver unbind while filesystem holds device John Groves
` (14 subsequent siblings)
20 siblings, 1 reply; 74+ messages in thread
From: John Groves @ 2026-01-07 15:33 UTC (permalink / raw)
To: John Groves, Miklos Szeredi, Dan Williams, Bernd Schubert,
Alison Schofield
Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
Josef Bacik, Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel, John Groves
The fs_dax_get() function should be called by fs-dax file systems after
opening a fsdev dax device. This adds holder_operations, which provides
a memory failure callback path and effects exclusivity between callers
of fs_dax_get().
fs_dax_get() is specific to fsdev_dax, so it checks the driver type
(which required touching bus.[ch]). fs_dax_get() fails if fsdev_dax is
not bound to the memory.
This function serves the same role as fs_dax_get_by_bdev(), which dax
file systems call after opening the pmem block device.
This can't be located in fsdev.c because struct dax_device is opaque
there.
This will be called by fs/fuse/famfs.c in a subsequent commit.
Signed-off-by: John Groves <john@groves.net>
---
drivers/dax/bus.c | 2 --
drivers/dax/bus.h | 2 ++
drivers/dax/super.c | 54 +++++++++++++++++++++++++++++++++++++++++++++
include/linux/dax.h | 1 +
4 files changed, 57 insertions(+), 2 deletions(-)
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 0d7228acb913..6e0e28116edc 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -42,8 +42,6 @@ static int dax_bus_uevent(const struct device *dev, struct kobj_uevent_env *env)
return add_uevent_var(env, "MODALIAS=" DAX_DEVICE_MODALIAS_FMT, 0);
}
-#define to_dax_drv(__drv) container_of_const(__drv, struct dax_device_driver, drv)
-
static struct dax_id *__dax_match_id(const struct dax_device_driver *dax_drv,
const char *dev_name)
{
diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
index 880bdf7e72d7..dc6f112ac4a4 100644
--- a/drivers/dax/bus.h
+++ b/drivers/dax/bus.h
@@ -42,6 +42,8 @@ struct dax_device_driver {
void (*remove)(struct dev_dax *dev);
};
+#define to_dax_drv(__drv) container_of_const(__drv, struct dax_device_driver, drv)
+
int __dax_driver_register(struct dax_device_driver *dax_drv,
struct module *module, const char *mod_name);
#define dax_driver_register(driver) \
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index ba0b4cd18a77..68c45b918cff 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -14,6 +14,7 @@
#include <linux/fs.h>
#include <linux/cacheinfo.h>
#include "dax-private.h"
+#include "bus.h"
/**
* struct dax_device - anchor object for dax services
@@ -121,6 +122,59 @@ void fs_put_dax(struct dax_device *dax_dev, void *holder)
EXPORT_SYMBOL_GPL(fs_put_dax);
#endif /* CONFIG_BLOCK && CONFIG_FS_DAX */
+#if IS_ENABLED(CONFIG_DEV_DAX_FS)
+/**
+ * fs_dax_get() - get ownership of a devdax via holder/holder_ops
+ *
+ * fs-dax file systems call this function to prepare to use a devdax device for
+ * fsdax. This is like fs_dax_get_by_bdev(), but the caller already has struct
+ * dev_dax (and there is no bdev). The holder makes this exclusive.
+ *
+ * @dax_dev: dev to be prepared for fs-dax usage
+ * @holder: filesystem or mapped device inside the dax_device
+ * @hops: operations for the inner holder
+ *
+ * Returns: 0 on success, <0 on failure
+ */
+int fs_dax_get(struct dax_device *dax_dev, void *holder,
+ const struct dax_holder_operations *hops)
+{
+ struct dev_dax *dev_dax;
+ struct dax_device_driver *dax_drv;
+ int id;
+
+ id = dax_read_lock();
+ if (!dax_dev || !dax_alive(dax_dev) || !igrab(&dax_dev->inode)) {
+ dax_read_unlock(id);
+ return -ENODEV;
+ }
+ dax_read_unlock(id);
+
+ /* Verify the device is bound to fsdev_dax driver */
+ dev_dax = dax_get_private(dax_dev);
+ if (!dev_dax || !dev_dax->dev.driver) {
+ iput(&dax_dev->inode);
+ return -ENODEV;
+ }
+
+ dax_drv = to_dax_drv(dev_dax->dev.driver);
+ if (dax_drv->type != DAXDRV_FSDEV_TYPE) {
+ iput(&dax_dev->inode);
+ return -EOPNOTSUPP;
+ }
+
+ if (cmpxchg(&dax_dev->holder_data, NULL, holder)) {
+ iput(&dax_dev->inode);
+ return -EBUSY;
+ }
+
+ dax_dev->holder_ops = hops;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(fs_dax_get);
+#endif /* DEV_DAX_FS */
+
enum dax_device_flags {
/* !alive + rcu grace period == no new operations / mappings */
DAXDEV_ALIVE,
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 3fcd8562b72b..76f2a75f3144 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -53,6 +53,7 @@ struct dax_holder_operations {
struct dax_device *alloc_dax(void *private, const struct dax_operations *ops);
#if IS_ENABLED(CONFIG_DEV_DAX_FS)
+int fs_dax_get(struct dax_device *dax_dev, void *holder, const struct dax_holder_operations *hops);
struct dax_device *inode_dax(struct inode *inode);
#endif
void *dax_holder(struct dax_device *dax_dev);
--
2.49.0
^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH V3 07/21] dax: prevent driver unbind while filesystem holds device
2026-01-07 15:33 ` [PATCH V3 00/21] famfs: port into fuse John Groves
` (5 preceding siblings ...)
2026-01-07 15:33 ` [PATCH V3 06/21] dax: Add fs_dax_get() func to prepare dax for fs-dax usage John Groves
@ 2026-01-07 15:33 ` John Groves
2026-01-08 12:34 ` Jonathan Cameron
2026-01-12 18:55 ` John Groves
2026-01-07 15:33 ` [PATCH V3 08/21] dax: export dax_dev_get() John Groves
` (13 subsequent siblings)
20 siblings, 2 replies; 74+ messages in thread
From: John Groves @ 2026-01-07 15:33 UTC (permalink / raw)
To: John Groves, Miklos Szeredi, Dan Williams, Bernd Schubert,
Alison Schofield
Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
Josef Bacik, Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel, John Groves
From: John Groves <John@Groves.net>
Add custom bind/unbind sysfs attributes for the dax bus that check
whether a filesystem has registered as a holder (via fs_dax_get())
before allowing driver unbind.
When a filesystem like famfs mounts on a dax device, it registers
itself as the holder via dax_holder_ops. Previously, there was no
mechanism to prevent driver unbind while the filesystem was mounted,
which could cause some havoc.
The new unbind_store() checks dax_holder() and returns -EBUSY if
a holder is registered, giving userspace proper feedback that the
device is in use.
To use our custom bind/unbind handlers instead of the default ones,
set suppress_bind_attrs=true on all dax drivers during registration.
Signed-off-by: John Groves <john@groves.net>
---
drivers/dax/bus.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 53 insertions(+)
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 6e0e28116edc..ed453442739d 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -151,9 +151,61 @@ static ssize_t remove_id_store(struct device_driver *drv, const char *buf,
}
static DRIVER_ATTR_WO(remove_id);
+static const struct bus_type dax_bus_type;
+
+/*
+ * Custom bind/unbind handlers for dax bus.
+ * The unbind handler checks if a filesystem holds the dax device and
+ * returns -EBUSY if so, preventing driver unbind while in use.
+ */
+static ssize_t unbind_store(struct device_driver *drv, const char *buf,
+ size_t count)
+{
+ struct device *dev;
+ int rc = -ENODEV;
+
+ dev = bus_find_device_by_name(&dax_bus_type, NULL, buf);
+ if (dev && dev->driver == drv) {
+ struct dev_dax *dev_dax = to_dev_dax(dev);
+
+ if (dax_holder(dev_dax->dax_dev)) {
+ dev_dbg(dev,
+ "%s: blocking unbind due to active holder\n",
+ __func__);
+ rc = -EBUSY;
+ goto out;
+ }
+ device_release_driver(dev);
+ rc = count;
+ }
+out:
+ put_device(dev);
+ return rc;
+}
+static DRIVER_ATTR_WO(unbind);
+
+static ssize_t bind_store(struct device_driver *drv, const char *buf,
+ size_t count)
+{
+ struct device *dev;
+ int rc = -ENODEV;
+
+ dev = bus_find_device_by_name(&dax_bus_type, NULL, buf);
+ if (dev) {
+ rc = device_driver_attach(drv, dev);
+ if (!rc)
+ rc = count;
+ }
+ put_device(dev);
+ return rc;
+}
+static DRIVER_ATTR_WO(bind);
+
static struct attribute *dax_drv_attrs[] = {
&driver_attr_new_id.attr,
&driver_attr_remove_id.attr,
+ &driver_attr_bind.attr,
+ &driver_attr_unbind.attr,
NULL,
};
ATTRIBUTE_GROUPS(dax_drv);
@@ -1591,6 +1643,7 @@ int __dax_driver_register(struct dax_device_driver *dax_drv,
drv->name = mod_name;
drv->mod_name = mod_name;
drv->bus = &dax_bus_type;
+ drv->suppress_bind_attrs = true;
return driver_register(drv);
}
--
2.49.0
^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH V3 08/21] dax: export dax_dev_get()
2026-01-07 15:33 ` [PATCH V3 00/21] famfs: port into fuse John Groves
` (6 preceding siblings ...)
2026-01-07 15:33 ` [PATCH V3 07/21] dax: prevent driver unbind while filesystem holds device John Groves
@ 2026-01-07 15:33 ` John Groves
2026-01-07 15:33 ` [PATCH V3 09/21] famfs_fuse: magic.h: Add famfs magic numbers John Groves
` (12 subsequent siblings)
20 siblings, 0 replies; 74+ messages in thread
From: John Groves @ 2026-01-07 15:33 UTC (permalink / raw)
To: John Groves, Miklos Szeredi, Dan Williams, Bernd Schubert,
Alison Schofield
Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
Josef Bacik, Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel, John Groves
famfs needs to look up a dax_device by dev_t when resolving fmap
entries that reference character dax devices.
Signed-off-by: John Groves <john@groves.net>
---
drivers/dax/super.c | 3 ++-
include/linux/dax.h | 1 +
2 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 68c45b918cff..c14b07be6a4e 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -511,7 +511,7 @@ static int dax_set(struct inode *inode, void *data)
return 0;
}
-static struct dax_device *dax_dev_get(dev_t devt)
+struct dax_device *dax_dev_get(dev_t devt)
{
struct dax_device *dax_dev;
struct inode *inode;
@@ -534,6 +534,7 @@ static struct dax_device *dax_dev_get(dev_t devt)
return dax_dev;
}
+EXPORT_SYMBOL_GPL(dax_dev_get);
struct dax_device *alloc_dax(void *private, const struct dax_operations *ops)
{
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 76f2a75f3144..2a04c3535806 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -56,6 +56,7 @@ struct dax_device *alloc_dax(void *private, const struct dax_operations *ops);
int fs_dax_get(struct dax_device *dax_dev, void *holder, const struct dax_holder_operations *hops);
struct dax_device *inode_dax(struct inode *inode);
#endif
+struct dax_device *dax_dev_get(dev_t devt);
void *dax_holder(struct dax_device *dax_dev);
void put_dax(struct dax_device *dax_dev);
void kill_dax(struct dax_device *dax_dev);
--
2.49.0
^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH V3 09/21] famfs_fuse: magic.h: Add famfs magic numbers
2026-01-07 15:33 ` [PATCH V3 00/21] famfs: port into fuse John Groves
` (7 preceding siblings ...)
2026-01-07 15:33 ` [PATCH V3 08/21] dax: export dax_dev_get() John Groves
@ 2026-01-07 15:33 ` John Groves
2026-01-07 15:33 ` [PATCH V3 10/21] famfs_fuse: Kconfig John Groves
` (11 subsequent siblings)
20 siblings, 0 replies; 74+ messages in thread
From: John Groves @ 2026-01-07 15:33 UTC (permalink / raw)
To: John Groves, Miklos Szeredi, Dan Williams, Bernd Schubert,
Alison Schofield
Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
Josef Bacik, Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel, John Groves
Famfs distinguishes between its on-media and in-memory superblocks. This
reserves the numbers, but they are only used by the user space
components of famfs.
Signed-off-by: John Groves <john@groves.net>
---
include/uapi/linux/magic.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 638ca21b7a90..712b097bf2a5 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -38,6 +38,8 @@
#define OVERLAYFS_SUPER_MAGIC 0x794c7630
#define FUSE_SUPER_MAGIC 0x65735546
#define BCACHEFS_SUPER_MAGIC 0xca451a4e
+#define FAMFS_SUPER_MAGIC 0x87b282ff
+#define FAMFS_STATFS_MAGIC 0x87b282fd
#define MINIX_SUPER_MAGIC 0x137F /* minix v1 fs, 14 char names */
#define MINIX_SUPER_MAGIC2 0x138F /* minix v1 fs, 30 char names */
--
2.49.0
^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH V3 10/21] famfs_fuse: Kconfig
2026-01-07 15:33 ` [PATCH V3 00/21] famfs: port into fuse John Groves
` (8 preceding siblings ...)
2026-01-07 15:33 ` [PATCH V3 09/21] famfs_fuse: magic.h: Add famfs magic numbers John Groves
@ 2026-01-07 15:33 ` John Groves
2026-01-08 12:36 ` Jonathan Cameron
2026-01-07 15:33 ` [PATCH V3 11/21] famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/ John Groves
` (10 subsequent siblings)
20 siblings, 1 reply; 74+ messages in thread
From: John Groves @ 2026-01-07 15:33 UTC (permalink / raw)
To: John Groves, Miklos Szeredi, Dan Williams, Bernd Schubert,
Alison Schofield
Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
Josef Bacik, Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel, John Groves
Add FUSE_FAMFS_DAX config parameter, to control compilation of famfs
within fuse.
Signed-off-by: John Groves <john@groves.net>
---
fs/fuse/Kconfig | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
index 3a4ae632c94a..3b6d3121fe40 100644
--- a/fs/fuse/Kconfig
+++ b/fs/fuse/Kconfig
@@ -76,3 +76,17 @@ config FUSE_IO_URING
If you want to allow fuse server/client communication through io-uring,
answer Y
+
+config FUSE_FAMFS_DAX
+ bool "FUSE support for fs-dax filesystems backed by devdax"
+ depends on FUSE_FS
+ depends on DEV_DAX
+ default FUSE_FS
+ select DEV_DAX_FS
+ help
+ This enables the fabric-attached memory file system (famfs),
+ which enables formatting devdax memory as a file system. Famfs
+ is primarily intended for scale-out shared access to
+ disaggregated memory.
+
+ To enable famfs or other fuse/fs-dax file systems, answer Y
--
2.49.0
^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH V3 11/21] famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/
2026-01-07 15:33 ` [PATCH V3 00/21] famfs: port into fuse John Groves
` (9 preceding siblings ...)
2026-01-07 15:33 ` [PATCH V3 10/21] famfs_fuse: Kconfig John Groves
@ 2026-01-07 15:33 ` John Groves
2026-01-09 18:16 ` Joanne Koong
2026-01-07 15:33 ` [PATCH V3 12/21] famfs_fuse: Basic fuse kernel ABI enablement for famfs John Groves
` (9 subsequent siblings)
20 siblings, 1 reply; 74+ messages in thread
From: John Groves @ 2026-01-07 15:33 UTC (permalink / raw)
To: John Groves, Miklos Szeredi, Dan Williams, Bernd Schubert,
Alison Schofield
Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
Josef Bacik, Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel, John Groves
Virtio_fs now needs to determine if an inode is DAX && not famfs.
Signed-off-by: John Groves <john@groves.net>
---
fs/fuse/dir.c | 2 +-
fs/fuse/file.c | 13 ++++++++-----
fs/fuse/fuse_i.h | 6 +++++-
fs/fuse/inode.c | 4 ++--
fs/fuse/iomode.c | 2 +-
5 files changed, 17 insertions(+), 10 deletions(-)
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 4b6b3d2758ff..1400c9d733ba 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -2153,7 +2153,7 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
is_truncate = true;
}
- if (FUSE_IS_DAX(inode) && is_truncate) {
+ if (FUSE_IS_VIRTIO_DAX(fi) && is_truncate) {
filemap_invalidate_lock(mapping);
fault_blocked = true;
err = fuse_dax_break_layouts(inode, 0, -1);
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 01bc894e9c2b..093569033ed1 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -252,7 +252,7 @@ static int fuse_open(struct inode *inode, struct file *file)
int err;
bool is_truncate = (file->f_flags & O_TRUNC) && fc->atomic_o_trunc;
bool is_wb_truncate = is_truncate && fc->writeback_cache;
- bool dax_truncate = is_truncate && FUSE_IS_DAX(inode);
+ bool dax_truncate = is_truncate && FUSE_IS_VIRTIO_DAX(fi);
if (fuse_is_bad(inode))
return -EIO;
@@ -1812,11 +1812,12 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
struct file *file = iocb->ki_filp;
struct fuse_file *ff = file->private_data;
struct inode *inode = file_inode(file);
+ struct fuse_inode *fi = get_fuse_inode(inode);
if (fuse_is_bad(inode))
return -EIO;
- if (FUSE_IS_DAX(inode))
+ if (FUSE_IS_VIRTIO_DAX(fi))
return fuse_dax_read_iter(iocb, to);
/* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
@@ -1833,11 +1834,12 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
struct file *file = iocb->ki_filp;
struct fuse_file *ff = file->private_data;
struct inode *inode = file_inode(file);
+ struct fuse_inode *fi = get_fuse_inode(inode);
if (fuse_is_bad(inode))
return -EIO;
- if (FUSE_IS_DAX(inode))
+ if (FUSE_IS_VIRTIO_DAX(fi))
return fuse_dax_write_iter(iocb, from);
/* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
@@ -2370,10 +2372,11 @@ static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
struct fuse_file *ff = file->private_data;
struct fuse_conn *fc = ff->fm->fc;
struct inode *inode = file_inode(file);
+ struct fuse_inode *fi = get_fuse_inode(inode);
int rc;
/* DAX mmap is superior to direct_io mmap */
- if (FUSE_IS_DAX(inode))
+ if (FUSE_IS_VIRTIO_DAX(fi))
return fuse_dax_mmap(file, vma);
/*
@@ -2934,7 +2937,7 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
.mode = mode
};
int err;
- bool block_faults = FUSE_IS_DAX(inode) &&
+ bool block_faults = FUSE_IS_VIRTIO_DAX(fi) &&
(!(mode & FALLOC_FL_KEEP_SIZE) ||
(mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE)));
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 7f16049387d1..17736c0a6d2f 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1508,7 +1508,11 @@ void fuse_free_conn(struct fuse_conn *fc);
/* dax.c */
-#define FUSE_IS_DAX(inode) (IS_ENABLED(CONFIG_FUSE_DAX) && IS_DAX(inode))
+/* This macro is used by virtio_fs, but now it also needs to filter for
+ * "not famfs"
+ */
+#define FUSE_IS_VIRTIO_DAX(fuse_inode) (IS_ENABLED(CONFIG_FUSE_DAX) \
+ && IS_DAX(&fuse_inode->inode))
ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to);
ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 819e50d66622..ed667920997f 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -162,7 +162,7 @@ static void fuse_evict_inode(struct inode *inode)
/* Will write inode on close/munmap and in all other dirtiers */
WARN_ON(inode_state_read_once(inode) & I_DIRTY_INODE);
- if (FUSE_IS_DAX(inode))
+ if (FUSE_IS_VIRTIO_DAX(fi))
dax_break_layout_final(inode);
truncate_inode_pages_final(&inode->i_data);
@@ -170,7 +170,7 @@ static void fuse_evict_inode(struct inode *inode)
if (inode->i_sb->s_flags & SB_ACTIVE) {
struct fuse_conn *fc = get_fuse_conn(inode);
- if (FUSE_IS_DAX(inode))
+ if (FUSE_IS_VIRTIO_DAX(fi))
fuse_dax_inode_cleanup(inode);
if (fi->nlookup) {
fuse_queue_forget(fc, fi->forget, fi->nodeid,
diff --git a/fs/fuse/iomode.c b/fs/fuse/iomode.c
index 3728933188f3..31ee7f3304c6 100644
--- a/fs/fuse/iomode.c
+++ b/fs/fuse/iomode.c
@@ -203,7 +203,7 @@ int fuse_file_io_open(struct file *file, struct inode *inode)
* io modes are not relevant with DAX and with server that does not
* implement open.
*/
- if (FUSE_IS_DAX(inode) || !ff->args)
+ if (FUSE_IS_VIRTIO_DAX(fi) || !ff->args)
return 0;
/*
--
2.49.0
^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH V3 12/21] famfs_fuse: Basic fuse kernel ABI enablement for famfs
2026-01-07 15:33 ` [PATCH V3 00/21] famfs: port into fuse John Groves
` (10 preceding siblings ...)
2026-01-07 15:33 ` [PATCH V3 11/21] famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/ John Groves
@ 2026-01-07 15:33 ` John Groves
2026-01-09 18:29 ` Joanne Koong
2026-01-07 15:33 ` [PATCH V3 13/21] famfs_fuse: Famfs mount opt: -o shadow=<shadowpath> John Groves
` (8 subsequent siblings)
20 siblings, 1 reply; 74+ messages in thread
From: John Groves @ 2026-01-07 15:33 UTC (permalink / raw)
To: John Groves, Miklos Szeredi, Dan Williams, Bernd Schubert,
Alison Schofield
Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
Josef Bacik, Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel, John Groves
* FUSE_DAX_FMAP flag in INIT request/reply
* fuse_conn->famfs_iomap (enable famfs-mapped files) to denote a
famfs-enabled connection
Signed-off-by: John Groves <john@groves.net>
---
fs/fuse/fuse_i.h | 3 +++
fs/fuse/inode.c | 6 ++++++
include/uapi/linux/fuse.h | 5 +++++
3 files changed, 14 insertions(+)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 17736c0a6d2f..ec2446099010 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -921,6 +921,9 @@ struct fuse_conn {
/* Is synchronous FUSE_INIT allowed? */
unsigned int sync_init:1;
+ /* dev_dax_iomap support for famfs */
+ unsigned int famfs_iomap:1;
+
/* Use io_uring for communication */
unsigned int io_uring;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index ed667920997f..acabf92a11f8 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1456,6 +1456,10 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
if (flags & FUSE_REQUEST_TIMEOUT)
timeout = arg->request_timeout;
+
+ if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX) &&
+ flags & FUSE_DAX_FMAP)
+ fc->famfs_iomap = 1;
} else {
ra_pages = fc->max_read / PAGE_SIZE;
fc->no_lock = 1;
@@ -1517,6 +1521,8 @@ static struct fuse_init_args *fuse_new_init(struct fuse_mount *fm)
flags |= FUSE_SUBMOUNTS;
if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
flags |= FUSE_PASSTHROUGH;
+ if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
+ flags |= FUSE_DAX_FMAP;
/*
* This is just an information flag for fuse server. No need to check
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index c13e1f9a2f12..5e2c93433823 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -240,6 +240,9 @@
* - add FUSE_COPY_FILE_RANGE_64
* - add struct fuse_copy_file_range_out
* - add FUSE_NOTIFY_PRUNE
+ *
+ * 7.46
+ * - Add FUSE_DAX_FMAP capability - ability to handle in-kernel fsdax maps
*/
#ifndef _LINUX_FUSE_H
@@ -448,6 +451,7 @@ struct fuse_file_lock {
* FUSE_OVER_IO_URING: Indicate that client supports io-uring
* FUSE_REQUEST_TIMEOUT: kernel supports timing out requests.
* init_out.request_timeout contains the timeout (in secs)
+ * FUSE_DAX_FMAP: kernel supports dev_dax_iomap (aka famfs) fmaps
*/
#define FUSE_ASYNC_READ (1 << 0)
#define FUSE_POSIX_LOCKS (1 << 1)
@@ -495,6 +499,7 @@ struct fuse_file_lock {
#define FUSE_ALLOW_IDMAP (1ULL << 40)
#define FUSE_OVER_IO_URING (1ULL << 41)
#define FUSE_REQUEST_TIMEOUT (1ULL << 42)
+#define FUSE_DAX_FMAP (1ULL << 43)
/**
* CUSE INIT request/reply flags
--
2.49.0
^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH V3 13/21] famfs_fuse: Famfs mount opt: -o shadow=<shadowpath>
2026-01-07 15:33 ` [PATCH V3 00/21] famfs: port into fuse John Groves
` (11 preceding siblings ...)
2026-01-07 15:33 ` [PATCH V3 12/21] famfs_fuse: Basic fuse kernel ABI enablement for famfs John Groves
@ 2026-01-07 15:33 ` John Groves
2026-01-09 19:22 ` Joanne Koong
2026-01-07 15:33 ` [PATCH V3 14/21] famfs_fuse: Plumb the GET_FMAP message/response John Groves
` (7 subsequent siblings)
20 siblings, 1 reply; 74+ messages in thread
From: John Groves @ 2026-01-07 15:33 UTC (permalink / raw)
To: John Groves, Miklos Szeredi, Dan Williams, Bernd Schubert,
Alison Schofield
Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
Josef Bacik, Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel, John Groves
The shadow path is a (usually in tmpfs) file system area used by the
famfs user space to communicate with the famfs fuse server. There is a
minor dilemma that the user space tools must be able to resolve from a
mount point path to a shadow path. Passing in the 'shadow=<path>'
argument at mount time causes the shadow path to be exposed via
/proc/mounts, Solving this dilemma. The shadow path is not otherwise
used in the kernel.
Signed-off-by: John Groves <john@groves.net>
---
fs/fuse/fuse_i.h | 25 ++++++++++++++++++++++++-
fs/fuse/inode.c | 28 +++++++++++++++++++++++++++-
2 files changed, 51 insertions(+), 2 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index ec2446099010..84d0ee2a501d 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -620,9 +620,11 @@ struct fuse_fs_context {
unsigned int blksize;
const char *subtype;
- /* DAX device, may be NULL */
+ /* DAX device for virtiofs, may be NULL */
struct dax_device *dax_dev;
+ const char *shadow; /* famfs - null if not famfs */
+
/* fuse_dev pointer to fill in, should contain NULL on entry */
void **fudptr;
};
@@ -998,6 +1000,18 @@ struct fuse_conn {
/* Request timeout (in jiffies). 0 = no timeout */
unsigned int req_timeout;
} timeout;
+
+ /*
+ * This is a workaround until fuse uses iomap for reads.
+ * For fuseblk servers, this represents the blocksize passed in at
+ * mount time and for regular fuse servers, this is equivalent to
+ * inode->i_blkbits.
+ */
+ u8 blkbits;
+
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+ char *shadow;
+#endif
};
/*
@@ -1631,4 +1645,13 @@ extern void fuse_sysctl_unregister(void);
#define fuse_sysctl_unregister() do { } while (0)
#endif /* CONFIG_SYSCTL */
+/* famfs.c */
+
+static inline void famfs_teardown(struct fuse_conn *fc)
+{
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+ kfree(fc->shadow);
+#endif
+}
+
#endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index acabf92a11f8..2e0844aabbae 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -783,6 +783,9 @@ enum {
OPT_ALLOW_OTHER,
OPT_MAX_READ,
OPT_BLKSIZE,
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+ OPT_SHADOW,
+#endif
OPT_ERR
};
@@ -797,6 +800,9 @@ static const struct fs_parameter_spec fuse_fs_parameters[] = {
fsparam_u32 ("max_read", OPT_MAX_READ),
fsparam_u32 ("blksize", OPT_BLKSIZE),
fsparam_string ("subtype", OPT_SUBTYPE),
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+ fsparam_string("shadow", OPT_SHADOW),
+#endif
{}
};
@@ -892,6 +898,15 @@ static int fuse_parse_param(struct fs_context *fsc, struct fs_parameter *param)
ctx->blksize = result.uint_32;
break;
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+ case OPT_SHADOW:
+ if (ctx->shadow)
+ return invalfc(fsc, "Multiple shadows specified");
+ ctx->shadow = param->string;
+ param->string = NULL;
+ break;
+#endif
+
default:
return -EINVAL;
}
@@ -905,6 +920,7 @@ static void fuse_free_fsc(struct fs_context *fsc)
if (ctx) {
kfree(ctx->subtype);
+ kfree(ctx->shadow);
kfree(ctx);
}
}
@@ -936,7 +952,10 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
else if (fc->dax_mode == FUSE_DAX_INODE_USER)
seq_puts(m, ",dax=inode");
#endif
-
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+ if (fc->shadow)
+ seq_printf(m, ",shadow=%s", fc->shadow);
+#endif
return 0;
}
@@ -1041,6 +1060,8 @@ void fuse_conn_put(struct fuse_conn *fc)
WARN_ON(atomic_read(&bucket->count) != 1);
kfree(bucket);
}
+ famfs_teardown(fc);
+
if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
fuse_backing_files_free(fc);
call_rcu(&fc->rcu, delayed_release);
@@ -1916,6 +1937,11 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
*ctx->fudptr = fud;
wake_up_all(&fuse_dev_waitq);
}
+
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+ fc->shadow = kstrdup(ctx->shadow, GFP_KERNEL);
+#endif
+
mutex_unlock(&fuse_mutex);
return 0;
--
2.49.0
^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH V3 14/21] famfs_fuse: Plumb the GET_FMAP message/response
2026-01-07 15:33 ` [PATCH V3 00/21] famfs: port into fuse John Groves
` (12 preceding siblings ...)
2026-01-07 15:33 ` [PATCH V3 13/21] famfs_fuse: Famfs mount opt: -o shadow=<shadowpath> John Groves
@ 2026-01-07 15:33 ` John Groves
2026-01-08 12:49 ` Jonathan Cameron
2026-01-07 15:33 ` [PATCH V3 15/21] famfs_fuse: Create files with famfs fmaps John Groves
` (6 subsequent siblings)
20 siblings, 1 reply; 74+ messages in thread
From: John Groves @ 2026-01-07 15:33 UTC (permalink / raw)
To: John Groves, Miklos Szeredi, Dan Williams, Bernd Schubert,
Alison Schofield
Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
Josef Bacik, Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel, John Groves
Upon completion of an OPEN, if we're in famfs-mode we do a GET_FMAP to
retrieve and cache up the file-to-dax map in the kernel. If this
succeeds, read/write/mmap are resolved direct-to-dax with no upcalls.
Signed-off-by: John Groves <john@groves.net>
---
MAINTAINERS | 8 +++++
fs/fuse/Makefile | 1 +
fs/fuse/famfs.c | 74 +++++++++++++++++++++++++++++++++++++++
fs/fuse/file.c | 14 +++++++-
fs/fuse/fuse_i.h | 47 ++++++++++++++++++++++++-
fs/fuse/inode.c | 8 ++++-
fs/fuse/iomode.c | 2 +-
include/uapi/linux/fuse.h | 7 ++++
8 files changed, 157 insertions(+), 4 deletions(-)
create mode 100644 fs/fuse/famfs.c
diff --git a/MAINTAINERS b/MAINTAINERS
index 90429cb06090..526309943026 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10374,6 +10374,14 @@ F: fs/fuse/
F: include/uapi/linux/fuse.h
F: tools/testing/selftests/filesystems/fuse/
+FUSE [FAMFS Fabric-Attached Memory File System]
+M: John Groves <jgroves@micron.com>
+M: John Groves <John@Groves.net>
+L: linux-cxl@vger.kernel.org
+L: linux-fsdevel@vger.kernel.org
+S: Supported
+F: fs/fuse/famfs.c
+
FUTEX SUBSYSTEM
M: Thomas Gleixner <tglx@linutronix.de>
M: Ingo Molnar <mingo@redhat.com>
diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
index 22ad9538dfc4..3f8dcc8cbbd0 100644
--- a/fs/fuse/Makefile
+++ b/fs/fuse/Makefile
@@ -17,5 +17,6 @@ fuse-$(CONFIG_FUSE_DAX) += dax.o
fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o backing.o
fuse-$(CONFIG_SYSCTL) += sysctl.o
fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o
+fuse-$(CONFIG_FUSE_FAMFS_DAX) += famfs.o
virtiofs-y := virtio_fs.o
diff --git a/fs/fuse/famfs.c b/fs/fuse/famfs.c
new file mode 100644
index 000000000000..0f7e3f00e1e7
--- /dev/null
+++ b/fs/fuse/famfs.c
@@ -0,0 +1,74 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * famfs - dax file system for shared fabric-attached memory
+ *
+ * Copyright 2023-2025 Micron Technology, Inc.
+ *
+ * This file system, originally based on ramfs the dax support from xfs,
+ * is intended to allow multiple host systems to mount a common file system
+ * view of dax files that map to shared memory.
+ */
+
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/dax.h>
+#include <linux/iomap.h>
+#include <linux/path.h>
+#include <linux/namei.h>
+#include <linux/string.h>
+
+#include "fuse_i.h"
+
+
+#define FMAP_BUFSIZE PAGE_SIZE
+
+int
+fuse_get_fmap(struct fuse_mount *fm, struct inode *inode)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ size_t fmap_bufsize = FMAP_BUFSIZE;
+ u64 nodeid = get_node_id(inode);
+ ssize_t fmap_size;
+ void *fmap_buf;
+ int rc;
+
+ FUSE_ARGS(args);
+
+ /* Don't retrieve if we already have the famfs metadata */
+ if (fi->famfs_meta)
+ return 0;
+
+ fmap_buf = kcalloc(1, FMAP_BUFSIZE, GFP_KERNEL);
+ if (!fmap_buf)
+ return -EIO;
+
+ args.opcode = FUSE_GET_FMAP;
+ args.nodeid = nodeid;
+
+ /* Variable-sized output buffer
+ * this causes fuse_simple_request() to return the size of the
+ * output payload
+ */
+ args.out_argvar = true;
+ args.out_numargs = 1;
+ args.out_args[0].size = fmap_bufsize;
+ args.out_args[0].value = fmap_buf;
+
+ /* Send GET_FMAP command */
+ rc = fuse_simple_request(fm, &args);
+ if (rc < 0) {
+ pr_err("%s: err=%d from fuse_simple_request()\n",
+ __func__, rc);
+ return rc;
+ }
+ fmap_size = rc;
+
+ /* We retrieved the "fmap" (the file's map to memory), but
+ * we haven't used it yet. A call to famfs_file_init_dax() will be added
+ * here in a subsequent patch, when we add the ability to attach
+ * fmaps to files.
+ */
+
+ kfree(fmap_buf);
+ return 0;
+}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 093569033ed1..1f64bf68b5ee 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -277,6 +277,16 @@ static int fuse_open(struct inode *inode, struct file *file)
err = fuse_do_open(fm, get_node_id(inode), file, false);
if (!err) {
ff = file->private_data;
+
+ if ((fm->fc->famfs_iomap) && (S_ISREG(inode->i_mode))) {
+ /* Get the famfs fmap - failure is fatal */
+ err = fuse_get_fmap(fm, inode);
+ if (err) {
+ fuse_sync_release(fi, ff, file->f_flags);
+ goto out_nowrite;
+ }
+ }
+
err = fuse_finish_open(inode, file);
if (err)
fuse_sync_release(fi, ff, file->f_flags);
@@ -284,12 +294,14 @@ static int fuse_open(struct inode *inode, struct file *file)
fuse_truncate_update_attr(inode, file);
}
+out_nowrite:
if (is_wb_truncate || dax_truncate)
fuse_release_nowrite(inode);
if (!err) {
if (is_truncate)
truncate_pagecache(inode, 0);
- else if (!(ff->open_flags & FOPEN_KEEP_CACHE))
+ else if (!(ff->open_flags & FOPEN_KEEP_CACHE) &&
+ !fuse_file_famfs(fi))
invalidate_inode_pages2(inode->i_mapping);
}
if (dax_truncate)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 84d0ee2a501d..691c7850cf4e 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -223,6 +223,14 @@ struct fuse_inode {
* so preserve the blocksize specified by the server.
*/
u8 cached_i_blkbits;
+
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+ /* Pointer to the file's famfs metadata. Primary content is the
+ * in-memory version of the fmap - the map from file's offset range
+ * to DAX memory
+ */
+ void *famfs_meta;
+#endif
};
/** FUSE inode state bits */
@@ -1525,11 +1533,14 @@ void fuse_free_conn(struct fuse_conn *fc);
/* dax.c */
+static inline int fuse_file_famfs(struct fuse_inode *fi); /* forward */
+
/* This macro is used by virtio_fs, but now it also needs to filter for
* "not famfs"
*/
#define FUSE_IS_VIRTIO_DAX(fuse_inode) (IS_ENABLED(CONFIG_FUSE_DAX) \
- && IS_DAX(&fuse_inode->inode))
+ && IS_DAX(&fuse_inode->inode) \
+ && !fuse_file_famfs(fuse_inode))
ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to);
ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from);
@@ -1654,4 +1665,38 @@ static inline void famfs_teardown(struct fuse_conn *fc)
#endif
}
+static inline struct fuse_backing *famfs_meta_set(struct fuse_inode *fi,
+ void *meta)
+{
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+ return xchg(&fi->famfs_meta, meta);
+#else
+ return NULL;
+#endif
+}
+
+static inline void famfs_meta_free(struct fuse_inode *fi)
+{
+ /* Stub wil be connected in a subsequent commit */
+}
+
+static inline int fuse_file_famfs(struct fuse_inode *fi)
+{
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+ return (READ_ONCE(fi->famfs_meta) != NULL);
+#else
+ return 0;
+#endif
+}
+
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+int fuse_get_fmap(struct fuse_mount *fm, struct inode *inode);
+#else
+static inline int
+fuse_get_fmap(struct fuse_mount *fm, struct inode *inode)
+{
+ return 0;
+}
+#endif
+
#endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 2e0844aabbae..9e121a1d63b7 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -120,6 +120,9 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
fuse_inode_backing_set(fi, NULL);
+ if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
+ famfs_meta_set(fi, NULL);
+
return &fi->inode;
out_free_forget:
@@ -141,6 +144,9 @@ static void fuse_free_inode(struct inode *inode)
if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
fuse_backing_put(fuse_inode_backing(fi));
+ if (S_ISREG(inode->i_mode) && fuse_file_famfs(fi))
+ famfs_meta_free(fi);
+
kmem_cache_free(fuse_inode_cachep, fi);
}
@@ -162,7 +168,7 @@ static void fuse_evict_inode(struct inode *inode)
/* Will write inode on close/munmap and in all other dirtiers */
WARN_ON(inode_state_read_once(inode) & I_DIRTY_INODE);
- if (FUSE_IS_VIRTIO_DAX(fi))
+ if (FUSE_IS_VIRTIO_DAX(fi) || fuse_file_famfs(fi))
dax_break_layout_final(inode);
truncate_inode_pages_final(&inode->i_data);
diff --git a/fs/fuse/iomode.c b/fs/fuse/iomode.c
index 31ee7f3304c6..948148316ef0 100644
--- a/fs/fuse/iomode.c
+++ b/fs/fuse/iomode.c
@@ -203,7 +203,7 @@ int fuse_file_io_open(struct file *file, struct inode *inode)
* io modes are not relevant with DAX and with server that does not
* implement open.
*/
- if (FUSE_IS_VIRTIO_DAX(fi) || !ff->args)
+ if (FUSE_IS_VIRTIO_DAX(fi) || fuse_file_famfs(fi) || !ff->args)
return 0;
/*
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 5e2c93433823..bfb92a4aa8a9 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -669,6 +669,9 @@ enum fuse_opcode {
FUSE_STATX = 52,
FUSE_COPY_FILE_RANGE_64 = 53,
+ /* Famfs / devdax opcodes */
+ FUSE_GET_FMAP = 54,
+
/* CUSE specific operations */
CUSE_INIT = 4096,
@@ -1313,4 +1316,8 @@ struct fuse_uring_cmd_req {
uint8_t padding[6];
};
+/* Famfs fmap message components */
+
+#define FAMFS_FMAP_MAX 32768 /* Largest supported fmap message */
+
#endif /* _LINUX_FUSE_H */
--
2.49.0
^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH V3 15/21] famfs_fuse: Create files with famfs fmaps
2026-01-07 15:33 ` [PATCH V3 00/21] famfs: port into fuse John Groves
` (13 preceding siblings ...)
2026-01-07 15:33 ` [PATCH V3 14/21] famfs_fuse: Plumb the GET_FMAP message/response John Groves
@ 2026-01-07 15:33 ` John Groves
2026-01-07 21:30 ` John Groves
2026-01-08 13:14 ` Jonathan Cameron
2026-01-07 15:33 ` [PATCH V3 16/21] famfs_fuse: GET_DAXDEV message and daxdev_table John Groves
` (5 subsequent siblings)
20 siblings, 2 replies; 74+ messages in thread
From: John Groves @ 2026-01-07 15:33 UTC (permalink / raw)
To: John Groves, Miklos Szeredi, Dan Williams, Bernd Schubert,
Alison Schofield
Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
Josef Bacik, Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel, John Groves
On completion of GET_FMAP message/response, setup the full famfs
metadata such that it's possible to handle read/write/mmap directly to
dax. Note that the devdax_iomap plumbing is not in yet...
* Add famfs_kfmap.h: in-memory structures for resolving famfs file maps
(fmaps) to dax.
* famfs.c: allocate, initialize and free fmaps
* inode.c: only allow famfs mode if the fuse server has CAP_SYS_RAWIO
* Update MAINTAINERS for the new files.
Signed-off-by: John Groves <john@groves.net>
---
MAINTAINERS | 1 +
fs/fuse/famfs.c | 355 +++++++++++++++++++++++++++++++++++++-
fs/fuse/famfs_kfmap.h | 67 +++++++
fs/fuse/fuse_i.h | 22 ++-
fs/fuse/inode.c | 21 ++-
include/uapi/linux/fuse.h | 56 ++++++
6 files changed, 510 insertions(+), 12 deletions(-)
create mode 100644 fs/fuse/famfs_kfmap.h
diff --git a/MAINTAINERS b/MAINTAINERS
index 526309943026..16b0606a3b85 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10381,6 +10381,7 @@ L: linux-cxl@vger.kernel.org
L: linux-fsdevel@vger.kernel.org
S: Supported
F: fs/fuse/famfs.c
+F: fs/fuse/famfs_kfmap.h
FUTEX SUBSYSTEM
M: Thomas Gleixner <tglx@linutronix.de>
diff --git a/fs/fuse/famfs.c b/fs/fuse/famfs.c
index 0f7e3f00e1e7..2aabd1d589fd 100644
--- a/fs/fuse/famfs.c
+++ b/fs/fuse/famfs.c
@@ -17,9 +17,355 @@
#include <linux/namei.h>
#include <linux/string.h>
+#include "famfs_kfmap.h"
#include "fuse_i.h"
+/***************************************************************************/
+
+void
+__famfs_meta_free(void *famfs_meta)
+{
+ struct famfs_file_meta *fmap = famfs_meta;
+
+ if (!fmap)
+ return;
+
+ if (fmap) {
+ switch (fmap->fm_extent_type) {
+ case SIMPLE_DAX_EXTENT:
+ kfree(fmap->se);
+ break;
+ case INTERLEAVED_EXTENT:
+ if (fmap->ie)
+ kfree(fmap->ie->ie_strips);
+
+ kfree(fmap->ie);
+ break;
+ default:
+ pr_err("%s: invalid fmap type\n", __func__);
+ break;
+ }
+ }
+ kfree(fmap);
+}
+
+static int
+famfs_check_ext_alignment(struct famfs_meta_simple_ext *se)
+{
+ int errs = 0;
+
+ if (se->dev_index != 0)
+ errs++;
+
+ /* TODO: pass in alignment so we can support the other page sizes */
+ if (!IS_ALIGNED(se->ext_offset, PMD_SIZE))
+ errs++;
+
+ if (!IS_ALIGNED(se->ext_len, PMD_SIZE))
+ errs++;
+
+ return errs;
+}
+
+/**
+ * famfs_fuse_meta_alloc() - Allocate famfs file metadata
+ * @metap: Pointer to an mcache_map_meta pointer
+ * @ext_count: The number of extents needed
+ *
+ * Returns: 0=success
+ * -errno=failure
+ */
+static int
+famfs_fuse_meta_alloc(
+ void *fmap_buf,
+ size_t fmap_buf_size,
+ struct famfs_file_meta **metap)
+{
+ struct famfs_file_meta *meta = NULL;
+ struct fuse_famfs_fmap_header *fmh;
+ size_t extent_total = 0;
+ size_t next_offset = 0;
+ int errs = 0;
+ int i, j;
+ int rc;
+
+ fmh = (struct fuse_famfs_fmap_header *)fmap_buf;
+
+ /* Move past fmh in fmap_buf */
+ next_offset += sizeof(*fmh);
+ if (next_offset > fmap_buf_size) {
+ pr_err("%s:%d: fmap_buf underflow offset/size %ld/%ld\n",
+ __func__, __LINE__, next_offset, fmap_buf_size);
+ return -EINVAL;
+ }
+
+ if (fmh->nextents < 1) {
+ pr_err("%s: nextents %d < 1\n", __func__, fmh->nextents);
+ return -EINVAL;
+ }
+
+ if (fmh->nextents > FUSE_FAMFS_MAX_EXTENTS) {
+ pr_err("%s: nextents %d > max (%d) 1\n",
+ __func__, fmh->nextents, FUSE_FAMFS_MAX_EXTENTS);
+ return -E2BIG;
+ }
+
+ meta = kzalloc(sizeof(*meta), GFP_KERNEL);
+ if (!meta)
+ return -ENOMEM;
+
+ meta->error = false;
+ meta->file_type = fmh->file_type;
+ meta->file_size = fmh->file_size;
+ meta->fm_extent_type = fmh->ext_type;
+
+ switch (fmh->ext_type) {
+ case FUSE_FAMFS_EXT_SIMPLE: {
+ struct fuse_famfs_simple_ext *se_in;
+
+ se_in = (struct fuse_famfs_simple_ext *)(fmap_buf + next_offset);
+
+ /* Move past simple extents */
+ next_offset += fmh->nextents * sizeof(*se_in);
+ if (next_offset > fmap_buf_size) {
+ pr_err("%s:%d: fmap_buf underflow offset/size %ld/%ld\n",
+ __func__, __LINE__, next_offset, fmap_buf_size);
+ rc = -EINVAL;
+ goto errout;
+ }
+
+ meta->fm_nextents = fmh->nextents;
+
+ meta->se = kcalloc(meta->fm_nextents, sizeof(*(meta->se)),
+ GFP_KERNEL);
+ if (!meta->se) {
+ rc = -ENOMEM;
+ goto errout;
+ }
+
+ if ((meta->fm_nextents > FUSE_FAMFS_MAX_EXTENTS) ||
+ (meta->fm_nextents < 1)) {
+ rc = -EINVAL;
+ goto errout;
+ }
+
+ for (i = 0; i < fmh->nextents; i++) {
+ meta->se[i].dev_index = se_in[i].se_devindex;
+ meta->se[i].ext_offset = se_in[i].se_offset;
+ meta->se[i].ext_len = se_in[i].se_len;
+
+ /* Record bitmap of referenced daxdev indices */
+ meta->dev_bitmap |= (1 << meta->se[i].dev_index);
+
+ errs += famfs_check_ext_alignment(&meta->se[i]);
+
+ extent_total += meta->se[i].ext_len;
+ }
+ break;
+ }
+
+ case FUSE_FAMFS_EXT_INTERLEAVE: {
+ s64 size_remainder = meta->file_size;
+ struct fuse_famfs_iext *ie_in;
+ int niext = fmh->nextents;
+
+ meta->fm_niext = niext;
+
+ /* Allocate interleaved extent */
+ meta->ie = kcalloc(niext, sizeof(*(meta->ie)), GFP_KERNEL);
+ if (!meta->ie) {
+ rc = -ENOMEM;
+ goto errout;
+ }
+
+ /*
+ * Each interleaved extent has a simple extent list of strips.
+ * Outer loop is over separate interleaved extents
+ */
+ for (i = 0; i < niext; i++) {
+ u64 nstrips;
+ struct fuse_famfs_simple_ext *sie_in;
+
+ /* ie_in = one interleaved extent in fmap_buf */
+ ie_in = (struct fuse_famfs_iext *)
+ (fmap_buf + next_offset);
+
+ /* Move past one interleaved extent header in fmap_buf */
+ next_offset += sizeof(*ie_in);
+ if (next_offset > fmap_buf_size) {
+ pr_err("%s:%d: fmap_buf underflow offset/size %ld/%ld\n",
+ __func__, __LINE__, next_offset,
+ fmap_buf_size);
+ rc = -EINVAL;
+ goto errout;
+ }
+
+ nstrips = ie_in->ie_nstrips;
+ meta->ie[i].fie_chunk_size = ie_in->ie_chunk_size;
+ meta->ie[i].fie_nstrips = ie_in->ie_nstrips;
+ meta->ie[i].fie_nbytes = ie_in->ie_nbytes;
+
+ if (!meta->ie[i].fie_nbytes) {
+ pr_err("%s: zero-length interleave!\n",
+ __func__);
+ rc = -EINVAL;
+ goto errout;
+ }
+
+ /* sie_in = the strip extents in fmap_buf */
+ sie_in = (struct fuse_famfs_simple_ext *)
+ (fmap_buf + next_offset);
+
+ /* Move past strip extents in fmap_buf */
+ next_offset += nstrips * sizeof(*sie_in);
+ if (next_offset > fmap_buf_size) {
+ pr_err("%s:%d: fmap_buf underflow offset/size %ld/%ld\n",
+ __func__, __LINE__, next_offset,
+ fmap_buf_size);
+ rc = -EINVAL;
+ goto errout;
+ }
+
+ if ((nstrips > FUSE_FAMFS_MAX_STRIPS) || (nstrips < 1)) {
+ pr_err("%s: invalid nstrips=%lld (max=%d)\n",
+ __func__, nstrips,
+ FUSE_FAMFS_MAX_STRIPS);
+ errs++;
+ }
+
+ /* Allocate strip extent array */
+ meta->ie[i].ie_strips = kcalloc(ie_in->ie_nstrips,
+ sizeof(meta->ie[i].ie_strips[0]),
+ GFP_KERNEL);
+ if (!meta->ie[i].ie_strips) {
+ rc = -ENOMEM;
+ goto errout;
+ }
+
+ /* Inner loop is over strips */
+ for (j = 0; j < nstrips; j++) {
+ struct famfs_meta_simple_ext *strips_out;
+ u64 devindex = sie_in[j].se_devindex;
+ u64 offset = sie_in[j].se_offset;
+ u64 len = sie_in[j].se_len;
+
+ strips_out = meta->ie[i].ie_strips;
+ strips_out[j].dev_index = devindex;
+ strips_out[j].ext_offset = offset;
+ strips_out[j].ext_len = len;
+
+ /* Record bitmap of referenced daxdev indices */
+ meta->dev_bitmap |= (1 << devindex);
+
+ extent_total += len;
+ errs += famfs_check_ext_alignment(&strips_out[j]);
+ size_remainder -= len;
+ }
+ }
+
+ if (size_remainder > 0) {
+ /* Sum of interleaved extent sizes is less than file size! */
+ pr_err("%s: size_remainder %lld (0x%llx)\n",
+ __func__, size_remainder, size_remainder);
+ rc = -EINVAL;
+ goto errout;
+ }
+ break;
+ }
+
+ default:
+ pr_err("%s: invalid ext_type %d\n", __func__, fmh->ext_type);
+ rc = -EINVAL;
+ goto errout;
+ }
+
+ if (errs > 0) {
+ pr_err("%s: %d alignment errors found\n", __func__, errs);
+ rc = -EINVAL;
+ goto errout;
+ }
+
+ /* More sanity checks */
+ if (extent_total < meta->file_size) {
+ pr_err("%s: file size %ld larger than map size %ld\n",
+ __func__, meta->file_size, extent_total);
+ rc = -EINVAL;
+ goto errout;
+ }
+
+ if (cmpxchg(metap, NULL, meta) != NULL) {
+ pr_debug("%s: fmap race detected\n", __func__);
+ rc = 0; /* fmap already installed */
+ goto errout;
+ }
+
+ return 0;
+errout:
+ __famfs_meta_free(meta);
+ return rc;
+}
+
+/**
+ * famfs_file_init_dax() - init famfs dax file metadata
+ *
+ * @fm: fuse_mount
+ * @inode: the inode
+ * @fmap_buf: fmap response message
+ * @fmap_size: Size of the fmap message
+ *
+ * Initialize famfs metadata for a file, based on the contents of the GET_FMAP
+ * response
+ *
+ * Return: 0=success
+ * -errno=failure
+ */
+int
+famfs_file_init_dax(
+ struct fuse_mount *fm,
+ struct inode *inode,
+ void *fmap_buf,
+ size_t fmap_size)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct famfs_file_meta *meta = NULL;
+ int rc = 0;
+
+ if (fi->famfs_meta) {
+ pr_notice("%s: i_no=%ld fmap_size=%ld ALREADY INITIALIZED\n",
+ __func__,
+ inode->i_ino, fmap_size);
+ return 0;
+ }
+
+ rc = famfs_fuse_meta_alloc(fmap_buf, fmap_size, &meta);
+ if (rc)
+ goto errout;
+
+ /* Publish the famfs metadata on fi->famfs_meta */
+ inode_lock(inode);
+ if (fi->famfs_meta) {
+ rc = -EEXIST; /* file already has famfs metadata */
+ } else {
+ if (famfs_meta_set(fi, meta) != NULL) {
+ pr_debug("%s: file already had metadata\n", __func__);
+ __famfs_meta_free(meta);
+ /* rc is 0 - the file is valid */
+ goto unlock_out;
+ }
+ i_size_write(inode, meta->file_size);
+ inode->i_flags |= S_DAX;
+ }
+ unlock_out:
+ inode_unlock(inode);
+
+errout:
+ if (rc)
+ __famfs_meta_free(meta);
+
+ return rc;
+}
+
#define FMAP_BUFSIZE PAGE_SIZE
int
@@ -63,12 +409,9 @@ fuse_get_fmap(struct fuse_mount *fm, struct inode *inode)
}
fmap_size = rc;
- /* We retrieved the "fmap" (the file's map to memory), but
- * we haven't used it yet. A call to famfs_file_init_dax() will be added
- * here in a subsequent patch, when we add the ability to attach
- * fmaps to files.
- */
+ /* Convert fmap into in-memory format and hang from inode */
+ rc = famfs_file_init_dax(fm, inode, fmap_buf, fmap_size);
kfree(fmap_buf);
- return 0;
+ return rc;
}
diff --git a/fs/fuse/famfs_kfmap.h b/fs/fuse/famfs_kfmap.h
new file mode 100644
index 000000000000..058645cb10a1
--- /dev/null
+++ b/fs/fuse/famfs_kfmap.h
@@ -0,0 +1,67 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * famfs - dax file system for shared fabric-attached memory
+ *
+ * Copyright 2023-2025 Micron Technology, Inc.
+ */
+#ifndef FAMFS_KFMAP_H
+#define FAMFS_KFMAP_H
+
+/*
+ * The structures below are the in-memory metadata format for famfs files.
+ * Metadata retrieved via the GET_FMAP response is converted to this format
+ * for use in resolving file mapping faults.
+ *
+ * The GET_FMAP response contains the same information, but in a more
+ * message-and-versioning-friendly format. Those structs can be found in the
+ * famfs section of include/uapi/linux/fuse.h (aka fuse_kernel.h in libfuse)
+ */
+
+enum famfs_file_type {
+ FAMFS_REG,
+ FAMFS_SUPERBLOCK,
+ FAMFS_LOG,
+};
+
+/* We anticipate the possibility of supporting additional types of extents */
+enum famfs_extent_type {
+ SIMPLE_DAX_EXTENT,
+ INTERLEAVED_EXTENT,
+ INVALID_EXTENT_TYPE,
+};
+
+struct famfs_meta_simple_ext {
+ u64 dev_index;
+ u64 ext_offset;
+ u64 ext_len;
+};
+
+struct famfs_meta_interleaved_ext {
+ u64 fie_nstrips;
+ u64 fie_chunk_size;
+ u64 fie_nbytes;
+ struct famfs_meta_simple_ext *ie_strips;
+};
+
+/*
+ * Each famfs dax file has this hanging from its fuse_inode->famfs_meta
+ */
+struct famfs_file_meta {
+ bool error;
+ enum famfs_file_type file_type;
+ size_t file_size;
+ enum famfs_extent_type fm_extent_type;
+ u64 dev_bitmap; /* bitmap of referenced daxdevs by index */
+ union { /* This will make code a bit more readable */
+ struct {
+ size_t fm_nextents;
+ struct famfs_meta_simple_ext *se;
+ };
+ struct {
+ size_t fm_niext;
+ struct famfs_meta_interleaved_ext *ie;
+ };
+ };
+};
+
+#endif /* FAMFS_KFMAP_H */
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 691c7850cf4e..f9e920e95baf 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1658,6 +1658,12 @@ extern void fuse_sysctl_unregister(void);
/* famfs.c */
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+int famfs_file_init_dax(struct fuse_mount *fm,
+ struct inode *inode, void *fmap_buf,
+ size_t fmap_size);
+void __famfs_meta_free(void *map);
+#endif
static inline void famfs_teardown(struct fuse_conn *fc)
{
#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
@@ -1665,11 +1671,18 @@ static inline void famfs_teardown(struct fuse_conn *fc)
#endif
}
+static inline void famfs_meta_init(struct fuse_inode *fi)
+{
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+ fi->famfs_meta = NULL;
+#endif
+}
+
static inline struct fuse_backing *famfs_meta_set(struct fuse_inode *fi,
void *meta)
{
#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
- return xchg(&fi->famfs_meta, meta);
+ return cmpxchg(&fi->famfs_meta, NULL, meta);
#else
return NULL;
#endif
@@ -1677,7 +1690,12 @@ static inline struct fuse_backing *famfs_meta_set(struct fuse_inode *fi,
static inline void famfs_meta_free(struct fuse_inode *fi)
{
- /* Stub wil be connected in a subsequent commit */
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+ if (fi->famfs_meta != NULL) {
+ __famfs_meta_free(fi->famfs_meta);
+ famfs_meta_set(fi, NULL);
+ }
+#endif
}
static inline int fuse_file_famfs(struct fuse_inode *fi)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 9e121a1d63b7..391ead26bfa2 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -121,7 +121,7 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
fuse_inode_backing_set(fi, NULL);
if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
- famfs_meta_set(fi, NULL);
+ famfs_meta_init(fi);
return &fi->inode;
@@ -1485,8 +1485,21 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
timeout = arg->request_timeout;
if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX) &&
- flags & FUSE_DAX_FMAP)
- fc->famfs_iomap = 1;
+ flags & FUSE_DAX_FMAP) {
+ /* famfs_iomap is only allowed if the fuse
+ * server has CAP_SYS_RAWIO. This was checked
+ * in fuse_send_init, and FUSE_DAX_IOMAP was
+ * set in in_flags if so. Only allow enablement
+ * if we find it there. This function is
+ * normally not running in fuse server context,
+ * so we can do the capability check here...
+ */
+ u64 in_flags = ((u64)ia->in.flags2 << 32)
+ | ia->in.flags;
+
+ if (in_flags & FUSE_DAX_FMAP)
+ fc->famfs_iomap = 1;
+ }
} else {
ra_pages = fc->max_read / PAGE_SIZE;
fc->no_lock = 1;
@@ -1548,7 +1561,7 @@ static struct fuse_init_args *fuse_new_init(struct fuse_mount *fm)
flags |= FUSE_SUBMOUNTS;
if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
flags |= FUSE_PASSTHROUGH;
- if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
+ if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX) && capable(CAP_SYS_RAWIO))
flags |= FUSE_DAX_FMAP;
/*
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index bfb92a4aa8a9..e6dd3c24bb11 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -243,6 +243,13 @@
*
* 7.46
* - Add FUSE_DAX_FMAP capability - ability to handle in-kernel fsdax maps
+ * - Add the following structures for the GET_FMAP message reply components:
+ * - struct fuse_famfs_simple_ext
+ * - struct fuse_famfs_iext
+ * - struct fuse_famfs_fmap_header
+ * - Add the following enumerated types
+ * - enum fuse_famfs_file_type
+ * - enum famfs_ext_type
*/
#ifndef _LINUX_FUSE_H
@@ -1318,6 +1325,55 @@ struct fuse_uring_cmd_req {
/* Famfs fmap message components */
+#define FAMFS_FMAP_VERSION 1
+
#define FAMFS_FMAP_MAX 32768 /* Largest supported fmap message */
+#define FUSE_FAMFS_MAX_EXTENTS 32
+#define FUSE_FAMFS_MAX_STRIPS 32
+
+enum fuse_famfs_file_type {
+ FUSE_FAMFS_FILE_REG,
+ FUSE_FAMFS_FILE_SUPERBLOCK,
+ FUSE_FAMFS_FILE_LOG,
+};
+
+enum famfs_ext_type {
+ FUSE_FAMFS_EXT_SIMPLE = 0,
+ FUSE_FAMFS_EXT_INTERLEAVE = 1,
+};
+
+struct fuse_famfs_simple_ext {
+ uint32_t se_devindex;
+ uint32_t reserved;
+ uint64_t se_offset;
+ uint64_t se_len;
+};
+
+struct fuse_famfs_iext { /* Interleaved extent */
+ uint32_t ie_nstrips;
+ uint32_t ie_chunk_size;
+ uint64_t ie_nbytes; /* Total bytes for this interleaved_ext;
+ * sum of strips may be more
+ */
+ uint64_t reserved;
+};
+
+struct fuse_famfs_fmap_header {
+ uint8_t file_type; /* enum famfs_file_type */
+ uint8_t reserved;
+ uint16_t fmap_version;
+ uint32_t ext_type; /* enum famfs_log_ext_type */
+ uint32_t nextents;
+ uint32_t reserved0;
+ uint64_t file_size;
+ uint64_t reserved1;
+};
+
+static inline int32_t fmap_msg_min_size(void)
+{
+ /* Smallest fmap message is a header plus one simple extent */
+ return (sizeof(struct fuse_famfs_fmap_header)
+ + sizeof(struct fuse_famfs_simple_ext));
+}
#endif /* _LINUX_FUSE_H */
--
2.49.0
^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH V3 16/21] famfs_fuse: GET_DAXDEV message and daxdev_table
2026-01-07 15:33 ` [PATCH V3 00/21] famfs: port into fuse John Groves
` (14 preceding siblings ...)
2026-01-07 15:33 ` [PATCH V3 15/21] famfs_fuse: Create files with famfs fmaps John Groves
@ 2026-01-07 15:33 ` John Groves
2026-01-08 14:45 ` Jonathan Cameron
2026-01-07 15:33 ` [PATCH V3 17/21] famfs_fuse: Plumb dax iomap and fuse read/write/mmap John Groves
` (4 subsequent siblings)
20 siblings, 1 reply; 74+ messages in thread
From: John Groves @ 2026-01-07 15:33 UTC (permalink / raw)
To: John Groves, Miklos Szeredi, Dan Williams, Bernd Schubert,
Alison Schofield
Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
Josef Bacik, Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel, John Groves
* The new GET_DAXDEV message/response is added
* The famfs.c:famfs_teardown() function is added as a primary teardown
function for famfs.
* The command it triggered by the update_daxdev_table() call, if there
are any daxdevs in the subject fmap that are not represented in the
daxdev_table yet.
* fs/namei.c: export may_open_dev()
Signed-off-by: John Groves <john@groves.net>
---
fs/fuse/famfs.c | 236 ++++++++++++++++++++++++++++++++++++++
fs/fuse/famfs_kfmap.h | 26 +++++
fs/fuse/fuse_i.h | 13 ++-
fs/fuse/inode.c | 4 +-
fs/namei.c | 1 +
include/uapi/linux/fuse.h | 20 ++++
6 files changed, 298 insertions(+), 2 deletions(-)
diff --git a/fs/fuse/famfs.c b/fs/fuse/famfs.c
index 2aabd1d589fd..b5cd1b5c1d6c 100644
--- a/fs/fuse/famfs.c
+++ b/fs/fuse/famfs.c
@@ -20,6 +20,239 @@
#include "famfs_kfmap.h"
#include "fuse_i.h"
+/*
+ * famfs_teardown()
+ *
+ * Deallocate famfs metadata for a fuse_conn
+ */
+void
+famfs_teardown(struct fuse_conn *fc)
+{
+ struct famfs_dax_devlist *devlist = fc->dax_devlist;
+ int i;
+
+ kfree(fc->shadow);
+
+ fc->dax_devlist = NULL;
+
+ if (!devlist)
+ return;
+
+ if (!devlist->devlist)
+ goto out;
+
+ /* Close & release all the daxdevs in our table */
+ for (i = 0; i < devlist->nslots; i++) {
+ struct famfs_daxdev *dd = &devlist->devlist[i];
+
+ if (!dd->valid)
+ continue;
+
+ /* Release reference from dax_dev_get() */
+ if (dd->devp)
+ put_dax(dd->devp);
+
+ kfree(dd->name);
+ }
+ kfree(devlist->devlist);
+
+out:
+ kfree(devlist);
+}
+
+static int
+famfs_verify_daxdev(const char *pathname, dev_t *devno)
+{
+ struct inode *inode;
+ struct path path;
+ int err;
+
+ if (!pathname || !*pathname)
+ return -EINVAL;
+
+ err = kern_path(pathname, LOOKUP_FOLLOW, &path);
+ if (err)
+ return err;
+
+ inode = d_backing_inode(path.dentry);
+ if (!S_ISCHR(inode->i_mode)) {
+ err = -EINVAL;
+ goto out_path_put;
+ }
+
+ if (!may_open_dev(&path)) { /* had to export this */
+ err = -EACCES;
+ goto out_path_put;
+ }
+
+ *devno = inode->i_rdev;
+
+out_path_put:
+ path_put(&path);
+ return err;
+}
+
+/**
+ * famfs_fuse_get_daxdev() - Retrieve info for a DAX device from fuse server
+ *
+ * Send a GET_DAXDEV message to the fuse server to retrieve info on a
+ * dax device.
+ *
+ * @fm: fuse_mount
+ * @index: the index of the dax device; daxdevs are referred to by index
+ * in fmaps, and the server resolves the index to a particular daxdev
+ *
+ * Returns: 0=success
+ * -errno=failure
+ */
+static int
+famfs_fuse_get_daxdev(struct fuse_mount *fm, const u64 index)
+{
+ struct fuse_daxdev_out daxdev_out = { 0 };
+ struct fuse_conn *fc = fm->fc;
+ struct famfs_daxdev *daxdev;
+ int err = 0;
+
+ FUSE_ARGS(args);
+
+ /* Store the daxdev in our table */
+ if (index >= fc->dax_devlist->nslots) {
+ pr_err("%s: index(%lld) > nslots(%d)\n",
+ __func__, index, fc->dax_devlist->nslots);
+ err = -EINVAL;
+ goto out;
+ }
+
+ args.opcode = FUSE_GET_DAXDEV;
+ args.nodeid = index;
+
+ args.in_numargs = 0;
+
+ args.out_numargs = 1;
+ args.out_args[0].size = sizeof(daxdev_out);
+ args.out_args[0].value = &daxdev_out;
+
+ /* Send GET_DAXDEV command */
+ err = fuse_simple_request(fm, &args);
+ if (err) {
+ pr_err("%s: err=%d from fuse_simple_request()\n",
+ __func__, err);
+ /*
+ * Error will be that the payload is smaller than FMAP_BUFSIZE,
+ * which is the max we can handle. Empty payload handled below.
+ */
+ goto out;
+ }
+
+ down_write(&fc->famfs_devlist_sem);
+
+ daxdev = &fc->dax_devlist->devlist[index];
+
+ /* Abort if daxdev is now valid (race - another thread got it first) */
+ if (daxdev->valid) {
+ up_write(&fc->famfs_devlist_sem);
+ /* We already have a valid entry at this index */
+ pr_debug("%s: daxdev already known\n", __func__);
+ goto out;
+ }
+
+ /* Verify that the dev is valid and can be opened and gets the devno */
+ err = famfs_verify_daxdev(daxdev_out.name, &daxdev->devno);
+ if (err) {
+ up_write(&fc->famfs_devlist_sem);
+ pr_err("%s: err=%d from famfs_verify_daxdev()\n", __func__, err);
+ goto out;
+ }
+
+ /* This will fail if it's not a dax device */
+ daxdev->devp = dax_dev_get(daxdev->devno);
+ if (!daxdev->devp) {
+ up_write(&fc->famfs_devlist_sem);
+ pr_warn("%s: device %s not found or not dax\n",
+ __func__, daxdev_out.name);
+ err = -ENODEV;
+ goto out;
+ }
+
+ daxdev->name = kstrdup(daxdev_out.name, GFP_KERNEL);
+ wmb(); /* all daxdev fields must be visible before marking it valid */
+ daxdev->valid = 1;
+
+ up_write(&fc->famfs_devlist_sem);
+
+out:
+ return err;
+}
+
+/**
+ * famfs_update_daxdev_table() - Update the daxdev table
+ * @fm - fuse_mount
+ * @meta - famfs_file_meta, in-memory format, built from a GET_FMAP response
+ *
+ * This function is called for each new file fmap, to verify whether all
+ * referenced daxdevs are already known (i.e. in the table). Any daxdev
+ * indices referenced in @meta but not in the table will be retrieved via
+ * famfs_fuse_get_daxdev() and added to the table
+ *
+ * Return: 0=success
+ * -errno=failure
+ */
+static int
+famfs_update_daxdev_table(
+ struct fuse_mount *fm,
+ const struct famfs_file_meta *meta)
+{
+ struct famfs_dax_devlist *local_devlist;
+ struct fuse_conn *fc = fm->fc;
+ int err;
+ int i;
+
+ /* First time through we will need to allocate the dax_devlist */
+ if (unlikely(!fc->dax_devlist)) {
+ local_devlist = kcalloc(1, sizeof(*fc->dax_devlist), GFP_KERNEL);
+ if (!local_devlist)
+ return -ENOMEM;
+
+ local_devlist->nslots = MAX_DAXDEVS;
+
+ local_devlist->devlist = kcalloc(MAX_DAXDEVS,
+ sizeof(struct famfs_daxdev),
+ GFP_KERNEL);
+ if (!local_devlist->devlist) {
+ kfree(local_devlist);
+ return -ENOMEM;
+ }
+
+ /* We don't need famfs_devlist_sem here because we use cmpxchg */
+ if (cmpxchg(&fc->dax_devlist, NULL, local_devlist) != NULL) {
+ kfree(local_devlist->devlist);
+ kfree(local_devlist); /* another thread beat us to it */
+ }
+ }
+
+ down_read(&fc->famfs_devlist_sem);
+ for (i = 0; i < fc->dax_devlist->nslots; i++) {
+ if (!(meta->dev_bitmap & (1ULL << i)))
+ continue;
+
+ /* This file meta struct references devindex i
+ * if devindex i isn't in the table; get it...
+ */
+ if (!(fc->dax_devlist->devlist[i].valid)) {
+ up_read(&fc->famfs_devlist_sem);
+
+ err = famfs_fuse_get_daxdev(fm, i);
+ if (err)
+ pr_err("%s: failed to get daxdev=%d\n",
+ __func__, i);
+
+ down_read(&fc->famfs_devlist_sem);
+ }
+ }
+ up_read(&fc->famfs_devlist_sem);
+
+ return 0;
+}
/***************************************************************************/
@@ -342,6 +575,9 @@ famfs_file_init_dax(
if (rc)
goto errout;
+ /* Make sure this fmap doesn't reference any unknown daxdevs */
+ famfs_update_daxdev_table(fm, meta);
+
/* Publish the famfs metadata on fi->famfs_meta */
inode_lock(inode);
if (fi->famfs_meta) {
diff --git a/fs/fuse/famfs_kfmap.h b/fs/fuse/famfs_kfmap.h
index 058645cb10a1..e76b9057a1e0 100644
--- a/fs/fuse/famfs_kfmap.h
+++ b/fs/fuse/famfs_kfmap.h
@@ -64,4 +64,30 @@ struct famfs_file_meta {
};
};
+/*
+ * famfs_daxdev - tracking struct for a daxdev within a famfs file system
+ *
+ * This is the in-memory daxdev metadata that is populated by parsing
+ * the responses to GET_FMAP messages
+ */
+struct famfs_daxdev {
+ /* Include dev uuid? */
+ bool valid;
+ bool error;
+ dev_t devno;
+ struct dax_device *devp;
+ char *name;
+};
+
+#define MAX_DAXDEVS 24
+
+/*
+ * famfs_dax_devlist - list of famfs_daxdev's
+ */
+struct famfs_dax_devlist {
+ int nslots;
+ int ndevs;
+ struct famfs_daxdev *devlist;
+};
+
#endif /* FAMFS_KFMAP_H */
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index f9e920e95baf..d308b74c83ec 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1018,6 +1018,8 @@ struct fuse_conn {
u8 blkbits;
#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+ struct rw_semaphore famfs_devlist_sem;
+ struct famfs_dax_devlist *dax_devlist;
char *shadow;
#endif
};
@@ -1663,13 +1665,15 @@ int famfs_file_init_dax(struct fuse_mount *fm,
struct inode *inode, void *fmap_buf,
size_t fmap_size);
void __famfs_meta_free(void *map);
-#endif
+void famfs_teardown(struct fuse_conn *fc);
+#else
static inline void famfs_teardown(struct fuse_conn *fc)
{
#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
kfree(fc->shadow);
#endif
}
+#endif
static inline void famfs_meta_init(struct fuse_inode *fi)
{
@@ -1688,6 +1692,13 @@ static inline struct fuse_backing *famfs_meta_set(struct fuse_inode *fi,
#endif
}
+static inline void famfs_init_devlist_sem(struct fuse_conn *fc)
+{
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+ init_rwsem(&fc->famfs_devlist_sem);
+#endif
+}
+
static inline void famfs_meta_free(struct fuse_inode *fi)
{
#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 391ead26bfa2..78787efcfd07 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1497,8 +1497,10 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
u64 in_flags = ((u64)ia->in.flags2 << 32)
| ia->in.flags;
- if (in_flags & FUSE_DAX_FMAP)
+ if (in_flags & FUSE_DAX_FMAP) {
+ famfs_init_devlist_sem(fc);
fc->famfs_iomap = 1;
+ }
}
} else {
ra_pages = fc->max_read / PAGE_SIZE;
diff --git a/fs/namei.c b/fs/namei.c
index bf0f66f0e9b9..b47511ac7337 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -4162,6 +4162,7 @@ bool may_open_dev(const struct path *path)
return !(path->mnt->mnt_flags & MNT_NODEV) &&
!(path->mnt->mnt_sb->s_iflags & SB_I_NODEV);
}
+EXPORT_SYMBOL(may_open_dev);
static int may_open(struct mnt_idmap *idmap, const struct path *path,
int acc_mode, int flag)
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index e6dd3c24bb11..2432ccc4f913 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -247,6 +247,9 @@
* - struct fuse_famfs_simple_ext
* - struct fuse_famfs_iext
* - struct fuse_famfs_fmap_header
+ * - Add the following structs for the GET_DAXDEV message and reply
+ * - struct fuse_get_daxdev_in
+ * - struct fuse_get_daxdev_out
* - Add the following enumerated types
* - enum fuse_famfs_file_type
* - enum famfs_ext_type
@@ -678,6 +681,7 @@ enum fuse_opcode {
/* Famfs / devdax opcodes */
FUSE_GET_FMAP = 54,
+ FUSE_GET_DAXDEV = 55,
/* CUSE specific operations */
CUSE_INIT = 4096,
@@ -1369,6 +1373,22 @@ struct fuse_famfs_fmap_header {
uint64_t reserved1;
};
+struct fuse_get_daxdev_in {
+ uint32_t daxdev_num;
+};
+
+#define DAXDEV_NAME_MAX 256
+
+/* fuse_daxdev_out has enough space for a uuid if we need it */
+struct fuse_daxdev_out {
+ uint16_t index;
+ uint16_t reserved;
+ uint32_t reserved2;
+ uint64_t reserved3;
+ uint64_t reserved4;
+ char name[DAXDEV_NAME_MAX];
+};
+
static inline int32_t fmap_msg_min_size(void)
{
/* Smallest fmap message is a header plus one simple extent */
--
2.49.0
^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH V3 17/21] famfs_fuse: Plumb dax iomap and fuse read/write/mmap
2026-01-07 15:33 ` [PATCH V3 00/21] famfs: port into fuse John Groves
` (15 preceding siblings ...)
2026-01-07 15:33 ` [PATCH V3 16/21] famfs_fuse: GET_DAXDEV message and daxdev_table John Groves
@ 2026-01-07 15:33 ` John Groves
2026-01-08 15:13 ` Jonathan Cameron
2026-01-07 15:33 ` [PATCH V3 18/21] famfs_fuse: Add holder_operations for dax notify_failure() John Groves
` (3 subsequent siblings)
20 siblings, 1 reply; 74+ messages in thread
From: John Groves @ 2026-01-07 15:33 UTC (permalink / raw)
To: John Groves, Miklos Szeredi, Dan Williams, Bernd Schubert,
Alison Schofield
Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
Josef Bacik, Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel, John Groves
This commit fills in read/write/mmap handling for famfs files. The
dev_dax_iomap interface is used - just like xfs in fs-dax mode.
* Read/write are handled by famfs_fuse_[read|write]_iter() via
dax_iomap_rw() to fsdev_dax.
* Mmap is handled by famfs_fuse_mmap()
* Faults are handled by famfs_filemap*fault(), using dax_iomap_fault()
to fsdev_dax.
* File offset to dax offset resolution is handled via
famfs_fuse_iomap_begin(), which uses famfs "fmaps" to resolve the
the requested (file, offset) to an offset on a dax device (by way of
famfs_fileofs_to_daxofs() and famfs_interleave_fileofs_to_daxofs())
Signed-off-by: John Groves <john@groves.net>
---
fs/fuse/famfs.c | 458 +++++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/file.c | 18 +-
fs/fuse/fuse_i.h | 18 ++
3 files changed, 492 insertions(+), 2 deletions(-)
diff --git a/fs/fuse/famfs.c b/fs/fuse/famfs.c
index b5cd1b5c1d6c..c02b14789c6e 100644
--- a/fs/fuse/famfs.c
+++ b/fs/fuse/famfs.c
@@ -602,6 +602,464 @@ famfs_file_init_dax(
return rc;
}
+/*********************************************************************
+ * iomap_operations
+ *
+ * This stuff uses the iomap (dax-related) helpers to resolve file offsets to
+ * offsets within a dax device.
+ */
+
+static ssize_t famfs_file_bad(struct inode *inode);
+
+static int
+famfs_interleave_fileofs_to_daxofs(struct inode *inode, struct iomap *iomap,
+ loff_t file_offset, off_t len, unsigned int flags)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct famfs_file_meta *meta = fi->famfs_meta;
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ loff_t local_offset = file_offset;
+ int i;
+
+ /* This function is only for extent_type INTERLEAVED_EXTENT */
+ if (meta->fm_extent_type != INTERLEAVED_EXTENT) {
+ pr_err("%s: bad extent type\n", __func__);
+ goto err_out;
+ }
+
+ if (famfs_file_bad(inode))
+ goto err_out;
+
+ iomap->offset = file_offset;
+
+ for (i = 0; i < meta->fm_niext; i++) {
+ struct famfs_meta_interleaved_ext *fei = &meta->ie[i];
+ u64 chunk_size = fei->fie_chunk_size;
+ u64 nstrips = fei->fie_nstrips;
+ u64 ext_size = fei->fie_nbytes;
+
+ ext_size = min_t(u64, ext_size, meta->file_size);
+
+ if (ext_size == 0) {
+ pr_err("%s: ext_size=%lld file_size=%ld\n",
+ __func__, fei->fie_nbytes, meta->file_size);
+ goto err_out;
+ }
+
+ /* Is the data is in this striped extent? */
+ if (local_offset < ext_size) {
+ u64 chunk_num = local_offset / chunk_size;
+ u64 chunk_offset = local_offset % chunk_size;
+ u64 stripe_num = chunk_num / nstrips;
+ u64 strip_num = chunk_num % nstrips;
+ u64 chunk_remainder = chunk_size - chunk_offset;
+ u64 strip_offset = chunk_offset + (stripe_num * chunk_size);
+ u64 strip_dax_ofs = fei->ie_strips[strip_num].ext_offset;
+ u64 strip_devidx = fei->ie_strips[strip_num].dev_index;
+
+ if (strip_devidx >= fc->dax_devlist->nslots) {
+ pr_err("%s: strip_devidx %llu >= nslots %d\n",
+ __func__, strip_devidx,
+ fc->dax_devlist->nslots);
+ goto err_out;
+ }
+
+ if (!fc->dax_devlist->devlist[strip_devidx].valid) {
+ pr_err("%s: daxdev=%lld invalid\n", __func__,
+ strip_devidx);
+ goto err_out;
+ }
+
+ iomap->addr = strip_dax_ofs + strip_offset;
+ iomap->offset = file_offset;
+ iomap->length = min_t(loff_t, len, chunk_remainder);
+
+ iomap->dax_dev = fc->dax_devlist->devlist[strip_devidx].devp;
+
+ iomap->type = IOMAP_MAPPED;
+ iomap->flags = flags;
+
+ return 0;
+ }
+ local_offset -= ext_size; /* offset is beyond this striped extent */
+ }
+
+ err_out:
+ pr_err("%s: err_out\n", __func__);
+
+ /* We fell out the end of the extent list.
+ * Set iomap to zero length in this case, and return 0
+ * This just means that the r/w is past EOF
+ */
+ iomap->addr = 0; /* there is no valid dax device offset */
+ iomap->offset = file_offset; /* file offset */
+ iomap->length = 0; /* this had better result in no access to dax mem */
+ iomap->dax_dev = NULL;
+ iomap->type = IOMAP_MAPPED;
+ iomap->flags = flags;
+
+ return 0;
+}
+
+/**
+ * famfs_fileofs_to_daxofs() - Resolve (file, offset, len) to (daxdev, offset, len)
+ *
+ * This function is called by famfs_fuse_iomap_begin() to resolve an offset in a
+ * file to an offset in a dax device. This is upcalled from dax from calls to
+ * both * dax_iomap_fault() and dax_iomap_rw(). Dax finishes the job resolving
+ * a fault to a specific physical page (the fault case) or doing a memcpy
+ * variant (the rw case)
+ *
+ * Pages can be PTE (4k), PMD (2MiB) or (theoretically) PuD (1GiB)
+ * (these sizes are for X86; may vary on other cpu architectures
+ *
+ * @inode: The file where the fault occurred
+ * @iomap: To be filled in to indicate where to find the right memory,
+ * relative to a dax device.
+ * @file_offset: Within the file where the fault occurred (will be page boundary)
+ * @len: The length of the faulted mapping (will be a page multiple)
+ * (will be trimmed in *iomap if it's disjoint in the extent list)
+ * @flags:
+ *
+ * Return values: 0. (info is returned in a modified @iomap struct)
+ */
+static int
+famfs_fileofs_to_daxofs(struct inode *inode, struct iomap *iomap,
+ loff_t file_offset, off_t len, unsigned int flags)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct famfs_file_meta *meta = fi->famfs_meta;
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ loff_t local_offset = file_offset;
+ int i;
+
+ if (!fc->dax_devlist) {
+ pr_err("%s: null dax_devlist\n", __func__);
+ goto err_out;
+ }
+
+ if (famfs_file_bad(inode))
+ goto err_out;
+
+ if (meta->fm_extent_type == INTERLEAVED_EXTENT)
+ return famfs_interleave_fileofs_to_daxofs(inode, iomap,
+ file_offset,
+ len, flags);
+
+ iomap->offset = file_offset;
+
+ for (i = 0; i < meta->fm_nextents; i++) {
+ /* TODO: check devindex too */
+ loff_t dax_ext_offset = meta->se[i].ext_offset;
+ loff_t dax_ext_len = meta->se[i].ext_len;
+ u64 daxdev_idx = meta->se[i].dev_index;
+
+
+ /* TODO: test that superblock and log offsets only happen
+ * with superblock and log files. Requires instrumentaiton
+ * from user space...
+ */
+
+ /* local_offset is the offset minus the size of extents skipped
+ * so far; If local_offset < dax_ext_len, the data of interest
+ * starts in this extent
+ */
+ if (local_offset < dax_ext_len) {
+ loff_t ext_len_remainder = dax_ext_len - local_offset;
+ struct famfs_daxdev *dd;
+
+ if (daxdev_idx >= fc->dax_devlist->nslots) {
+ pr_err("%s: daxdev_idx %llu >= nslots %d\n",
+ __func__, daxdev_idx,
+ fc->dax_devlist->nslots);
+ goto err_out;
+ }
+
+ dd = &fc->dax_devlist->devlist[daxdev_idx];
+
+ if (!dd->valid || dd->error) {
+ pr_err("%s: daxdev=%lld %s\n", __func__,
+ daxdev_idx,
+ dd->valid ? "error" : "invalid");
+ goto err_out;
+ }
+
+ /*
+ * OK, we found the file metadata extent where this
+ * data begins
+ * @local_offset - The offset within the current
+ * extent
+ * @ext_len_remainder - Remaining length of ext after
+ * skipping local_offset
+ * Outputs:
+ * iomap->addr: the offset within the dax device where
+ * the data starts
+ * iomap->offset: the file offset
+ * iomap->length: the valid length resolved here
+ */
+ iomap->addr = dax_ext_offset + local_offset;
+ iomap->offset = file_offset;
+ iomap->length = min_t(loff_t, len, ext_len_remainder);
+
+ iomap->dax_dev = fc->dax_devlist->devlist[daxdev_idx].devp;
+
+ iomap->type = IOMAP_MAPPED;
+ iomap->flags = flags;
+ return 0;
+ }
+ local_offset -= dax_ext_len; /* Get ready for the next extent */
+ }
+
+ err_out:
+ pr_err("%s: err_out\n", __func__);
+
+ /* We fell out the end of the extent list.
+ * Set iomap to zero length in this case, and return 0
+ * This just means that the r/w is past EOF
+ */
+ iomap->addr = 0; /* there is no valid dax device offset */
+ iomap->offset = file_offset; /* file offset */
+ iomap->length = 0; /* this had better result in no access to dax mem */
+ iomap->dax_dev = NULL;
+ iomap->type = IOMAP_MAPPED;
+ iomap->flags = flags;
+
+ return 0;
+}
+
+/**
+ * famfs_fuse_iomap_begin() - Handler for iomap_begin upcall from dax
+ *
+ * This function is pretty simple because files are
+ * * never partially allocated
+ * * never have holes (never sparse)
+ * * never "allocate on write"
+ *
+ * @inode: inode for the file being accessed
+ * @offset: offset within the file
+ * @length: Length being accessed at offset
+ * @flags:
+ * @iomap: iomap struct to be filled in, resolving (offset, length) to
+ * (daxdev, offset, len)
+ * @srcmap:
+ */
+static int
+famfs_fuse_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
+ unsigned int flags, struct iomap *iomap, struct iomap *srcmap)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct famfs_file_meta *meta = fi->famfs_meta;
+ size_t size;
+
+ size = i_size_read(inode);
+
+ WARN_ON(size != meta->file_size);
+
+ return famfs_fileofs_to_daxofs(inode, iomap, offset, length, flags);
+}
+
+/* Note: We never need a special set of write_iomap_ops because famfs never
+ * performs allocation on write.
+ */
+const struct iomap_ops famfs_iomap_ops = {
+ .iomap_begin = famfs_fuse_iomap_begin,
+};
+
+/*********************************************************************
+ * vm_operations
+ */
+static vm_fault_t
+__famfs_fuse_filemap_fault(struct vm_fault *vmf, unsigned int pe_size,
+ bool write_fault)
+{
+ struct inode *inode = file_inode(vmf->vma->vm_file);
+ vm_fault_t ret;
+ unsigned long pfn;
+
+ if (!IS_DAX(file_inode(vmf->vma->vm_file))) {
+ pr_err("%s: file not marked IS_DAX!!\n", __func__);
+ return VM_FAULT_SIGBUS;
+ }
+
+ if (write_fault) {
+ sb_start_pagefault(inode->i_sb);
+ file_update_time(vmf->vma->vm_file);
+ }
+
+ ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &famfs_iomap_ops);
+ if (ret & VM_FAULT_NEEDDSYNC)
+ ret = dax_finish_sync_fault(vmf, pe_size, pfn);
+
+ if (write_fault)
+ sb_end_pagefault(inode->i_sb);
+
+ return ret;
+}
+
+static inline bool
+famfs_is_write_fault(struct vm_fault *vmf)
+{
+ return (vmf->flags & FAULT_FLAG_WRITE) &&
+ (vmf->vma->vm_flags & VM_SHARED);
+}
+
+static vm_fault_t
+famfs_filemap_fault(struct vm_fault *vmf)
+{
+ return __famfs_fuse_filemap_fault(vmf, 0, famfs_is_write_fault(vmf));
+}
+
+static vm_fault_t
+famfs_filemap_huge_fault(struct vm_fault *vmf, unsigned int pe_size)
+{
+ return __famfs_fuse_filemap_fault(vmf, pe_size, famfs_is_write_fault(vmf));
+}
+
+static vm_fault_t
+famfs_filemap_page_mkwrite(struct vm_fault *vmf)
+{
+ return __famfs_fuse_filemap_fault(vmf, 0, true);
+}
+
+static vm_fault_t
+famfs_filemap_pfn_mkwrite(struct vm_fault *vmf)
+{
+ return __famfs_fuse_filemap_fault(vmf, 0, true);
+}
+
+static vm_fault_t
+famfs_filemap_map_pages(struct vm_fault *vmf, pgoff_t start_pgoff,
+ pgoff_t end_pgoff)
+{
+ return filemap_map_pages(vmf, start_pgoff, end_pgoff);
+}
+
+const struct vm_operations_struct famfs_file_vm_ops = {
+ .fault = famfs_filemap_fault,
+ .huge_fault = famfs_filemap_huge_fault,
+ .map_pages = famfs_filemap_map_pages,
+ .page_mkwrite = famfs_filemap_page_mkwrite,
+ .pfn_mkwrite = famfs_filemap_pfn_mkwrite,
+};
+
+/*********************************************************************
+ * file_operations
+ */
+
+/**
+ * famfs_file_bad() - Check for files that aren't in a valid state
+ *
+ * @inode - inode
+ *
+ * Returns: 0=success
+ * -errno=failure
+ */
+static ssize_t
+famfs_file_bad(struct inode *inode)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct famfs_file_meta *meta = fi->famfs_meta;
+ size_t i_size = i_size_read(inode);
+
+ if (!meta) {
+ pr_err("%s: un-initialized famfs file\n", __func__);
+ return -EIO;
+ }
+ if (meta->error) {
+ pr_debug("%s: previously detected metadata errors\n", __func__);
+ return -EIO;
+ }
+ if (i_size != meta->file_size) {
+ pr_warn("%s: i_size overwritten from %ld to %ld\n",
+ __func__, meta->file_size, i_size);
+ meta->error = true;
+ return -ENXIO;
+ }
+ if (!IS_DAX(inode)) {
+ pr_debug("%s: inode %llx IS_DAX is false\n",
+ __func__, (u64)inode);
+ return -ENXIO;
+ }
+ return 0;
+}
+
+static ssize_t
+famfs_fuse_rw_prep(struct kiocb *iocb, struct iov_iter *ubuf)
+{
+ struct inode *inode = iocb->ki_filp->f_mapping->host;
+ size_t i_size = i_size_read(inode);
+ size_t count = iov_iter_count(ubuf);
+ size_t max_count;
+ ssize_t rc;
+
+ rc = famfs_file_bad(inode);
+ if (rc)
+ return rc;
+
+ /* Avoid unsigned underflow if position is past EOF */
+ if (iocb->ki_pos >= i_size)
+ max_count = 0;
+ else
+ max_count = i_size - iocb->ki_pos;
+
+ if (count > max_count)
+ iov_iter_truncate(ubuf, max_count);
+
+ if (!iov_iter_count(ubuf))
+ return 0;
+
+ return rc;
+}
+
+ssize_t
+famfs_fuse_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+ ssize_t rc;
+
+ rc = famfs_fuse_rw_prep(iocb, to);
+ if (rc)
+ return rc;
+
+ if (!iov_iter_count(to))
+ return 0;
+
+ rc = dax_iomap_rw(iocb, to, &famfs_iomap_ops);
+
+ file_accessed(iocb->ki_filp);
+ return rc;
+}
+
+ssize_t
+famfs_fuse_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+ ssize_t rc;
+
+ rc = famfs_fuse_rw_prep(iocb, from);
+ if (rc)
+ return rc;
+
+ if (!iov_iter_count(from))
+ return 0;
+
+ return dax_iomap_rw(iocb, from, &famfs_iomap_ops);
+}
+
+int
+famfs_fuse_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ struct inode *inode = file_inode(file);
+ ssize_t rc;
+
+ rc = famfs_file_bad(inode);
+ if (rc)
+ return (int)rc;
+
+ file_accessed(file);
+ vma->vm_ops = &famfs_file_vm_ops;
+ vm_flags_set(vma, VM_HUGEPAGE);
+ return 0;
+}
+
#define FMAP_BUFSIZE PAGE_SIZE
int
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 1f64bf68b5ee..45a09a7f0012 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1831,6 +1831,8 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
if (FUSE_IS_VIRTIO_DAX(fi))
return fuse_dax_read_iter(iocb, to);
+ if (fuse_file_famfs(fi))
+ return famfs_fuse_read_iter(iocb, to);
/* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
if (ff->open_flags & FOPEN_DIRECT_IO)
@@ -1853,6 +1855,8 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (FUSE_IS_VIRTIO_DAX(fi))
return fuse_dax_write_iter(iocb, from);
+ if (fuse_file_famfs(fi))
+ return famfs_fuse_write_iter(iocb, from);
/* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
if (ff->open_flags & FOPEN_DIRECT_IO)
@@ -1868,9 +1872,13 @@ static ssize_t fuse_splice_read(struct file *in, loff_t *ppos,
unsigned int flags)
{
struct fuse_file *ff = in->private_data;
+ struct inode *inode = file_inode(in);
+ struct fuse_inode *fi = get_fuse_inode(inode);
/* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
- if (fuse_file_passthrough(ff) && !(ff->open_flags & FOPEN_DIRECT_IO))
+ if (fuse_file_famfs(fi))
+ return -EIO; /* famfs does not use the page cache... */
+ else if (fuse_file_passthrough(ff) && !(ff->open_flags & FOPEN_DIRECT_IO))
return fuse_passthrough_splice_read(in, ppos, pipe, len, flags);
else
return filemap_splice_read(in, ppos, pipe, len, flags);
@@ -1880,9 +1888,13 @@ static ssize_t fuse_splice_write(struct pipe_inode_info *pipe, struct file *out,
loff_t *ppos, size_t len, unsigned int flags)
{
struct fuse_file *ff = out->private_data;
+ struct inode *inode = file_inode(out);
+ struct fuse_inode *fi = get_fuse_inode(inode);
/* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
- if (fuse_file_passthrough(ff) && !(ff->open_flags & FOPEN_DIRECT_IO))
+ if (fuse_file_famfs(fi))
+ return -EIO; /* famfs does not use the page cache... */
+ else if (fuse_file_passthrough(ff) && !(ff->open_flags & FOPEN_DIRECT_IO))
return fuse_passthrough_splice_write(pipe, out, ppos, len, flags);
else
return iter_file_splice_write(pipe, out, ppos, len, flags);
@@ -2390,6 +2402,8 @@ static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
/* DAX mmap is superior to direct_io mmap */
if (FUSE_IS_VIRTIO_DAX(fi))
return fuse_dax_mmap(file, vma);
+ if (fuse_file_famfs(fi))
+ return famfs_fuse_mmap(file, vma);
/*
* If inode is in passthrough io mode, because it has some file open
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index d308b74c83ec..5e52c3ba6e94 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1664,6 +1664,9 @@ extern void fuse_sysctl_unregister(void);
int famfs_file_init_dax(struct fuse_mount *fm,
struct inode *inode, void *fmap_buf,
size_t fmap_size);
+ssize_t famfs_fuse_write_iter(struct kiocb *iocb, struct iov_iter *from);
+ssize_t famfs_fuse_read_iter(struct kiocb *iocb, struct iov_iter *to);
+int famfs_fuse_mmap(struct file *file, struct vm_area_struct *vma);
void __famfs_meta_free(void *map);
void famfs_teardown(struct fuse_conn *fc);
#else
@@ -1673,6 +1676,21 @@ static inline void famfs_teardown(struct fuse_conn *fc)
kfree(fc->shadow);
#endif
}
+static inline ssize_t famfs_fuse_write_iter(struct kiocb *iocb,
+ struct iov_iter *to)
+{
+ return -ENODEV;
+}
+static inline ssize_t famfs_fuse_read_iter(struct kiocb *iocb,
+ struct iov_iter *to)
+{
+ return -ENODEV;
+}
+static inline int famfs_fuse_mmap(struct file *file,
+ struct vm_area_struct *vma)
+{
+ return -ENODEV;
+}
#endif
static inline void famfs_meta_init(struct fuse_inode *fi)
--
2.49.0
^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH V3 18/21] famfs_fuse: Add holder_operations for dax notify_failure()
2026-01-07 15:33 ` [PATCH V3 00/21] famfs: port into fuse John Groves
` (16 preceding siblings ...)
2026-01-07 15:33 ` [PATCH V3 17/21] famfs_fuse: Plumb dax iomap and fuse read/write/mmap John Groves
@ 2026-01-07 15:33 ` John Groves
2026-01-08 15:17 ` Jonathan Cameron
2026-01-07 15:33 ` [PATCH V3 19/21] famfs_fuse: Add DAX address_space_operations with noop_dirty_folio John Groves
` (2 subsequent siblings)
20 siblings, 1 reply; 74+ messages in thread
From: John Groves @ 2026-01-07 15:33 UTC (permalink / raw)
To: John Groves, Miklos Szeredi, Dan Williams, Bernd Schubert,
Alison Schofield
Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
Josef Bacik, Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel, John Groves
Memory errors are at least somewhat more likely on disaggregated memory
than on-board memory. This commit registers to be notified by fsdev_dax
in the event that a memory failure is detected.
When a file access resolves to a daxdev with memory errors, it will fail
with an appropriate error.
If a daxdev failed fs_dax_get(), we set dd->dax_err. If a daxdev called
our notify_failure(), set dd->error. When any of the above happens, set
(file)->error and stop allowing access.
In general, the recovery from memory errors is to unmount the file
system and re-initialize the memory, but there may be usable degraded
modes of operation - particularly in the future when famfs supports
file systems backed by more than one daxdev. In those cases,
accessing data that is on a working daxdev can still work.
For now, return errors for any file that has encountered a memory or dax
error.
Signed-off-by: John Groves <john@groves.net>
---
fs/fuse/famfs.c | 115 +++++++++++++++++++++++++++++++++++++++---
fs/fuse/famfs_kfmap.h | 3 +-
2 files changed, 109 insertions(+), 9 deletions(-)
diff --git a/fs/fuse/famfs.c b/fs/fuse/famfs.c
index c02b14789c6e..4eb87c5c628e 100644
--- a/fs/fuse/famfs.c
+++ b/fs/fuse/famfs.c
@@ -20,6 +20,26 @@
#include "famfs_kfmap.h"
#include "fuse_i.h"
+static void famfs_set_daxdev_err(
+ struct fuse_conn *fc, struct dax_device *dax_devp);
+
+static int
+famfs_dax_notify_failure(struct dax_device *dax_devp, u64 offset,
+ u64 len, int mf_flags)
+{
+ struct fuse_conn *fc = dax_holder(dax_devp);
+
+ famfs_set_daxdev_err(fc, dax_devp);
+
+ return 0;
+}
+
+static const struct dax_holder_operations famfs_fuse_dax_holder_ops = {
+ .notify_failure = famfs_dax_notify_failure,
+};
+
+/*****************************************************************************/
+
/*
* famfs_teardown()
*
@@ -48,9 +68,12 @@ famfs_teardown(struct fuse_conn *fc)
if (!dd->valid)
continue;
- /* Release reference from dax_dev_get() */
- if (dd->devp)
+ /* Only call fs_put_dax if fs_dax_get succeeded */
+ if (dd->devp) {
+ if (!dd->dax_err)
+ fs_put_dax(dd->devp, fc);
put_dax(dd->devp);
+ }
kfree(dd->name);
}
@@ -174,6 +197,17 @@ famfs_fuse_get_daxdev(struct fuse_mount *fm, const u64 index)
goto out;
}
+ err = fs_dax_get(daxdev->devp, fc, &famfs_fuse_dax_holder_ops);
+ if (err) {
+ /* If fs_dax_get() fails, we don't attempt recovery;
+ * We mark the daxdev valid with dax_err
+ */
+ daxdev->dax_err = 1;
+ pr_err("%s: fs_dax_get(%lld) failed\n",
+ __func__, (u64)daxdev->devno);
+ err = -EBUSY;
+ }
+
daxdev->name = kstrdup(daxdev_out.name, GFP_KERNEL);
wmb(); /* all daxdev fields must be visible before marking it valid */
daxdev->valid = 1;
@@ -254,6 +288,38 @@ famfs_update_daxdev_table(
return 0;
}
+static void
+famfs_set_daxdev_err(
+ struct fuse_conn *fc,
+ struct dax_device *dax_devp)
+{
+ int i;
+
+ /* Gotta search the list by dax_devp;
+ * read lock because we're not adding or removing daxdev entries
+ */
+ down_read(&fc->famfs_devlist_sem);
+ for (i = 0; i < fc->dax_devlist->nslots; i++) {
+ if (fc->dax_devlist->devlist[i].valid) {
+ struct famfs_daxdev *dd = &fc->dax_devlist->devlist[i];
+
+ if (dd->devp != dax_devp)
+ continue;
+
+ dd->error = true;
+ up_read(&fc->famfs_devlist_sem);
+
+ pr_err("%s: memory error on daxdev %s (%d)\n",
+ __func__, dd->name, i);
+ goto done;
+ }
+ }
+ up_read(&fc->famfs_devlist_sem);
+ pr_err("%s: memory err on unrecognized daxdev\n", __func__);
+
+done:
+}
+
/***************************************************************************/
void
@@ -611,6 +677,26 @@ famfs_file_init_dax(
static ssize_t famfs_file_bad(struct inode *inode);
+static int famfs_dax_err(struct famfs_daxdev *dd)
+{
+ if (!dd->valid) {
+ pr_err("%s: daxdev=%s invalid\n",
+ __func__, dd->name);
+ return -EIO;
+ }
+ if (dd->dax_err) {
+ pr_err("%s: daxdev=%s dax_err\n",
+ __func__, dd->name);
+ return -EIO;
+ }
+ if (dd->error) {
+ pr_err("%s: daxdev=%s memory error\n",
+ __func__, dd->name);
+ return -EHWPOISON;
+ }
+ return 0;
+}
+
static int
famfs_interleave_fileofs_to_daxofs(struct inode *inode, struct iomap *iomap,
loff_t file_offset, off_t len, unsigned int flags)
@@ -648,6 +734,7 @@ famfs_interleave_fileofs_to_daxofs(struct inode *inode, struct iomap *iomap,
/* Is the data is in this striped extent? */
if (local_offset < ext_size) {
+ struct famfs_daxdev *dd;
u64 chunk_num = local_offset / chunk_size;
u64 chunk_offset = local_offset % chunk_size;
u64 stripe_num = chunk_num / nstrips;
@@ -656,6 +743,7 @@ famfs_interleave_fileofs_to_daxofs(struct inode *inode, struct iomap *iomap,
u64 strip_offset = chunk_offset + (stripe_num * chunk_size);
u64 strip_dax_ofs = fei->ie_strips[strip_num].ext_offset;
u64 strip_devidx = fei->ie_strips[strip_num].dev_index;
+ int rc;
if (strip_devidx >= fc->dax_devlist->nslots) {
pr_err("%s: strip_devidx %llu >= nslots %d\n",
@@ -670,6 +758,15 @@ famfs_interleave_fileofs_to_daxofs(struct inode *inode, struct iomap *iomap,
goto err_out;
}
+ dd = &fc->dax_devlist->devlist[strip_devidx];
+
+ rc = famfs_dax_err(dd);
+ if (rc) {
+ /* Shut down access to this file */
+ meta->error = true;
+ return rc;
+ }
+
iomap->addr = strip_dax_ofs + strip_offset;
iomap->offset = file_offset;
iomap->length = min_t(loff_t, len, chunk_remainder);
@@ -767,6 +864,7 @@ famfs_fileofs_to_daxofs(struct inode *inode, struct iomap *iomap,
if (local_offset < dax_ext_len) {
loff_t ext_len_remainder = dax_ext_len - local_offset;
struct famfs_daxdev *dd;
+ int rc;
if (daxdev_idx >= fc->dax_devlist->nslots) {
pr_err("%s: daxdev_idx %llu >= nslots %d\n",
@@ -777,11 +875,11 @@ famfs_fileofs_to_daxofs(struct inode *inode, struct iomap *iomap,
dd = &fc->dax_devlist->devlist[daxdev_idx];
- if (!dd->valid || dd->error) {
- pr_err("%s: daxdev=%lld %s\n", __func__,
- daxdev_idx,
- dd->valid ? "error" : "invalid");
- goto err_out;
+ rc = famfs_dax_err(dd);
+ if (rc) {
+ /* Shut down access to this file */
+ meta->error = true;
+ return rc;
}
/*
@@ -966,7 +1064,8 @@ famfs_file_bad(struct inode *inode)
return -EIO;
}
if (meta->error) {
- pr_debug("%s: previously detected metadata errors\n", __func__);
+ pr_debug("%s: previously detected metadata errors\n",
+ __func__);
return -EIO;
}
if (i_size != meta->file_size) {
diff --git a/fs/fuse/famfs_kfmap.h b/fs/fuse/famfs_kfmap.h
index e76b9057a1e0..6a6420bdff48 100644
--- a/fs/fuse/famfs_kfmap.h
+++ b/fs/fuse/famfs_kfmap.h
@@ -73,7 +73,8 @@ struct famfs_file_meta {
struct famfs_daxdev {
/* Include dev uuid? */
bool valid;
- bool error;
+ bool error; /* Dax has reported a memory error (probably poison) */
+ bool dax_err; /* fs_dax_get() failed */
dev_t devno;
struct dax_device *devp;
char *name;
--
2.49.0
^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH V3 19/21] famfs_fuse: Add DAX address_space_operations with noop_dirty_folio
2026-01-07 15:33 ` [PATCH V3 00/21] famfs: port into fuse John Groves
` (17 preceding siblings ...)
2026-01-07 15:33 ` [PATCH V3 18/21] famfs_fuse: Add holder_operations for dax notify_failure() John Groves
@ 2026-01-07 15:33 ` John Groves
2026-01-07 15:33 ` [PATCH V3 20/21] famfs_fuse: Add famfs fmap metadata documentation John Groves
2026-01-07 15:33 ` [PATCH V3 21/21] famfs_fuse: Add documentation John Groves
20 siblings, 0 replies; 74+ messages in thread
From: John Groves @ 2026-01-07 15:33 UTC (permalink / raw)
To: John Groves, Miklos Szeredi, Dan Williams, Bernd Schubert,
Alison Schofield
Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
Josef Bacik, Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel, John Groves
From: John Groves <John@Groves.net>
Famfs is memory-backed; there is no place to write back to, and no
reason to mark pages dirty at all.
Signed-off-by: John Groves <john@groves.net>
---
fs/fuse/famfs.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/fs/fuse/famfs.c b/fs/fuse/famfs.c
index 4eb87c5c628e..32c3d0c2ec48 100644
--- a/fs/fuse/famfs.c
+++ b/fs/fuse/famfs.c
@@ -13,6 +13,7 @@
#include <linux/mm.h>
#include <linux/dax.h>
#include <linux/iomap.h>
+#include <linux/pagemap.h>
#include <linux/path.h>
#include <linux/namei.h>
#include <linux/string.h>
@@ -38,6 +39,15 @@ static const struct dax_holder_operations famfs_fuse_dax_holder_ops = {
.notify_failure = famfs_dax_notify_failure,
};
+/*
+ * DAX address_space_operations for famfs.
+ * famfs doesn't need dirty tracking - writes go directly to
+ * memory with no writeback required.
+ */
+static const struct address_space_operations famfs_dax_aops = {
+ .dirty_folio = noop_dirty_folio,
+};
+
/*****************************************************************************/
/*
@@ -657,6 +667,7 @@ famfs_file_init_dax(
}
i_size_write(inode, meta->file_size);
inode->i_flags |= S_DAX;
+ inode->i_data.a_ops = &famfs_dax_aops;
}
unlock_out:
inode_unlock(inode);
--
2.49.0
^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH V3 20/21] famfs_fuse: Add famfs fmap metadata documentation
2026-01-07 15:33 ` [PATCH V3 00/21] famfs: port into fuse John Groves
` (18 preceding siblings ...)
2026-01-07 15:33 ` [PATCH V3 19/21] famfs_fuse: Add DAX address_space_operations with noop_dirty_folio John Groves
@ 2026-01-07 15:33 ` John Groves
2026-01-07 15:33 ` [PATCH V3 21/21] famfs_fuse: Add documentation John Groves
20 siblings, 0 replies; 74+ messages in thread
From: John Groves @ 2026-01-07 15:33 UTC (permalink / raw)
To: John Groves, Miklos Szeredi, Dan Williams, Bernd Schubert,
Alison Schofield
Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
Josef Bacik, Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel, John Groves
From: John Groves <John@Groves.net>
This describes the fmap metadata - both simple and interleaved
Signed-off-by: John Groves <john@groves.net>
---
fs/fuse/famfs_kfmap.h | 73 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 73 insertions(+)
diff --git a/fs/fuse/famfs_kfmap.h b/fs/fuse/famfs_kfmap.h
index 6a6420bdff48..ac5971d4c63a 100644
--- a/fs/fuse/famfs_kfmap.h
+++ b/fs/fuse/famfs_kfmap.h
@@ -7,6 +7,79 @@
#ifndef FAMFS_KFMAP_H
#define FAMFS_KFMAP_H
+/* KABI version 43 (aka v2) fmap structures
+ *
+ * The location of the memory backing for a famfs file is described by
+ * the response to the GET_FMAP fuse message (defined in
+ * include/uapi/linux/fuse.h
+ *
+ * There are currently two extent formats: Simple and Interleaved.
+ *
+ * Simple extents are just (devindex, offset, length) tuples, where devindex
+ * references a devdax device that must be retrievable via the GET_DAXDEV
+ * message/response.
+ *
+ * The extent list size must be >= file_size.
+ *
+ * Interleaved extents merit some additional explanation. Interleaved
+ * extents stripe data across a collection of strips. Each strip is a
+ * contiguous allocation from a single devdax device - and is described by
+ * a simple_extent structure.
+ *
+ * Interleaved_extent example:
+ * ie_nstrips = 4
+ * ie_chunk_size = 2MiB
+ * ie_nbytes = 24MiB
+ *
+ * ┌────────────┐────────────┐────────────┐────────────┐
+ * │Chunk = 0 │Chunk = 1 │Chunk = 2 │Chunk = 3 │
+ * │Strip = 0 │Strip = 1 │Strip = 2 │Strip = 3 │
+ * │Stripe = 0 │Stripe = 0 │Stripe = 0 │Stripe = 0 │
+ * │ │ │ │ │
+ * └────────────┘────────────┘────────────┘────────────┘
+ * │Chunk = 4 │Chunk = 5 │Chunk = 6 │Chunk = 7 │
+ * │Strip = 0 │Strip = 1 │Strip = 2 │Strip = 3 │
+ * │Stripe = 1 │Stripe = 1 │Stripe = 1 │Stripe = 1 │
+ * │ │ │ │ │
+ * └────────────┘────────────┘────────────┘────────────┘
+ * │Chunk = 8 │Chunk = 9 │Chunk = 10 │Chunk = 11 │
+ * │Strip = 0 │Strip = 1 │Strip = 2 │Strip = 3 │
+ * │Stripe = 2 │Stripe = 2 │Stripe = 2 │Stripe = 2 │
+ * │ │ │ │ │
+ * └────────────┘────────────┘────────────┘────────────┘
+ *
+ * * Data is laid out across chunks in chunk # order
+ * * Columns are strips
+ * * Strips are contiguous devdax extents, normally each coming from a
+ * different memory device
+ * * Rows are stripes
+ * * The number of chunks is (int)((file_size + chunk_size - 1) / chunk_size)
+ * (and obviously the last chunk could be partial)
+ * * The stripe_size = (nstrips * chunk_size)
+ * * chunk_num(offset) = offset / chunk_size //integer division
+ * * strip_num(offset) = chunk_num(offset) % nchunks
+ * * stripe_num(offset) = offset / stripe_size //integer division
+ * * ...You get the idea - see the code for more details...
+ *
+ * Some concrete examples from the layout above:
+ * * Offset 0 in the file is offset 0 in chunk 0, which is offset 0 in
+ * strip 0
+ * * Offset 4MiB in the file is offset 0 in chunk 2, which is offset 0 in
+ * strip 2
+ * * Offset 15MiB in the file is offset 1MiB in chunk 7, which is offset
+ * 3MiB in strip 3
+ *
+ * Notes about this metadata format:
+ *
+ * * For various reasons, chunk_size must be a multiple of the applicable
+ * PAGE_SIZE
+ * * Since chunk_size and nstrips are constant within an interleaved_extent,
+ * resolving a file offset to a strip offset within a single
+ * interleaved_ext is order 1.
+ * * If nstrips==1, a list of interleaved_ext structures degenerates to a
+ * regular extent list (albeit with some wasted struct space).
+ */
+
/*
* The structures below are the in-memory metadata format for famfs files.
* Metadata retrieved via the GET_FMAP response is converted to this format
--
2.49.0
^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH V3 21/21] famfs_fuse: Add documentation
2026-01-07 15:33 ` [PATCH V3 00/21] famfs: port into fuse John Groves
` (19 preceding siblings ...)
2026-01-07 15:33 ` [PATCH V3 20/21] famfs_fuse: Add famfs fmap metadata documentation John Groves
@ 2026-01-07 15:33 ` John Groves
2026-01-08 15:27 ` Jonathan Cameron
20 siblings, 1 reply; 74+ messages in thread
From: John Groves @ 2026-01-07 15:33 UTC (permalink / raw)
To: John Groves, Miklos Szeredi, Dan Williams, Bernd Schubert,
Alison Schofield
Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
Josef Bacik, Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel, John Groves
Add Documentation/filesystems/famfs.rst and update MAINTAINERS
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: John Groves <john@groves.net>
---
Documentation/filesystems/famfs.rst | 142 ++++++++++++++++++++++++++++
Documentation/filesystems/index.rst | 1 +
MAINTAINERS | 1 +
3 files changed, 144 insertions(+)
create mode 100644 Documentation/filesystems/famfs.rst
diff --git a/Documentation/filesystems/famfs.rst b/Documentation/filesystems/famfs.rst
new file mode 100644
index 000000000000..0d3c9ba9b7a8
--- /dev/null
+++ b/Documentation/filesystems/famfs.rst
@@ -0,0 +1,142 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _famfs_index:
+
+==================================================================
+famfs: The fabric-attached memory file system
+==================================================================
+
+- Copyright (C) 2024-2025 Micron Technology, Inc.
+
+Introduction
+============
+Compute Express Link (CXL) provides a mechanism for disaggregated or
+fabric-attached memory (FAM). This creates opportunities for data sharing;
+clustered apps that would otherwise have to shard or replicate data can
+share one copy in disaggregated memory.
+
+Famfs, which is not CXL-specific in any way, provides a mechanism for
+multiple hosts to concurrently access data in shared memory, by giving it
+a file system interface. With famfs, any app that understands files can
+access data sets in shared memory. Although famfs supports read and write,
+the real point is to support mmap, which provides direct (dax) access to
+the memory - either writable or read-only.
+
+Shared memory can pose complex coherency and synchronization issues, but
+there are also simple cases. Two simple and eminently useful patterns that
+occur frequently in data analytics and AI are:
+
+* Serial Sharing - Only one host or process at a time has access to a file
+* Read-only Sharing - Multiple hosts or processes share read-only access
+ to a file
+
+The famfs fuse file system is part of the famfs framework; user space
+components [1] handle metadata allocation and distribution, and provide a
+low-level fuse server to expose files that map directly to [presumably
+shared] memory.
+
+The famfs framework manages coherency of its own metadata and structures,
+but does not attempt to manage coherency for applications.
+
+Famfs also provides data isolation between files. That is, even though
+the host has access to an entire memory "device" (as a devdax device), apps
+cannot write to memory for which the file is read-only, and mapping one
+file provides isolation from the memory of all other files. This is pretty
+basic, but some experimental shared memory usage patterns provide no such
+isolation.
+
+Principles of Operation
+=======================
+
+Famfs is a file system with one or more devdax devices as a first-class
+backing device(s). Metadata maintenance and query operations happen
+entirely in user space.
+
+The famfs low-level fuse server daemon provides file maps (fmaps) and
+devdax device info to the fuse/famfs kernel component so that
+read/write/mapping faults can be handled without up-calls for all active
+files.
+
+The famfs user space is responsible for maintaining and distributing
+consistent metadata. This is currently handled via an append-only
+metadata log within the memory, but this is orthogonal to the fuse/famfs
+kernel code.
+
+Once instantiated, "the same file" on each host points to the same shared
+memory, but in-memory metadata (inodes, etc.) is ephemeral on each host
+that has a famfs instance mounted. Use cases are free to allow or not
+allow mutations to data on a file-by-file basis.
+
+When an app accesses a data object in a famfs file, there is no page cache
+involvement. The CPU cache is loaded directly from the shared memory. In
+some use cases, this is an enormous reduction read amplification compared
+to loading an entire page into the page cache.
+
+
+Famfs is Not a Conventional File System
+---------------------------------------
+
+Famfs files can be accessed by conventional means, but there are
+limitations. The kernel component of fuse/famfs is not involved in the
+allocation of backing memory for files at all; the famfs user space
+creates files and responds as a low-level fuse server with fmaps and
+devdax device info upon request.
+
+Famfs differs in some important ways from conventional file systems:
+
+* Files must be pre-allocated by the famfs framework; allocation is never
+ performed on (or after) write.
+* Any operation that changes a file's size is considered to put the file
+ in an invalid state, disabling access to the data. It may be possible to
+ revisit this in the future. (Typically the famfs user space can restore
+ files to a valid state by replaying the famfs metadata log.)
+
+Famfs exists to apply the existing file system abstractions to shared
+memory so applications and workflows can more easily adapt to an
+environment with disaggregated shared memory.
+
+Memory Error Handling
+=====================
+
+Possible memory errors include timeouts, poison and unexpected
+reconfiguration of an underlying dax device. In all of these cases, famfs
+receives a call from the devdax layer via its iomap_ops->notify_failure()
+function. If any memory errors have been detected, access to the affected
+daxdev is disabled to avoid further errors or corruption.
+
+In all known cases, famfs can be unmounted cleanly. In most cases errors
+can be cleared by re-initializing the memory - at which point a new famfs
+file system can be created.
+
+Key Requirements
+================
+
+The primary requirements for famfs are:
+
+1. Must support a file system abstraction backed by sharable devdax memory
+2. Files must efficiently handle VMA faults
+3. Must support metadata distribution in a sharable way
+4. Must handle clients with a stale copy of metadata
+
+The famfs kernel component takes care of 1-2 above by caching each file's
+mapping metadata in the kernel.
+
+Requirements 3 and 4 are handled by the user space components, and are
+largely orthogonal to the functionality of the famfs kernel module.
+
+Requirements 3 and 4 cannot be met by conventional fs-dax file systems
+(e.g. xfs) because they use write-back metadata; it is not valid to mount
+such a file system on two hosts from the same in-memory image.
+
+
+Famfs Usage
+===========
+
+Famfs usage is documented at [1].
+
+
+References
+==========
+
+- [1] Famfs user space repository and documentation
+ https://github.com/cxl-micron-reskit/famfs
diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index f4873197587d..e6fb467c1680 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -89,6 +89,7 @@ Documentation for filesystem implementations.
ext3
ext4/index
f2fs
+ famfs
gfs2/index
hfs
hfsplus
diff --git a/MAINTAINERS b/MAINTAINERS
index 16b0606a3b85..b74ac9395264 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10380,6 +10380,7 @@ M: John Groves <John@Groves.net>
L: linux-cxl@vger.kernel.org
L: linux-fsdevel@vger.kernel.org
S: Supported
+F: Documentation/filesystems/famfs.rst
F: fs/fuse/famfs.c
F: fs/fuse/famfs_kfmap.h
--
2.49.0
^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH V3 0/4] libfuse: add basic famfs support to libfuse
2026-01-07 15:32 [PATCH BUNDLE] famfs: Fabric-Attached Memory File System John Groves
2026-01-07 15:33 ` [PATCH V3 00/21] famfs: port into fuse John Groves
@ 2026-01-07 15:34 ` John Groves
2026-01-07 15:34 ` [PATCH V3 1/4] fuse_kernel.h: bring up to baseline 6.19 John Groves
` (3 more replies)
2026-01-07 15:34 ` [PATCH 0/2] ndctl: Add daxctl support for the new "famfs" mode of devdax John Groves
2 siblings, 4 replies; 74+ messages in thread
From: John Groves @ 2026-01-07 15:34 UTC (permalink / raw)
To: John Groves, Miklos Szeredi, Dan Williams, Bernd Schubert,
Alison Schofield
Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
Josef Bacik, Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel, John Groves
This short series adds adds the necessary support for famfs to libfuse.
This series is also a pull request at [1].
References
[1] - https://github.com/libfuse/libfuse/pull/1414
John Groves (4):
fuse_kernel.h: bring up to baseline 6.19
fuse_kernel.h: add famfs DAX fmap protocol definitions
fuse: add API to set kernel mount options
fuse: add famfs DAX fmap support
include/fuse_common.h | 5 +++
include/fuse_kernel.h | 98 ++++++++++++++++++++++++++++++++++++++++-
include/fuse_lowlevel.h | 47 ++++++++++++++++++++
lib/fuse_i.h | 1 +
lib/fuse_lowlevel.c | 36 ++++++++++++++-
lib/fuse_versionscript | 1 +
lib/mount.c | 8 ++++
7 files changed, 194 insertions(+), 2 deletions(-)
base-commit: 6278995cca991978abd25ebb2c20ebd3fc9e8a13
--
2.49.0
^ permalink raw reply [flat|nested] 74+ messages in thread
* [PATCH V3 1/4] fuse_kernel.h: bring up to baseline 6.19
2026-01-07 15:34 ` [PATCH V3 0/4] libfuse: add basic famfs support to libfuse John Groves
@ 2026-01-07 15:34 ` John Groves
2026-01-07 15:34 ` [PATCH V3 2/4] fuse_kernel.h: add famfs DAX fmap protocol definitions John Groves
` (2 subsequent siblings)
3 siblings, 0 replies; 74+ messages in thread
From: John Groves @ 2026-01-07 15:34 UTC (permalink / raw)
To: John Groves, Miklos Szeredi, Dan Williams, Bernd Schubert,
Alison Schofield
Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
Josef Bacik, Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel, John Groves
This is copied from include/uapi/linux/fuse.h in 6.19 with no changes.
Signed-off-by: John Groves <john@groves.net>
---
include/fuse_kernel.h | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h
index 94621f6..c13e1f9 100644
--- a/include/fuse_kernel.h
+++ b/include/fuse_kernel.h
@@ -239,6 +239,7 @@
* 7.45
* - add FUSE_COPY_FILE_RANGE_64
* - add struct fuse_copy_file_range_out
+ * - add FUSE_NOTIFY_PRUNE
*/
#ifndef _LINUX_FUSE_H
@@ -680,7 +681,7 @@ enum fuse_notify_code {
FUSE_NOTIFY_DELETE = 6,
FUSE_NOTIFY_RESEND = 7,
FUSE_NOTIFY_INC_EPOCH = 8,
- FUSE_NOTIFY_CODE_MAX,
+ FUSE_NOTIFY_PRUNE = 9,
};
/* The read buffer is required to be at least 8k, but may be much larger */
@@ -1119,6 +1120,12 @@ struct fuse_notify_retrieve_in {
uint64_t dummy4;
};
+struct fuse_notify_prune_out {
+ uint32_t count;
+ uint32_t padding;
+ uint64_t spare;
+};
+
struct fuse_backing_map {
int32_t fd;
uint32_t flags;
@@ -1131,6 +1138,7 @@ struct fuse_backing_map {
#define FUSE_DEV_IOC_BACKING_OPEN _IOW(FUSE_DEV_IOC_MAGIC, 1, \
struct fuse_backing_map)
#define FUSE_DEV_IOC_BACKING_CLOSE _IOW(FUSE_DEV_IOC_MAGIC, 2, uint32_t)
+#define FUSE_DEV_IOC_SYNC_INIT _IO(FUSE_DEV_IOC_MAGIC, 3)
struct fuse_lseek_in {
uint64_t fh;
--
2.49.0
^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH V3 2/4] fuse_kernel.h: add famfs DAX fmap protocol definitions
2026-01-07 15:34 ` [PATCH V3 0/4] libfuse: add basic famfs support to libfuse John Groves
2026-01-07 15:34 ` [PATCH V3 1/4] fuse_kernel.h: bring up to baseline 6.19 John Groves
@ 2026-01-07 15:34 ` John Groves
2026-01-07 15:34 ` [PATCH V3 3/4] fuse: add API to set kernel mount options John Groves
2026-01-07 15:34 ` [PATCH V3 4/4] fuse: add famfs DAX fmap support John Groves
3 siblings, 0 replies; 74+ messages in thread
From: John Groves @ 2026-01-07 15:34 UTC (permalink / raw)
To: John Groves, Miklos Szeredi, Dan Williams, Bernd Schubert,
Alison Schofield
Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
Josef Bacik, Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel, John Groves
Add FUSE protocol version 7.46 definitions for famfs DAX file mapping:
Capability flag:
- FUSE_DAX_FMAP (bit 43): kernel supports DAX fmap operations
New opcodes:
- FUSE_GET_FMAP (54): retrieve file extent map for DAX mapping
- FUSE_GET_DAXDEV (55): retrieve DAX device info by index
New structures for GET_FMAP reply:
- struct fuse_famfs_fmap_header: file map header with type and extent info
- struct fuse_famfs_simple_ext: simple extent (device, offset, length)
- struct fuse_famfs_iext: interleaved extent for striped allocations
New structures for GET_DAXDEV:
- struct fuse_get_daxdev_in: request DAX device by index
- struct fuse_daxdev_out: DAX device name response
Supporting definitions:
- enum fuse_famfs_file_type: regular, superblock, or log file
- enum famfs_ext_type: simple or interleaved extent type
Signed-off-by: John Groves <john@groves.net>
---
include/fuse_kernel.h | 88 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 88 insertions(+)
diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h
index c13e1f9..7fdfc30 100644
--- a/include/fuse_kernel.h
+++ b/include/fuse_kernel.h
@@ -240,6 +240,19 @@
* - add FUSE_COPY_FILE_RANGE_64
* - add struct fuse_copy_file_range_out
* - add FUSE_NOTIFY_PRUNE
+ *
+ * 7.46
+ * - Add FUSE_DAX_FMAP capability - ability to handle in-kernel fsdax maps
+ * - Add the following structures for the GET_FMAP message reply components:
+ * - struct fuse_famfs_simple_ext
+ * - struct fuse_famfs_iext
+ * - struct fuse_famfs_fmap_header
+ * - Add the following structs for the GET_DAXDEV message and reply
+ * - struct fuse_get_daxdev_in
+ * - struct fuse_get_daxdev_out
+ * - Add the following enumerated types
+ * - enum fuse_famfs_file_type
+ * - enum famfs_ext_type
*/
#ifndef _LINUX_FUSE_H
@@ -448,6 +461,7 @@ struct fuse_file_lock {
* FUSE_OVER_IO_URING: Indicate that client supports io-uring
* FUSE_REQUEST_TIMEOUT: kernel supports timing out requests.
* init_out.request_timeout contains the timeout (in secs)
+ * FUSE_DAX_FMAP: kernel supports dev_dax_iomap (aka famfs) fmaps
*/
#define FUSE_ASYNC_READ (1 << 0)
#define FUSE_POSIX_LOCKS (1 << 1)
@@ -495,6 +509,7 @@ struct fuse_file_lock {
#define FUSE_ALLOW_IDMAP (1ULL << 40)
#define FUSE_OVER_IO_URING (1ULL << 41)
#define FUSE_REQUEST_TIMEOUT (1ULL << 42)
+#define FUSE_DAX_FMAP (1ULL << 43)
/**
* CUSE INIT request/reply flags
@@ -664,6 +679,10 @@ enum fuse_opcode {
FUSE_STATX = 52,
FUSE_COPY_FILE_RANGE_64 = 53,
+ /* Famfs / devdax opcodes */
+ FUSE_GET_FMAP = 54,
+ FUSE_GET_DAXDEV = 55,
+
/* CUSE specific operations */
CUSE_INIT = 4096,
@@ -1308,4 +1327,73 @@ struct fuse_uring_cmd_req {
uint8_t padding[6];
};
+/* Famfs fmap message components */
+
+#define FAMFS_FMAP_VERSION 1
+
+#define FAMFS_FMAP_MAX 32768 /* Largest supported fmap message */
+#define FUSE_FAMFS_MAX_EXTENTS 32
+#define FUSE_FAMFS_MAX_STRIPS 32
+
+enum fuse_famfs_file_type {
+ FUSE_FAMFS_FILE_REG,
+ FUSE_FAMFS_FILE_SUPERBLOCK,
+ FUSE_FAMFS_FILE_LOG,
+};
+
+enum famfs_ext_type {
+ FUSE_FAMFS_EXT_SIMPLE = 0,
+ FUSE_FAMFS_EXT_INTERLEAVE = 1,
+};
+
+struct fuse_famfs_simple_ext {
+ uint32_t se_devindex;
+ uint32_t reserved;
+ uint64_t se_offset;
+ uint64_t se_len;
+};
+
+struct fuse_famfs_iext { /* Interleaved extent */
+ uint32_t ie_nstrips;
+ uint32_t ie_chunk_size;
+ uint64_t ie_nbytes; /* Total bytes for this interleaved_ext;
+ * sum of strips may be more
+ */
+ uint64_t reserved;
+};
+
+struct fuse_famfs_fmap_header {
+ uint8_t file_type; /* enum famfs_file_type */
+ uint8_t reserved;
+ uint16_t fmap_version;
+ uint32_t ext_type; /* enum famfs_log_ext_type */
+ uint32_t nextents;
+ uint32_t reserved0;
+ uint64_t file_size;
+ uint64_t reserved1;
+};
+
+struct fuse_get_daxdev_in {
+ uint32_t daxdev_num;
+};
+
+#define DAXDEV_NAME_MAX 256
+
+/* fuse_daxdev_out has enough space for a uuid if we need it */
+struct fuse_daxdev_out {
+ uint16_t index;
+ uint16_t reserved;
+ uint32_t reserved2;
+ uint64_t reserved3;
+ uint64_t reserved4;
+ char name[DAXDEV_NAME_MAX];
+};
+
+static __inline__ int32_t fmap_msg_min_size(void)
+{
+ /* Smallest fmap message is a header plus one simple extent */
+ return (sizeof(struct fuse_famfs_fmap_header)
+ + sizeof(struct fuse_famfs_simple_ext));
+}
+
#endif /* _LINUX_FUSE_H */
--
2.49.0
^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH V3 3/4] fuse: add API to set kernel mount options
2026-01-07 15:34 ` [PATCH V3 0/4] libfuse: add basic famfs support to libfuse John Groves
2026-01-07 15:34 ` [PATCH V3 1/4] fuse_kernel.h: bring up to baseline 6.19 John Groves
2026-01-07 15:34 ` [PATCH V3 2/4] fuse_kernel.h: add famfs DAX fmap protocol definitions John Groves
@ 2026-01-07 15:34 ` John Groves
2026-01-07 15:34 ` [PATCH V3 4/4] fuse: add famfs DAX fmap support John Groves
3 siblings, 0 replies; 74+ messages in thread
From: John Groves @ 2026-01-07 15:34 UTC (permalink / raw)
To: John Groves, Miklos Szeredi, Dan Williams, Bernd Schubert,
Alison Schofield
Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
Josef Bacik, Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel, John Groves
Add fuse_add_kernel_mount_opt() to allow libfuse callers to pass
additional mount options directly to the kernel. This enables
filesystem-specific kernel mount options that aren't exposed through
the standard libfuse mount option parsing.
For example, famfs uses this to set the "shadow=" mount option
for shadow file system mounts.
API addition:
int fuse_add_kernel_mount_opt(struct fuse_session *se, const char *mount_opt)
Signed-off-by: John Groves <john@groves.net>
---
include/fuse_lowlevel.h | 10 ++++++++++
lib/fuse_i.h | 1 +
lib/fuse_lowlevel.c | 5 +++++
lib/fuse_versionscript | 1 +
lib/mount.c | 8 ++++++++
5 files changed, 25 insertions(+)
diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h
index 016f831..d2bbcca 100644
--- a/include/fuse_lowlevel.h
+++ b/include/fuse_lowlevel.h
@@ -2195,6 +2195,16 @@ static inline int fuse_session_custom_io(struct fuse_session *se,
}
#endif
+/**
+ * Allow a libfuse caller to directly add kernel mount opts
+ *
+ * @param se session object
+ * @param mount_opt the option to add
+ *
+ * @return 0 on success, -1 on failure
+ */
+int fuse_add_kernel_mount_opt(struct fuse_session *se, const char *mount_opt);
+
/**
* Mount a FUSE file system.
*
diff --git a/lib/fuse_i.h b/lib/fuse_i.h
index 65d2f68..41285d2 100644
--- a/lib/fuse_i.h
+++ b/lib/fuse_i.h
@@ -220,6 +220,7 @@ void destroy_mount_opts(struct mount_opts *mo);
void fuse_mount_version(void);
unsigned get_max_read(struct mount_opts *o);
void fuse_kern_unmount(const char *mountpoint, int fd);
+int __fuse_add_kernel_mount_opt(struct fuse_session *se, const char *mount_opt);
int fuse_kern_mount(const char *mountpoint, struct mount_opts *mo);
int fuse_send_reply_iov_nofree(fuse_req_t req, int error, struct iovec *iov,
diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
index 0cde3d4..413e7c3 100644
--- a/lib/fuse_lowlevel.c
+++ b/lib/fuse_lowlevel.c
@@ -4349,6 +4349,11 @@ int fuse_session_custom_io_30(struct fuse_session *se,
offsetof(struct fuse_custom_io, clone_fd), fd);
}
+int fuse_add_kernel_mount_opt(struct fuse_session *se, const char *mount_opt)
+{
+ return __fuse_add_kernel_mount_opt(se, mount_opt);
+}
+
int fuse_session_mount(struct fuse_session *se, const char *_mountpoint)
{
int fd;
diff --git a/lib/fuse_versionscript b/lib/fuse_versionscript
index f9562b6..536569a 100644
--- a/lib/fuse_versionscript
+++ b/lib/fuse_versionscript
@@ -220,6 +220,7 @@ FUSE_3.18 {
fuse_reply_statx;
fuse_fs_statx;
+ fuse_add_kernel_mount_opt;
} FUSE_3.17;
FUSE_3.19 {
diff --git a/lib/mount.c b/lib/mount.c
index 7a856c1..e6c2305 100644
--- a/lib/mount.c
+++ b/lib/mount.c
@@ -674,6 +674,14 @@ void destroy_mount_opts(struct mount_opts *mo)
free(mo);
}
+int __fuse_add_kernel_mount_opt(struct fuse_session *se, const char *mount_opt)
+{
+ if (!se->mo)
+ return -1;
+ if (!mount_opt)
+ return -1;
+ return fuse_opt_add_opt(&se->mo->kernel_opts, mount_opt);
+}
int fuse_kern_mount(const char *mountpoint, struct mount_opts *mo)
{
--
2.49.0
^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH V3 4/4] fuse: add famfs DAX fmap support
2026-01-07 15:34 ` [PATCH V3 0/4] libfuse: add basic famfs support to libfuse John Groves
` (2 preceding siblings ...)
2026-01-07 15:34 ` [PATCH V3 3/4] fuse: add API to set kernel mount options John Groves
@ 2026-01-07 15:34 ` John Groves
2026-01-08 15:31 ` Jonathan Cameron
3 siblings, 1 reply; 74+ messages in thread
From: John Groves @ 2026-01-07 15:34 UTC (permalink / raw)
To: John Groves, Miklos Szeredi, Dan Williams, Bernd Schubert,
Alison Schofield
Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
Josef Bacik, Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel, John Groves
Add new FUSE operations and capability for famfs DAX file mapping:
- FUSE_CAP_DAX_FMAP: New capability flag at bit 32 (using want_ext/capable_ext
fields) to indicate kernel and userspace support for DAX fmaps
- GET_FMAP: New operation to retrieve a file map for DAX-mapped files.
Returns a fuse_famfs_fmap_header followed by simple or interleaved
extent descriptors. The kernel passes the file size as an argument.
- GET_DAXDEV: New operation to retrieve DAX device info by index.
Called when GET_FMAP returns an fmap referencing a previously
unknown DAX device.
These operations enable FUSE filesystems to provide direct access
mappings to persistent memory, allowing the kernel to map files
directly to DAX devices without page cache intermediation.
Signed-off-by: John Groves <john@groves.net>
---
include/fuse_common.h | 5 +++++
include/fuse_lowlevel.h | 37 +++++++++++++++++++++++++++++++++++++
lib/fuse_lowlevel.c | 31 ++++++++++++++++++++++++++++++-
3 files changed, 72 insertions(+), 1 deletion(-)
diff --git a/include/fuse_common.h b/include/fuse_common.h
index 041188e..e428ddb 100644
--- a/include/fuse_common.h
+++ b/include/fuse_common.h
@@ -512,6 +512,11 @@ struct fuse_loop_config_v1 {
*/
#define FUSE_CAP_OVER_IO_URING (1UL << 31)
+/**
+ * handle files that use famfs dax fmaps
+ */
+#define FUSE_CAP_DAX_FMAP (1UL<<32)
+
/**
* Ioctl flags
*
diff --git a/include/fuse_lowlevel.h b/include/fuse_lowlevel.h
index d2bbcca..55fcfd7 100644
--- a/include/fuse_lowlevel.h
+++ b/include/fuse_lowlevel.h
@@ -1341,6 +1341,43 @@ struct fuse_lowlevel_ops {
*/
void (*statx)(fuse_req_t req, fuse_ino_t ino, int flags, int mask,
struct fuse_file_info *fi);
+
+ /**
+ * Get a famfs/devdax/fsdax fmap
+ *
+ * Retrieve a file map (aka fmap) for a previously looked-up file.
+ * The fmap is serialized into the buffer, anchored by
+ * struct fuse_famfs_fmap_header, followed by one or more
+ * structs fuse_famfs_simple_ext, or fuse_famfs_iext (which itself
+ * is followed by one or more fuse_famfs_simple_ext...
+ *
+ * Valid replies:
+ * fuse_reply_buf (TODO: variable-size reply)
+ * fuse_reply_err
+ *
+ * @param req request handle
+ * @param ino the inode number
+ */
+ void (*get_fmap) (fuse_req_t req, fuse_ino_t ino, size_t size);
+
+ /**
+ * Get a daxdev by index
+ *
+ * Retrieve info on a daxdev by index. This will be called any time
+ * GET_FMAP has returned a file map that references a previously
+ * unused daxdev. struct famfs_simple_ext, which is used for all
+ * resolutions to daxdev offsets, references daxdevs by index.
+ * In user space we maintain a master list of all referenced daxdevs
+ * by index, which is queried by get_daxdev.
+ *
+ * Valid replies:
+ * fuse_reply_buf
+ * fuse_reply_err
+ *
+ * @param req request handle
+ * @param ino the index of the daxdev
+ */
+ void (*get_daxdev) (fuse_req_t req, int daxdev_index);
};
/**
diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
index 413e7c3..c3adfa2 100644
--- a/lib/fuse_lowlevel.c
+++ b/lib/fuse_lowlevel.c
@@ -2769,7 +2769,8 @@ _do_init(fuse_req_t req, const fuse_ino_t nodeid, const void *op_in,
se->conn.capable_ext |= FUSE_CAP_NO_EXPORT_SUPPORT;
if (inargflags & FUSE_OVER_IO_URING)
se->conn.capable_ext |= FUSE_CAP_OVER_IO_URING;
-
+ if (inargflags & FUSE_DAX_FMAP)
+ se->conn.capable_ext |= FUSE_CAP_DAX_FMAP;
} else {
se->conn.max_readahead = 0;
}
@@ -2932,6 +2933,8 @@ _do_init(fuse_req_t req, const fuse_ino_t nodeid, const void *op_in,
outargflags |= FUSE_REQUEST_TIMEOUT;
outarg.request_timeout = se->conn.request_timeout;
}
+ if (se->conn.want_ext & FUSE_CAP_DAX_FMAP)
+ outargflags |= FUSE_DAX_FMAP;
outarg.max_readahead = se->conn.max_readahead;
outarg.max_write = se->conn.max_write;
@@ -3035,6 +3038,30 @@ static void do_destroy(fuse_req_t req, fuse_ino_t nodeid, const void *inarg)
_do_destroy(req, nodeid, inarg, NULL);
}
+static void
+do_get_fmap(fuse_req_t req, fuse_ino_t nodeid, const void *inarg)
+{
+ struct fuse_session *se = req->se;
+ struct fuse_getxattr_in *arg = (struct fuse_getxattr_in *) inarg;
+
+ if (se->op.get_fmap)
+ se->op.get_fmap(req, nodeid, arg->size);
+ else
+ fuse_reply_err(req, -EOPNOTSUPP);
+}
+
+static void
+do_get_daxdev(fuse_req_t req, fuse_ino_t nodeid, const void *inarg)
+{
+ struct fuse_session *se = req->se;
+ (void)inarg;
+
+ if (se->op.get_daxdev)
+ se->op.get_daxdev(req, nodeid); /* Use nodeid as daxdev_index */
+ else
+ fuse_reply_err(req, -EOPNOTSUPP);
+}
+
static void list_del_nreq(struct fuse_notify_req *nreq)
{
struct fuse_notify_req *prev = nreq->prev;
@@ -3470,6 +3497,8 @@ static struct {
[FUSE_LSEEK] = { do_lseek, "LSEEK" },
[FUSE_STATX] = { do_statx, "STATX" },
[CUSE_INIT] = { cuse_lowlevel_init, "CUSE_INIT" },
+ [FUSE_GET_FMAP] = { do_get_fmap, "GET_FMAP" },
+ [FUSE_GET_DAXDEV] = { do_get_daxdev, "GET_DAXDEV" },
};
static struct {
--
2.49.0
^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH 0/2] ndctl: Add daxctl support for the new "famfs" mode of devdax
2026-01-07 15:32 [PATCH BUNDLE] famfs: Fabric-Attached Memory File System John Groves
2026-01-07 15:33 ` [PATCH V3 00/21] famfs: port into fuse John Groves
2026-01-07 15:34 ` [PATCH V3 0/4] libfuse: add basic famfs support to libfuse John Groves
@ 2026-01-07 15:34 ` John Groves
2026-01-07 15:34 ` [PATCH 1/2] daxctl: Add support for famfs mode John Groves
2026-01-07 15:34 ` [PATCH 2/2] Add test/daxctl-famfs.sh to test famfs mode transitions: John Groves
2 siblings, 2 replies; 74+ messages in thread
From: John Groves @ 2026-01-07 15:34 UTC (permalink / raw)
To: John Groves, Miklos Szeredi, Dan Williams, Bernd Schubert,
Alison Schofield
Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
Josef Bacik, Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel, John Groves
This short series adds support and tests to daxctl for famfs[1]. The
famfs kernel patch series, under the same "compound cover" as this
series, adds a new 'fsdev_dax' driver for devdax. When that driver
is bound (instead of device_dax), the device is in 'famfs' mode rather
than 'devdax' mode.
References
[1] - https://famfs.org
John Groves (2):
daxctl: Add support for famfs mode
Add test/daxctl-famfs.sh to test famfs mode transitions:
daxctl/device.c | 126 ++++++++++++++--
daxctl/json.c | 6 +-
daxctl/lib/libdaxctl-private.h | 2 +
daxctl/lib/libdaxctl.c | 77 ++++++++++
daxctl/lib/libdaxctl.sym | 7 +
daxctl/libdaxctl.h | 3 +
test/daxctl-famfs.sh | 253 +++++++++++++++++++++++++++++++++
test/meson.build | 2 +
8 files changed, 465 insertions(+), 11 deletions(-)
create mode 100755 test/daxctl-famfs.sh
base-commit: 4f7a1c63b3305c97013d3c46daa6c0f76feff10d
--
2.49.0
^ permalink raw reply [flat|nested] 74+ messages in thread
* [PATCH 1/2] daxctl: Add support for famfs mode
2026-01-07 15:34 ` [PATCH 0/2] ndctl: Add daxctl support for the new "famfs" mode of devdax John Groves
@ 2026-01-07 15:34 ` John Groves
2026-01-07 15:34 ` [PATCH 2/2] Add test/daxctl-famfs.sh to test famfs mode transitions: John Groves
1 sibling, 0 replies; 74+ messages in thread
From: John Groves @ 2026-01-07 15:34 UTC (permalink / raw)
To: John Groves, Miklos Szeredi, Dan Williams, Bernd Schubert,
Alison Schofield
Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
Josef Bacik, Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel, John Groves
From: John Groves <John@Groves.net>
Putting a daxdev in famfs mode means binding it to fsdev_dax.ko
(drivers/dax/fsdev.c). Finding a daxdev bound to fsdev_dax means
it is in famfs mode.
The test is added to the destructive test suite since it
modifies device modes.
With devdax, famfs, and system-ram modes, the previous logic that assumed
'not in mode X means in mode Y' needed to get slightly more complicated
Add explicit mode detection functions:
- daxctl_dev_is_famfs_mode(): check if bound to fsdev_dax driver
- daxctl_dev_is_devdax_mode(): check if bound to device_dax driver
Fix mode transition logic in device.c:
- disable_devdax_device(): verify device is actually in devdax mode
- disable_famfs_device(): verify device is actually in famfs mode
- All reconfig_mode_*() functions now explicitly check each mode
- Handle unknown mode with error instead of wrong assumption
Modify json.c to show 'unknown' if device is not in a recognized mode.
Signed-off-by: John Groves <john@groves.net>
---
daxctl/device.c | 126 ++++++++++++++++++++++++++++++---
daxctl/json.c | 6 +-
daxctl/lib/libdaxctl-private.h | 2 +
daxctl/lib/libdaxctl.c | 77 ++++++++++++++++++++
daxctl/lib/libdaxctl.sym | 7 ++
daxctl/libdaxctl.h | 3 +
6 files changed, 210 insertions(+), 11 deletions(-)
diff --git a/daxctl/device.c b/daxctl/device.c
index e3993b1..14e1796 100644
--- a/daxctl/device.c
+++ b/daxctl/device.c
@@ -42,6 +42,7 @@ enum dev_mode {
DAXCTL_DEV_MODE_UNKNOWN,
DAXCTL_DEV_MODE_DEVDAX,
DAXCTL_DEV_MODE_RAM,
+ DAXCTL_DEV_MODE_FAMFS,
};
struct mapping {
@@ -471,6 +472,13 @@ static const char *parse_device_options(int argc, const char **argv,
"--no-online is incompatible with --mode=devdax\n");
rc = -EINVAL;
}
+ } else if (strcmp(param.mode, "famfs") == 0) {
+ reconfig_mode = DAXCTL_DEV_MODE_FAMFS;
+ if (param.no_online) {
+ fprintf(stderr,
+ "--no-online is incompatible with --mode=famfs\n");
+ rc = -EINVAL;
+ }
}
break;
case ACTION_CREATE:
@@ -696,8 +704,42 @@ static int disable_devdax_device(struct daxctl_dev *dev)
int rc;
if (mem) {
- fprintf(stderr, "%s was already in system-ram mode\n",
- devname);
+ fprintf(stderr, "%s is in system-ram mode\n", devname);
+ return 1;
+ }
+ if (daxctl_dev_is_famfs_mode(dev)) {
+ fprintf(stderr, "%s is in famfs mode\n", devname);
+ return 1;
+ }
+ if (!daxctl_dev_is_devdax_mode(dev)) {
+ fprintf(stderr, "%s is not in devdax mode\n", devname);
+ return 1;
+ }
+ rc = daxctl_dev_disable(dev);
+ if (rc) {
+ fprintf(stderr, "%s: disable failed: %s\n",
+ daxctl_dev_get_devname(dev), strerror(-rc));
+ return rc;
+ }
+ return 0;
+}
+
+static int disable_famfs_device(struct daxctl_dev *dev)
+{
+ struct daxctl_memory *mem = daxctl_dev_get_memory(dev);
+ const char *devname = daxctl_dev_get_devname(dev);
+ int rc;
+
+ if (mem) {
+ fprintf(stderr, "%s is in system-ram mode\n", devname);
+ return 1;
+ }
+ if (daxctl_dev_is_devdax_mode(dev)) {
+ fprintf(stderr, "%s is in devdax mode\n", devname);
+ return 1;
+ }
+ if (!daxctl_dev_is_famfs_mode(dev)) {
+ fprintf(stderr, "%s is not in famfs mode\n", devname);
return 1;
}
rc = daxctl_dev_disable(dev);
@@ -711,6 +753,7 @@ static int disable_devdax_device(struct daxctl_dev *dev)
static int reconfig_mode_system_ram(struct daxctl_dev *dev)
{
+ struct daxctl_memory *mem = daxctl_dev_get_memory(dev);
const char *devname = daxctl_dev_get_devname(dev);
int rc, skip_enable = 0;
@@ -724,11 +767,21 @@ static int reconfig_mode_system_ram(struct daxctl_dev *dev)
}
if (daxctl_dev_is_enabled(dev)) {
- rc = disable_devdax_device(dev);
- if (rc < 0)
- return rc;
- if (rc > 0)
+ if (mem) {
+ /* already in system-ram mode */
skip_enable = 1;
+ } else if (daxctl_dev_is_famfs_mode(dev)) {
+ rc = disable_famfs_device(dev);
+ if (rc)
+ return rc;
+ } else if (daxctl_dev_is_devdax_mode(dev)) {
+ rc = disable_devdax_device(dev);
+ if (rc)
+ return rc;
+ } else {
+ fprintf(stderr, "%s: unknown mode\n", devname);
+ return -EINVAL;
+ }
}
if (!skip_enable) {
@@ -750,7 +803,7 @@ static int disable_system_ram_device(struct daxctl_dev *dev)
int rc;
if (!mem) {
- fprintf(stderr, "%s was already in devdax mode\n", devname);
+ fprintf(stderr, "%s is not in system-ram mode\n", devname);
return 1;
}
@@ -786,12 +839,28 @@ static int disable_system_ram_device(struct daxctl_dev *dev)
static int reconfig_mode_devdax(struct daxctl_dev *dev)
{
+ struct daxctl_memory *mem = daxctl_dev_get_memory(dev);
+ const char *devname = daxctl_dev_get_devname(dev);
int rc;
if (daxctl_dev_is_enabled(dev)) {
- rc = disable_system_ram_device(dev);
- if (rc)
- return rc;
+ if (mem) {
+ rc = disable_system_ram_device(dev);
+ if (rc)
+ return rc;
+ } else if (daxctl_dev_is_famfs_mode(dev)) {
+ rc = disable_famfs_device(dev);
+ if (rc)
+ return rc;
+ } else if (daxctl_dev_is_devdax_mode(dev)) {
+ /* already in devdax mode, just re-enable */
+ rc = daxctl_dev_disable(dev);
+ if (rc)
+ return rc;
+ } else {
+ fprintf(stderr, "%s: unknown mode\n", devname);
+ return -EINVAL;
+ }
}
rc = daxctl_dev_enable_devdax(dev);
@@ -801,6 +870,40 @@ static int reconfig_mode_devdax(struct daxctl_dev *dev)
return 0;
}
+static int reconfig_mode_famfs(struct daxctl_dev *dev)
+{
+ struct daxctl_memory *mem = daxctl_dev_get_memory(dev);
+ const char *devname = daxctl_dev_get_devname(dev);
+ int rc;
+
+ if (daxctl_dev_is_enabled(dev)) {
+ if (mem) {
+ fprintf(stderr,
+ "%s is in system-ram mode, must be in devdax mode to convert to famfs\n",
+ devname);
+ return -EINVAL;
+ } else if (daxctl_dev_is_famfs_mode(dev)) {
+ /* already in famfs mode, just re-enable */
+ rc = daxctl_dev_disable(dev);
+ if (rc)
+ return rc;
+ } else if (daxctl_dev_is_devdax_mode(dev)) {
+ rc = disable_devdax_device(dev);
+ if (rc)
+ return rc;
+ } else {
+ fprintf(stderr, "%s: unknown mode\n", devname);
+ return -EINVAL;
+ }
+ }
+
+ rc = daxctl_dev_enable_famfs(dev);
+ if (rc)
+ return rc;
+
+ return 0;
+}
+
static int do_create(struct daxctl_region *region, long long val,
struct json_object **jdevs)
{
@@ -887,6 +990,9 @@ static int do_reconfig(struct daxctl_dev *dev, enum dev_mode mode,
case DAXCTL_DEV_MODE_DEVDAX:
rc = reconfig_mode_devdax(dev);
break;
+ case DAXCTL_DEV_MODE_FAMFS:
+ rc = reconfig_mode_famfs(dev);
+ break;
default:
fprintf(stderr, "%s: unknown mode requested: %d\n",
devname, mode);
diff --git a/daxctl/json.c b/daxctl/json.c
index 3cbce9d..01f139b 100644
--- a/daxctl/json.c
+++ b/daxctl/json.c
@@ -48,8 +48,12 @@ struct json_object *util_daxctl_dev_to_json(struct daxctl_dev *dev,
if (mem)
jobj = json_object_new_string("system-ram");
- else
+ else if (daxctl_dev_is_famfs_mode(dev))
+ jobj = json_object_new_string("famfs");
+ else if (daxctl_dev_is_devdax_mode(dev))
jobj = json_object_new_string("devdax");
+ else
+ jobj = json_object_new_string("unknown");
if (jobj)
json_object_object_add(jdev, "mode", jobj);
diff --git a/daxctl/lib/libdaxctl-private.h b/daxctl/lib/libdaxctl-private.h
index ae45311..0bb73e8 100644
--- a/daxctl/lib/libdaxctl-private.h
+++ b/daxctl/lib/libdaxctl-private.h
@@ -21,12 +21,14 @@ static const char *dax_subsystems[] = {
enum daxctl_dev_mode {
DAXCTL_DEV_MODE_DEVDAX = 0,
DAXCTL_DEV_MODE_RAM,
+ DAXCTL_DEV_MODE_FAMFS,
DAXCTL_DEV_MODE_END,
};
static const char *dax_modules[] = {
[DAXCTL_DEV_MODE_DEVDAX] = "device_dax",
[DAXCTL_DEV_MODE_RAM] = "kmem",
+ [DAXCTL_DEV_MODE_FAMFS] = "fsdev_dax",
};
enum memory_op {
diff --git a/daxctl/lib/libdaxctl.c b/daxctl/lib/libdaxctl.c
index b7fa0de..0a6cbfe 100644
--- a/daxctl/lib/libdaxctl.c
+++ b/daxctl/lib/libdaxctl.c
@@ -418,6 +418,78 @@ DAXCTL_EXPORT int daxctl_dev_is_system_ram_capable(struct daxctl_dev *dev)
return false;
}
+/*
+ * Check if device is currently in famfs mode (bound to fsdev_dax driver)
+ */
+DAXCTL_EXPORT int daxctl_dev_is_famfs_mode(struct daxctl_dev *dev)
+{
+ const char *devname = daxctl_dev_get_devname(dev);
+ struct daxctl_ctx *ctx = daxctl_dev_get_ctx(dev);
+ char *mod_path, *mod_base;
+ char path[200];
+ const int len = sizeof(path);
+
+ if (!device_model_is_dax_bus(dev))
+ return false;
+
+ if (!daxctl_dev_is_enabled(dev))
+ return false;
+
+ if (snprintf(path, len, "%s/driver", dev->dev_path) >= len) {
+ err(ctx, "%s: buffer too small!\n", devname);
+ return false;
+ }
+
+ mod_path = realpath(path, NULL);
+ if (!mod_path)
+ return false;
+
+ mod_base = basename(mod_path);
+ if (strcmp(mod_base, dax_modules[DAXCTL_DEV_MODE_FAMFS]) == 0) {
+ free(mod_path);
+ return true;
+ }
+
+ free(mod_path);
+ return false;
+}
+
+/*
+ * Check if device is currently in devdax mode (bound to device_dax driver)
+ */
+DAXCTL_EXPORT int daxctl_dev_is_devdax_mode(struct daxctl_dev *dev)
+{
+ const char *devname = daxctl_dev_get_devname(dev);
+ struct daxctl_ctx *ctx = daxctl_dev_get_ctx(dev);
+ char *mod_path, *mod_base;
+ char path[200];
+ const int len = sizeof(path);
+
+ if (!device_model_is_dax_bus(dev))
+ return false;
+
+ if (!daxctl_dev_is_enabled(dev))
+ return false;
+
+ if (snprintf(path, len, "%s/driver", dev->dev_path) >= len) {
+ err(ctx, "%s: buffer too small!\n", devname);
+ return false;
+ }
+
+ mod_path = realpath(path, NULL);
+ if (!mod_path)
+ return false;
+
+ mod_base = basename(mod_path);
+ if (strcmp(mod_base, dax_modules[DAXCTL_DEV_MODE_DEVDAX]) == 0) {
+ free(mod_path);
+ return true;
+ }
+
+ free(mod_path);
+ return false;
+}
+
/*
* This checks for the device to be in system-ram mode, so calling
* daxctl_dev_get_memory() on a devdax mode device will always return NULL.
@@ -982,6 +1054,11 @@ DAXCTL_EXPORT int daxctl_dev_enable_ram(struct daxctl_dev *dev)
return daxctl_dev_enable(dev, DAXCTL_DEV_MODE_RAM);
}
+DAXCTL_EXPORT int daxctl_dev_enable_famfs(struct daxctl_dev *dev)
+{
+ return daxctl_dev_enable(dev, DAXCTL_DEV_MODE_FAMFS);
+}
+
DAXCTL_EXPORT int daxctl_dev_disable(struct daxctl_dev *dev)
{
const char *devname = daxctl_dev_get_devname(dev);
diff --git a/daxctl/lib/libdaxctl.sym b/daxctl/lib/libdaxctl.sym
index 3098811..2a812c6 100644
--- a/daxctl/lib/libdaxctl.sym
+++ b/daxctl/lib/libdaxctl.sym
@@ -104,3 +104,10 @@ LIBDAXCTL_10 {
global:
daxctl_dev_is_system_ram_capable;
} LIBDAXCTL_9;
+
+LIBDAXCTL_11 {
+global:
+ daxctl_dev_enable_famfs;
+ daxctl_dev_is_famfs_mode;
+ daxctl_dev_is_devdax_mode;
+} LIBDAXCTL_10;
diff --git a/daxctl/libdaxctl.h b/daxctl/libdaxctl.h
index 53c6bbd..84fcdb4 100644
--- a/daxctl/libdaxctl.h
+++ b/daxctl/libdaxctl.h
@@ -72,12 +72,15 @@ int daxctl_dev_is_enabled(struct daxctl_dev *dev);
int daxctl_dev_disable(struct daxctl_dev *dev);
int daxctl_dev_enable_devdax(struct daxctl_dev *dev);
int daxctl_dev_enable_ram(struct daxctl_dev *dev);
+int daxctl_dev_enable_famfs(struct daxctl_dev *dev);
int daxctl_dev_get_target_node(struct daxctl_dev *dev);
int daxctl_dev_will_auto_online_memory(struct daxctl_dev *dev);
int daxctl_dev_has_online_memory(struct daxctl_dev *dev);
struct daxctl_memory;
int daxctl_dev_is_system_ram_capable(struct daxctl_dev *dev);
+int daxctl_dev_is_famfs_mode(struct daxctl_dev *dev);
+int daxctl_dev_is_devdax_mode(struct daxctl_dev *dev);
struct daxctl_memory *daxctl_dev_get_memory(struct daxctl_dev *dev);
struct daxctl_dev *daxctl_memory_get_dev(struct daxctl_memory *mem);
const char *daxctl_memory_get_node_path(struct daxctl_memory *mem);
--
2.49.0
^ permalink raw reply related [flat|nested] 74+ messages in thread
* [PATCH 2/2] Add test/daxctl-famfs.sh to test famfs mode transitions:
2026-01-07 15:34 ` [PATCH 0/2] ndctl: Add daxctl support for the new "famfs" mode of devdax John Groves
2026-01-07 15:34 ` [PATCH 1/2] daxctl: Add support for famfs mode John Groves
@ 2026-01-07 15:34 ` John Groves
1 sibling, 0 replies; 74+ messages in thread
From: John Groves @ 2026-01-07 15:34 UTC (permalink / raw)
To: John Groves, Miklos Szeredi, Dan Williams, Bernd Schubert,
Alison Schofield
Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
Josef Bacik, Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel, John Groves
From: John Groves <John@Groves.net>
- devdax <-> famfs mode switches
- Verify famfs -> system-ram is rejected (must go via devdax)
- Test JSON output shows correct mode
- Test error handling for invalid modes
The test is added to the destructive test suite since it
modifies device modes.
Signed-off-by: John Groves <john@groves.net>
---
test/daxctl-famfs.sh | 253 +++++++++++++++++++++++++++++++++++++++++++
test/meson.build | 2 +
2 files changed, 255 insertions(+)
create mode 100755 test/daxctl-famfs.sh
diff --git a/test/daxctl-famfs.sh b/test/daxctl-famfs.sh
new file mode 100755
index 0000000..12fbfef
--- /dev/null
+++ b/test/daxctl-famfs.sh
@@ -0,0 +1,253 @@
+#!/bin/bash -Ex
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (C) 2025 Micron Technology, Inc. All rights reserved.
+#
+# Test daxctl famfs mode transitions and mode detection
+
+rc=77
+. $(dirname $0)/common
+
+trap 'cleanup $LINENO' ERR
+
+daxdev=""
+original_mode=""
+
+cleanup()
+{
+ printf "Error at line %d\n" "$1"
+ # Try to restore to original mode if we know it
+ if [[ $daxdev && $original_mode ]]; then
+ "$DAXCTL" reconfigure-device -f -m "$original_mode" "$daxdev" 2>/dev/null || true
+ fi
+ exit $rc
+}
+
+# Check if fsdev_dax module is available
+check_fsdev_dax()
+{
+ if modinfo fsdev_dax &>/dev/null; then
+ return 0
+ fi
+ if grep -qF "fsdev_dax" "/lib/modules/$(uname -r)/modules.builtin" 2>/dev/null; then
+ return 0
+ fi
+ printf "fsdev_dax module not available, skipping\n"
+ exit 77
+}
+
+# Check if kmem module is available (needed for system-ram mode tests)
+check_kmem()
+{
+ if modinfo kmem &>/dev/null; then
+ return 0
+ fi
+ if grep -qF "kmem" "/lib/modules/$(uname -r)/modules.builtin" 2>/dev/null; then
+ return 0
+ fi
+ printf "kmem module not available, skipping system-ram tests\n"
+ return 1
+}
+
+# Find an existing dax device to test with
+find_daxdev()
+{
+ # Look for any available dax device
+ daxdev=$("$DAXCTL" list | jq -er '.[0].chardev // empty' 2>/dev/null) || true
+
+ if [[ ! $daxdev ]]; then
+ printf "No dax device found, skipping\n"
+ exit 77
+ fi
+
+ # Save the original mode so we can restore it
+ original_mode=$("$DAXCTL" list -d "$daxdev" | jq -er '.[].mode')
+
+ printf "Found dax device: %s (current mode: %s)\n" "$daxdev" "$original_mode"
+}
+
+daxctl_get_mode()
+{
+ "$DAXCTL" list -d "$1" | jq -er '.[].mode'
+}
+
+# Ensure device is in devdax mode for testing
+ensure_devdax_mode()
+{
+ local mode
+ mode=$(daxctl_get_mode "$daxdev")
+
+ if [[ "$mode" == "devdax" ]]; then
+ return 0
+ fi
+
+ if [[ "$mode" == "system-ram" ]]; then
+ printf "Device is in system-ram mode, attempting to convert to devdax...\n"
+ "$DAXCTL" reconfigure-device -f -m devdax "$daxdev"
+ elif [[ "$mode" == "famfs" ]]; then
+ printf "Device is in famfs mode, converting to devdax...\n"
+ "$DAXCTL" reconfigure-device -m devdax "$daxdev"
+ else
+ printf "Device is in unknown mode: %s\n" "$mode"
+ return 1
+ fi
+
+ [[ $(daxctl_get_mode "$daxdev") == "devdax" ]]
+}
+
+#
+# Test basic mode transitions involving famfs
+#
+test_famfs_mode_transitions()
+{
+ printf "\n=== Testing famfs mode transitions ===\n"
+
+ # Ensure starting in devdax mode
+ ensure_devdax_mode
+ [[ $(daxctl_get_mode "$daxdev") == "devdax" ]]
+ printf "Initial mode: devdax - OK\n"
+
+ # Test: devdax -> famfs
+ printf "Testing devdax -> famfs... "
+ "$DAXCTL" reconfigure-device -m famfs "$daxdev"
+ [[ $(daxctl_get_mode "$daxdev") == "famfs" ]]
+ printf "OK\n"
+
+ # Test: famfs -> famfs (re-enable in same mode)
+ printf "Testing famfs -> famfs (re-enable)... "
+ "$DAXCTL" reconfigure-device -m famfs "$daxdev"
+ [[ $(daxctl_get_mode "$daxdev") == "famfs" ]]
+ printf "OK\n"
+
+ # Test: famfs -> devdax
+ printf "Testing famfs -> devdax... "
+ "$DAXCTL" reconfigure-device -m devdax "$daxdev"
+ [[ $(daxctl_get_mode "$daxdev") == "devdax" ]]
+ printf "OK\n"
+
+ # Test: devdax -> devdax (re-enable in same mode)
+ printf "Testing devdax -> devdax (re-enable)... "
+ "$DAXCTL" reconfigure-device -m devdax "$daxdev"
+ [[ $(daxctl_get_mode "$daxdev") == "devdax" ]]
+ printf "OK\n"
+}
+
+#
+# Test mode transitions with system-ram (requires kmem)
+#
+test_system_ram_transitions()
+{
+ printf "\n=== Testing system-ram transitions with famfs ===\n"
+
+ # Ensure we start in devdax mode
+ ensure_devdax_mode
+ [[ $(daxctl_get_mode "$daxdev") == "devdax" ]]
+
+ # Test: devdax -> system-ram
+ printf "Testing devdax -> system-ram... "
+ "$DAXCTL" reconfigure-device -N -m system-ram "$daxdev"
+ [[ $(daxctl_get_mode "$daxdev") == "system-ram" ]]
+ printf "OK\n"
+
+ # Test: system-ram -> famfs should fail
+ printf "Testing system-ram -> famfs (should fail)... "
+ if "$DAXCTL" reconfigure-device -m famfs "$daxdev" 2>/dev/null; then
+ printf "FAILED - should have been rejected\n"
+ return 1
+ fi
+ printf "OK (correctly rejected)\n"
+
+ # Test: system-ram -> devdax -> famfs (proper path)
+ printf "Testing system-ram -> devdax -> famfs... "
+ "$DAXCTL" reconfigure-device -f -m devdax "$daxdev"
+ [[ $(daxctl_get_mode "$daxdev") == "devdax" ]]
+ "$DAXCTL" reconfigure-device -m famfs "$daxdev"
+ [[ $(daxctl_get_mode "$daxdev") == "famfs" ]]
+ printf "OK\n"
+
+ # Restore to devdax for subsequent tests
+ "$DAXCTL" reconfigure-device -m devdax "$daxdev"
+}
+
+#
+# Test JSON output shows correct mode
+#
+test_json_output()
+{
+ printf "\n=== Testing JSON output for mode field ===\n"
+
+ # Test devdax mode in JSON
+ ensure_devdax_mode
+ printf "Testing JSON output for devdax mode... "
+ mode=$("$DAXCTL" list -d "$daxdev" | jq -er '.[].mode')
+ [[ "$mode" == "devdax" ]]
+ printf "OK\n"
+
+ # Test famfs mode in JSON
+ "$DAXCTL" reconfigure-device -m famfs "$daxdev"
+ printf "Testing JSON output for famfs mode... "
+ mode=$("$DAXCTL" list -d "$daxdev" | jq -er '.[].mode')
+ [[ "$mode" == "famfs" ]]
+ printf "OK\n"
+
+ # Restore to devdax
+ "$DAXCTL" reconfigure-device -m devdax "$daxdev"
+}
+
+#
+# Test error messages for invalid transitions
+#
+test_error_handling()
+{
+ printf "\n=== Testing error handling ===\n"
+
+ # Ensure we're in famfs mode
+ "$DAXCTL" reconfigure-device -m famfs "$daxdev"
+
+ # Test that invalid mode is rejected
+ printf "Testing invalid mode rejection... "
+ if "$DAXCTL" reconfigure-device -m invalidmode "$daxdev" 2>/dev/null; then
+ printf "FAILED - invalid mode should be rejected\n"
+ return 1
+ fi
+ printf "OK (correctly rejected)\n"
+
+ # Restore to devdax
+ "$DAXCTL" reconfigure-device -m devdax "$daxdev"
+}
+
+#
+# Main test sequence
+#
+main()
+{
+ check_fsdev_dax
+ find_daxdev
+
+ rc=1 # From here on, failures are real failures
+
+ test_famfs_mode_transitions
+ test_json_output
+ test_error_handling
+
+ # System-ram tests require kmem module
+ if check_kmem; then
+ # Save and disable online policy for system-ram tests
+ saved_policy="$(cat /sys/devices/system/memory/auto_online_blocks)"
+ echo "offline" > /sys/devices/system/memory/auto_online_blocks
+
+ test_system_ram_transitions
+
+ # Restore online policy
+ echo "$saved_policy" > /sys/devices/system/memory/auto_online_blocks
+ fi
+
+ # Restore original mode
+ printf "\nRestoring device to original mode: %s\n" "$original_mode"
+ "$DAXCTL" reconfigure-device -f -m "$original_mode" "$daxdev"
+
+ printf "\n=== All famfs tests passed ===\n"
+
+ exit 0
+}
+
+main
diff --git a/test/meson.build b/test/meson.build
index 615376e..ad1d393 100644
--- a/test/meson.build
+++ b/test/meson.build
@@ -209,6 +209,7 @@ if get_option('destructive').enabled()
device_dax_fio = find_program('device-dax-fio.sh')
daxctl_devices = find_program('daxctl-devices.sh')
daxctl_create = find_program('daxctl-create.sh')
+ daxctl_famfs = find_program('daxctl-famfs.sh')
dm = find_program('dm.sh')
mmap_test = find_program('mmap.sh')
@@ -226,6 +227,7 @@ if get_option('destructive').enabled()
[ 'device-dax-fio.sh', device_dax_fio, 'dax' ],
[ 'daxctl-devices.sh', daxctl_devices, 'dax' ],
[ 'daxctl-create.sh', daxctl_create, 'dax' ],
+ [ 'daxctl-famfs.sh', daxctl_famfs, 'dax' ],
[ 'dm.sh', dm, 'dax' ],
[ 'mmap.sh', mmap_test, 'dax' ],
]
--
2.49.0
^ permalink raw reply related [flat|nested] 74+ messages in thread
* Re: [PATCH V3 15/21] famfs_fuse: Create files with famfs fmaps
2026-01-07 15:33 ` [PATCH V3 15/21] famfs_fuse: Create files with famfs fmaps John Groves
@ 2026-01-07 21:30 ` John Groves
2026-01-08 13:14 ` Jonathan Cameron
1 sibling, 0 replies; 74+ messages in thread
From: John Groves @ 2026-01-07 21:30 UTC (permalink / raw)
To: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield
Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
Josef Bacik, Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On 26/01/07 09:33AM, John Groves wrote:
> On completion of GET_FMAP message/response, setup the full famfs
> metadata such that it's possible to handle read/write/mmap directly to
> dax. Note that the devdax_iomap plumbing is not in yet...
>
> * Add famfs_kfmap.h: in-memory structures for resolving famfs file maps
> (fmaps) to dax.
> * famfs.c: allocate, initialize and free fmaps
> * inode.c: only allow famfs mode if the fuse server has CAP_SYS_RAWIO
> * Update MAINTAINERS for the new files.
>
> Signed-off-by: John Groves <john@groves.net>
> ---
> MAINTAINERS | 1 +
> fs/fuse/famfs.c | 355 +++++++++++++++++++++++++++++++++++++-
> fs/fuse/famfs_kfmap.h | 67 +++++++
> fs/fuse/fuse_i.h | 22 ++-
> fs/fuse/inode.c | 21 ++-
> include/uapi/linux/fuse.h | 56 ++++++
> 6 files changed, 510 insertions(+), 12 deletions(-)
> create mode 100644 fs/fuse/famfs_kfmap.h
>
[ ... ]
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 9e121a1d63b7..391ead26bfa2 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -121,7 +121,7 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
> fuse_inode_backing_set(fi, NULL);
>
> if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
> - famfs_meta_set(fi, NULL);
> + famfs_meta_init(fi);
>
> return &fi->inode;
>
> @@ -1485,8 +1485,21 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
> timeout = arg->request_timeout;
>
> if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX) &&
> - flags & FUSE_DAX_FMAP)
> - fc->famfs_iomap = 1;
> + flags & FUSE_DAX_FMAP) {
> + /* famfs_iomap is only allowed if the fuse
> + * server has CAP_SYS_RAWIO. This was checked
> + * in fuse_send_init, and FUSE_DAX_IOMAP was
> + * set in in_flags if so. Only allow enablement
> + * if we find it there. This function is
> + * normally not running in fuse server context,
> + * so we can do the capability check here...
^^^
Oops: this should be "can't" - we can't do the capability check here since we're not in
fuse server context. Will fix before merge...
[ ... ]
John
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 01/21] dax: move dax_pgoff_to_phys from [drivers/dax/] device.c to bus.c
2026-01-07 15:33 ` [PATCH V3 01/21] dax: move dax_pgoff_to_phys from [drivers/dax/] device.c to bus.c John Groves
@ 2026-01-08 10:43 ` Jonathan Cameron
2026-01-08 13:25 ` John Groves
0 siblings, 1 reply; 74+ messages in thread
From: Jonathan Cameron @ 2026-01-08 10:43 UTC (permalink / raw)
To: John Groves
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On Wed, 7 Jan 2026 09:33:10 -0600
John Groves <John@Groves.net> wrote:
> This function will be used by both device.c and fsdev.c, but both are
> loadable modules. Moving to bus.c puts it in core and makes it available
> to both.
>
> No code changes - just relocated.
>
> Signed-off-by: John Groves <john@groves.net>
Hi John,
I don't know the code well enough to offer an opinion on whether this
move causes any issues or if this is the best location, so review is superficial
stuff only.
Jonathan
> ---
> drivers/dax/bus.c | 27 +++++++++++++++++++++++++++
> drivers/dax/device.c | 23 -----------------------
> 2 files changed, 27 insertions(+), 23 deletions(-)
>
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index fde29e0ad68b..a2f9a3cc30a5 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -7,6 +7,9 @@
> #include <linux/slab.h>
> #include <linux/dax.h>
> #include <linux/io.h>
> +#include <linux/backing-dev.h>
I'm not immediately spotting why this one. Maybe should be in a different
patch?
> +#include <linux/range.h>
> +#include <linux/uio.h>
Why this one?
Style wise, dax seems to use reverse xmas tree for includes, so
this should keep to that.
> #include "dax-private.h"
> #include "bus.h"
>
> @@ -1417,6 +1420,30 @@ static const struct device_type dev_dax_type = {
> .groups = dax_attribute_groups,
> };
>
> +/* see "strong" declaration in tools/testing/nvdimm/dax-dev.c */
Bonus space before that */
Curiously that wasn't there in the original.
> +__weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
> + unsigned long size)
> +{
> + int i;
> +
> + for (i = 0; i < dev_dax->nr_range; i++) {
> + struct dev_dax_range *dax_range = &dev_dax->ranges[i];
> + struct range *range = &dax_range->range;
> + unsigned long long pgoff_end;
> + phys_addr_t phys;
> +
> + pgoff_end = dax_range->pgoff + PHYS_PFN(range_len(range)) - 1;
> + if (pgoff < dax_range->pgoff || pgoff > pgoff_end)
> + continue;
> + phys = PFN_PHYS(pgoff - dax_range->pgoff) + range->start;
> + if (phys + size - 1 <= range->end)
> + return phys;
> + break;
> + }
> + return -1;
> +}
> +EXPORT_SYMBOL_GPL(dax_pgoff_to_phys);
> +
> static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
> {
> struct dax_region *dax_region = data->dax_region;
> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> index 22999a402e02..132c1d03fd07 100644
> --- a/drivers/dax/device.c
> +++ b/drivers/dax/device.c
> @@ -57,29 +57,6 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
> vma->vm_file, func);
> }
>
> -/* see "strong" declaration in tools/testing/nvdimm/dax-dev.c */
> -__weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
> - unsigned long size)
> -{
> - int i;
> -
> - for (i = 0; i < dev_dax->nr_range; i++) {
> - struct dev_dax_range *dax_range = &dev_dax->ranges[i];
> - struct range *range = &dax_range->range;
> - unsigned long long pgoff_end;
> - phys_addr_t phys;
> -
> - pgoff_end = dax_range->pgoff + PHYS_PFN(range_len(range)) - 1;
> - if (pgoff < dax_range->pgoff || pgoff > pgoff_end)
> - continue;
> - phys = PFN_PHYS(pgoff - dax_range->pgoff) + range->start;
> - if (phys + size - 1 <= range->end)
> - return phys;
> - break;
> - }
> - return -1;
> -}
> -
> static void dax_set_mapping(struct vm_fault *vmf, unsigned long pfn,
> unsigned long fault_size)
> {
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 02/21] dax: add fsdev.c driver for fs-dax on character dax
2026-01-07 15:33 ` [PATCH V3 02/21] dax: add fsdev.c driver for fs-dax on character dax John Groves
@ 2026-01-08 11:31 ` Jonathan Cameron
2026-01-08 14:32 ` John Groves
2026-01-08 15:12 ` John Groves
0 siblings, 2 replies; 74+ messages in thread
From: Jonathan Cameron @ 2026-01-08 11:31 UTC (permalink / raw)
To: John Groves
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On Wed, 7 Jan 2026 09:33:11 -0600
John Groves <John@Groves.net> wrote:
> The new fsdev driver provides pages/folios initialized compatibly with
> fsdax - normal rather than devdax-style refcounting, and starting out
> with order-0 folios.
>
> When fsdev binds to a daxdev, it is usually (always?) switching from the
> devdax mode (device.c), which pre-initializes compound folios according
> to its alignment. Fsdev uses fsdev_clear_folio_state() to switch the
> folios into a fsdax-compatible state.
>
> A side effect of this is that raw mmap doesn't (can't?) work on an fsdev
> dax instance. Accordingly, The fsdev driver does not provide raw mmap -
> devices must be put in 'devdax' mode (drivers/dax/device.c) to get raw
> mmap capability.
>
> In this commit is just the framework, which remaps pages/folios compatibly
> with fsdax.
>
> Enabling dax changes:
>
> * bus.h: add DAXDRV_FSDEV_TYPE driver type
> * bus.c: allow DAXDRV_FSDEV_TYPE drivers to bind to daxdevs
> * dax.h: prototype inode_dax(), which fsdev needs
>
> Suggested-by: Dan Williams <dan.j.williams@intel.com>
> Suggested-by: Gregory Price <gourry@gourry.net>
> Signed-off-by: John Groves <john@groves.net>
> diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
> index d656e4c0eb84..491325d914a8 100644
> --- a/drivers/dax/Kconfig
> +++ b/drivers/dax/Kconfig
> @@ -78,4 +78,21 @@ config DEV_DAX_KMEM
>
> Say N if unsure.
>
> +config DEV_DAX_FS
> + tristate "FSDEV DAX: fs-dax compatible device driver"
> + depends on DEV_DAX
> + default DEV_DAX
What's the logic for the default? Generally I'd not expect a
default for something new like this (so default of default == no)
> + help
> + Support a device-dax driver mode that is compatible with fs-dax
...
> struct dax_device_driver {
> diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> new file mode 100644
> index 000000000000..2a3249d1529c
> --- /dev/null
> +++ b/drivers/dax/fsdev.c
> @@ -0,0 +1,276 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright(c) 2026 Micron Technology, Inc. */
> +#include <linux/memremap.h>
> +#include <linux/pagemap.h>
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/cdev.h>
> +#include <linux/slab.h>
> +#include <linux/dax.h>
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include "dax-private.h"
> +#include "bus.h"
...
> +static void fsdev_cdev_del(void *cdev)
> +{
> + cdev_del(cdev);
> +}
> +
> +static void fsdev_kill(void *dev_dax)
> +{
> + kill_dev_dax(dev_dax);
> +}
...
> +/*
> + * Clear any stale folio state from pages in the given range.
> + * This is necessary because device_dax pre-initializes compound folios
> + * based on vmemmap_shift, and that state may persist after driver unbind.
What's the argument for not cleaning these out in the unbind path for device_dax?
I can see that it might be an optimization if some other code path blindly
overwrites all this state.
> + * Since fsdev_dax uses MEMORY_DEVICE_FS_DAX without vmemmap_shift, fs-dax
> + * expects to find clean order-0 folios that it can build into compound
> + * folios on demand.
> + *
> + * At probe time, no filesystem should be mounted yet, so all mappings
> + * are stale and must be cleared along with compound state.
> + */
> +static void fsdev_clear_folio_state(struct dev_dax *dev_dax)
> +{
> + int i;
It's becoming increasingly common to declare loop variables as
for (int i = 0; i <...
and given that saves us a few lines here it seems worth doing.
> +
> + for (i = 0; i < dev_dax->nr_range; i++) {
> + struct range *range = &dev_dax->ranges[i].range;
> + unsigned long pfn, end_pfn;
> +
> + pfn = PHYS_PFN(range->start);
> + end_pfn = PHYS_PFN(range->end) + 1;
Might as well do
unsigned long pfn = PHY_PFN(range->start);
unsigned long end_pfn = PHYS_PFN(range->end) + 1;
> +
> + while (pfn < end_pfn) {
> + struct page *page = pfn_to_page(pfn);
> + struct folio *folio = (struct folio *)page;
> + struct dev_pagemap *pgmap = page_pgmap(page);
> + int order = folio_order(folio);
> +
> + /*
> + * Clear any stale mapping pointer. At probe time,
> + * no filesystem is mounted, so any mapping is stale.
> + */
> + folio->mapping = NULL;
> + folio->share = 0;
> +
> + if (order > 0) {
> + int j;
> +
> + folio_reset_order(folio);
> + for (j = 0; j < (1UL << order); j++) {
> + struct page *p = page + j;
> +
> + ClearPageHead(p);
> + clear_compound_head(p);
> + ((struct folio *)p)->mapping = NULL;
This code block is very similar to a chunk in dax_folio_put() in fs/dax.c
Can we create a helper for both to use?
I note that uses a local struct folio *new_folio to avoid multiple casts.
I'd do similar here even if it's a long line.
If not possible to use a common helper, it is probably still worth
having a helper here for the stuff in the while loop just to reduce indent
and improve readability a little.
> + ((struct folio *)p)->share = 0;
> + ((struct folio *)p)->pgmap = pgmap;
> + }
> + pfn += (1UL << order);
> + } else {
> + folio->pgmap = pgmap;
> + pfn++;
> + }
> + }
> + }
> +}
> +
> +static int fsdev_open(struct inode *inode, struct file *filp)
> +{
> + struct dax_device *dax_dev = inode_dax(inode);
> + struct dev_dax *dev_dax = dax_get_private(dax_dev);
> +
> + dev_dbg(&dev_dax->dev, "trace\n");
Hmm. This is a somewhat odd, but I see dax/device.c does
the same thing and I guess that's because you are using
dynamic debug with function names turned on to provide the
'real' information.
> + filp->private_data = dev_dax;
> +
> + return 0;
> +}
> +static int fsdev_dax_probe(struct dev_dax *dev_dax)
> +{
> + struct dax_device *dax_dev = dev_dax->dax_dev;
> + struct device *dev = &dev_dax->dev;
> + struct dev_pagemap *pgmap;
> + u64 data_offset = 0;
> + struct inode *inode;
> + struct cdev *cdev;
> + void *addr;
> + int rc, i;
> +
A bunch of this is cut and paste from dax/device.c
If it carries on looking like this, can we have a helper module that
both drivers use with the common code in it? That would make the
difference more obvious as well.
> + if (static_dev_dax(dev_dax)) {
> + if (dev_dax->nr_range > 1) {
> + dev_warn(dev,
> + "static pgmap / multi-range device conflict\n");
> + return -EINVAL;
> + }
> +
> + pgmap = dev_dax->pgmap;
> + } else {
> + if (dev_dax->pgmap) {
> + dev_warn(dev,
> + "dynamic-dax with pre-populated page map\n");
Unless dax maintainers are very fussy about 80 chars, I'd go long on these as it's
only just over 80 chars on one line.
Given you are failing probe, not sure why dev_warn() is considered sufficient.
To me dev_err() seems more sensible. What you have matches dax/device.c though
so maybe there is a sound reason.
> + return -EINVAL;
> + }
> +
> + pgmap = devm_kzalloc(dev,
> + struct_size(pgmap, ranges, dev_dax->nr_range - 1),
> + GFP_KERNEL);
Pick an alignment style and stick to it. Either.
pgmap = devm_kzalloc(dev,
struct_size(pgmap, ranges, dev_dax->nr_range - 1),
GFP_KERNEL);
or go long for readability and do
pgmap = devm_kzalloc(dev,
struct_size(pgmap, ranges, dev_dax->nr_range - 1),
GFP_KERNEL);
> + if (!pgmap)
> + return -ENOMEM;
> +
> + pgmap->nr_range = dev_dax->nr_range;
> + dev_dax->pgmap = pgmap;
> +
> + for (i = 0; i < dev_dax->nr_range; i++) {
> + struct range *range = &dev_dax->ranges[i].range;
> +
> + pgmap->ranges[i] = *range;
> + }
> + }
> +
> + for (i = 0; i < dev_dax->nr_range; i++) {
> + struct range *range = &dev_dax->ranges[i].range;
> +
> + if (!devm_request_mem_region(dev, range->start,
> + range_len(range), dev_name(dev))) {
> + dev_warn(dev, "mapping%d: %#llx-%#llx could not reserve range\n",
> + i, range->start, range->end);
> + return -EBUSY;
> + }
> + }
> +
> + /*
> + * FS-DAX compatible mode: Use MEMORY_DEVICE_FS_DAX type and
> + * do NOT set vmemmap_shift. This leaves folios at order-0,
> + * allowing fs-dax to dynamically create compound folios as needed
> + * (similar to pmem behavior).
> + */
> + pgmap->type = MEMORY_DEVICE_FS_DAX;
> + pgmap->ops = &fsdev_pagemap_ops;
> + pgmap->owner = dev_dax;
> +
> + /*
> + * CRITICAL DIFFERENCE from device.c:
> + * We do NOT set vmemmap_shift here, even if align > PAGE_SIZE.
> + * This ensures folios remain order-0 and are compatible with
> + * fs-dax's folio management.
> + */
> +
> + addr = devm_memremap_pages(dev, pgmap);
> + if (IS_ERR(addr))
> + return PTR_ERR(addr);
> +
> + /*
> + * Clear any stale compound folio state left over from a previous
> + * driver (e.g., device_dax with vmemmap_shift).
> + */
> + fsdev_clear_folio_state(dev_dax);
> +
> + /* Detect whether the data is at a non-zero offset into the memory */
> + if (pgmap->range.start != dev_dax->ranges[0].range.start) {
> + u64 phys = dev_dax->ranges[0].range.start;
> + u64 pgmap_phys = dev_dax->pgmap[0].range.start;
> +
> + if (!WARN_ON(pgmap_phys > phys))
> + data_offset = phys - pgmap_phys;
> +
> + pr_debug("%s: offset detected phys=%llx pgmap_phys=%llx offset=%llx\n",
> + __func__, phys, pgmap_phys, data_offset);
> + }
> +
> + inode = dax_inode(dax_dev);
> + cdev = inode->i_cdev;
> + cdev_init(cdev, &fsdev_fops);
> + cdev->owner = dev->driver->owner;
> + cdev_set_parent(cdev, &dev->kobj);
> + rc = cdev_add(cdev, dev->devt, 1);
> + if (rc)
> + return rc;
> +
> + rc = devm_add_action_or_reset(dev, fsdev_cdev_del, cdev);
> + if (rc)
> + return rc;
> +
> + run_dax(dax_dev);
> + return devm_add_action_or_reset(dev, fsdev_kill, dev_dax);
> +}
> +
> +static struct dax_device_driver fsdev_dax_driver = {
> + .probe = fsdev_dax_probe,
> + .type = DAXDRV_FSDEV_TYPE,
> +};
> +
> +static int __init dax_init(void)
> +{
> + return dax_driver_register(&fsdev_dax_driver);
> +}
> +
> +static void __exit dax_exit(void)
> +{
> + dax_driver_unregister(&fsdev_dax_driver);
> +}
If these don't get more complex, maybe it's time for a dax specific define
using module_driver()
> +
> +MODULE_AUTHOR("John Groves");
> +MODULE_DESCRIPTION("FS-DAX Device: fs-dax compatible devdax driver");
> +MODULE_LICENSE("GPL");
> +module_init(dax_init);
> +module_exit(dax_exit);
> +MODULE_ALIAS_DAX_DEVICE(0);
Curious macro. Always has same parameter... Maybe ripe for just dropping the parameter?
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 03/21] dax: Save the kva from memremap
2026-01-07 15:33 ` [PATCH V3 03/21] dax: Save the kva from memremap John Groves
@ 2026-01-08 11:32 ` Jonathan Cameron
2026-01-08 15:15 ` John Groves
0 siblings, 1 reply; 74+ messages in thread
From: Jonathan Cameron @ 2026-01-08 11:32 UTC (permalink / raw)
To: John Groves
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On Wed, 7 Jan 2026 09:33:12 -0600
John Groves <John@Groves.net> wrote:
> Save the kva from memremap because we need it for iomap rw support.
>
> Prior to famfs, there were no iomap users of /dev/dax - so the virtual
> address from memremap was not needed.
>
> (also fill in missing kerneldoc comment fields for struct dev_dax)
Do that as a precursor that can be picked up ahead of the rest of the series.
>
> Signed-off-by: John Groves <john@groves.net>
> ---
> drivers/dax/dax-private.h | 4 ++++
> drivers/dax/fsdev.c | 1 +
> 2 files changed, 5 insertions(+)
>
> diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
> index 0867115aeef2..1bb1631af485 100644
> --- a/drivers/dax/dax-private.h
> +++ b/drivers/dax/dax-private.h
> @@ -69,18 +69,22 @@ struct dev_dax_range {
> * data while the device is activated in the driver.
> * @region - parent region
> * @dax_dev - core dax functionality
> + * @virt_addr - kva from memremap; used by fsdev_dax
> + * @align - alignment of this instance
> * @target_node: effective numa node if dev_dax memory range is onlined
> * @dyn_id: is this a dynamic or statically created instance
> * @id: ida allocated id when the dax_region is not static
> * @ida: mapping id allocator
> * @dev - device core
> * @pgmap - pgmap for memmap setup / lifetime (driver owned)
> + * @memmap_on_memory - allow kmem to put the memmap in the memory
> * @nr_range: size of @ranges
> * @ranges: range tuples of memory used
> */
> struct dev_dax {
> struct dax_region *region;
> struct dax_device *dax_dev;
> + void *virt_addr;
> unsigned int align;
> int target_node;
> bool dyn_id;
> diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> index 2a3249d1529c..c5c660b193e5 100644
> --- a/drivers/dax/fsdev.c
> +++ b/drivers/dax/fsdev.c
> @@ -235,6 +235,7 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
> pr_debug("%s: offset detected phys=%llx pgmap_phys=%llx offset=%llx\n",
> __func__, phys, pgmap_phys, data_offset);
> }
> + dev_dax->virt_addr = addr + data_offset;
>
> inode = dax_inode(dax_dev);
> cdev = inode->i_cdev;
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 04/21] dax: Add dax_operations for use by fs-dax on fsdev dax
2026-01-07 15:33 ` [PATCH V3 04/21] dax: Add dax_operations for use by fs-dax on fsdev dax John Groves
@ 2026-01-08 11:50 ` Jonathan Cameron
2026-01-08 15:59 ` John Groves
0 siblings, 1 reply; 74+ messages in thread
From: Jonathan Cameron @ 2026-01-08 11:50 UTC (permalink / raw)
To: John Groves
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On Wed, 7 Jan 2026 09:33:13 -0600
John Groves <John@Groves.net> wrote:
> From: John Groves <John@Groves.net>
>
Hi John
The description should generally make sense without the title.
Sometimes that means more or less repeating the title.
A few other things inline.
> * These methods are based on pmem_dax_ops from drivers/nvdimm/pmem.c
> * fsdev_dax_direct_access() returns the hpa, pfn and kva. The kva was
> newly stored as dev_dax->virt_addr by dev_dax_probe().
> * The hpa/pfn are used for mmap (dax_iomap_fault()), and the kva is used
> for read/write (dax_iomap_rw())
> * fsdev_dax_recovery_write() and dev_dax_zero_page_range() have not been
> tested yet. I'm looking for suggestions as to how to test those.
> * dax-private.h: add dev_dax->cached_size, which fsdev needs to
> remember. The dev_dax size cannot change while a driver is bound
> (dev_dax_resize returns -EBUSY if dev->driver is set). Caching the size
> at probe time allows fsdev's direct_access path can use it without
> acquiring dax_dev_rwsem (which isn't exported anyway).
>
> Signed-off-by: John Groves <john@groves.net>
> diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> index c5c660b193e5..9e2f83aa2584 100644
> --- a/drivers/dax/fsdev.c
> +++ b/drivers/dax/fsdev.c
> @@ -27,6 +27,81 @@
> * - No mmap support - all access is through fs-dax/iomap
> */
>
> +static void fsdev_write_dax(void *pmem_addr, struct page *page,
> + unsigned int off, unsigned int len)
> +{
> + while (len) {
> + void *mem = kmap_local_page(page);
I guess it's pretty simple, but do we care about HIGHMEM for this
new feature? Maybe it's just easier to support it than argue about it however ;)
> + unsigned int chunk = min_t(unsigned int, len, PAGE_SIZE - off);
> +
> + memcpy_flushcache(pmem_addr, mem + off, chunk);
> + kunmap_local(mem);
> + len -= chunk;
> + off = 0;
> + page++;
> + pmem_addr += chunk;
> + }
> +}
> +
> +static long __fsdev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
> + long nr_pages, enum dax_access_mode mode, void **kaddr,
> + unsigned long *pfn)
> +{
> + struct dev_dax *dev_dax = dax_get_private(dax_dev);
> + size_t size = nr_pages << PAGE_SHIFT;
> + size_t offset = pgoff << PAGE_SHIFT;
> + void *virt_addr = dev_dax->virt_addr + offset;
> + phys_addr_t phys;
> + unsigned long local_pfn;
> +
> + WARN_ON(!dev_dax->virt_addr);
> +
> + phys = dax_pgoff_to_phys(dev_dax, pgoff, nr_pages << PAGE_SHIFT);
Use size given you already computed it.
> +
> + if (kaddr)
> + *kaddr = virt_addr;
> +
> + local_pfn = PHYS_PFN(phys);
> + if (pfn)
> + *pfn = local_pfn;
> +
> + /*
> + * Use cached_size which was computed at probe time. The size cannot
> + * change while the driver is bound (resize returns -EBUSY).
> + */
> + return PHYS_PFN(min_t(size_t, size, dev_dax->cached_size - offset));
Is the min_t() needed? min() is pretty good at picking right types these days.
> +}
> +
> +static int fsdev_dax_zero_page_range(struct dax_device *dax_dev,
> + pgoff_t pgoff, size_t nr_pages)
> +{
> + void *kaddr;
> +
> + WARN_ONCE(nr_pages > 1, "%s: nr_pages > 1\n", __func__);
> + __fsdev_dax_direct_access(dax_dev, pgoff, 1, DAX_ACCESS, &kaddr, NULL);
> + fsdev_write_dax(kaddr, ZERO_PAGE(0), 0, PAGE_SIZE);
> + return 0;
> +}
> +
> +static long fsdev_dax_direct_access(struct dax_device *dax_dev,
> + pgoff_t pgoff, long nr_pages, enum dax_access_mode mode,
> + void **kaddr, unsigned long *pfn)
> +{
> + return __fsdev_dax_direct_access(dax_dev, pgoff, nr_pages, mode,
> + kaddr, pfn);
Alignment in this file is a bit random, but I'd at least align this one
after the (
> +}
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 05/21] dax: Add dax_set_ops() for setting dax_operations at bind time
2026-01-07 15:33 ` [PATCH V3 05/21] dax: Add dax_set_ops() for setting dax_operations at bind time John Groves
@ 2026-01-08 12:06 ` Jonathan Cameron
2026-01-08 16:20 ` John Groves
0 siblings, 1 reply; 74+ messages in thread
From: Jonathan Cameron @ 2026-01-08 12:06 UTC (permalink / raw)
To: John Groves
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On Wed, 7 Jan 2026 09:33:14 -0600
John Groves <John@Groves.net> wrote:
> From: John Groves <John@Groves.net>
>
> The dax_device is created (in the non-pmem case) at hmem probe time via
> devm_create_dev_dax(), before we know which driver (device_dax,
> fsdev_dax, or kmem) will bind - by calling alloc_dax() with NULL ops,
> drivers (i.e. fsdev_dax) that need specific dax_operations must set
> them later.
>
> Add dax_set_ops() exported function so fsdev_dax can set its ops at
> probe time and clear them on remove. device_dax doesn't need ops since
> it uses the mmap fault path directly.
>
> Use cmpxchg() to atomically set ops only if currently NULL, returning
> -EBUSY if ops are already set. This prevents accidental double-binding.
> Clearing ops (NULL) always succeeds.
>
> Signed-off-by: John Groves <john@groves.net>
Hi John
This one runs into the fun mess of mixing devm and other calls.
I'd advise you just don't do it because it makes code much harder
to review and hits the 'smells bad' button.
Jonathan
> ---
> drivers/dax/fsdev.c | 12 ++++++++++++
> drivers/dax/super.c | 38 +++++++++++++++++++++++++++++++++++++-
> include/linux/dax.h | 1 +
> 3 files changed, 50 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> index 9e2f83aa2584..3f4f593896e3 100644
> --- a/drivers/dax/fsdev.c
> +++ b/drivers/dax/fsdev.c
> @@ -330,12 +330,24 @@ static int fsdev_dax_probe(struct dev_dax *dev_dax)
> if (rc)
> return rc;
>
> + /* Set the dax operations for fs-dax access path */
> + rc = dax_set_ops(dax_dev, &dev_dax_ops);
> + if (rc)
> + return rc;
> +
> run_dax(dax_dev);
> return devm_add_action_or_reset(dev, fsdev_kill, dev_dax);
> }
>
> +static void fsdev_dax_remove(struct dev_dax *dev_dax)
> +{
> + /* Clear ops on unbind so they aren't used with a different driver */
> + dax_set_ops(dev_dax->dax_dev, NULL);
Generally orderings of calls that mix devm and stuff done manually in remove are
a bad idea. They can be safe (and this one probably is) but it adds a review
burden that is best avoided.
Once you stop using devm_ you need to stop it for everything. So either
use a devm_add_action_or_reset for this or drop the one for fsdev_kill and
call that code here instead.
> +}
> +
> static struct dax_device_driver fsdev_dax_driver = {
> .probe = fsdev_dax_probe,
> + .remove = fsdev_dax_remove,
> .type = DAXDRV_FSDEV_TYPE,
> };
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 06/21] dax: Add fs_dax_get() func to prepare dax for fs-dax usage
2026-01-07 15:33 ` [PATCH V3 06/21] dax: Add fs_dax_get() func to prepare dax for fs-dax usage John Groves
@ 2026-01-08 12:27 ` Jonathan Cameron
2026-01-08 16:45 ` John Groves
0 siblings, 1 reply; 74+ messages in thread
From: Jonathan Cameron @ 2026-01-08 12:27 UTC (permalink / raw)
To: John Groves
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On Wed, 7 Jan 2026 09:33:15 -0600
John Groves <John@Groves.net> wrote:
> The fs_dax_get() function should be called by fs-dax file systems after
> opening a fsdev dax device. This adds holder_operations, which provides
> a memory failure callback path and effects exclusivity between callers
> of fs_dax_get().
>
> fs_dax_get() is specific to fsdev_dax, so it checks the driver type
> (which required touching bus.[ch]). fs_dax_get() fails if fsdev_dax is
> not bound to the memory.
>
> This function serves the same role as fs_dax_get_by_bdev(), which dax
> file systems call after opening the pmem block device.
>
> This can't be located in fsdev.c because struct dax_device is opaque
> there.
>
> This will be called by fs/fuse/famfs.c in a subsequent commit.
>
> Signed-off-by: John Groves <john@groves.net>
Hi John,
A few passing comments on this one.
Jonathan
> ---
> #define dax_driver_register(driver) \
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index ba0b4cd18a77..68c45b918cff 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -14,6 +14,7 @@
> #include <linux/fs.h>
> #include <linux/cacheinfo.h>
> #include "dax-private.h"
> +#include "bus.h"
>
> /**
> * struct dax_device - anchor object for dax services
> @@ -121,6 +122,59 @@ void fs_put_dax(struct dax_device *dax_dev, void *holder)
> EXPORT_SYMBOL_GPL(fs_put_dax);
> #endif /* CONFIG_BLOCK && CONFIG_FS_DAX */
>
> +#if IS_ENABLED(CONFIG_DEV_DAX_FS)
> +/**
> + * fs_dax_get() - get ownership of a devdax via holder/holder_ops
> + *
> + * fs-dax file systems call this function to prepare to use a devdax device for
> + * fsdax. This is like fs_dax_get_by_bdev(), but the caller already has struct
> + * dev_dax (and there is no bdev). The holder makes this exclusive.
> + *
> + * @dax_dev: dev to be prepared for fs-dax usage
> + * @holder: filesystem or mapped device inside the dax_device
> + * @hops: operations for the inner holder
> + *
> + * Returns: 0 on success, <0 on failure
> + */
> +int fs_dax_get(struct dax_device *dax_dev, void *holder,
> + const struct dax_holder_operations *hops)
> +{
> + struct dev_dax *dev_dax;
> + struct dax_device_driver *dax_drv;
> + int id;
> +
> + id = dax_read_lock();
Given this is an srcu_read_lock under the hood you could do similar
to the DEFINE_LOCK_GUARD_1 for the srcu (srcu.h) (though here it's a
DEFINE_LOCK_GUARD_0 given the lock itself isn't a parameter and then
use scoped_guard() here. Might not be worth the hassle and would need
a wrapper macro to poke &dax_srcu in which means exposing that at least
a little in a header.
DEFINE_LOCK_GUARD_0(_T->idx = dax_read_lock, dax_read_lock(_T->idx), idx);
Based loosely on the irqflags.h irqsave one.
> + if (!dax_dev || !dax_alive(dax_dev) || !igrab(&dax_dev->inode)) {
> + dax_read_unlock(id);
> + return -ENODEV;
> + }
> + dax_read_unlock(id);
> +
> + /* Verify the device is bound to fsdev_dax driver */
> + dev_dax = dax_get_private(dax_dev);
> + if (!dev_dax || !dev_dax->dev.driver) {
> + iput(&dax_dev->inode);
> + return -ENODEV;
> + }
> +
> + dax_drv = to_dax_drv(dev_dax->dev.driver);
> + if (dax_drv->type != DAXDRV_FSDEV_TYPE) {
> + iput(&dax_dev->inode);
> + return -EOPNOTSUPP;
> + }
> +
> + if (cmpxchg(&dax_dev->holder_data, NULL, holder)) {
> + iput(&dax_dev->inode);
> + return -EBUSY;
> + }
> +
> + dax_dev->holder_ops = hops;
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(fs_dax_get);
> +#endif /* DEV_DAX_FS */
> +
> enum dax_device_flags {
> /* !alive + rcu grace period == no new operations / mappings */
> DAXDEV_ALIVE,
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 3fcd8562b72b..76f2a75f3144 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -53,6 +53,7 @@ struct dax_holder_operations {
> struct dax_device *alloc_dax(void *private, const struct dax_operations *ops);
>
> #if IS_ENABLED(CONFIG_DEV_DAX_FS)
> +int fs_dax_get(struct dax_device *dax_dev, void *holder, const struct dax_holder_operations *hops);
I'd wrap this. It's rather long and there isn't a huge readability benefit in keeping
it on one line.
> struct dax_device *inode_dax(struct inode *inode);
> #endif
> void *dax_holder(struct dax_device *dax_dev);
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 07/21] dax: prevent driver unbind while filesystem holds device
2026-01-07 15:33 ` [PATCH V3 07/21] dax: prevent driver unbind while filesystem holds device John Groves
@ 2026-01-08 12:34 ` Jonathan Cameron
2026-01-08 18:08 ` John Groves
2026-01-12 18:55 ` John Groves
1 sibling, 1 reply; 74+ messages in thread
From: Jonathan Cameron @ 2026-01-08 12:34 UTC (permalink / raw)
To: John Groves
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On Wed, 7 Jan 2026 09:33:16 -0600
John Groves <John@Groves.net> wrote:
> From: John Groves <John@Groves.net>
>
> Add custom bind/unbind sysfs attributes for the dax bus that check
> whether a filesystem has registered as a holder (via fs_dax_get())
> before allowing driver unbind.
>
> When a filesystem like famfs mounts on a dax device, it registers
> itself as the holder via dax_holder_ops. Previously, there was no
> mechanism to prevent driver unbind while the filesystem was mounted,
> which could cause some havoc.
>
> The new unbind_store() checks dax_holder() and returns -EBUSY if
> a holder is registered, giving userspace proper feedback that the
> device is in use.
>
> To use our custom bind/unbind handlers instead of the default ones,
> set suppress_bind_attrs=true on all dax drivers during registration.
Whilst I appreciate that it is painful, so are many other driver unbinds
where services are provided to another driver. Is there any precedence
for doing something like this? If not, I'd like to see a review on this
from one of the driver core folk. Maybe Greg KH.
Might just be a case of calling it something else to avoid userspace
tooling getting a surprise.
>
> Signed-off-by: John Groves <john@groves.net>
> ---
> drivers/dax/bus.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 53 insertions(+)
>
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 6e0e28116edc..ed453442739d 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -151,9 +151,61 @@ static ssize_t remove_id_store(struct device_driver *drv, const char *buf,
> }
> static DRIVER_ATTR_WO(remove_id);
>
> +static const struct bus_type dax_bus_type;
> +
> +/*
> + * Custom bind/unbind handlers for dax bus.
> + * The unbind handler checks if a filesystem holds the dax device and
> + * returns -EBUSY if so, preventing driver unbind while in use.
> + */
> +static ssize_t unbind_store(struct device_driver *drv, const char *buf,
> + size_t count)
> +{
> + struct device *dev;
> + int rc = -ENODEV;
> +
> + dev = bus_find_device_by_name(&dax_bus_type, NULL, buf);
struct device *dev __free(put_device) = bus_find_device_by_name()...
and you can just return on error.
> + if (dev && dev->driver == drv) {
With the __free I'd flip this
if (!dev || !dev->driver == drv)
return -ENODEV;
...
> + struct dev_dax *dev_dax = to_dev_dax(dev);
> +
> + if (dax_holder(dev_dax->dax_dev)) {
> + dev_dbg(dev,
> + "%s: blocking unbind due to active holder\n",
> + __func__);
> + rc = -EBUSY;
> + goto out;
> + }
> + device_release_driver(dev);
> + rc = count;
> + }
> +out:
> + put_device(dev);
> + return rc;
> +}
> +static DRIVER_ATTR_WO(unbind);
> +
> +static ssize_t bind_store(struct device_driver *drv, const char *buf,
> + size_t count)
> +{
> + struct device *dev;
> + int rc = -ENODEV;
> +
> + dev = bus_find_device_by_name(&dax_bus_type, NULL, buf);
Use __free magic here as well..
> + if (dev) {
> + rc = device_driver_attach(drv, dev);
> + if (!rc)
> + rc = count;
then this can be
if (rc)
return rc;
return count;
> + }
> + put_device(dev);
> + return rc;
> +}
> +static DRIVER_ATTR_WO(bind);
> +
> static struct attribute *dax_drv_attrs[] = {
> &driver_attr_new_id.attr,
> &driver_attr_remove_id.attr,
> + &driver_attr_bind.attr,
> + &driver_attr_unbind.attr,
> NULL,
> };
> ATTRIBUTE_GROUPS(dax_drv);
> @@ -1591,6 +1643,7 @@ int __dax_driver_register(struct dax_device_driver *dax_drv,
> drv->name = mod_name;
> drv->mod_name = mod_name;
> drv->bus = &dax_bus_type;
> + drv->suppress_bind_attrs = true;
>
> return driver_register(drv);
> }
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 10/21] famfs_fuse: Kconfig
2026-01-07 15:33 ` [PATCH V3 10/21] famfs_fuse: Kconfig John Groves
@ 2026-01-08 12:36 ` Jonathan Cameron
2026-01-12 16:46 ` John Groves
0 siblings, 1 reply; 74+ messages in thread
From: Jonathan Cameron @ 2026-01-08 12:36 UTC (permalink / raw)
To: John Groves
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On Wed, 7 Jan 2026 09:33:19 -0600
John Groves <John@Groves.net> wrote:
> Add FUSE_FAMFS_DAX config parameter, to control compilation of famfs
> within fuse.
>
> Signed-off-by: John Groves <john@groves.net>
A separate commit for this doesn't obviously add anything over combining
it with first place the CONFIG_xxx is used.
Maybe it's a convention for fs/fuse though. If it is ignore me.
> ---
> fs/fuse/Kconfig | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
> index 3a4ae632c94a..3b6d3121fe40 100644
> --- a/fs/fuse/Kconfig
> +++ b/fs/fuse/Kconfig
> @@ -76,3 +76,17 @@ config FUSE_IO_URING
>
> If you want to allow fuse server/client communication through io-uring,
> answer Y
> +
> +config FUSE_FAMFS_DAX
> + bool "FUSE support for fs-dax filesystems backed by devdax"
> + depends on FUSE_FS
> + depends on DEV_DAX
> + default FUSE_FS
> + select DEV_DAX_FS
> + help
> + This enables the fabric-attached memory file system (famfs),
> + which enables formatting devdax memory as a file system. Famfs
> + is primarily intended for scale-out shared access to
> + disaggregated memory.
> +
> + To enable famfs or other fuse/fs-dax file systems, answer Y
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 14/21] famfs_fuse: Plumb the GET_FMAP message/response
2026-01-07 15:33 ` [PATCH V3 14/21] famfs_fuse: Plumb the GET_FMAP message/response John Groves
@ 2026-01-08 12:49 ` Jonathan Cameron
2026-01-09 2:12 ` John Groves
0 siblings, 1 reply; 74+ messages in thread
From: Jonathan Cameron @ 2026-01-08 12:49 UTC (permalink / raw)
To: John Groves
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On Wed, 7 Jan 2026 09:33:23 -0600
John Groves <John@Groves.net> wrote:
> Upon completion of an OPEN, if we're in famfs-mode we do a GET_FMAP to
> retrieve and cache up the file-to-dax map in the kernel. If this
> succeeds, read/write/mmap are resolved direct-to-dax with no upcalls.
>
> Signed-off-by: John Groves <john@groves.net>
A few things inline.
J
> diff --git a/fs/fuse/famfs.c b/fs/fuse/famfs.c
> new file mode 100644
> index 000000000000..0f7e3f00e1e7
> --- /dev/null
> +++ b/fs/fuse/famfs.c
> @@ -0,0 +1,74 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * famfs - dax file system for shared fabric-attached memory
> + *
> + * Copyright 2023-2025 Micron Technology, Inc.
> + *
> + * This file system, originally based on ramfs the dax support from xfs,
> + * is intended to allow multiple host systems to mount a common file system
> + * view of dax files that map to shared memory.
> + */
> +
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include <linux/dax.h>
> +#include <linux/iomap.h>
> +#include <linux/path.h>
> +#include <linux/namei.h>
> +#include <linux/string.h>
> +
> +#include "fuse_i.h"
> +
> +
> +#define FMAP_BUFSIZE PAGE_SIZE
> +
> +int
> +fuse_get_fmap(struct fuse_mount *fm, struct inode *inode)
> +{
> + struct fuse_inode *fi = get_fuse_inode(inode);
> + size_t fmap_bufsize = FMAP_BUFSIZE;
> + u64 nodeid = get_node_id(inode);
> + ssize_t fmap_size;
> + void *fmap_buf;
> + int rc;
> +
> + FUSE_ARGS(args);
> +
> + /* Don't retrieve if we already have the famfs metadata */
> + if (fi->famfs_meta)
> + return 0;
> +
> + fmap_buf = kcalloc(1, FMAP_BUFSIZE, GFP_KERNEL);
If there is only ever 1, does kcalloc() make sense over kzalloc()?
> + if (!fmap_buf)
> + return -EIO;
> +
> + args.opcode = FUSE_GET_FMAP;
> + args.nodeid = nodeid;
> +
> + /* Variable-sized output buffer
> + * this causes fuse_simple_request() to return the size of the
> + * output payload
> + */
> + args.out_argvar = true;
> + args.out_numargs = 1;
> + args.out_args[0].size = fmap_bufsize;
> + args.out_args[0].value = fmap_buf;
> +
> + /* Send GET_FMAP command */
> + rc = fuse_simple_request(fm, &args);
> + if (rc < 0) {
> + pr_err("%s: err=%d from fuse_simple_request()\n",
> + __func__, rc);
Leaks the fmap_buf? Maybe use a __free() so no need to keep track of htat.
> + return rc;
> + }
> + fmap_size = rc;
> +
> + /* We retrieved the "fmap" (the file's map to memory), but
> + * we haven't used it yet. A call to famfs_file_init_dax() will be added
> + * here in a subsequent patch, when we add the ability to attach
> + * fmaps to files.
> + */
> +
> + kfree(fmap_buf);
> + return 0;
> +}
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 84d0ee2a501d..691c7850cf4e 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -223,6 +223,14 @@ struct fuse_inode {
>
> +static inline struct fuse_backing *famfs_meta_set(struct fuse_inode *fi,
> + void *meta)
> +{
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> + return xchg(&fi->famfs_meta, meta);
> +#else
> + return NULL;
> +#endif
> +}
> +
> +static inline void famfs_meta_free(struct fuse_inode *fi)
> +{
> + /* Stub wil be connected in a subsequent commit */
> +}
> +
> +static inline int fuse_file_famfs(struct fuse_inode *fi)
> +{
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> + return (READ_ONCE(fi->famfs_meta) != NULL);
> +#else
> + return 0;
> +#endif
> +}
> +
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +int fuse_get_fmap(struct fuse_mount *fm, struct inode *inode);
> +#else
> +static inline int
> +fuse_get_fmap(struct fuse_mount *fm, struct inode *inode)
> +{
> + return 0;
> +}
> +#endif
I'd do a single block under one if IS_ENABLED() and then use an else
for the stubs. Should end up more readable.
Jonathan
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 15/21] famfs_fuse: Create files with famfs fmaps
2026-01-07 15:33 ` [PATCH V3 15/21] famfs_fuse: Create files with famfs fmaps John Groves
2026-01-07 21:30 ` John Groves
@ 2026-01-08 13:14 ` Jonathan Cameron
2026-01-09 14:30 ` John Groves
1 sibling, 1 reply; 74+ messages in thread
From: Jonathan Cameron @ 2026-01-08 13:14 UTC (permalink / raw)
To: John Groves
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On Wed, 7 Jan 2026 09:33:24 -0600
John Groves <John@Groves.net> wrote:
> On completion of GET_FMAP message/response, setup the full famfs
> metadata such that it's possible to handle read/write/mmap directly to
> dax. Note that the devdax_iomap plumbing is not in yet...
>
> * Add famfs_kfmap.h: in-memory structures for resolving famfs file maps
> (fmaps) to dax.
> * famfs.c: allocate, initialize and free fmaps
> * inode.c: only allow famfs mode if the fuse server has CAP_SYS_RAWIO
> * Update MAINTAINERS for the new files.
>
> Signed-off-by: John Groves <john@groves.net>
> diff --git a/fs/fuse/famfs.c b/fs/fuse/famfs.c
> index 0f7e3f00e1e7..2aabd1d589fd 100644
> --- a/fs/fuse/famfs.c
> +++ b/fs/fuse/famfs.c
> @@ -17,9 +17,355 @@
> #include <linux/namei.h>
> #include <linux/string.h>
>
> +#include "famfs_kfmap.h"
> #include "fuse_i.h"
>
>
> +/***************************************************************************/
Who doesn't like stars? Why have them here?
> +
> +void
> +__famfs_meta_free(void *famfs_meta)
Maybe a local convention, but if not one line.
Same for other cases.
> +{
> + struct famfs_file_meta *fmap = famfs_meta;
> +
> + if (!fmap)
> + return;
> +
> + if (fmap) {
Well that's never going to fail given 2 lines above.
> + switch (fmap->fm_extent_type) {
> + case SIMPLE_DAX_EXTENT:
> + kfree(fmap->se);
> + break;
> + case INTERLEAVED_EXTENT:
> + if (fmap->ie)
> + kfree(fmap->ie->ie_strips);
> +
> + kfree(fmap->ie);
> + break;
> + default:
> + pr_err("%s: invalid fmap type\n", __func__);
> + break;
> + }
> + }
> + kfree(fmap);
> +}
> +/**
> + * famfs_fuse_meta_alloc() - Allocate famfs file metadata
> + * @metap: Pointer to an mcache_map_meta pointer
> + * @ext_count: The number of extents needed
run kernel-doc over the file as that's not the parameters...
> + *
> + * Returns: 0=success
> + * -errno=failure
> + */
> +static int
> +famfs_fuse_meta_alloc(
> + void *fmap_buf,
> + size_t fmap_buf_size,
> + struct famfs_file_meta **metap)
> +{
> + struct famfs_file_meta *meta = NULL;
> + struct fuse_famfs_fmap_header *fmh;
> + size_t extent_total = 0;
> + size_t next_offset = 0;
> + int errs = 0;
> + int i, j;
> + int rc;
> +
> + fmh = (struct fuse_famfs_fmap_header *)fmap_buf;
void * so cast not needed and hence just assign it at the
declaration.
> +
> + /* Move past fmh in fmap_buf */
> + next_offset += sizeof(*fmh);
> + if (next_offset > fmap_buf_size) {
> + pr_err("%s:%d: fmap_buf underflow offset/size %ld/%ld\n",
> + __func__, __LINE__, next_offset, fmap_buf_size);
> + return -EINVAL;
> + }
> +
> + if (fmh->nextents < 1) {
> + pr_err("%s: nextents %d < 1\n", __func__, fmh->nextents);
> + return -EINVAL;
> + }
> +
> + if (fmh->nextents > FUSE_FAMFS_MAX_EXTENTS) {
> + pr_err("%s: nextents %d > max (%d) 1\n",
> + __func__, fmh->nextents, FUSE_FAMFS_MAX_EXTENTS);
> + return -E2BIG;
> + }
> +
> + meta = kzalloc(sizeof(*meta), GFP_KERNEL);
Maybe sprinkle some __free magic on this then you can return in
all the goto error_out places which to me makes this more readable.
> + if (!meta)
> + return -ENOMEM;
> +
> + meta->error = false;
> + meta->file_type = fmh->file_type;
> + meta->file_size = fmh->file_size;
> + meta->fm_extent_type = fmh->ext_type;
> +
> + switch (fmh->ext_type) {
> + case FUSE_FAMFS_EXT_SIMPLE: {
> + struct fuse_famfs_simple_ext *se_in;
> +
> + se_in = (struct fuse_famfs_simple_ext *)(fmap_buf + next_offset);
void * so no need for cast. Though you could keep the cast but apply it to
fmh + 1 to take advantage of that type.
> +
> + /* Move past simple extents */
> + next_offset += fmh->nextents * sizeof(*se_in);
> + if (next_offset > fmap_buf_size) {
> + pr_err("%s:%d: fmap_buf underflow offset/size %ld/%ld\n",
> + __func__, __LINE__, next_offset, fmap_buf_size);
> + rc = -EINVAL;
> + goto errout;
> + }
> +
> + meta->fm_nextents = fmh->nextents;
> +
> + meta->se = kcalloc(meta->fm_nextents, sizeof(*(meta->se)),
> + GFP_KERNEL);
> + if (!meta->se) {
> + rc = -ENOMEM;
> + goto errout;
> + }
> +
> + if ((meta->fm_nextents > FUSE_FAMFS_MAX_EXTENTS) ||
> + (meta->fm_nextents < 1)) {
> + rc = -EINVAL;
> + goto errout;
> + }
> +
> + for (i = 0; i < fmh->nextents; i++) {
> + meta->se[i].dev_index = se_in[i].se_devindex;
> + meta->se[i].ext_offset = se_in[i].se_offset;
> + meta->se[i].ext_len = se_in[i].se_len;
> +
> + /* Record bitmap of referenced daxdev indices */
> + meta->dev_bitmap |= (1 << meta->se[i].dev_index);
> +
> + errs += famfs_check_ext_alignment(&meta->se[i]);
> +
> + extent_total += meta->se[i].ext_len;
> + }
> + break;
> + }
> +
> + case FUSE_FAMFS_EXT_INTERLEAVE: {
> + s64 size_remainder = meta->file_size;
> + struct fuse_famfs_iext *ie_in;
> + int niext = fmh->nextents;
> +
> + meta->fm_niext = niext;
> +
> + /* Allocate interleaved extent */
> + meta->ie = kcalloc(niext, sizeof(*(meta->ie)), GFP_KERNEL);
> + if (!meta->ie) {
> + rc = -ENOMEM;
> + goto errout;
> + }
> +
> + /*
> + * Each interleaved extent has a simple extent list of strips.
> + * Outer loop is over separate interleaved extents
> + */
> + for (i = 0; i < niext; i++) {
> + u64 nstrips;
> + struct fuse_famfs_simple_ext *sie_in;
> +
> + /* ie_in = one interleaved extent in fmap_buf */
> + ie_in = (struct fuse_famfs_iext *)
> + (fmap_buf + next_offset);
void * so no cast needed.
> +
> + /* Move past one interleaved extent header in fmap_buf */
> + next_offset += sizeof(*ie_in);
> + if (next_offset > fmap_buf_size) {
> + pr_err("%s:%d: fmap_buf underflow offset/size %ld/%ld\n",
> + __func__, __LINE__, next_offset,
> + fmap_buf_size);
> + rc = -EINVAL;
> + goto errout;
> + }
> +
> + nstrips = ie_in->ie_nstrips;
> + meta->ie[i].fie_chunk_size = ie_in->ie_chunk_size;
> + meta->ie[i].fie_nstrips = ie_in->ie_nstrips;
> + meta->ie[i].fie_nbytes = ie_in->ie_nbytes;
> +
> + if (!meta->ie[i].fie_nbytes) {
> + pr_err("%s: zero-length interleave!\n",
> + __func__);
> + rc = -EINVAL;
> + goto errout;
> + }
> +
> + /* sie_in = the strip extents in fmap_buf */
> + sie_in = (struct fuse_famfs_simple_ext *)
> + (fmap_buf + next_offset);
no cast needed.
> +
> + /* Move past strip extents in fmap_buf */
> + next_offset += nstrips * sizeof(*sie_in);
> + if (next_offset > fmap_buf_size) {
> + pr_err("%s:%d: fmap_buf underflow offset/size %ld/%ld\n",
> + __func__, __LINE__, next_offset,
> + fmap_buf_size);
> + rc = -EINVAL;
> + goto errout;
> + }
> +
> + if ((nstrips > FUSE_FAMFS_MAX_STRIPS) || (nstrips < 1)) {
> + pr_err("%s: invalid nstrips=%lld (max=%d)\n",
> + __func__, nstrips,
> + FUSE_FAMFS_MAX_STRIPS);
> + errs++;
> + }
> +
> + /* Allocate strip extent array */
> + meta->ie[i].ie_strips = kcalloc(ie_in->ie_nstrips,
> + sizeof(meta->ie[i].ie_strips[0]),
> + GFP_KERNEL);
Align all lines after 1st one to same point.
...
> +
> +/**
> + * famfs_file_init_dax() - init famfs dax file metadata
> + *
> + * @fm: fuse_mount
> + * @inode: the inode
> + * @fmap_buf: fmap response message
> + * @fmap_size: Size of the fmap message
> + *
> + * Initialize famfs metadata for a file, based on the contents of the GET_FMAP
> + * response
> + *
> + * Return: 0=success
> + * -errno=failure
> + */
> +int
> +famfs_file_init_dax(
> + struct fuse_mount *fm,
> + struct inode *inode,
> + void *fmap_buf,
> + size_t fmap_size)
> +{
> + struct fuse_inode *fi = get_fuse_inode(inode);
> + struct famfs_file_meta *meta = NULL;
> + int rc = 0;
Always set before use.
> +
> + if (fi->famfs_meta) {
> + pr_notice("%s: i_no=%ld fmap_size=%ld ALREADY INITIALIZED\n",
> + __func__,
> + inode->i_ino, fmap_size);
> + return 0;
> + }
> +
> + rc = famfs_fuse_meta_alloc(fmap_buf, fmap_size, &meta);
> + if (rc)
> + goto errout;
> +
> + /* Publish the famfs metadata on fi->famfs_meta */
> + inode_lock(inode);
> + if (fi->famfs_meta) {
> + rc = -EEXIST; /* file already has famfs metadata */
> + } else {
> + if (famfs_meta_set(fi, meta) != NULL) {
> + pr_debug("%s: file already had metadata\n", __func__);
> + __famfs_meta_free(meta);
> + /* rc is 0 - the file is valid */
> + goto unlock_out;
> + }
> + i_size_write(inode, meta->file_size);
> + inode->i_flags |= S_DAX;
> + }
> + unlock_out:
> + inode_unlock(inode);
> +
> +errout:
> + if (rc)
> + __famfs_meta_free(meta);
For readability I'd split he good and bad exit paths even it unlock
needs to happen in two places.
> +
> + return rc;
> +}
> +
> diff --git a/fs/fuse/famfs_kfmap.h b/fs/fuse/famfs_kfmap.h
> new file mode 100644
> index 000000000000..058645cb10a1
> --- /dev/null
> +++ b/fs/fuse/famfs_kfmap.h
> @@ -0,0 +1,67 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * famfs - dax file system for shared fabric-attached memory
> + *
> + * Copyright 2023-2025 Micron Technology, Inc.
> + */
> +#ifndef FAMFS_KFMAP_H
> +#define FAMFS_KFMAP_H
> +
> +/*
> + * The structures below are the in-memory metadata format for famfs files.
> + * Metadata retrieved via the GET_FMAP response is converted to this format
> + * for use in resolving file mapping faults.
bonus space after in
> + *
> + * The GET_FMAP response contains the same information, but in a more
> + * message-and-versioning-friendly format. Those structs can be found in the
> + * famfs section of include/uapi/linux/fuse.h (aka fuse_kernel.h in libfuse)
> + */
> +/*
> + * Each famfs dax file has this hanging from its fuse_inode->famfs_meta
> + */
> +struct famfs_file_meta {
> + bool error;
> + enum famfs_file_type file_type;
> + size_t file_size;
> + enum famfs_extent_type fm_extent_type;
> + u64 dev_bitmap; /* bitmap of referenced daxdevs by index */
> + union { /* This will make code a bit more readable */
Not sure what the comment is for. I'd drop it.
> + struct {
> + size_t fm_nextents;
> + struct famfs_meta_simple_ext *se;
> + };
> + struct {
> + size_t fm_niext;
> + struct famfs_meta_interleaved_ext *ie;
> + };
> + };
> +};
> +
> +#endif /* FAMFS_KFMAP_H */
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 01/21] dax: move dax_pgoff_to_phys from [drivers/dax/] device.c to bus.c
2026-01-08 10:43 ` Jonathan Cameron
@ 2026-01-08 13:25 ` John Groves
2026-01-08 15:20 ` Jonathan Cameron
0 siblings, 1 reply; 74+ messages in thread
From: John Groves @ 2026-01-08 13:25 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On 26/01/08 10:43AM, Jonathan Cameron wrote:
> On Wed, 7 Jan 2026 09:33:10 -0600
> John Groves <John@Groves.net> wrote:
>
> > This function will be used by both device.c and fsdev.c, but both are
> > loadable modules. Moving to bus.c puts it in core and makes it available
> > to both.
> >
> > No code changes - just relocated.
> >
> > Signed-off-by: John Groves <john@groves.net>
> Hi John,
>
> I don't know the code well enough to offer an opinion on whether this
> move causes any issues or if this is the best location, so review is superficial
> stuff only.
>
> Jonathan
>
> > ---
> > drivers/dax/bus.c | 27 +++++++++++++++++++++++++++
> > drivers/dax/device.c | 23 -----------------------
> > 2 files changed, 27 insertions(+), 23 deletions(-)
> >
> > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> > index fde29e0ad68b..a2f9a3cc30a5 100644
> > --- a/drivers/dax/bus.c
> > +++ b/drivers/dax/bus.c
> > @@ -7,6 +7,9 @@
> > #include <linux/slab.h>
> > #include <linux/dax.h>
> > #include <linux/io.h>
> > +#include <linux/backing-dev.h>
>
> I'm not immediately spotting why this one. Maybe should be in a different
> patch?
>
> > +#include <linux/range.h>
> > +#include <linux/uio.h>
>
> Why this one?
Good eye, thanks. These must have leaked from some of the many dead ends
that I tried before coming up with this approach.
I've dropped all new includes and it still builds :D
>
> Style wise, dax seems to use reverse xmas tree for includes, so
> this should keep to that.
>
> > #include "dax-private.h"
> > #include "bus.h"
> >
> > @@ -1417,6 +1420,30 @@ static const struct device_type dev_dax_type = {
> > .groups = dax_attribute_groups,
> > };
> >
> > +/* see "strong" declaration in tools/testing/nvdimm/dax-dev.c */
> Bonus space before that */
> Curiously that wasn't there in the original.
Removed.
[ ... ]
Thanks,
John
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 02/21] dax: add fsdev.c driver for fs-dax on character dax
2026-01-08 11:31 ` Jonathan Cameron
@ 2026-01-08 14:32 ` John Groves
2026-01-08 15:12 ` John Groves
1 sibling, 0 replies; 74+ messages in thread
From: John Groves @ 2026-01-08 14:32 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On 26/01/08 11:31AM, Jonathan Cameron wrote:
> On Wed, 7 Jan 2026 09:33:11 -0600
> John Groves <John@Groves.net> wrote:
>
> > The new fsdev driver provides pages/folios initialized compatibly with
> > fsdax - normal rather than devdax-style refcounting, and starting out
> > with order-0 folios.
> >
> > When fsdev binds to a daxdev, it is usually (always?) switching from the
> > devdax mode (device.c), which pre-initializes compound folios according
> > to its alignment. Fsdev uses fsdev_clear_folio_state() to switch the
> > folios into a fsdax-compatible state.
> >
> > A side effect of this is that raw mmap doesn't (can't?) work on an fsdev
> > dax instance. Accordingly, The fsdev driver does not provide raw mmap -
> > devices must be put in 'devdax' mode (drivers/dax/device.c) to get raw
> > mmap capability.
> >
> > In this commit is just the framework, which remaps pages/folios compatibly
> > with fsdax.
> >
> > Enabling dax changes:
> >
> > * bus.h: add DAXDRV_FSDEV_TYPE driver type
> > * bus.c: allow DAXDRV_FSDEV_TYPE drivers to bind to daxdevs
> > * dax.h: prototype inode_dax(), which fsdev needs
> >
> > Suggested-by: Dan Williams <dan.j.williams@intel.com>
> > Suggested-by: Gregory Price <gourry@gourry.net>
> > Signed-off-by: John Groves <john@groves.net>
>
> > diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
> > index d656e4c0eb84..491325d914a8 100644
> > --- a/drivers/dax/Kconfig
> > +++ b/drivers/dax/Kconfig
> > @@ -78,4 +78,21 @@ config DEV_DAX_KMEM
> >
> > Say N if unsure.
> >
> > +config DEV_DAX_FS
> > + tristate "FSDEV DAX: fs-dax compatible device driver"
> > + depends on DEV_DAX
> > + default DEV_DAX
>
> What's the logic for the default? Generally I'd not expect a
> default for something new like this (so default of default == no)
>
> > + help
> > + Support a device-dax driver mode that is compatible with fs-dax
>
> ...
>
>
>
> > struct dax_device_driver {
> > diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> > new file mode 100644
> > index 000000000000..2a3249d1529c
> > --- /dev/null
> > +++ b/drivers/dax/fsdev.c
> > @@ -0,0 +1,276 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/* Copyright(c) 2026 Micron Technology, Inc. */
> > +#include <linux/memremap.h>
> > +#include <linux/pagemap.h>
> > +#include <linux/module.h>
> > +#include <linux/device.h>
> > +#include <linux/cdev.h>
> > +#include <linux/slab.h>
> > +#include <linux/dax.h>
> > +#include <linux/fs.h>
> > +#include <linux/mm.h>
> > +#include "dax-private.h"
> > +#include "bus.h"
>
> ...
>
> > +static void fsdev_cdev_del(void *cdev)
> > +{
> > + cdev_del(cdev);
> > +}
> > +
> > +static void fsdev_kill(void *dev_dax)
> > +{
> > + kill_dev_dax(dev_dax);
> > +}
>
> ...
>
> > +/*
> > + * Clear any stale folio state from pages in the given range.
> > + * This is necessary because device_dax pre-initializes compound folios
> > + * based on vmemmap_shift, and that state may persist after driver unbind.
>
> What's the argument for not cleaning these out in the unbind path for device_dax?
> I can see that it might be an optimization if some other code path blindly
> overwrites all this state.
I prefer this because it doesn't rely on some other module having done the
right thing. Dax maintainers might have thoughts too though.
>
> > + * Since fsdev_dax uses MEMORY_DEVICE_FS_DAX without vmemmap_shift, fs-dax
> > + * expects to find clean order-0 folios that it can build into compound
> > + * folios on demand.
> > + *
> > + * At probe time, no filesystem should be mounted yet, so all mappings
> > + * are stale and must be cleared along with compound state.
> > + */
> > +static void fsdev_clear_folio_state(struct dev_dax *dev_dax)
> > +{
> > + int i;
>
> It's becoming increasingly common to declare loop variables as
> for (int i = 0; i <...
>
> and given that saves us a few lines here it seems worth doing.
Done thanks
>
> > +
> > + for (i = 0; i < dev_dax->nr_range; i++) {
> > + struct range *range = &dev_dax->ranges[i].range;
> > + unsigned long pfn, end_pfn;
> > +
> > + pfn = PHYS_PFN(range->start);
> > + end_pfn = PHYS_PFN(range->end) + 1;
>
> Might as well do
> unsigned long pfn = PHY_PFN(range->start);
> unsigned long end_pfn = PHYS_PFN(range->end) + 1;
Sounds good, done
> > +
> > + while (pfn < end_pfn) {
> > + struct page *page = pfn_to_page(pfn);
> > + struct folio *folio = (struct folio *)page;
> > + struct dev_pagemap *pgmap = page_pgmap(page);
> > + int order = folio_order(folio);
> > +
> > + /*
> > + * Clear any stale mapping pointer. At probe time,
> > + * no filesystem is mounted, so any mapping is stale.
> > + */
> > + folio->mapping = NULL;
> > + folio->share = 0;
> > +
> > + if (order > 0) {
> > + int j;
> > +
> > + folio_reset_order(folio);
> > + for (j = 0; j < (1UL << order); j++) {
> > + struct page *p = page + j;
> > +
> > + ClearPageHead(p);
> > + clear_compound_head(p);
> > + ((struct folio *)p)->mapping = NULL;
>
> This code block is very similar to a chunk in dax_folio_put() in fs/dax.c
>
> Can we create a helper for both to use?
>
> I note that uses a local struct folio *new_folio to avoid multiple casts.
> I'd do similar here even if it's a long line.
>
> If not possible to use a common helper, it is probably still worth
> having a helper here for the stuff in the while loop just to reduce indent
> and improve readability a little.
Good catch! You shall have a Suggested-by in the next version, which will
inject a commit right before this that factors out dax_folio_reset_order()
from dax_folio_put(). Then fsdev_clear_folio_state() will also call that.
>
> > + ((struct folio *)p)->share = 0;
> > + ((struct folio *)p)->pgmap = pgmap;
> > + }
> > + pfn += (1UL << order);
> > + } else {
> > + folio->pgmap = pgmap;
> > + pfn++;
> > + }
> > + }
> > + }
> > +}
> > +
> > +static int fsdev_open(struct inode *inode, struct file *filp)
> > +{
> > + struct dax_device *dax_dev = inode_dax(inode);
> > + struct dev_dax *dev_dax = dax_get_private(dax_dev);
> > +
> > + dev_dbg(&dev_dax->dev, "trace\n");
>
> Hmm. This is a somewhat odd, but I see dax/device.c does
> the same thing and I guess that's because you are using
> dynamic debug with function names turned on to provide the
> 'real' information.
Actually I just have it from the gut-and-repurpose of device.c.
Dropping from fsdev.c as I'm not using it.
>
>
>
> > + filp->private_data = dev_dax;
> > +
> > + return 0;
> > +}
>
> > +static int fsdev_dax_probe(struct dev_dax *dev_dax)
> > +{
> > + struct dax_device *dax_dev = dev_dax->dax_dev;
> > + struct device *dev = &dev_dax->dev;
> > + struct dev_pagemap *pgmap;
> > + u64 data_offset = 0;
> > + struct inode *inode;
> > + struct cdev *cdev;
> > + void *addr;
> > + int rc, i;
> > +
>
> A bunch of this is cut and paste from dax/device.c
> If it carries on looking like this, can we have a helper module that
> both drivers use with the common code in it? That would make the
> difference more obvious as well.
Makes sense. I'll wait for thoughts from the dax people before
flipping bits on this though.
>
> > + if (static_dev_dax(dev_dax)) {
> > + if (dev_dax->nr_range > 1) {
> > + dev_warn(dev,
> > + "static pgmap / multi-range device conflict\n");
> > + return -EINVAL;
> > + }
> > +
> > + pgmap = dev_dax->pgmap;
> > + } else {
> > + if (dev_dax->pgmap) {
> > + dev_warn(dev,
> > + "dynamic-dax with pre-populated page map\n");
> Unless dax maintainers are very fussy about 80 chars, I'd go long on these as it's
> only just over 80 chars on one line.
>
> Given you are failing probe, not sure why dev_warn() is considered sufficient.
> To me dev_err() seems more sensible. What you have matches dax/device.c though
> so maybe there is a sound reason.
I'm personally a bit fussy about 80 column code, being kinda old and favoring
80 column emacs windows :D - mulling it over.
>
> > + return -EINVAL;
> > + }
> > +
> > + pgmap = devm_kzalloc(dev,
> > + struct_size(pgmap, ranges, dev_dax->nr_range - 1),
> > + GFP_KERNEL);
> Pick an alignment style and stick to it. Either.
> pgmap = devm_kzalloc(dev,
> struct_size(pgmap, ranges, dev_dax->nr_range - 1),
> GFP_KERNEL);
>
> or go long for readability and do
> pgmap = devm_kzalloc(dev,
> struct_size(pgmap, ranges, dev_dax->nr_range - 1),
> GFP_KERNEL);
Will do something cleaner. This is the aforementioned 80 column curmudgeonliness
at work...
>
>
>
> > + if (!pgmap)
> > + return -ENOMEM;
> > +
> > + pgmap->nr_range = dev_dax->nr_range;
> > + dev_dax->pgmap = pgmap;
> > +
> > + for (i = 0; i < dev_dax->nr_range; i++) {
> > + struct range *range = &dev_dax->ranges[i].range;
> > +
> > + pgmap->ranges[i] = *range;
> > + }
> > + }
> > +
> > + for (i = 0; i < dev_dax->nr_range; i++) {
> > + struct range *range = &dev_dax->ranges[i].range;
> > +
> > + if (!devm_request_mem_region(dev, range->start,
> > + range_len(range), dev_name(dev))) {
> > + dev_warn(dev, "mapping%d: %#llx-%#llx could not reserve range\n",
> > + i, range->start, range->end);
> > + return -EBUSY;
> > + }
> > + }
> > +
> > + /*
> > + * FS-DAX compatible mode: Use MEMORY_DEVICE_FS_DAX type and
> > + * do NOT set vmemmap_shift. This leaves folios at order-0,
> > + * allowing fs-dax to dynamically create compound folios as needed
> > + * (similar to pmem behavior).
> > + */
> > + pgmap->type = MEMORY_DEVICE_FS_DAX;
> > + pgmap->ops = &fsdev_pagemap_ops;
> > + pgmap->owner = dev_dax;
> > +
> > + /*
> > + * CRITICAL DIFFERENCE from device.c:
> > + * We do NOT set vmemmap_shift here, even if align > PAGE_SIZE.
> > + * This ensures folios remain order-0 and are compatible with
> > + * fs-dax's folio management.
> > + */
> > +
> > + addr = devm_memremap_pages(dev, pgmap);
> > + if (IS_ERR(addr))
> > + return PTR_ERR(addr);
> > +
> > + /*
> > + * Clear any stale compound folio state left over from a previous
> > + * driver (e.g., device_dax with vmemmap_shift).
> > + */
> > + fsdev_clear_folio_state(dev_dax);
> > +
> > + /* Detect whether the data is at a non-zero offset into the memory */
> > + if (pgmap->range.start != dev_dax->ranges[0].range.start) {
> > + u64 phys = dev_dax->ranges[0].range.start;
> > + u64 pgmap_phys = dev_dax->pgmap[0].range.start;
> > +
> > + if (!WARN_ON(pgmap_phys > phys))
> > + data_offset = phys - pgmap_phys;
> > +
> > + pr_debug("%s: offset detected phys=%llx pgmap_phys=%llx offset=%llx\n",
> > + __func__, phys, pgmap_phys, data_offset);
> > + }
> > +
> > + inode = dax_inode(dax_dev);
> > + cdev = inode->i_cdev;
> > + cdev_init(cdev, &fsdev_fops);
> > + cdev->owner = dev->driver->owner;
> > + cdev_set_parent(cdev, &dev->kobj);
> > + rc = cdev_add(cdev, dev->devt, 1);
> > + if (rc)
> > + return rc;
> > +
> > + rc = devm_add_action_or_reset(dev, fsdev_cdev_del, cdev);
> > + if (rc)
> > + return rc;
> > +
> > + run_dax(dax_dev);
> > + return devm_add_action_or_reset(dev, fsdev_kill, dev_dax);
> > +}
> > +
> > +static struct dax_device_driver fsdev_dax_driver = {
> > + .probe = fsdev_dax_probe,
> > + .type = DAXDRV_FSDEV_TYPE,
> > +};
> > +
> > +static int __init dax_init(void)
> > +{
> > + return dax_driver_register(&fsdev_dax_driver);
> > +}
> > +
> > +static void __exit dax_exit(void)
> > +{
> > + dax_driver_unregister(&fsdev_dax_driver);
> > +}
> If these don't get more complex, maybe it's time for a dax specific define
> using module_driver()
I'll defer to the dax folks here
>
> > +
> > +MODULE_AUTHOR("John Groves");
> > +MODULE_DESCRIPTION("FS-DAX Device: fs-dax compatible devdax driver");
> > +MODULE_LICENSE("GPL");
> > +module_init(dax_init);
> > +module_exit(dax_exit);
> > +MODULE_ALIAS_DAX_DEVICE(0);
>
> Curious macro. Always has same parameter... Maybe ripe for just dropping the parameter?
>
>
Thanks!
John
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 16/21] famfs_fuse: GET_DAXDEV message and daxdev_table
2026-01-07 15:33 ` [PATCH V3 16/21] famfs_fuse: GET_DAXDEV message and daxdev_table John Groves
@ 2026-01-08 14:45 ` Jonathan Cameron
0 siblings, 0 replies; 74+ messages in thread
From: Jonathan Cameron @ 2026-01-08 14:45 UTC (permalink / raw)
To: John Groves
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On Wed, 7 Jan 2026 09:33:25 -0600
John Groves <John@Groves.net> wrote:
> * The new GET_DAXDEV message/response is added
> * The famfs.c:famfs_teardown() function is added as a primary teardown
> function for famfs.
> * The command it triggered by the update_daxdev_table() call, if there
> are any daxdevs in the subject fmap that are not represented in the
> daxdev_table yet.
> * fs/namei.c: export may_open_dev()
>
> Signed-off-by: John Groves <john@groves.net>
Hi John,
A few things inline
Thanks,
Jonathan
> ---
> fs/fuse/famfs.c | 236 ++++++++++++++++++++++++++++++++++++++
> fs/fuse/famfs_kfmap.h | 26 +++++
> fs/fuse/fuse_i.h | 13 ++-
> fs/fuse/inode.c | 4 +-
> fs/namei.c | 1 +
> include/uapi/linux/fuse.h | 20 ++++
> 6 files changed, 298 insertions(+), 2 deletions(-)
>
> diff --git a/fs/fuse/famfs.c b/fs/fuse/famfs.c
> index 2aabd1d589fd..b5cd1b5c1d6c 100644
> --- a/fs/fuse/famfs.c
> +++ b/fs/fuse/famfs.c
> @@ -20,6 +20,239 @@
> #include "famfs_kfmap.h"
> #include "fuse_i.h"
>
> +/*
> + * famfs_teardown()
> + *
> + * Deallocate famfs metadata for a fuse_conn
> + */
> +void
> +famfs_teardown(struct fuse_conn *fc)
> +{
> + struct famfs_dax_devlist *devlist = fc->dax_devlist;
> + int i;
> +
> + kfree(fc->shadow);
> +
> + fc->dax_devlist = NULL;
> +
> + if (!devlist)
> + return;
> +
> + if (!devlist->devlist)
I'm going to assume that if this is true, devlist->nslots == 0?
If so I'd skip this check and just let the rest of the code happen.
> + goto out;
> +
> + /* Close & release all the daxdevs in our table */
> + for (i = 0; i < devlist->nslots; i++) {
> + struct famfs_daxdev *dd = &devlist->devlist[i];
> +
> + if (!dd->valid)
> + continue;
> +
> + /* Release reference from dax_dev_get() */
> + if (dd->devp)
> + put_dax(dd->devp);
> +
> + kfree(dd->name);
> + }
> + kfree(devlist->devlist);
> +
> +out:
> + kfree(devlist);
> +}
> +/**
> + * famfs_fuse_get_daxdev() - Retrieve info for a DAX device from fuse server
> + *
> + * Send a GET_DAXDEV message to the fuse server to retrieve info on a
> + * dax device.
> + *
> + * @fm: fuse_mount
> + * @index: the index of the dax device; daxdevs are referred to by index
> + * in fmaps, and the server resolves the index to a particular daxdev
> + *
> + * Returns: 0=success
> + * -errno=failure
> + */
> +static int
> +famfs_fuse_get_daxdev(struct fuse_mount *fm, const u64 index)
> +{
> + struct fuse_daxdev_out daxdev_out = { 0 };
> + struct fuse_conn *fc = fm->fc;
> + struct famfs_daxdev *daxdev;
> + int err = 0;
Always set before use so no need to init.
> +
> + FUSE_ARGS(args);
> +
> + /* Store the daxdev in our table */
> + if (index >= fc->dax_devlist->nslots) {
> + pr_err("%s: index(%lld) > nslots(%d)\n",
> + __func__, index, fc->dax_devlist->nslots);
> + err = -EINVAL;
> + goto out;
I'd return here as nothing to do.
> + }
> +
> + args.opcode = FUSE_GET_DAXDEV;
> + args.nodeid = index;
> +
> + args.in_numargs = 0;
> +
> + args.out_numargs = 1;
> + args.out_args[0].size = sizeof(daxdev_out);
> + args.out_args[0].value = &daxdev_out;
> +
> + /* Send GET_DAXDEV command */
> + err = fuse_simple_request(fm, &args);
> + if (err) {
> + pr_err("%s: err=%d from fuse_simple_request()\n",
> + __func__, err);
> + /*
I'm not sure what local comment style is, but be consistent of
whether there is a blank line or not.
> + * Error will be that the payload is smaller than FMAP_BUFSIZE,
> + * which is the max we can handle. Empty payload handled below.
> + */
> + goto out;
return here is probably simpler.
> + }
> +
> + down_write(&fc->famfs_devlist_sem);
> +
> + daxdev = &fc->dax_devlist->devlist[index];
> +
> + /* Abort if daxdev is now valid (race - another thread got it first) */
> + if (daxdev->valid) {
> + up_write(&fc->famfs_devlist_sem);
> + /* We already have a valid entry at this index */
> + pr_debug("%s: daxdev already known\n", __func__);
> + goto out;
> + }
> +
> + /* Verify that the dev is valid and can be opened and gets the devno */
> + err = famfs_verify_daxdev(daxdev_out.name, &daxdev->devno);
> + if (err) {
> + up_write(&fc->famfs_devlist_sem);
> + pr_err("%s: err=%d from famfs_verify_daxdev()\n", __func__, err);
> + goto out;
> + }
> +
> + /* This will fail if it's not a dax device */
> + daxdev->devp = dax_dev_get(daxdev->devno);
> + if (!daxdev->devp) {
> + up_write(&fc->famfs_devlist_sem);
Move the label before the up_write, so you don't need to do it in each
error case or use a guard()
> + pr_warn("%s: device %s not found or not dax\n",
> + __func__, daxdev_out.name);
> + err = -ENODEV;
> + goto out;
> + }
> +
> + daxdev->name = kstrdup(daxdev_out.name, GFP_KERNEL);
Can fail.
> + wmb(); /* all daxdev fields must be visible before marking it valid */
> + daxdev->valid = 1;
> +
> + up_write(&fc->famfs_devlist_sem);
> +
> +out:
> + return err;
> +}
> +
> +/**
> + * famfs_update_daxdev_table() - Update the daxdev table
> + * @fm - fuse_mount
> + * @meta - famfs_file_meta, in-memory format, built from a GET_FMAP response
> + *
> + * This function is called for each new file fmap, to verify whether all
> + * referenced daxdevs are already known (i.e. in the table). Any daxdev
> + * indices referenced in @meta but not in the table will be retrieved via
> + * famfs_fuse_get_daxdev() and added to the table
> + *
> + * Return: 0=success
> + * -errno=failure
> + */
> +static int
> +famfs_update_daxdev_table(
> + struct fuse_mount *fm,
> + const struct famfs_file_meta *meta)
> +{
> + struct famfs_dax_devlist *local_devlist;
> + struct fuse_conn *fc = fm->fc;
> + int err;
> + int i;
Might as well put those on one line or move i down to the loop init.
> +
> + /* First time through we will need to allocate the dax_devlist */
> + if (unlikely(!fc->dax_devlist)) {
I'd avoid unlikely markings unless you have good evidence they are needed.
Let the branch predictors figure it out.
> + local_devlist = kcalloc(1, sizeof(*fc->dax_devlist), GFP_KERNEL);
> + if (!local_devlist)
> + return -ENOMEM;
> +
> + local_devlist->nslots = MAX_DAXDEVS;
> +
> + local_devlist->devlist = kcalloc(MAX_DAXDEVS,
> + sizeof(struct famfs_daxdev),
> + GFP_KERNEL);
> + if (!local_devlist->devlist) {
> + kfree(local_devlist);
> + return -ENOMEM;
> + }
> +
> + /* We don't need famfs_devlist_sem here because we use cmpxchg */
> + if (cmpxchg(&fc->dax_devlist, NULL, local_devlist) != NULL) {
> + kfree(local_devlist->devlist);
> + kfree(local_devlist); /* another thread beat us to it */
> + }
> + }
> +
> + down_read(&fc->famfs_devlist_sem);
> + for (i = 0; i < fc->dax_devlist->nslots; i++) {
> + if (!(meta->dev_bitmap & (1ULL << i)))
Could you do for_each_set_bit() on that bitmap?
Might end up clearer.
> + continue;
> +
> + /* This file meta struct references devindex i
> + * if devindex i isn't in the table; get it...
> + */
> + if (!(fc->dax_devlist->devlist[i].valid)) {
Maybe flip logic and do a continue as you do with the condition above.
> + up_read(&fc->famfs_devlist_sem);
> +
> + err = famfs_fuse_get_daxdev(fm, i);
> + if (err)
> + pr_err("%s: failed to get daxdev=%d\n",
> + __func__, i);
> +
> + down_read(&fc->famfs_devlist_sem);
> + }
> + }
> + up_read(&fc->famfs_devlist_sem);
> +
> + return 0;
> +}
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 02/21] dax: add fsdev.c driver for fs-dax on character dax
2026-01-08 11:31 ` Jonathan Cameron
2026-01-08 14:32 ` John Groves
@ 2026-01-08 15:12 ` John Groves
2026-01-08 21:15 ` John Groves
1 sibling, 1 reply; 74+ messages in thread
From: John Groves @ 2026-01-08 15:12 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On 26/01/08 11:31AM, Jonathan Cameron wrote:
> On Wed, 7 Jan 2026 09:33:11 -0600
> John Groves <John@Groves.net> wrote:
>
> > The new fsdev driver provides pages/folios initialized compatibly with
> > fsdax - normal rather than devdax-style refcounting, and starting out
> > with order-0 folios.
> >
> > When fsdev binds to a daxdev, it is usually (always?) switching from the
> > devdax mode (device.c), which pre-initializes compound folios according
> > to its alignment. Fsdev uses fsdev_clear_folio_state() to switch the
> > folios into a fsdax-compatible state.
> >
> > A side effect of this is that raw mmap doesn't (can't?) work on an fsdev
> > dax instance. Accordingly, The fsdev driver does not provide raw mmap -
> > devices must be put in 'devdax' mode (drivers/dax/device.c) to get raw
> > mmap capability.
> >
> > In this commit is just the framework, which remaps pages/folios compatibly
> > with fsdax.
> >
> > Enabling dax changes:
> >
> > * bus.h: add DAXDRV_FSDEV_TYPE driver type
> > * bus.c: allow DAXDRV_FSDEV_TYPE drivers to bind to daxdevs
> > * dax.h: prototype inode_dax(), which fsdev needs
> >
> > Suggested-by: Dan Williams <dan.j.williams@intel.com>
> > Suggested-by: Gregory Price <gourry@gourry.net>
> > Signed-off-by: John Groves <john@groves.net>
>
> > diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
> > index d656e4c0eb84..491325d914a8 100644
> > --- a/drivers/dax/Kconfig
> > +++ b/drivers/dax/Kconfig
> > @@ -78,4 +78,21 @@ config DEV_DAX_KMEM
> >
> > Say N if unsure.
> >
> > +config DEV_DAX_FS
> > + tristate "FSDEV DAX: fs-dax compatible device driver"
> > + depends on DEV_DAX
> > + default DEV_DAX
>
> What's the logic for the default? Generally I'd not expect a
> default for something new like this (so default of default == no)
My thinking is that this is harmless unless you use it, but if you
need it you need it. So defaulting to include the module seems
viable.
[ ... ]
John
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 17/21] famfs_fuse: Plumb dax iomap and fuse read/write/mmap
2026-01-07 15:33 ` [PATCH V3 17/21] famfs_fuse: Plumb dax iomap and fuse read/write/mmap John Groves
@ 2026-01-08 15:13 ` Jonathan Cameron
2026-01-09 17:44 ` John Groves
0 siblings, 1 reply; 74+ messages in thread
From: Jonathan Cameron @ 2026-01-08 15:13 UTC (permalink / raw)
To: John Groves
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On Wed, 7 Jan 2026 09:33:26 -0600
John Groves <John@Groves.net> wrote:
> This commit fills in read/write/mmap handling for famfs files. The
> dev_dax_iomap interface is used - just like xfs in fs-dax mode.
>
> * Read/write are handled by famfs_fuse_[read|write]_iter() via
> dax_iomap_rw() to fsdev_dax.
> * Mmap is handled by famfs_fuse_mmap()
> * Faults are handled by famfs_filemap*fault(), using dax_iomap_fault()
> to fsdev_dax.
> * File offset to dax offset resolution is handled via
> famfs_fuse_iomap_begin(), which uses famfs "fmaps" to resolve the
> the requested (file, offset) to an offset on a dax device (by way of
> famfs_fileofs_to_daxofs() and famfs_interleave_fileofs_to_daxofs())
>
> Signed-off-by: John Groves <john@groves.net>
A few minor comments and suggestions inline.
Thanks,
Jonathan
> ---
> fs/fuse/famfs.c | 458 +++++++++++++++++++++++++++++++++++++++++++++++
> fs/fuse/file.c | 18 +-
> fs/fuse/fuse_i.h | 18 ++
> 3 files changed, 492 insertions(+), 2 deletions(-)
>
> diff --git a/fs/fuse/famfs.c b/fs/fuse/famfs.c
> index b5cd1b5c1d6c..c02b14789c6e 100644
> --- a/fs/fuse/famfs.c
> +++ b/fs/fuse/famfs.c
> @@ -602,6 +602,464 @@ famfs_file_init_dax(
> return rc;
> }
>
> +/*********************************************************************
> + * iomap_operations
> + *
> + * This stuff uses the iomap (dax-related) helpers to resolve file offsets to
> + * offsets within a dax device.
> + */
> +
> +static ssize_t famfs_file_bad(struct inode *inode);
> +
> +static int
> +famfs_interleave_fileofs_to_daxofs(struct inode *inode, struct iomap *iomap,
> + loff_t file_offset, off_t len, unsigned int flags)
> +{
> + struct fuse_inode *fi = get_fuse_inode(inode);
> + struct famfs_file_meta *meta = fi->famfs_meta;
> + struct fuse_conn *fc = get_fuse_conn(inode);
> + loff_t local_offset = file_offset;
> + int i;
> +
> + /* This function is only for extent_type INTERLEAVED_EXTENT */
> + if (meta->fm_extent_type != INTERLEAVED_EXTENT) {
> + pr_err("%s: bad extent type\n", __func__);
> + goto err_out;
> + }
> +
> + if (famfs_file_bad(inode))
> + goto err_out;
> +
> + iomap->offset = file_offset;
> +
> + for (i = 0; i < meta->fm_niext; i++) {
> + struct famfs_meta_interleaved_ext *fei = &meta->ie[i];
> + u64 chunk_size = fei->fie_chunk_size;
> + u64 nstrips = fei->fie_nstrips;
> + u64 ext_size = fei->fie_nbytes;
> +
> + ext_size = min_t(u64, ext_size, meta->file_size);
min() probably fine. Also, how about avoiding the assignment that
is immediately overwritten.
u64 ext_size = min(fei->fie_nbytes, meta->file_size);
> +
> + if (ext_size == 0) {
> + pr_err("%s: ext_size=%lld file_size=%ld\n",
> + __func__, fei->fie_nbytes, meta->file_size);
> + goto err_out;
> + }
> +
> + /* Is the data is in this striped extent? */
> + if (local_offset < ext_size) {
Similar comments to below, though here that would mean not being able
to scope these local variables as tightly so maybe not worth it to reduce
indent.
> + u64 chunk_num = local_offset / chunk_size;
> + u64 chunk_offset = local_offset % chunk_size;
> + u64 stripe_num = chunk_num / nstrips;
> + u64 strip_num = chunk_num % nstrips;
> + u64 chunk_remainder = chunk_size - chunk_offset;
I'd group chunk stuff, then strip stuff.
> + u64 strip_offset = chunk_offset + (stripe_num * chunk_size);
> + u64 strip_dax_ofs = fei->ie_strips[strip_num].ext_offset;
> + u64 strip_devidx = fei->ie_strips[strip_num].dev_index;
> +
> + if (strip_devidx >= fc->dax_devlist->nslots) {
> + pr_err("%s: strip_devidx %llu >= nslots %d\n",
> + __func__, strip_devidx,
> + fc->dax_devlist->nslots);
> + goto err_out;
> + }
> +
> + if (!fc->dax_devlist->devlist[strip_devidx].valid) {
> + pr_err("%s: daxdev=%lld invalid\n", __func__,
> + strip_devidx);
> + goto err_out;
> + }
> +
> + iomap->addr = strip_dax_ofs + strip_offset;
> + iomap->offset = file_offset;
> + iomap->length = min_t(loff_t, len, chunk_remainder);
> +
> + iomap->dax_dev = fc->dax_devlist->devlist[strip_devidx].devp;
> +
> + iomap->type = IOMAP_MAPPED;
> + iomap->flags = flags;
> +
> + return 0;
> + }
> + local_offset -= ext_size; /* offset is beyond this striped extent */
> + }
> +
> + err_out:
> + pr_err("%s: err_out\n", __func__);
> +
> + /* We fell out the end of the extent list.
> + * Set iomap to zero length in this case, and return 0
> + * This just means that the r/w is past EOF
> + */
> + iomap->addr = 0; /* there is no valid dax device offset */
> + iomap->offset = file_offset; /* file offset */
> + iomap->length = 0; /* this had better result in no access to dax mem */
> + iomap->dax_dev = NULL;
> + iomap->type = IOMAP_MAPPED;
> + iomap->flags = flags;
> +
> + return 0;
> +}
> +
> +/**
> + * famfs_fileofs_to_daxofs() - Resolve (file, offset, len) to (daxdev, offset, len)
> + *
> + * This function is called by famfs_fuse_iomap_begin() to resolve an offset in a
> + * file to an offset in a dax device. This is upcalled from dax from calls to
> + * both * dax_iomap_fault() and dax_iomap_rw(). Dax finishes the job resolving
> + * a fault to a specific physical page (the fault case) or doing a memcpy
> + * variant (the rw case)
> + *
> + * Pages can be PTE (4k), PMD (2MiB) or (theoretically) PuD (1GiB)
> + * (these sizes are for X86; may vary on other cpu architectures
> + *
> + * @inode: The file where the fault occurred
> + * @iomap: To be filled in to indicate where to find the right memory,
> + * relative to a dax device.
> + * @file_offset: Within the file where the fault occurred (will be page boundary)
> + * @len: The length of the faulted mapping (will be a page multiple)
> + * (will be trimmed in *iomap if it's disjoint in the extent list)
> + * @flags:
As below. All should have docs, even if trivial.
> + *
> + * Return values: 0. (info is returned in a modified @iomap struct)
> + */
> +static int
> +famfs_fileofs_to_daxofs(struct inode *inode, struct iomap *iomap,
> + loff_t file_offset, off_t len, unsigned int flags)
> +{
> + struct fuse_inode *fi = get_fuse_inode(inode);
> + struct famfs_file_meta *meta = fi->famfs_meta;
> + struct fuse_conn *fc = get_fuse_conn(inode);
> + loff_t local_offset = file_offset;
> + int i;
> +
> + if (!fc->dax_devlist) {
> + pr_err("%s: null dax_devlist\n", __func__);
> + goto err_out;
> + }
> +
> + if (famfs_file_bad(inode))
> + goto err_out;
> +
> + if (meta->fm_extent_type == INTERLEAVED_EXTENT)
> + return famfs_interleave_fileofs_to_daxofs(inode, iomap,
> + file_offset,
> + len, flags);
> +
> + iomap->offset = file_offset;
> +
> + for (i = 0; i < meta->fm_nextents; i++) {
I'd drag declaration of i into the loop init.
> + /* TODO: check devindex too */
> + loff_t dax_ext_offset = meta->se[i].ext_offset;
> + loff_t dax_ext_len = meta->se[i].ext_len;
> + u64 daxdev_idx = meta->se[i].dev_index;
> +
> +
> + /* TODO: test that superblock and log offsets only happen
> + * with superblock and log files. Requires instrumentaiton
> + * from user space...
> + */
> +
> + /* local_offset is the offset minus the size of extents skipped
> + * so far; If local_offset < dax_ext_len, the data of interest
> + * starts in this extent
> + */
> + if (local_offset < dax_ext_len) {
Maybe flip logic and use a continue. Mostly to reduce indent of the rest of
this. Or maybe a helper function for this bit.
> + loff_t ext_len_remainder = dax_ext_len - local_offset;
> + struct famfs_daxdev *dd;
> +
> + if (daxdev_idx >= fc->dax_devlist->nslots) {
> + pr_err("%s: daxdev_idx %llu >= nslots %d\n",
> + __func__, daxdev_idx,
> + fc->dax_devlist->nslots);
> + goto err_out;
> + }
> +
> + dd = &fc->dax_devlist->devlist[daxdev_idx];
> +
> + if (!dd->valid || dd->error) {
> + pr_err("%s: daxdev=%lld %s\n", __func__,
> + daxdev_idx,
> + dd->valid ? "error" : "invalid");
> + goto err_out;
> + }
> +
> + /*
> + * OK, we found the file metadata extent where this
> + * data begins
> + * @local_offset - The offset within the current
> + * extent
> + * @ext_len_remainder - Remaining length of ext after
> + * skipping local_offset
> + * Outputs:
> + * iomap->addr: the offset within the dax device where
> + * the data starts
> + * iomap->offset: the file offset
> + * iomap->length: the valid length resolved here
> + */
> + iomap->addr = dax_ext_offset + local_offset;
> + iomap->offset = file_offset;
> + iomap->length = min_t(loff_t, len, ext_len_remainder);
> +
> + iomap->dax_dev = fc->dax_devlist->devlist[daxdev_idx].devp;
> +
> + iomap->type = IOMAP_MAPPED;
> + iomap->flags = flags;
> + return 0;
> + }
> + local_offset -= dax_ext_len; /* Get ready for the next extent */
> + }
> +
> + err_out:
> + pr_err("%s: err_out\n", __func__);
> +
> + /* We fell out the end of the extent list.
> + * Set iomap to zero length in this case, and return 0
> + * This just means that the r/w is past EOF
> + */
> + iomap->addr = 0; /* there is no valid dax device offset */
> + iomap->offset = file_offset; /* file offset */
> + iomap->length = 0; /* this had better result in no access to dax mem */
> + iomap->dax_dev = NULL;
> + iomap->type = IOMAP_MAPPED;
> + iomap->flags = flags;
> +
> + return 0;
> +}
> +
> +/**
> + * famfs_fuse_iomap_begin() - Handler for iomap_begin upcall from dax
> + *
> + * This function is pretty simple because files are
> + * * never partially allocated
> + * * never have holes (never sparse)
> + * * never "allocate on write"
> + *
> + * @inode: inode for the file being accessed
> + * @offset: offset within the file
> + * @length: Length being accessed at offset
> + * @flags:
> + * @iomap: iomap struct to be filled in, resolving (offset, length) to
> + * (daxdev, offset, len)
> + * @srcmap:
All parameters should have description.
> + */
> +static int
> +famfs_fuse_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> + unsigned int flags, struct iomap *iomap, struct iomap *srcmap)
> +{
> + struct fuse_inode *fi = get_fuse_inode(inode);
> + struct famfs_file_meta *meta = fi->famfs_meta;
> + size_t size;
> +
> + size = i_size_read(inode);
> +
> + WARN_ON(size != meta->file_size);
> +
> + return famfs_fileofs_to_daxofs(inode, iomap, offset, length, flags);
> +}
> +
> +static inline bool
> +famfs_is_write_fault(struct vm_fault *vmf)
> +{
> + return (vmf->flags & FAULT_FLAG_WRITE) &&
> + (vmf->vma->vm_flags & VM_SHARED);
> +}
> +
> +static vm_fault_t
> +famfs_filemap_fault(struct vm_fault *vmf)
> +{
> + return __famfs_fuse_filemap_fault(vmf, 0, famfs_is_write_fault(vmf));
> +}
> +
> +static vm_fault_t
> +famfs_filemap_huge_fault(struct vm_fault *vmf, unsigned int pe_size)
> +{
> + return __famfs_fuse_filemap_fault(vmf, pe_size, famfs_is_write_fault(vmf));
> +}
> +
> +static vm_fault_t
> +famfs_filemap_page_mkwrite(struct vm_fault *vmf)
> +{
> + return __famfs_fuse_filemap_fault(vmf, 0, true);
I'm not an fs person but I note ext4 etc are able to use the
same callback for all of these and can figure out the write fault
question inside that callback. Is there a reason that doesn't work here?
Looks like an appropriate vmf flag is set for each type of callback.
> +}
> +
> +static vm_fault_t
Similar to earlier comments. I'd put these on one line unless you
have to split them due to length.
> +famfs_filemap_pfn_mkwrite(struct vm_fault *vmf)
Given this and the previous page_mkwrite one are identical, just
use one more generically named callback. Lots of FS seem to do this
when these match. E.g. ext4_dax_fault()
> +{
> + return __famfs_fuse_filemap_fault(vmf, 0, true);
> +}
> +
> +static vm_fault_t
> +famfs_filemap_map_pages(struct vm_fault *vmf, pgoff_t start_pgoff,
> + pgoff_t end_pgoff)
> +{
> + return filemap_map_pages(vmf, start_pgoff, end_pgoff);
Why not just use this directly as the vm_operation? shmem does
this for instance.
> +}
> +
> +const struct vm_operations_struct famfs_file_vm_ops = {
> + .fault = famfs_filemap_fault,
> + .huge_fault = famfs_filemap_huge_fault,
> + .map_pages = famfs_filemap_map_pages,
> + .page_mkwrite = famfs_filemap_page_mkwrite,
> + .pfn_mkwrite = famfs_filemap_pfn_mkwrite,
> +};
> +
> +/*********************************************************************
> + * file_operations
> + */
> +
> +/**
> + * famfs_file_bad() - Check for files that aren't in a valid state
> + *
> + * @inode - inode
> + *
> + * Returns: 0=success
> + * -errno=failure
> + */
> +static ssize_t
Odd return type. Why not int?
> +famfs_file_bad(struct inode *inode)
> +{
> + struct fuse_inode *fi = get_fuse_inode(inode);
> + struct famfs_file_meta *meta = fi->famfs_meta;
> + size_t i_size = i_size_read(inode);
> +
> + if (!meta) {
> + pr_err("%s: un-initialized famfs file\n", __func__);
> + return -EIO;
> + }
> + if (meta->error) {
> + pr_debug("%s: previously detected metadata errors\n", __func__);
> + return -EIO;
> + }
> + if (i_size != meta->file_size) {
> + pr_warn("%s: i_size overwritten from %ld to %ld\n",
> + __func__, meta->file_size, i_size);
> + meta->error = true;
> + return -ENXIO;
> + }
> + if (!IS_DAX(inode)) {
> + pr_debug("%s: inode %llx IS_DAX is false\n",
> + __func__, (u64)inode);
> + return -ENXIO;
> + }
> + return 0;
> +}
> +
> +static ssize_t
This can probably just return an int given type seems to be driven
by famfs_file_bad() which doesn't make much sense as returning a ssize_t
Storing an int into a ssize_t without cast should be fine.
> +famfs_fuse_rw_prep(struct kiocb *iocb, struct iov_iter *ubuf)
> +{
> + struct inode *inode = iocb->ki_filp->f_mapping->host;
> + size_t i_size = i_size_read(inode);
> + size_t count = iov_iter_count(ubuf);
> + size_t max_count;
> + ssize_t rc;
> +
> + rc = famfs_file_bad(inode);
> + if (rc)
> + return rc;
> +
> + /* Avoid unsigned underflow if position is past EOF */
> + if (iocb->ki_pos >= i_size)
> + max_count = 0;
> + else
> + max_count = i_size - iocb->ki_pos;
> +
> + if (count > max_count)
> + iov_iter_truncate(ubuf, max_count);
> +
> + if (!iov_iter_count(ubuf))
> + return 0;
> +
> + return rc;
> +}
> +
> +ssize_t
> +famfs_fuse_read_iter(struct kiocb *iocb, struct iov_iter *to)
> +{
> + ssize_t rc;
> +
> + rc = famfs_fuse_rw_prep(iocb, to);
> + if (rc)
> + return rc;
> +
> + if (!iov_iter_count(to))
> + return 0;
> +
> + rc = dax_iomap_rw(iocb, to, &famfs_iomap_ops);
> +
> + file_accessed(iocb->ki_filp);
> + return rc;
> +}
> +
> +int
> +famfs_fuse_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> + struct inode *inode = file_inode(file);
> + ssize_t rc;
> +
> + rc = famfs_file_bad(inode);
> + if (rc)
> + return (int)rc;
This was odd so I went and looked. famfs_file_bad() should probably just return an int.
> +
> + file_accessed(file);
> + vma->vm_ops = &famfs_file_vm_ops;
> + vm_flags_set(vma, VM_HUGEPAGE);
> + return 0;
> +}
> +
> #define FMAP_BUFSIZE PAGE_SIZE
>
> int
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 1f64bf68b5ee..45a09a7f0012 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1831,6 +1831,8 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
>
> if (FUSE_IS_VIRTIO_DAX(fi))
> return fuse_dax_read_iter(iocb, to);
> + if (fuse_file_famfs(fi))
> + return famfs_fuse_read_iter(iocb, to);
>
> /* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
> if (ff->open_flags & FOPEN_DIRECT_IO)
> @@ -1853,6 +1855,8 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>
> if (FUSE_IS_VIRTIO_DAX(fi))
> return fuse_dax_write_iter(iocb, from);
> + if (fuse_file_famfs(fi))
> + return famfs_fuse_write_iter(iocb, from);
>
> /* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
> if (ff->open_flags & FOPEN_DIRECT_IO)
> @@ -1868,9 +1872,13 @@ static ssize_t fuse_splice_read(struct file *in, loff_t *ppos,
> unsigned int flags)
> {
> struct fuse_file *ff = in->private_data;
> + struct inode *inode = file_inode(in);
> + struct fuse_inode *fi = get_fuse_inode(inode);
>
> /* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
> - if (fuse_file_passthrough(ff) && !(ff->open_flags & FOPEN_DIRECT_IO))
> + if (fuse_file_famfs(fi))
> + return -EIO; /* famfs does not use the page cache... */
As below.
> + else if (fuse_file_passthrough(ff) && !(ff->open_flags & FOPEN_DIRECT_IO))
> return fuse_passthrough_splice_read(in, ppos, pipe, len, flags);
> else
> return filemap_splice_read(in, ppos, pipe, len, flags);
> @@ -1880,9 +1888,13 @@ static ssize_t fuse_splice_write(struct pipe_inode_info *pipe, struct file *out,
> loff_t *ppos, size_t len, unsigned int flags)
> {
> struct fuse_file *ff = out->private_data;
> + struct inode *inode = file_inode(out);
> + struct fuse_inode *fi = get_fuse_inode(inode);
>
> /* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
> - if (fuse_file_passthrough(ff) && !(ff->open_flags & FOPEN_DIRECT_IO))
> + if (fuse_file_famfs(fi))
> + return -EIO; /* famfs does not use the page cache... */
Not sure why original code had else, but not needed given returned.
Maybe stick to local style.
> + else if (fuse_file_passthrough(ff) && !(ff->open_flags & FOPEN_DIRECT_IO))
> return fuse_passthrough_splice_write(pipe, out, ppos, len, flags);
> else
> return iter_file_splice_write(pipe, out, ppos, len, flags);
> @@ -2390,6 +2402,8 @@ static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
> /* DAX mmap is superior to direct_io mmap */
> if (FUSE_IS_VIRTIO_DAX(fi))
> return fuse_dax_mmap(file, vma);
> + if (fuse_file_famfs(fi))
> + return famfs_fuse_mmap(file, vma);
>
> /*
> * If inode is in passthrough io mode, because it has some file open
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 03/21] dax: Save the kva from memremap
2026-01-08 11:32 ` Jonathan Cameron
@ 2026-01-08 15:15 ` John Groves
0 siblings, 0 replies; 74+ messages in thread
From: John Groves @ 2026-01-08 15:15 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On 26/01/08 11:32AM, Jonathan Cameron wrote:
> On Wed, 7 Jan 2026 09:33:12 -0600
> John Groves <John@Groves.net> wrote:
>
> > Save the kva from memremap because we need it for iomap rw support.
> >
> > Prior to famfs, there were no iomap users of /dev/dax - so the virtual
> > address from memremap was not needed.
> >
> > (also fill in missing kerneldoc comment fields for struct dev_dax)
>
> Do that as a precursor that can be picked up ahead of the rest of the series.
Makes sense. Actually, I'll just send it as a separate standalone patch...
Thanks,
John
[ ... ]
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 18/21] famfs_fuse: Add holder_operations for dax notify_failure()
2026-01-07 15:33 ` [PATCH V3 18/21] famfs_fuse: Add holder_operations for dax notify_failure() John Groves
@ 2026-01-08 15:17 ` Jonathan Cameron
2026-01-09 21:00 ` John Groves
0 siblings, 1 reply; 74+ messages in thread
From: Jonathan Cameron @ 2026-01-08 15:17 UTC (permalink / raw)
To: John Groves
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On Wed, 7 Jan 2026 09:33:27 -0600
John Groves <John@Groves.net> wrote:
> Memory errors are at least somewhat more likely on disaggregated memory
> than on-board memory. This commit registers to be notified by fsdev_dax
> in the event that a memory failure is detected.
>
> When a file access resolves to a daxdev with memory errors, it will fail
> with an appropriate error.
>
> If a daxdev failed fs_dax_get(), we set dd->dax_err. If a daxdev called
> our notify_failure(), set dd->error. When any of the above happens, set
> (file)->error and stop allowing access.
>
> In general, the recovery from memory errors is to unmount the file
> system and re-initialize the memory, but there may be usable degraded
> modes of operation - particularly in the future when famfs supports
> file systems backed by more than one daxdev. In those cases,
> accessing data that is on a working daxdev can still work.
>
> For now, return errors for any file that has encountered a memory or dax
> error.
>
> Signed-off-by: John Groves <john@groves.net>
> ---
> fs/fuse/famfs.c | 115 +++++++++++++++++++++++++++++++++++++++---
> fs/fuse/famfs_kfmap.h | 3 +-
> 2 files changed, 109 insertions(+), 9 deletions(-)
>
> diff --git a/fs/fuse/famfs.c b/fs/fuse/famfs.c
> index c02b14789c6e..4eb87c5c628e 100644
> --- a/fs/fuse/famfs.c
> +++ b/fs/fuse/famfs.c
> @@ -254,6 +288,38 @@ famfs_update_daxdev_table(
> return 0;
> }
>
> +static void
> +famfs_set_daxdev_err(
> + struct fuse_conn *fc,
> + struct dax_device *dax_devp)
> +{
> + int i;
> +
> + /* Gotta search the list by dax_devp;
> + * read lock because we're not adding or removing daxdev entries
> + */
> + down_read(&fc->famfs_devlist_sem);
Use a guard()
> + for (i = 0; i < fc->dax_devlist->nslots; i++) {
> + if (fc->dax_devlist->devlist[i].valid) {
> + struct famfs_daxdev *dd = &fc->dax_devlist->devlist[i];
> +
> + if (dd->devp != dax_devp)
> + continue;
> +
> + dd->error = true;
> + up_read(&fc->famfs_devlist_sem);
> +
> + pr_err("%s: memory error on daxdev %s (%d)\n",
> + __func__, dd->name, i);
> + goto done;
> + }
> + }
> + up_read(&fc->famfs_devlist_sem);
> + pr_err("%s: memory err on unrecognized daxdev\n", __func__);
> +
> +done:
If this isn't getting more interesting, just return above.
> +}
> +
> /***************************************************************************/
>
> void
> @@ -611,6 +677,26 @@ famfs_file_init_dax(
>
> static ssize_t famfs_file_bad(struct inode *inode);
>
> +static int famfs_dax_err(struct famfs_daxdev *dd)
I'd introduce this earlier in the series to reduce need
to refactor below.
> +{
> + if (!dd->valid) {
> + pr_err("%s: daxdev=%s invalid\n",
> + __func__, dd->name);
> + return -EIO;
> + }
> + if (dd->dax_err) {
> + pr_err("%s: daxdev=%s dax_err\n",
> + __func__, dd->name);
> + return -EIO;
> + }
> + if (dd->error) {
> + pr_err("%s: daxdev=%s memory error\n",
> + __func__, dd->name);
> + return -EHWPOISON;
> + }
> + return 0;
> +}
...
> @@ -966,7 +1064,8 @@ famfs_file_bad(struct inode *inode)
> return -EIO;
> }
> if (meta->error) {
> - pr_debug("%s: previously detected metadata errors\n", __func__);
> + pr_debug("%s: previously detected metadata errors\n",
> + __func__);
Spurious change.
> return -EIO;
> }
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 01/21] dax: move dax_pgoff_to_phys from [drivers/dax/] device.c to bus.c
2026-01-08 13:25 ` John Groves
@ 2026-01-08 15:20 ` Jonathan Cameron
0 siblings, 0 replies; 74+ messages in thread
From: Jonathan Cameron @ 2026-01-08 15:20 UTC (permalink / raw)
To: John Groves
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On Thu, 8 Jan 2026 07:25:47 -0600
John Groves <John@groves.net> wrote:
> On 26/01/08 10:43AM, Jonathan Cameron wrote:
> > On Wed, 7 Jan 2026 09:33:10 -0600
> > John Groves <John@Groves.net> wrote:
> >
> > > This function will be used by both device.c and fsdev.c, but both are
> > > loadable modules. Moving to bus.c puts it in core and makes it available
> > > to both.
> > >
> > > No code changes - just relocated.
> > >
> > > Signed-off-by: John Groves <john@groves.net>
> > Hi John,
> >
> > I don't know the code well enough to offer an opinion on whether this
> > move causes any issues or if this is the best location, so review is superficial
> > stuff only.
> >
> > Jonathan
> >
> > > ---
> > > drivers/dax/bus.c | 27 +++++++++++++++++++++++++++
> > > drivers/dax/device.c | 23 -----------------------
> > > 2 files changed, 27 insertions(+), 23 deletions(-)
> > >
> > > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> > > index fde29e0ad68b..a2f9a3cc30a5 100644
> > > --- a/drivers/dax/bus.c
> > > +++ b/drivers/dax/bus.c
> > > @@ -7,6 +7,9 @@
> > > #include <linux/slab.h>
> > > #include <linux/dax.h>
> > > #include <linux/io.h>
> > > +#include <linux/backing-dev.h>
> >
> > I'm not immediately spotting why this one. Maybe should be in a different
> > patch?
> >
> > > +#include <linux/range.h>
> > > +#include <linux/uio.h>
> >
> > Why this one?
>
> Good eye, thanks. These must have leaked from some of the many dead ends
> that I tried before coming up with this approach.
>
> I've dropped all new includes and it still builds :D
Range one should be there...
>
> >
> > Style wise, dax seems to use reverse xmas tree for includes, so
> > this should keep to that.
> >
> > > #include "dax-private.h"
> > > #include "bus.h"
> > >
> > > @@ -1417,6 +1420,30 @@ static const struct device_type dev_dax_type = {
> > > .groups = dax_attribute_groups,
> > > };
> > >
> > > +/* see "strong" declaration in tools/testing/nvdimm/dax-dev.c */
> > Bonus space before that */
> > Curiously that wasn't there in the original.
>
> Removed.
>
> [ ... ]
>
> Thanks,
> John
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 21/21] famfs_fuse: Add documentation
2026-01-07 15:33 ` [PATCH V3 21/21] famfs_fuse: Add documentation John Groves
@ 2026-01-08 15:27 ` Jonathan Cameron
2026-01-11 18:53 ` John Groves
0 siblings, 1 reply; 74+ messages in thread
From: Jonathan Cameron @ 2026-01-08 15:27 UTC (permalink / raw)
To: John Groves
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On Wed, 7 Jan 2026 09:33:30 -0600
John Groves <John@Groves.net> wrote:
> Add Documentation/filesystems/famfs.rst and update MAINTAINERS
>
> Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
> Tested-by: Randy Dunlap <rdunlap@infradead.org>
> Signed-off-by: John Groves <john@groves.net>
> ---
> Documentation/filesystems/famfs.rst | 142 ++++++++++++++++++++++++++++
> Documentation/filesystems/index.rst | 1 +
> MAINTAINERS | 1 +
> 3 files changed, 144 insertions(+)
> create mode 100644 Documentation/filesystems/famfs.rst
>
> diff --git a/Documentation/filesystems/famfs.rst b/Documentation/filesystems/famfs.rst
> new file mode 100644
> index 000000000000..0d3c9ba9b7a8
> --- /dev/null
> +++ b/Documentation/filesystems/famfs.rst
> +Principles of Operation
> +=======================
....
> +When an app accesses a data object in a famfs file, there is no page cache
> +involvement. The CPU cache is loaded directly from the shared memory. In
> +some use cases, this is an enormous reduction read amplification compared
> +to loading an entire page into the page cache.
> +
Trivial but this double blank line seems inconsistent.
I don't mind if it's one or two, but do the same everywhere.
> +
> +Famfs is Not a Conventional File System
> +---------------------------------------
Nice doc.
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 4/4] fuse: add famfs DAX fmap support
2026-01-07 15:34 ` [PATCH V3 4/4] fuse: add famfs DAX fmap support John Groves
@ 2026-01-08 15:31 ` Jonathan Cameron
2026-01-11 18:24 ` John Groves
0 siblings, 1 reply; 74+ messages in thread
From: Jonathan Cameron @ 2026-01-08 15:31 UTC (permalink / raw)
To: John Groves
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On Wed, 7 Jan 2026 09:34:43 -0600
John Groves <John@Groves.net> wrote:
> Add new FUSE operations and capability for famfs DAX file mapping:
>
> - FUSE_CAP_DAX_FMAP: New capability flag at bit 32 (using want_ext/capable_ext
> fields) to indicate kernel and userspace support for DAX fmaps
>
> - GET_FMAP: New operation to retrieve a file map for DAX-mapped files.
> Returns a fuse_famfs_fmap_header followed by simple or interleaved
> extent descriptors. The kernel passes the file size as an argument.
>
> - GET_DAXDEV: New operation to retrieve DAX device info by index.
> Called when GET_FMAP returns an fmap referencing a previously
> unknown DAX device.
>
> These operations enable FUSE filesystems to provide direct access
> mappings to persistent memory, allowing the kernel to map files
> directly to DAX devices without page cache intermediation.
>
> Signed-off-by: John Groves <john@groves.net>
> ---
> include/fuse_common.h | 5 +++++
> include/fuse_lowlevel.h | 37 +++++++++++++++++++++++++++++++++++++
> lib/fuse_lowlevel.c | 31 ++++++++++++++++++++++++++++++-
> 3 files changed, 72 insertions(+), 1 deletion(-)
>
> diff --git a/include/fuse_common.h b/include/fuse_common.h
> index 041188e..e428ddb 100644
> --- a/include/fuse_common.h
> +++ b/include/fuse_common.h
> @@ -512,6 +512,11 @@ struct fuse_loop_config_v1 {
> */
> #define FUSE_CAP_OVER_IO_URING (1UL << 31)
>
> +/**
> + * handle files that use famfs dax fmaps
> + */
> +#define FUSE_CAP_DAX_FMAP (1UL<<32)
From the context above, looks like local style is spaces around <<
That's about the level of my understanding of the fuse code ;)
> +
> /**
> * Ioctl flags
> *
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 04/21] dax: Add dax_operations for use by fs-dax on fsdev dax
2026-01-08 11:50 ` Jonathan Cameron
@ 2026-01-08 15:59 ` John Groves
2026-01-08 16:10 ` Jonathan Cameron
0 siblings, 1 reply; 74+ messages in thread
From: John Groves @ 2026-01-08 15:59 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On 26/01/08 11:50AM, Jonathan Cameron wrote:
> On Wed, 7 Jan 2026 09:33:13 -0600
> John Groves <John@Groves.net> wrote:
>
> > From: John Groves <John@Groves.net>
> >
> Hi John
>
> The description should generally make sense without the title.
> Sometimes that means more or less repeating the title.
>
> A few other things inline.
Will do
>
> > * These methods are based on pmem_dax_ops from drivers/nvdimm/pmem.c
> > * fsdev_dax_direct_access() returns the hpa, pfn and kva. The kva was
> > newly stored as dev_dax->virt_addr by dev_dax_probe().
> > * The hpa/pfn are used for mmap (dax_iomap_fault()), and the kva is used
> > for read/write (dax_iomap_rw())
> > * fsdev_dax_recovery_write() and dev_dax_zero_page_range() have not been
> > tested yet. I'm looking for suggestions as to how to test those.
> > * dax-private.h: add dev_dax->cached_size, which fsdev needs to
> > remember. The dev_dax size cannot change while a driver is bound
> > (dev_dax_resize returns -EBUSY if dev->driver is set). Caching the size
> > at probe time allows fsdev's direct_access path can use it without
> > acquiring dax_dev_rwsem (which isn't exported anyway).
> >
> > Signed-off-by: John Groves <john@groves.net>
>
> > diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> > index c5c660b193e5..9e2f83aa2584 100644
> > --- a/drivers/dax/fsdev.c
> > +++ b/drivers/dax/fsdev.c
> > @@ -27,6 +27,81 @@
> > * - No mmap support - all access is through fs-dax/iomap
> > */
> >
> > +static void fsdev_write_dax(void *pmem_addr, struct page *page,
> > + unsigned int off, unsigned int len)
> > +{
> > + while (len) {
> > + void *mem = kmap_local_page(page);
>
> I guess it's pretty simple, but do we care about HIGHMEM for this
> new feature? Maybe it's just easier to support it than argue about it however ;)
I think this compiles to zero overhead, and is an established pattern -
but I'm ok following a consensus elsewhere...
>
> > + unsigned int chunk = min_t(unsigned int, len, PAGE_SIZE - off);
> > +
> > + memcpy_flushcache(pmem_addr, mem + off, chunk);
> > + kunmap_local(mem);
> > + len -= chunk;
> > + off = 0;
> > + page++;
> > + pmem_addr += chunk;
> > + }
> > +}
> > +
> > +static long __fsdev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
> > + long nr_pages, enum dax_access_mode mode, void **kaddr,
> > + unsigned long *pfn)
> > +{
> > + struct dev_dax *dev_dax = dax_get_private(dax_dev);
> > + size_t size = nr_pages << PAGE_SHIFT;
> > + size_t offset = pgoff << PAGE_SHIFT;
> > + void *virt_addr = dev_dax->virt_addr + offset;
> > + phys_addr_t phys;
> > + unsigned long local_pfn;
> > +
> > + WARN_ON(!dev_dax->virt_addr);
> > +
> > + phys = dax_pgoff_to_phys(dev_dax, pgoff, nr_pages << PAGE_SHIFT);
>
> Use size given you already computed it.
Not sure I follow. nr_pages is the size of the access or fault, not the size
of the device.
>
> > +
> > + if (kaddr)
> > + *kaddr = virt_addr;
> > +
> > + local_pfn = PHYS_PFN(phys);
> > + if (pfn)
> > + *pfn = local_pfn;
> > +
> > + /*
> > + * Use cached_size which was computed at probe time. The size cannot
> > + * change while the driver is bound (resize returns -EBUSY).
> > + */
> > + return PHYS_PFN(min_t(size_t, size, dev_dax->cached_size - offset));
>
> Is the min_t() needed? min() is pretty good at picking right types these days.
Changed to min()
>
> > +}
> > +
> > +static int fsdev_dax_zero_page_range(struct dax_device *dax_dev,
> > + pgoff_t pgoff, size_t nr_pages)
> > +{
> > + void *kaddr;
> > +
> > + WARN_ONCE(nr_pages > 1, "%s: nr_pages > 1\n", __func__);
> > + __fsdev_dax_direct_access(dax_dev, pgoff, 1, DAX_ACCESS, &kaddr, NULL);
> > + fsdev_write_dax(kaddr, ZERO_PAGE(0), 0, PAGE_SIZE);
> > + return 0;
> > +}
> > +
> > +static long fsdev_dax_direct_access(struct dax_device *dax_dev,
> > + pgoff_t pgoff, long nr_pages, enum dax_access_mode mode,
> > + void **kaddr, unsigned long *pfn)
> > +{
> > + return __fsdev_dax_direct_access(dax_dev, pgoff, nr_pages, mode,
> > + kaddr, pfn);
>
> Alignment in this file is a bit random, but I'd at least align this one
> after the (
Done, thanks!
John
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 04/21] dax: Add dax_operations for use by fs-dax on fsdev dax
2026-01-08 15:59 ` John Groves
@ 2026-01-08 16:10 ` Jonathan Cameron
0 siblings, 0 replies; 74+ messages in thread
From: Jonathan Cameron @ 2026-01-08 16:10 UTC (permalink / raw)
To: John Groves
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On Thu, 8 Jan 2026 09:59:08 -0600
John Groves <John@groves.net> wrote:
> On 26/01/08 11:50AM, Jonathan Cameron wrote:
> > On Wed, 7 Jan 2026 09:33:13 -0600
> > John Groves <John@Groves.net> wrote:
> >
> > > From: John Groves <John@Groves.net>
> > >
> > Hi John
> >
> > The description should generally make sense without the title.
> > Sometimes that means more or less repeating the title.
> >
> > A few other things inline.
>
> Will do
>
> >
> > > * These methods are based on pmem_dax_ops from drivers/nvdimm/pmem.c
> > > * fsdev_dax_direct_access() returns the hpa, pfn and kva. The kva was
> > > newly stored as dev_dax->virt_addr by dev_dax_probe().
> > > * The hpa/pfn are used for mmap (dax_iomap_fault()), and the kva is used
> > > for read/write (dax_iomap_rw())
> > > * fsdev_dax_recovery_write() and dev_dax_zero_page_range() have not been
> > > tested yet. I'm looking for suggestions as to how to test those.
> > > * dax-private.h: add dev_dax->cached_size, which fsdev needs to
> > > remember. The dev_dax size cannot change while a driver is bound
> > > (dev_dax_resize returns -EBUSY if dev->driver is set). Caching the size
> > > at probe time allows fsdev's direct_access path can use it without
> > > acquiring dax_dev_rwsem (which isn't exported anyway).
> > >
> > > Signed-off-by: John Groves <john@groves.net>
> >
> > > diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> > > index c5c660b193e5..9e2f83aa2584 100644
> > > --- a/drivers/dax/fsdev.c
> > > +++ b/drivers/dax/fsdev.c
> > > @@ -27,6 +27,81 @@
> > > * - No mmap support - all access is through fs-dax/iomap
> > > */
> > >
> > > +static void fsdev_write_dax(void *pmem_addr, struct page *page,
> > > + unsigned int off, unsigned int len)
> > > +{
> > > + while (len) {
> > > + void *mem = kmap_local_page(page);
> >
> > I guess it's pretty simple, but do we care about HIGHMEM for this
> > new feature? Maybe it's just easier to support it than argue about it however ;)
>
> I think this compiles to zero overhead, and is an established pattern -
> but I'm ok following a consensus elsewhere...
That's fair, probably just keep it.
> > > +static long __fsdev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
> > > + long nr_pages, enum dax_access_mode mode, void **kaddr,
> > > + unsigned long *pfn)
> > > +{
> > > + struct dev_dax *dev_dax = dax_get_private(dax_dev);
> > > + size_t size = nr_pages << PAGE_SHIFT;
> > > + size_t offset = pgoff << PAGE_SHIFT;
> > > + void *virt_addr = dev_dax->virt_addr + offset;
> > > + phys_addr_t phys;
> > > + unsigned long local_pfn;
> > > +
> > > + WARN_ON(!dev_dax->virt_addr);
> > > +
> > > + phys = dax_pgoff_to_phys(dev_dax, pgoff, nr_pages << PAGE_SHIFT);
> >
> > Use size given you already computed it.
>
> Not sure I follow. nr_pages is the size of the access or fault, not the size
> of the device.
Just above:
size_t size = nr_pages << PAGE_SHIFT;
Jonathan
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 05/21] dax: Add dax_set_ops() for setting dax_operations at bind time
2026-01-08 12:06 ` Jonathan Cameron
@ 2026-01-08 16:20 ` John Groves
0 siblings, 0 replies; 74+ messages in thread
From: John Groves @ 2026-01-08 16:20 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On 26/01/08 12:06PM, Jonathan Cameron wrote:
> On Wed, 7 Jan 2026 09:33:14 -0600
> John Groves <John@Groves.net> wrote:
>
> > From: John Groves <John@Groves.net>
> >
> > The dax_device is created (in the non-pmem case) at hmem probe time via
> > devm_create_dev_dax(), before we know which driver (device_dax,
> > fsdev_dax, or kmem) will bind - by calling alloc_dax() with NULL ops,
> > drivers (i.e. fsdev_dax) that need specific dax_operations must set
> > them later.
> >
> > Add dax_set_ops() exported function so fsdev_dax can set its ops at
> > probe time and clear them on remove. device_dax doesn't need ops since
> > it uses the mmap fault path directly.
> >
> > Use cmpxchg() to atomically set ops only if currently NULL, returning
> > -EBUSY if ops are already set. This prevents accidental double-binding.
> > Clearing ops (NULL) always succeeds.
> >
> > Signed-off-by: John Groves <john@groves.net>
> Hi John
>
> This one runs into the fun mess of mixing devm and other calls.
> I'd advise you just don't do it because it makes code much harder
> to review and hits the 'smells bad' button.
>
> Jonathan
If I don't stink up something, I'm not trying hard enough :D
Next iteration will be full-devm.
[ ... ]
Thanks,
John
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 06/21] dax: Add fs_dax_get() func to prepare dax for fs-dax usage
2026-01-08 12:27 ` Jonathan Cameron
@ 2026-01-08 16:45 ` John Groves
0 siblings, 0 replies; 74+ messages in thread
From: John Groves @ 2026-01-08 16:45 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On 26/01/08 12:27PM, Jonathan Cameron wrote:
> On Wed, 7 Jan 2026 09:33:15 -0600
> John Groves <John@Groves.net> wrote:
>
> > The fs_dax_get() function should be called by fs-dax file systems after
> > opening a fsdev dax device. This adds holder_operations, which provides
> > a memory failure callback path and effects exclusivity between callers
> > of fs_dax_get().
> >
> > fs_dax_get() is specific to fsdev_dax, so it checks the driver type
> > (which required touching bus.[ch]). fs_dax_get() fails if fsdev_dax is
> > not bound to the memory.
> >
> > This function serves the same role as fs_dax_get_by_bdev(), which dax
> > file systems call after opening the pmem block device.
> >
> > This can't be located in fsdev.c because struct dax_device is opaque
> > there.
> >
> > This will be called by fs/fuse/famfs.c in a subsequent commit.
> >
> > Signed-off-by: John Groves <john@groves.net>
> Hi John,
>
> A few passing comments on this one.
>
> Jonathan
>
> > ---
>
> > #define dax_driver_register(driver) \
> > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > index ba0b4cd18a77..68c45b918cff 100644
> > --- a/drivers/dax/super.c
> > +++ b/drivers/dax/super.c
> > @@ -14,6 +14,7 @@
> > #include <linux/fs.h>
> > #include <linux/cacheinfo.h>
> > #include "dax-private.h"
> > +#include "bus.h"
> >
> > /**
> > * struct dax_device - anchor object for dax services
> > @@ -121,6 +122,59 @@ void fs_put_dax(struct dax_device *dax_dev, void *holder)
> > EXPORT_SYMBOL_GPL(fs_put_dax);
> > #endif /* CONFIG_BLOCK && CONFIG_FS_DAX */
> >
> > +#if IS_ENABLED(CONFIG_DEV_DAX_FS)
> > +/**
> > + * fs_dax_get() - get ownership of a devdax via holder/holder_ops
> > + *
> > + * fs-dax file systems call this function to prepare to use a devdax device for
> > + * fsdax. This is like fs_dax_get_by_bdev(), but the caller already has struct
> > + * dev_dax (and there is no bdev). The holder makes this exclusive.
> > + *
> > + * @dax_dev: dev to be prepared for fs-dax usage
> > + * @holder: filesystem or mapped device inside the dax_device
> > + * @hops: operations for the inner holder
> > + *
> > + * Returns: 0 on success, <0 on failure
> > + */
> > +int fs_dax_get(struct dax_device *dax_dev, void *holder,
> > + const struct dax_holder_operations *hops)
> > +{
> > + struct dev_dax *dev_dax;
> > + struct dax_device_driver *dax_drv;
> > + int id;
> > +
> > + id = dax_read_lock();
>
> Given this is an srcu_read_lock under the hood you could do similar
> to the DEFINE_LOCK_GUARD_1 for the srcu (srcu.h) (though here it's a
> DEFINE_LOCK_GUARD_0 given the lock itself isn't a parameter and then
> use scoped_guard() here. Might not be worth the hassle and would need
> a wrapper macro to poke &dax_srcu in which means exposing that at least
> a little in a header.
>
> DEFINE_LOCK_GUARD_0(_T->idx = dax_read_lock, dax_read_lock(_T->idx), idx);
> Based loosely on the irqflags.h irqsave one.
I'm getting more comfortable with scoped_guard(), but this feels like
a good leanup patch addressing all call sites of dax_read_lock() - after
the famfs dust settles.
If feelings are strong about this I'm open...
>
> > + if (!dax_dev || !dax_alive(dax_dev) || !igrab(&dax_dev->inode)) {
> > + dax_read_unlock(id);
> > + return -ENODEV;
> > + }
> > + dax_read_unlock(id);
> > +
> > + /* Verify the device is bound to fsdev_dax driver */
> > + dev_dax = dax_get_private(dax_dev);
> > + if (!dev_dax || !dev_dax->dev.driver) {
> > + iput(&dax_dev->inode);
> > + return -ENODEV;
> > + }
> > +
> > + dax_drv = to_dax_drv(dev_dax->dev.driver);
> > + if (dax_drv->type != DAXDRV_FSDEV_TYPE) {
> > + iput(&dax_dev->inode);
> > + return -EOPNOTSUPP;
> > + }
> > +
> > + if (cmpxchg(&dax_dev->holder_data, NULL, holder)) {
> > + iput(&dax_dev->inode);
> > + return -EBUSY;
> > + }
> > +
> > + dax_dev->holder_ops = hops;
> > +
> > + return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(fs_dax_get);
> > +#endif /* DEV_DAX_FS */
> > +
> > enum dax_device_flags {
> > /* !alive + rcu grace period == no new operations / mappings */
> > DAXDEV_ALIVE,
> > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > index 3fcd8562b72b..76f2a75f3144 100644
> > --- a/include/linux/dax.h
> > +++ b/include/linux/dax.h
> > @@ -53,6 +53,7 @@ struct dax_holder_operations {
> > struct dax_device *alloc_dax(void *private, const struct dax_operations *ops);
> >
> > #if IS_ENABLED(CONFIG_DEV_DAX_FS)
> > +int fs_dax_get(struct dax_device *dax_dev, void *holder, const struct dax_holder_operations *hops);
> I'd wrap this. It's rather long and there isn't a huge readability benefit in keeping
> it on one line.
Done, thanks!
John
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 07/21] dax: prevent driver unbind while filesystem holds device
2026-01-08 12:34 ` Jonathan Cameron
@ 2026-01-08 18:08 ` John Groves
0 siblings, 0 replies; 74+ messages in thread
From: John Groves @ 2026-01-08 18:08 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On 26/01/08 12:34PM, Jonathan Cameron wrote:
> On Wed, 7 Jan 2026 09:33:16 -0600
> John Groves <John@Groves.net> wrote:
>
> > From: John Groves <John@Groves.net>
> >
> > Add custom bind/unbind sysfs attributes for the dax bus that check
> > whether a filesystem has registered as a holder (via fs_dax_get())
> > before allowing driver unbind.
> >
> > When a filesystem like famfs mounts on a dax device, it registers
> > itself as the holder via dax_holder_ops. Previously, there was no
> > mechanism to prevent driver unbind while the filesystem was mounted,
> > which could cause some havoc.
> >
> > The new unbind_store() checks dax_holder() and returns -EBUSY if
> > a holder is registered, giving userspace proper feedback that the
> > device is in use.
> >
> > To use our custom bind/unbind handlers instead of the default ones,
> > set suppress_bind_attrs=true on all dax drivers during registration.
>
> Whilst I appreciate that it is painful, so are many other driver unbinds
> where services are provided to another driver. Is there any precedence
> for doing something like this? If not, I'd like to see a review on this
> from one of the driver core folk. Maybe Greg KH.
>
> Might just be a case of calling it something else to avoid userspace
> tooling getting a surprise.
I'll do more digging to see if there are other patterns; feedback/ideas
requested...
>
> >
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> > drivers/dax/bus.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 53 insertions(+)
> >
> > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> > index 6e0e28116edc..ed453442739d 100644
> > --- a/drivers/dax/bus.c
> > +++ b/drivers/dax/bus.c
> > @@ -151,9 +151,61 @@ static ssize_t remove_id_store(struct device_driver *drv, const char *buf,
> > }
> > static DRIVER_ATTR_WO(remove_id);
> >
> > +static const struct bus_type dax_bus_type;
> > +
> > +/*
> > + * Custom bind/unbind handlers for dax bus.
> > + * The unbind handler checks if a filesystem holds the dax device and
> > + * returns -EBUSY if so, preventing driver unbind while in use.
> > + */
> > +static ssize_t unbind_store(struct device_driver *drv, const char *buf,
> > + size_t count)
> > +{
> > + struct device *dev;
> > + int rc = -ENODEV;
> > +
> > + dev = bus_find_device_by_name(&dax_bus_type, NULL, buf);
>
> struct device *dev __free(put_device) = bus_find_device_by_name()...
>
> and you can just return on error.
>
> > + if (dev && dev->driver == drv) {
> With the __free I'd flip this
> if (!dev || !dev->driver == drv)
> return -ENODEV;
>
> ...
>
I like it; done.
> > + struct dev_dax *dev_dax = to_dev_dax(dev);
> > +
> > + if (dax_holder(dev_dax->dax_dev)) {
> > + dev_dbg(dev,
> > + "%s: blocking unbind due to active holder\n",
> > + __func__);
> > + rc = -EBUSY;
> > + goto out;
> > + }
> > + device_release_driver(dev);
> > + rc = count;
> > + }
> > +out:
> > + put_device(dev);
> > + return rc;
> > +}
> > +static DRIVER_ATTR_WO(unbind);
> > +
> > +static ssize_t bind_store(struct device_driver *drv, const char *buf,
> > + size_t count)
> > +{
> > + struct device *dev;
> > + int rc = -ENODEV;
> > +
> > + dev = bus_find_device_by_name(&dax_bus_type, NULL, buf);
> Use __free magic here as well..
> > + if (dev) {
> > + rc = device_driver_attach(drv, dev);
> > + if (!rc)
> > + rc = count;
> then this can be
> if (rc)
> return rc;
> return count;
>
Done
Thanks!
John
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 02/21] dax: add fsdev.c driver for fs-dax on character dax
2026-01-08 15:12 ` John Groves
@ 2026-01-08 21:15 ` John Groves
2026-01-08 23:25 ` Gregory Price
0 siblings, 1 reply; 74+ messages in thread
From: John Groves @ 2026-01-08 21:15 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On 26/01/08 09:12AM, John Groves wrote:
> On 26/01/08 11:31AM, Jonathan Cameron wrote:
> > On Wed, 7 Jan 2026 09:33:11 -0600
> > John Groves <John@Groves.net> wrote:
[ ... ]
> > > diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
> > > index d656e4c0eb84..491325d914a8 100644
> > > --- a/drivers/dax/Kconfig
> > > +++ b/drivers/dax/Kconfig
> > > @@ -78,4 +78,21 @@ config DEV_DAX_KMEM
> > >
> > > Say N if unsure.
> > >
> > > +config DEV_DAX_FS
> > > + tristate "FSDEV DAX: fs-dax compatible device driver"
> > > + depends on DEV_DAX
> > > + default DEV_DAX
> >
> > What's the logic for the default? Generally I'd not expect a
> > default for something new like this (so default of default == no)
>
> My thinking is that this is harmless unless you use it, but if you
> need it you need it. So defaulting to include the module seems
> viable.
>
> [ ... ]
On further deliberation, I think I'd like to get rid of
CONFIG_DEV_DAX_FS, and just include the fsdev_dax driver if DEV_DAX
and FS_DAX are configured. Then CONFIG_FUSE_FAMFS_DAX (controlling the
famfs code in fuse) can just depend on DEV_DAX, FS_DAX and FUSE_FS.
That's where I'm leaning for the next rev of the series...
John
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 02/21] dax: add fsdev.c driver for fs-dax on character dax
2026-01-08 21:15 ` John Groves
@ 2026-01-08 23:25 ` Gregory Price
0 siblings, 0 replies; 74+ messages in thread
From: Gregory Price @ 2026-01-08 23:25 UTC (permalink / raw)
To: John Groves
Cc: Jonathan Cameron, Miklos Szeredi, Dan Williams, Bernd Schubert,
Alison Schofield, John Groves, Jonathan Corbet, Vishal Verma,
Dave Jiang, Matthew Wilcox, Jan Kara, Alexander Viro,
David Hildenbrand, Christian Brauner, Darrick J . Wong,
Randy Dunlap, Jeff Layton, Amir Goldstein, Stefan Hajnoczi,
Joanne Koong, Josef Bacik, Bagas Sanjaya, Chen Linxuan,
James Morse, Fuad Tabba, Sean Christopherson, Shivank Garg,
Ackerley Tng, Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc,
linux-kernel, nvdimm, linux-cxl, linux-fsdevel
On Thu, Jan 08, 2026 at 03:15:10PM -0600, John Groves wrote:
> On 26/01/08 09:12AM, John Groves wrote:
> > On 26/01/08 11:31AM, Jonathan Cameron wrote:
> > > On Wed, 7 Jan 2026 09:33:11 -0600
> > > John Groves <John@Groves.net> wrote:
>
> [ ... ]
>
> > > > diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
> > > > index d656e4c0eb84..491325d914a8 100644
> > > > --- a/drivers/dax/Kconfig
> > > > +++ b/drivers/dax/Kconfig
> > > > @@ -78,4 +78,21 @@ config DEV_DAX_KMEM
> > > >
> > > > Say N if unsure.
> > > >
> > > > +config DEV_DAX_FS
> > > > + tristate "FSDEV DAX: fs-dax compatible device driver"
> > > > + depends on DEV_DAX
> > > > + default DEV_DAX
> > >
> > > What's the logic for the default? Generally I'd not expect a
> > > default for something new like this (so default of default == no)
> >
> > My thinking is that this is harmless unless you use it, but if you
> > need it you need it. So defaulting to include the module seems
> > viable.
> >
> > [ ... ]
>
> On further deliberation, I think I'd like to get rid of
> CONFIG_DEV_DAX_FS, and just include the fsdev_dax driver if DEV_DAX
> and FS_DAX are configured. Then CONFIG_FUSE_FAMFS_DAX (controlling the
> famfs code in fuse) can just depend on DEV_DAX, FS_DAX and FUSE_FS.
>
> That's where I'm leaning for the next rev of the series...
>
> John
>
Please do that for CXL_DAX or whatever because it's really annoying to
have CXL and DAX configured but not have your dax device show up because
CXL_DAX wasn't configured.
:P
~Gregory
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 14/21] famfs_fuse: Plumb the GET_FMAP message/response
2026-01-08 12:49 ` Jonathan Cameron
@ 2026-01-09 2:12 ` John Groves
0 siblings, 0 replies; 74+ messages in thread
From: John Groves @ 2026-01-09 2:12 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On 26/01/08 12:49PM, Jonathan Cameron wrote:
> On Wed, 7 Jan 2026 09:33:23 -0600
> John Groves <John@Groves.net> wrote:
>
> > Upon completion of an OPEN, if we're in famfs-mode we do a GET_FMAP to
> > retrieve and cache up the file-to-dax map in the kernel. If this
> > succeeds, read/write/mmap are resolved direct-to-dax with no upcalls.
> >
> > Signed-off-by: John Groves <john@groves.net>
> A few things inline.
>
> J
>
> > diff --git a/fs/fuse/famfs.c b/fs/fuse/famfs.c
> > new file mode 100644
> > index 000000000000..0f7e3f00e1e7
> > --- /dev/null
> > +++ b/fs/fuse/famfs.c
> > @@ -0,0 +1,74 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * famfs - dax file system for shared fabric-attached memory
> > + *
> > + * Copyright 2023-2025 Micron Technology, Inc.
> > + *
> > + * This file system, originally based on ramfs the dax support from xfs,
> > + * is intended to allow multiple host systems to mount a common file system
> > + * view of dax files that map to shared memory.
> > + */
> > +
> > +#include <linux/fs.h>
> > +#include <linux/mm.h>
> > +#include <linux/dax.h>
> > +#include <linux/iomap.h>
> > +#include <linux/path.h>
> > +#include <linux/namei.h>
> > +#include <linux/string.h>
> > +
> > +#include "fuse_i.h"
> > +
> > +
> > +#define FMAP_BUFSIZE PAGE_SIZE
> > +
> > +int
> > +fuse_get_fmap(struct fuse_mount *fm, struct inode *inode)
> > +{
> > + struct fuse_inode *fi = get_fuse_inode(inode);
> > + size_t fmap_bufsize = FMAP_BUFSIZE;
> > + u64 nodeid = get_node_id(inode);
> > + ssize_t fmap_size;
> > + void *fmap_buf;
> > + int rc;
> > +
> > + FUSE_ARGS(args);
> > +
> > + /* Don't retrieve if we already have the famfs metadata */
> > + if (fi->famfs_meta)
> > + return 0;
> > +
> > + fmap_buf = kcalloc(1, FMAP_BUFSIZE, GFP_KERNEL);
>
> If there is only ever 1, does kcalloc() make sense over kzalloc()?
Muscle memory? Good call, done.
>
> > + if (!fmap_buf)
> > + return -EIO;
> > +
> > + args.opcode = FUSE_GET_FMAP;
> > + args.nodeid = nodeid;
> > +
> > + /* Variable-sized output buffer
> > + * this causes fuse_simple_request() to return the size of the
> > + * output payload
> > + */
> > + args.out_argvar = true;
> > + args.out_numargs = 1;
> > + args.out_args[0].size = fmap_bufsize;
> > + args.out_args[0].value = fmap_buf;
> > +
> > + /* Send GET_FMAP command */
> > + rc = fuse_simple_request(fm, &args);
> > + if (rc < 0) {
> > + pr_err("%s: err=%d from fuse_simple_request()\n",
> > + __func__, rc);
>
> Leaks the fmap_buf? Maybe use a __free() so no need to keep track of htat.
Another good one - done.
>
>
> > + return rc;
> > + }
> > + fmap_size = rc;
> > +
> > + /* We retrieved the "fmap" (the file's map to memory), but
> > + * we haven't used it yet. A call to famfs_file_init_dax() will be added
> > + * here in a subsequent patch, when we add the ability to attach
> > + * fmaps to files.
> > + */
> > +
> > + kfree(fmap_buf);
> > + return 0;
> > +}
>
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index 84d0ee2a501d..691c7850cf4e 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -223,6 +223,14 @@ struct fuse_inode {
>
> >
> > +static inline struct fuse_backing *famfs_meta_set(struct fuse_inode *fi,
> > + void *meta)
> > +{
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > + return xchg(&fi->famfs_meta, meta);
> > +#else
> > + return NULL;
> > +#endif
> > +}
> > +
> > +static inline void famfs_meta_free(struct fuse_inode *fi)
> > +{
> > + /* Stub wil be connected in a subsequent commit */
> > +}
> > +
> > +static inline int fuse_file_famfs(struct fuse_inode *fi)
> > +{
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > + return (READ_ONCE(fi->famfs_meta) != NULL);
> > +#else
> > + return 0;
> > +#endif
> > +}
> > +
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > +int fuse_get_fmap(struct fuse_mount *fm, struct inode *inode);
> > +#else
> > +static inline int
> > +fuse_get_fmap(struct fuse_mount *fm, struct inode *inode)
> > +{
> > + return 0;
> > +}
> > +#endif
> I'd do a single block under one if IS_ENABLED() and then use an else
> for the stubs. Should end up more readable.
OK, this sounds good, but it's rebase hell (oh, the humanity! :D). Multiple
additional commits flesh out this stuff, and for now I'm giving up on that
rebase. I tried the flip, but I don't have all night (tonight), so I'm
leaving it alone.
I'll happily clean this up after the series is complete.
Thanks!
John
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 15/21] famfs_fuse: Create files with famfs fmaps
2026-01-08 13:14 ` Jonathan Cameron
@ 2026-01-09 14:30 ` John Groves
0 siblings, 0 replies; 74+ messages in thread
From: John Groves @ 2026-01-09 14:30 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On 26/01/08 01:14PM, Jonathan Cameron wrote:
> On Wed, 7 Jan 2026 09:33:24 -0600
> John Groves <John@Groves.net> wrote:
>
> > On completion of GET_FMAP message/response, setup the full famfs
> > metadata such that it's possible to handle read/write/mmap directly to
> > dax. Note that the devdax_iomap plumbing is not in yet...
> >
> > * Add famfs_kfmap.h: in-memory structures for resolving famfs file maps
> > (fmaps) to dax.
> > * famfs.c: allocate, initialize and free fmaps
> > * inode.c: only allow famfs mode if the fuse server has CAP_SYS_RAWIO
> > * Update MAINTAINERS for the new files.
> >
> > Signed-off-by: John Groves <john@groves.net>
>
> > diff --git a/fs/fuse/famfs.c b/fs/fuse/famfs.c
> > index 0f7e3f00e1e7..2aabd1d589fd 100644
> > --- a/fs/fuse/famfs.c
> > +++ b/fs/fuse/famfs.c
> > @@ -17,9 +17,355 @@
> > #include <linux/namei.h>
> > #include <linux/string.h>
> >
> > +#include "famfs_kfmap.h"
> > #include "fuse_i.h"
> >
> >
> > +/***************************************************************************/
> Who doesn't like stars? Why have them here?
>
> > +
> > +void
> > +__famfs_meta_free(void *famfs_meta)
>
> Maybe a local convention, but if not one line.
> Same for other cases.
Done
>
> > +{
> > + struct famfs_file_meta *fmap = famfs_meta;
> > +
> > + if (!fmap)
> > + return;
> > +
> > + if (fmap) {
>
> Well that's never going to fail given 2 lines above.
Good eye. Thanks.
>
>
> > + switch (fmap->fm_extent_type) {
> > + case SIMPLE_DAX_EXTENT:
> > + kfree(fmap->se);
> > + break;
> > + case INTERLEAVED_EXTENT:
> > + if (fmap->ie)
> > + kfree(fmap->ie->ie_strips);
> > +
> > + kfree(fmap->ie);
> > + break;
> > + default:
> > + pr_err("%s: invalid fmap type\n", __func__);
> > + break;
> > + }
> > + }
> > + kfree(fmap);
> > +}
>
> > +/**
> > + * famfs_fuse_meta_alloc() - Allocate famfs file metadata
> > + * @metap: Pointer to an mcache_map_meta pointer
> > + * @ext_count: The number of extents needed
>
> run kernel-doc over the file as that's not the parameters...
Not sure how I managed that; Fixed, thanks!
>
> > + *
> > + * Returns: 0=success
> > + * -errno=failure
> > + */
> > +static int
> > +famfs_fuse_meta_alloc(
> > + void *fmap_buf,
> > + size_t fmap_buf_size,
> > + struct famfs_file_meta **metap)
> > +{
> > + struct famfs_file_meta *meta = NULL;
> > + struct fuse_famfs_fmap_header *fmh;
> > + size_t extent_total = 0;
> > + size_t next_offset = 0;
> > + int errs = 0;
> > + int i, j;
> > + int rc;
> > +
> > + fmh = (struct fuse_famfs_fmap_header *)fmap_buf;
>
> void * so cast not needed and hence just assign it at the
> declaration.
Indeed, thanks.
>
> > +
> > + /* Move past fmh in fmap_buf */
> > + next_offset += sizeof(*fmh);
> > + if (next_offset > fmap_buf_size) {
> > + pr_err("%s:%d: fmap_buf underflow offset/size %ld/%ld\n",
> > + __func__, __LINE__, next_offset, fmap_buf_size);
> > + return -EINVAL;
> > + }
> > +
> > + if (fmh->nextents < 1) {
> > + pr_err("%s: nextents %d < 1\n", __func__, fmh->nextents);
> > + return -EINVAL;
> > + }
> > +
> > + if (fmh->nextents > FUSE_FAMFS_MAX_EXTENTS) {
> > + pr_err("%s: nextents %d > max (%d) 1\n",
> > + __func__, fmh->nextents, FUSE_FAMFS_MAX_EXTENTS);
> > + return -E2BIG;
> > + }
> > +
> > + meta = kzalloc(sizeof(*meta), GFP_KERNEL);
>
> Maybe sprinkle some __free magic on this then you can return in
> all the goto error_out places which to me makes this more readable.
I like it, and I learned how to make __famfs_meta_free() into a
__free() handler.
Done
>
> > + if (!meta)
> > + return -ENOMEM;
> > +
> > + meta->error = false;
> > + meta->file_type = fmh->file_type;
> > + meta->file_size = fmh->file_size;
> > + meta->fm_extent_type = fmh->ext_type;
> > +
> > + switch (fmh->ext_type) {
> > + case FUSE_FAMFS_EXT_SIMPLE: {
> > + struct fuse_famfs_simple_ext *se_in;
> > +
> > + se_in = (struct fuse_famfs_simple_ext *)(fmap_buf + next_offset);
>
> void * so no need for cast. Though you could keep the cast but apply it to
> fmh + 1 to take advantage of that type.
done, thanks
>
>
> > +
> > + /* Move past simple extents */
> > + next_offset += fmh->nextents * sizeof(*se_in);
> > + if (next_offset > fmap_buf_size) {
> > + pr_err("%s:%d: fmap_buf underflow offset/size %ld/%ld\n",
> > + __func__, __LINE__, next_offset, fmap_buf_size);
> > + rc = -EINVAL;
> > + goto errout;
> > + }
> > +
> > + meta->fm_nextents = fmh->nextents;
> > +
> > + meta->se = kcalloc(meta->fm_nextents, sizeof(*(meta->se)),
> > + GFP_KERNEL);
> > + if (!meta->se) {
> > + rc = -ENOMEM;
> > + goto errout;
> > + }
> > +
> > + if ((meta->fm_nextents > FUSE_FAMFS_MAX_EXTENTS) ||
> > + (meta->fm_nextents < 1)) {
> > + rc = -EINVAL;
> > + goto errout;
> > + }
> > +
> > + for (i = 0; i < fmh->nextents; i++) {
> > + meta->se[i].dev_index = se_in[i].se_devindex;
> > + meta->se[i].ext_offset = se_in[i].se_offset;
> > + meta->se[i].ext_len = se_in[i].se_len;
> > +
> > + /* Record bitmap of referenced daxdev indices */
> > + meta->dev_bitmap |= (1 << meta->se[i].dev_index);
> > +
> > + errs += famfs_check_ext_alignment(&meta->se[i]);
> > +
> > + extent_total += meta->se[i].ext_len;
> > + }
> > + break;
> > + }
> > +
> > + case FUSE_FAMFS_EXT_INTERLEAVE: {
> > + s64 size_remainder = meta->file_size;
> > + struct fuse_famfs_iext *ie_in;
> > + int niext = fmh->nextents;
> > +
> > + meta->fm_niext = niext;
> > +
> > + /* Allocate interleaved extent */
> > + meta->ie = kcalloc(niext, sizeof(*(meta->ie)), GFP_KERNEL);
> > + if (!meta->ie) {
> > + rc = -ENOMEM;
> > + goto errout;
> > + }
> > +
> > + /*
> > + * Each interleaved extent has a simple extent list of strips.
> > + * Outer loop is over separate interleaved extents
> > + */
> > + for (i = 0; i < niext; i++) {
> > + u64 nstrips;
> > + struct fuse_famfs_simple_ext *sie_in;
> > +
> > + /* ie_in = one interleaved extent in fmap_buf */
> > + ie_in = (struct fuse_famfs_iext *)
> > + (fmap_buf + next_offset);
>
> void * so no cast needed.
Right, thanks
>
> > +
> > + /* Move past one interleaved extent header in fmap_buf */
> > + next_offset += sizeof(*ie_in);
> > + if (next_offset > fmap_buf_size) {
> > + pr_err("%s:%d: fmap_buf underflow offset/size %ld/%ld\n",
> > + __func__, __LINE__, next_offset,
> > + fmap_buf_size);
> > + rc = -EINVAL;
> > + goto errout;
> > + }
> > +
> > + nstrips = ie_in->ie_nstrips;
> > + meta->ie[i].fie_chunk_size = ie_in->ie_chunk_size;
> > + meta->ie[i].fie_nstrips = ie_in->ie_nstrips;
> > + meta->ie[i].fie_nbytes = ie_in->ie_nbytes;
> > +
> > + if (!meta->ie[i].fie_nbytes) {
> > + pr_err("%s: zero-length interleave!\n",
> > + __func__);
> > + rc = -EINVAL;
> > + goto errout;
> > + }
> > +
> > + /* sie_in = the strip extents in fmap_buf */
> > + sie_in = (struct fuse_famfs_simple_ext *)
> > + (fmap_buf + next_offset);
> no cast needed.
Done, thanks
>
> > +
> > + /* Move past strip extents in fmap_buf */
> > + next_offset += nstrips * sizeof(*sie_in);
> > + if (next_offset > fmap_buf_size) {
> > + pr_err("%s:%d: fmap_buf underflow offset/size %ld/%ld\n",
> > + __func__, __LINE__, next_offset,
> > + fmap_buf_size);
> > + rc = -EINVAL;
> > + goto errout;
> > + }
> > +
> > + if ((nstrips > FUSE_FAMFS_MAX_STRIPS) || (nstrips < 1)) {
> > + pr_err("%s: invalid nstrips=%lld (max=%d)\n",
> > + __func__, nstrips,
> > + FUSE_FAMFS_MAX_STRIPS);
> > + errs++;
> > + }
> > +
> > + /* Allocate strip extent array */
> > + meta->ie[i].ie_strips = kcalloc(ie_in->ie_nstrips,
> > + sizeof(meta->ie[i].ie_strips[0]),
> > + GFP_KERNEL);
>
> Align all lines after 1st one to same point.
Yeah
>
> ...
>
> > +
> > +/**
> > + * famfs_file_init_dax() - init famfs dax file metadata
> > + *
> > + * @fm: fuse_mount
> > + * @inode: the inode
> > + * @fmap_buf: fmap response message
> > + * @fmap_size: Size of the fmap message
> > + *
> > + * Initialize famfs metadata for a file, based on the contents of the GET_FMAP
> > + * response
> > + *
> > + * Return: 0=success
> > + * -errno=failure
> > + */
> > +int
> > +famfs_file_init_dax(
> > + struct fuse_mount *fm,
> > + struct inode *inode,
> > + void *fmap_buf,
> > + size_t fmap_size)
> > +{
> > + struct fuse_inode *fi = get_fuse_inode(inode);
> > + struct famfs_file_meta *meta = NULL;
> > + int rc = 0;
>
> Always set before use.
Roger that - and it went away with the __free thingy anyway
>
> > +
> > + if (fi->famfs_meta) {
> > + pr_notice("%s: i_no=%ld fmap_size=%ld ALREADY INITIALIZED\n",
> > + __func__,
> > + inode->i_ino, fmap_size);
> > + return 0;
> > + }
> > +
> > + rc = famfs_fuse_meta_alloc(fmap_buf, fmap_size, &meta);
> > + if (rc)
> > + goto errout;
> > +
> > + /* Publish the famfs metadata on fi->famfs_meta */
> > + inode_lock(inode);
> > + if (fi->famfs_meta) {
> > + rc = -EEXIST; /* file already has famfs metadata */
> > + } else {
> > + if (famfs_meta_set(fi, meta) != NULL) {
> > + pr_debug("%s: file already had metadata\n", __func__);
> > + __famfs_meta_free(meta);
> > + /* rc is 0 - the file is valid */
> > + goto unlock_out;
> > + }
> > + i_size_write(inode, meta->file_size);
> > + inode->i_flags |= S_DAX;
> > + }
> > + unlock_out:
> > + inode_unlock(inode);
> > +
> > +errout:
> > + if (rc)
> > + __famfs_meta_free(meta);
>
> For readability I'd split he good and bad exit paths even it unlock
> needs to happen in two places.
Done
>
>
> > +
> > + return rc;
> > +}
> > +
>
> > diff --git a/fs/fuse/famfs_kfmap.h b/fs/fuse/famfs_kfmap.h
> > new file mode 100644
> > index 000000000000..058645cb10a1
> > --- /dev/null
> > +++ b/fs/fuse/famfs_kfmap.h
> > @@ -0,0 +1,67 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * famfs - dax file system for shared fabric-attached memory
> > + *
> > + * Copyright 2023-2025 Micron Technology, Inc.
> > + */
> > +#ifndef FAMFS_KFMAP_H
> > +#define FAMFS_KFMAP_H
> > +
> > +/*
> > + * The structures below are the in-memory metadata format for famfs files.
> > + * Metadata retrieved via the GET_FMAP response is converted to this format
> > + * for use in resolving file mapping faults.
>
> bonus space after in
Removed
>
> > + *
> > + * The GET_FMAP response contains the same information, but in a more
> > + * message-and-versioning-friendly format. Those structs can be found in the
> > + * famfs section of include/uapi/linux/fuse.h (aka fuse_kernel.h in libfuse)
> > + */
>
> > +/*
> > + * Each famfs dax file has this hanging from its fuse_inode->famfs_meta
> > + */
> > +struct famfs_file_meta {
> > + bool error;
> > + enum famfs_file_type file_type;
> > + size_t file_size;
> > + enum famfs_extent_type fm_extent_type;
> > + u64 dev_bitmap; /* bitmap of referenced daxdevs by index */
> > + union { /* This will make code a bit more readable */
>
> Not sure what the comment is for. I'd drop it.
I'm sure it made sense to me at some point but not now. Gone.
>
>
> > + struct {
> > + size_t fm_nextents;
> > + struct famfs_meta_simple_ext *se;
> > + };
> > + struct {
> > + size_t fm_niext;
> > + struct famfs_meta_interleaved_ext *ie;
> > + };
> > + };
> > +};
> > +
> > +#endif /* FAMFS_KFMAP_H */
Thanks!
John
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 17/21] famfs_fuse: Plumb dax iomap and fuse read/write/mmap
2026-01-08 15:13 ` Jonathan Cameron
@ 2026-01-09 17:44 ` John Groves
0 siblings, 0 replies; 74+ messages in thread
From: John Groves @ 2026-01-09 17:44 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On 26/01/08 03:13PM, Jonathan Cameron wrote:
> On Wed, 7 Jan 2026 09:33:26 -0600
> John Groves <John@Groves.net> wrote:
>
> > This commit fills in read/write/mmap handling for famfs files. The
> > dev_dax_iomap interface is used - just like xfs in fs-dax mode.
> >
> > * Read/write are handled by famfs_fuse_[read|write]_iter() via
> > dax_iomap_rw() to fsdev_dax.
> > * Mmap is handled by famfs_fuse_mmap()
> > * Faults are handled by famfs_filemap*fault(), using dax_iomap_fault()
> > to fsdev_dax.
> > * File offset to dax offset resolution is handled via
> > famfs_fuse_iomap_begin(), which uses famfs "fmaps" to resolve the
> > the requested (file, offset) to an offset on a dax device (by way of
> > famfs_fileofs_to_daxofs() and famfs_interleave_fileofs_to_daxofs())
> >
> > Signed-off-by: John Groves <john@groves.net>
> A few minor comments and suggestions inline.
>
> Thanks,
>
> Jonathan
>
> > ---
> > fs/fuse/famfs.c | 458 +++++++++++++++++++++++++++++++++++++++++++++++
> > fs/fuse/file.c | 18 +-
> > fs/fuse/fuse_i.h | 18 ++
> > 3 files changed, 492 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/fuse/famfs.c b/fs/fuse/famfs.c
> > index b5cd1b5c1d6c..c02b14789c6e 100644
> > --- a/fs/fuse/famfs.c
> > +++ b/fs/fuse/famfs.c
> > @@ -602,6 +602,464 @@ famfs_file_init_dax(
> > return rc;
> > }
> >
> > +/*********************************************************************
> > + * iomap_operations
> > + *
> > + * This stuff uses the iomap (dax-related) helpers to resolve file offsets to
> > + * offsets within a dax device.
> > + */
> > +
> > +static ssize_t famfs_file_bad(struct inode *inode);
> > +
> > +static int
> > +famfs_interleave_fileofs_to_daxofs(struct inode *inode, struct iomap *iomap,
> > + loff_t file_offset, off_t len, unsigned int flags)
> > +{
> > + struct fuse_inode *fi = get_fuse_inode(inode);
> > + struct famfs_file_meta *meta = fi->famfs_meta;
> > + struct fuse_conn *fc = get_fuse_conn(inode);
> > + loff_t local_offset = file_offset;
> > + int i;
> > +
> > + /* This function is only for extent_type INTERLEAVED_EXTENT */
> > + if (meta->fm_extent_type != INTERLEAVED_EXTENT) {
> > + pr_err("%s: bad extent type\n", __func__);
> > + goto err_out;
> > + }
> > +
> > + if (famfs_file_bad(inode))
> > + goto err_out;
> > +
> > + iomap->offset = file_offset;
> > +
> > + for (i = 0; i < meta->fm_niext; i++) {
> > + struct famfs_meta_interleaved_ext *fei = &meta->ie[i];
> > + u64 chunk_size = fei->fie_chunk_size;
> > + u64 nstrips = fei->fie_nstrips;
> > + u64 ext_size = fei->fie_nbytes;
> > +
> > + ext_size = min_t(u64, ext_size, meta->file_size);
> min() probably fine. Also, how about avoiding the assignment that
> is immediately overwritten.
>
> u64 ext_size = min(fei->fie_nbytes, meta->file_size);
Done and done, thanks
>
> > +
> > + if (ext_size == 0) {
> > + pr_err("%s: ext_size=%lld file_size=%ld\n",
> > + __func__, fei->fie_nbytes, meta->file_size);
> > + goto err_out;
> > + }
> > +
> > + /* Is the data is in this striped extent? */
> > + if (local_offset < ext_size) {
> Similar comments to below, though here that would mean not being able
> to scope these local variables as tightly so maybe not worth it to reduce
> indent.
I'll look at refactoring the fault handlers after the rebase-hell dust
settles on review stuff. They're quite stable as is, so I don't want to risk
a mistake while I'm branch-wrangling
>
> > + u64 chunk_num = local_offset / chunk_size;
> > + u64 chunk_offset = local_offset % chunk_size;
> > + u64 stripe_num = chunk_num / nstrips;
> > + u64 strip_num = chunk_num % nstrips;
> > + u64 chunk_remainder = chunk_size - chunk_offset;
>
> I'd group chunk stuff, then strip stuff.
chunk, stripe, strip. Done
(Had to stare at it to make sure inputs were set first...)
>
> > + u64 strip_offset = chunk_offset + (stripe_num * chunk_size);
> > + u64 strip_dax_ofs = fei->ie_strips[strip_num].ext_offset;
> > + u64 strip_devidx = fei->ie_strips[strip_num].dev_index;
> > +
> > + if (strip_devidx >= fc->dax_devlist->nslots) {
> > + pr_err("%s: strip_devidx %llu >= nslots %d\n",
> > + __func__, strip_devidx,
> > + fc->dax_devlist->nslots);
> > + goto err_out;
> > + }
> > +
> > + if (!fc->dax_devlist->devlist[strip_devidx].valid) {
> > + pr_err("%s: daxdev=%lld invalid\n", __func__,
> > + strip_devidx);
> > + goto err_out;
> > + }
> > +
> > + iomap->addr = strip_dax_ofs + strip_offset;
> > + iomap->offset = file_offset;
> > + iomap->length = min_t(loff_t, len, chunk_remainder);
> > +
> > + iomap->dax_dev = fc->dax_devlist->devlist[strip_devidx].devp;
> > +
> > + iomap->type = IOMAP_MAPPED;
> > + iomap->flags = flags;
> > +
> > + return 0;
> > + }
> > + local_offset -= ext_size; /* offset is beyond this striped extent */
> > + }
> > +
> > + err_out:
> > + pr_err("%s: err_out\n", __func__);
> > +
> > + /* We fell out the end of the extent list.
> > + * Set iomap to zero length in this case, and return 0
> > + * This just means that the r/w is past EOF
> > + */
> > + iomap->addr = 0; /* there is no valid dax device offset */
> > + iomap->offset = file_offset; /* file offset */
> > + iomap->length = 0; /* this had better result in no access to dax mem */
> > + iomap->dax_dev = NULL;
> > + iomap->type = IOMAP_MAPPED;
> > + iomap->flags = flags;
> > +
> > + return 0;
> > +}
> > +
> > +/**
> > + * famfs_fileofs_to_daxofs() - Resolve (file, offset, len) to (daxdev, offset, len)
> > + *
> > + * This function is called by famfs_fuse_iomap_begin() to resolve an offset in a
> > + * file to an offset in a dax device. This is upcalled from dax from calls to
> > + * both * dax_iomap_fault() and dax_iomap_rw(). Dax finishes the job resolving
> > + * a fault to a specific physical page (the fault case) or doing a memcpy
> > + * variant (the rw case)
> > + *
> > + * Pages can be PTE (4k), PMD (2MiB) or (theoretically) PuD (1GiB)
> > + * (these sizes are for X86; may vary on other cpu architectures
> > + *
> > + * @inode: The file where the fault occurred
> > + * @iomap: To be filled in to indicate where to find the right memory,
> > + * relative to a dax device.
> > + * @file_offset: Within the file where the fault occurred (will be page boundary)
> > + * @len: The length of the faulted mapping (will be a page multiple)
> > + * (will be trimmed in *iomap if it's disjoint in the extent list)
> > + * @flags:
>
> As below. All should have docs, even if trivial.
Done, thanks
>
> > + *
> > + * Return values: 0. (info is returned in a modified @iomap struct)
> > + */
> > +static int
> > +famfs_fileofs_to_daxofs(struct inode *inode, struct iomap *iomap,
> > + loff_t file_offset, off_t len, unsigned int flags)
> > +{
> > + struct fuse_inode *fi = get_fuse_inode(inode);
> > + struct famfs_file_meta *meta = fi->famfs_meta;
> > + struct fuse_conn *fc = get_fuse_conn(inode);
> > + loff_t local_offset = file_offset;
> > + int i;
> > +
> > + if (!fc->dax_devlist) {
> > + pr_err("%s: null dax_devlist\n", __func__);
> > + goto err_out;
> > + }
> > +
> > + if (famfs_file_bad(inode))
> > + goto err_out;
> > +
> > + if (meta->fm_extent_type == INTERLEAVED_EXTENT)
> > + return famfs_interleave_fileofs_to_daxofs(inode, iomap,
> > + file_offset,
> > + len, flags);
> > +
> > + iomap->offset = file_offset;
> > +
> > + for (i = 0; i < meta->fm_nextents; i++) {
>
> I'd drag declaration of i into the loop init.
Done, thanks
>
> > + /* TODO: check devindex too */
> > + loff_t dax_ext_offset = meta->se[i].ext_offset;
> > + loff_t dax_ext_len = meta->se[i].ext_len;
> > + u64 daxdev_idx = meta->se[i].dev_index;
> > +
> > +
> > + /* TODO: test that superblock and log offsets only happen
> > + * with superblock and log files. Requires instrumentaiton
> > + * from user space...
> > + */
> > +
> > + /* local_offset is the offset minus the size of extents skipped
> > + * so far; If local_offset < dax_ext_len, the data of interest
> > + * starts in this extent
> > + */
> > + if (local_offset < dax_ext_len) {
>
> Maybe flip logic and use a continue. Mostly to reduce indent of the rest of
> this. Or maybe a helper function for this bit.
May do. I don't want to rush changes to the primary fault handlers because
they're quite stable and are absolute core functionality.
>
>
> > + loff_t ext_len_remainder = dax_ext_len - local_offset;
> > + struct famfs_daxdev *dd;
> > +
> > + if (daxdev_idx >= fc->dax_devlist->nslots) {
> > + pr_err("%s: daxdev_idx %llu >= nslots %d\n",
> > + __func__, daxdev_idx,
> > + fc->dax_devlist->nslots);
> > + goto err_out;
> > + }
> > +
> > + dd = &fc->dax_devlist->devlist[daxdev_idx];
> > +
> > + if (!dd->valid || dd->error) {
> > + pr_err("%s: daxdev=%lld %s\n", __func__,
> > + daxdev_idx,
> > + dd->valid ? "error" : "invalid");
> > + goto err_out;
> > + }
> > +
> > + /*
> > + * OK, we found the file metadata extent where this
> > + * data begins
> > + * @local_offset - The offset within the current
> > + * extent
> > + * @ext_len_remainder - Remaining length of ext after
> > + * skipping local_offset
> > + * Outputs:
> > + * iomap->addr: the offset within the dax device where
> > + * the data starts
> > + * iomap->offset: the file offset
> > + * iomap->length: the valid length resolved here
> > + */
> > + iomap->addr = dax_ext_offset + local_offset;
> > + iomap->offset = file_offset;
> > + iomap->length = min_t(loff_t, len, ext_len_remainder);
> > +
> > + iomap->dax_dev = fc->dax_devlist->devlist[daxdev_idx].devp;
> > +
> > + iomap->type = IOMAP_MAPPED;
> > + iomap->flags = flags;
> > + return 0;
> > + }
> > + local_offset -= dax_ext_len; /* Get ready for the next extent */
> > + }
> > +
> > + err_out:
> > + pr_err("%s: err_out\n", __func__);
> > +
> > + /* We fell out the end of the extent list.
> > + * Set iomap to zero length in this case, and return 0
> > + * This just means that the r/w is past EOF
> > + */
> > + iomap->addr = 0; /* there is no valid dax device offset */
> > + iomap->offset = file_offset; /* file offset */
> > + iomap->length = 0; /* this had better result in no access to dax mem */
> > + iomap->dax_dev = NULL;
> > + iomap->type = IOMAP_MAPPED;
> > + iomap->flags = flags;
> > +
> > + return 0;
> > +}
> > +
> > +/**
> > + * famfs_fuse_iomap_begin() - Handler for iomap_begin upcall from dax
> > + *
> > + * This function is pretty simple because files are
> > + * * never partially allocated
> > + * * never have holes (never sparse)
> > + * * never "allocate on write"
> > + *
> > + * @inode: inode for the file being accessed
> > + * @offset: offset within the file
> > + * @length: Length being accessed at offset
> > + * @flags:
> > + * @iomap: iomap struct to be filled in, resolving (offset, length) to
> > + * (daxdev, offset, len)
> > + * @srcmap:
>
> All parameters should have description.
Done
>
> > + */
> > +static int
> > +famfs_fuse_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> > + unsigned int flags, struct iomap *iomap, struct iomap *srcmap)
> > +{
> > + struct fuse_inode *fi = get_fuse_inode(inode);
> > + struct famfs_file_meta *meta = fi->famfs_meta;
> > + size_t size;
> > +
> > + size = i_size_read(inode);
> > +
> > + WARN_ON(size != meta->file_size);
> > +
> > + return famfs_fileofs_to_daxofs(inode, iomap, offset, length, flags);
> > +}
>
> > +
> > +static inline bool
> > +famfs_is_write_fault(struct vm_fault *vmf)
> > +{
> > + return (vmf->flags & FAULT_FLAG_WRITE) &&
> > + (vmf->vma->vm_flags & VM_SHARED);
> > +}
> > +
> > +static vm_fault_t
> > +famfs_filemap_fault(struct vm_fault *vmf)
> > +{
> > + return __famfs_fuse_filemap_fault(vmf, 0, famfs_is_write_fault(vmf));
> > +}
> > +
> > +static vm_fault_t
> > +famfs_filemap_huge_fault(struct vm_fault *vmf, unsigned int pe_size)
> > +{
> > + return __famfs_fuse_filemap_fault(vmf, pe_size, famfs_is_write_fault(vmf));
> > +}
> > +
> > +static vm_fault_t
> > +famfs_filemap_page_mkwrite(struct vm_fault *vmf)
> > +{
> > + return __famfs_fuse_filemap_fault(vmf, 0, true);
> I'm not an fs person but I note ext4 etc are able to use the
> same callback for all of these and can figure out the write fault
> question inside that callback. Is there a reason that doesn't work here?
> Looks like an appropriate vmf flag is set for each type of callback.
Thanks for digging in!
I've merged the mkwrites (below), which is a no-brainer. I'm gonna
take further re-factoring of the rw/fault path under advisement. Possibly
for later cleanup. This code is quite stable and I want to be cautious
during the review process.
> > +}
> > +
> > +static vm_fault_t
> Similar to earlier comments. I'd put these on one line unless you
> have to split them due to length.
This is a common file system pattern - see fs/xfs/xfs_file.c
I kinda like to I think I'll stick with this one unless Miklos prefers
not to have it in fuse.
>
> > +famfs_filemap_pfn_mkwrite(struct vm_fault *vmf)
> Given this and the previous page_mkwrite one are identical, just
> use one more generically named callback. Lots of FS seem to do this
> when these match. E.g. ext4_dax_fault()
Right, done.
>
> > +{
> > + return __famfs_fuse_filemap_fault(vmf, 0, true);
> > +}
> > +
> > +static vm_fault_t
> > +famfs_filemap_map_pages(struct vm_fault *vmf, pgoff_t start_pgoff,
> > + pgoff_t end_pgoff)
> > +{
> > + return filemap_map_pages(vmf, start_pgoff, end_pgoff);
>
> Why not just use this directly as the vm_operation? shmem does
> this for instance.
Good idea :D
Done
>
>
> > +}
> > +
> > +const struct vm_operations_struct famfs_file_vm_ops = {
> > + .fault = famfs_filemap_fault,
> > + .huge_fault = famfs_filemap_huge_fault,
> > + .map_pages = famfs_filemap_map_pages,
> > + .page_mkwrite = famfs_filemap_page_mkwrite,
> > + .pfn_mkwrite = famfs_filemap_pfn_mkwrite,
> > +};
> > +
> > +/*********************************************************************
> > + * file_operations
> > + */
> > +
> > +/**
> > + * famfs_file_bad() - Check for files that aren't in a valid state
> > + *
> > + * @inode - inode
> > + *
> > + * Returns: 0=success
> > + * -errno=failure
> > + */
> > +static ssize_t
> Odd return type. Why not int?
Because reasons (not necessarily good ones). One of the callers wanted ssize_t,
but it looks better to me to switch to int and adapt the one caller that wanted
ssize_t.
Done, thanks
> > +famfs_file_bad(struct inode *inode)
> > +{
> > + struct fuse_inode *fi = get_fuse_inode(inode);
> > + struct famfs_file_meta *meta = fi->famfs_meta;
> > + size_t i_size = i_size_read(inode);
> > +
> > + if (!meta) {
> > + pr_err("%s: un-initialized famfs file\n", __func__);
> > + return -EIO;
> > + }
> > + if (meta->error) {
> > + pr_debug("%s: previously detected metadata errors\n", __func__);
> > + return -EIO;
> > + }
> > + if (i_size != meta->file_size) {
> > + pr_warn("%s: i_size overwritten from %ld to %ld\n",
> > + __func__, meta->file_size, i_size);
> > + meta->error = true;
> > + return -ENXIO;
> > + }
> > + if (!IS_DAX(inode)) {
> > + pr_debug("%s: inode %llx IS_DAX is false\n",
> > + __func__, (u64)inode);
> > + return -ENXIO;
> > + }
> > + return 0;
> > +}
> > +
> > +static ssize_t
>
> This can probably just return an int given type seems to be driven
> by famfs_file_bad() which doesn't make much sense as returning a ssize_t
> Storing an int into a ssize_t without cast should be fine.
Done
>
> > +famfs_fuse_rw_prep(struct kiocb *iocb, struct iov_iter *ubuf)
> > +{
> > + struct inode *inode = iocb->ki_filp->f_mapping->host;
> > + size_t i_size = i_size_read(inode);
> > + size_t count = iov_iter_count(ubuf);
> > + size_t max_count;
> > + ssize_t rc;
> > +
> > + rc = famfs_file_bad(inode);
> > + if (rc)
> > + return rc;
> > +
> > + /* Avoid unsigned underflow if position is past EOF */
> > + if (iocb->ki_pos >= i_size)
> > + max_count = 0;
> > + else
> > + max_count = i_size - iocb->ki_pos;
> > +
> > + if (count > max_count)
> > + iov_iter_truncate(ubuf, max_count);
> > +
> > + if (!iov_iter_count(ubuf))
> > + return 0;
> > +
> > + return rc;
> > +}
> > +
> > +ssize_t
> > +famfs_fuse_read_iter(struct kiocb *iocb, struct iov_iter *to)
> > +{
> > + ssize_t rc;
> > +
> > + rc = famfs_fuse_rw_prep(iocb, to);
> > + if (rc)
> > + return rc;
> > +
> > + if (!iov_iter_count(to))
> > + return 0;
> > +
> > + rc = dax_iomap_rw(iocb, to, &famfs_iomap_ops);
> > +
> > + file_accessed(iocb->ki_filp);
> > + return rc;
> > +}
>
> > +
> > +int
> > +famfs_fuse_mmap(struct file *file, struct vm_area_struct *vma)
> > +{
> > + struct inode *inode = file_inode(file);
> > + ssize_t rc;
> > +
> > + rc = famfs_file_bad(inode);
> > + if (rc)
> > + return (int)rc;
> This was odd so I went and looked. famfs_file_bad() should probably just return an int.
Fixed
> > +
> > + file_accessed(file);
> > + vma->vm_ops = &famfs_file_vm_ops;
> > + vm_flags_set(vma, VM_HUGEPAGE);
> > + return 0;
> > +}
> > +
> > #define FMAP_BUFSIZE PAGE_SIZE
> >
> > int
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index 1f64bf68b5ee..45a09a7f0012 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -1831,6 +1831,8 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> >
> > if (FUSE_IS_VIRTIO_DAX(fi))
> > return fuse_dax_read_iter(iocb, to);
> > + if (fuse_file_famfs(fi))
> > + return famfs_fuse_read_iter(iocb, to);
> >
> > /* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
> > if (ff->open_flags & FOPEN_DIRECT_IO)
> > @@ -1853,6 +1855,8 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> >
> > if (FUSE_IS_VIRTIO_DAX(fi))
> > return fuse_dax_write_iter(iocb, from);
> > + if (fuse_file_famfs(fi))
> > + return famfs_fuse_write_iter(iocb, from);
> >
> > /* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
> > if (ff->open_flags & FOPEN_DIRECT_IO)
> > @@ -1868,9 +1872,13 @@ static ssize_t fuse_splice_read(struct file *in, loff_t *ppos,
> > unsigned int flags)
> > {
> > struct fuse_file *ff = in->private_data;
> > + struct inode *inode = file_inode(in);
> > + struct fuse_inode *fi = get_fuse_inode(inode);
> >
> > /* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
> > - if (fuse_file_passthrough(ff) && !(ff->open_flags & FOPEN_DIRECT_IO))
> > + if (fuse_file_famfs(fi))
> > + return -EIO; /* famfs does not use the page cache... */
>
> As below.
Hmm. Fuse has multiple instances of these - maybe it's considered more readable,
since only one branch is hit.
Comments Miklos?
>
> > + else if (fuse_file_passthrough(ff) && !(ff->open_flags & FOPEN_DIRECT_IO))
> > return fuse_passthrough_splice_read(in, ppos, pipe, len, flags);
> > else
> > return filemap_splice_read(in, ppos, pipe, len, flags);
> > @@ -1880,9 +1888,13 @@ static ssize_t fuse_splice_write(struct pipe_inode_info *pipe, struct file *out,
> > loff_t *ppos, size_t len, unsigned int flags)
> > {
> > struct fuse_file *ff = out->private_data;
> > + struct inode *inode = file_inode(out);
> > + struct fuse_inode *fi = get_fuse_inode(inode);
> >
> > /* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
> > - if (fuse_file_passthrough(ff) && !(ff->open_flags & FOPEN_DIRECT_IO))
> > + if (fuse_file_famfs(fi))
> > + return -EIO; /* famfs does not use the page cache... */
>
> Not sure why original code had else, but not needed given returned.
> Maybe stick to local style.
Same as previous. Leaving them alone for now.
Thanks Jonathan - you did some work here.
John
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 11/21] famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/
2026-01-07 15:33 ` [PATCH V3 11/21] famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/ John Groves
@ 2026-01-09 18:16 ` Joanne Koong
2026-01-09 22:15 ` [PATCH V3 11/21] famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX John Groves
0 siblings, 1 reply; 74+ messages in thread
From: Joanne Koong @ 2026-01-09 18:16 UTC (permalink / raw)
To: John Groves
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On Wed, Jan 7, 2026 at 7:34 AM John Groves <John@groves.net> wrote:
>
> Virtio_fs now needs to determine if an inode is DAX && not famfs.
nit: it was unclear to me why this patch changed the macro to take in
a struct fuse_inode until I looked at patch 14. it might be useful
here to add a line about that
>
> Signed-off-by: John Groves <john@groves.net>
> ---
> fs/fuse/dir.c | 2 +-
> fs/fuse/file.c | 13 ++++++++-----
> fs/fuse/fuse_i.h | 6 +++++-
> fs/fuse/inode.c | 4 ++--
> fs/fuse/iomode.c | 2 +-
> 5 files changed, 17 insertions(+), 10 deletions(-)
>
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index 4b6b3d2758ff..1400c9d733ba 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -2153,7 +2153,7 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
> is_truncate = true;
> }
>
> - if (FUSE_IS_DAX(inode) && is_truncate) {
> + if (FUSE_IS_VIRTIO_DAX(fi) && is_truncate) {
> filemap_invalidate_lock(mapping);
> fault_blocked = true;
> err = fuse_dax_break_layouts(inode, 0, -1);
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 01bc894e9c2b..093569033ed1 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -252,7 +252,7 @@ static int fuse_open(struct inode *inode, struct file *file)
> int err;
> bool is_truncate = (file->f_flags & O_TRUNC) && fc->atomic_o_trunc;
> bool is_wb_truncate = is_truncate && fc->writeback_cache;
> - bool dax_truncate = is_truncate && FUSE_IS_DAX(inode);
> + bool dax_truncate = is_truncate && FUSE_IS_VIRTIO_DAX(fi);
>
> if (fuse_is_bad(inode))
> return -EIO;
> @@ -1812,11 +1812,12 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> struct file *file = iocb->ki_filp;
> struct fuse_file *ff = file->private_data;
> struct inode *inode = file_inode(file);
> + struct fuse_inode *fi = get_fuse_inode(inode);
>
> if (fuse_is_bad(inode))
> return -EIO;
>
> - if (FUSE_IS_DAX(inode))
> + if (FUSE_IS_VIRTIO_DAX(fi))
> return fuse_dax_read_iter(iocb, to);
>
> /* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
> @@ -1833,11 +1834,12 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> struct file *file = iocb->ki_filp;
> struct fuse_file *ff = file->private_data;
> struct inode *inode = file_inode(file);
> + struct fuse_inode *fi = get_fuse_inode(inode);
>
> if (fuse_is_bad(inode))
> return -EIO;
>
> - if (FUSE_IS_DAX(inode))
> + if (FUSE_IS_VIRTIO_DAX(fi))
> return fuse_dax_write_iter(iocb, from);
>
> /* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
> @@ -2370,10 +2372,11 @@ static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
> struct fuse_file *ff = file->private_data;
> struct fuse_conn *fc = ff->fm->fc;
> struct inode *inode = file_inode(file);
> + struct fuse_inode *fi = get_fuse_inode(inode);
> int rc;
>
> /* DAX mmap is superior to direct_io mmap */
> - if (FUSE_IS_DAX(inode))
> + if (FUSE_IS_VIRTIO_DAX(fi))
> return fuse_dax_mmap(file, vma);
>
> /*
> @@ -2934,7 +2937,7 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
> .mode = mode
> };
> int err;
> - bool block_faults = FUSE_IS_DAX(inode) &&
> + bool block_faults = FUSE_IS_VIRTIO_DAX(fi) &&
> (!(mode & FALLOC_FL_KEEP_SIZE) ||
> (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE)));
>
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 7f16049387d1..17736c0a6d2f 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -1508,7 +1508,11 @@ void fuse_free_conn(struct fuse_conn *fc);
>
> /* dax.c */
>
> -#define FUSE_IS_DAX(inode) (IS_ENABLED(CONFIG_FUSE_DAX) && IS_DAX(inode))
> +/* This macro is used by virtio_fs, but now it also needs to filter for
> + * "not famfs"
> + */
Did you mean to add this comment to "patch 14/21: famfs_fuse: Plumb
the GET_FMAP message/response" instead? it seems like that's the patch
that adds the "&& !fuse_file_famfs(fuse_inode))" part to this.
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Thanks,
Joanne
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 12/21] famfs_fuse: Basic fuse kernel ABI enablement for famfs
2026-01-07 15:33 ` [PATCH V3 12/21] famfs_fuse: Basic fuse kernel ABI enablement for famfs John Groves
@ 2026-01-09 18:29 ` Joanne Koong
2026-01-09 22:58 ` John Groves
0 siblings, 1 reply; 74+ messages in thread
From: Joanne Koong @ 2026-01-09 18:29 UTC (permalink / raw)
To: John Groves
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On Wed, Jan 7, 2026 at 7:34 AM John Groves <John@groves.net> wrote:
>
> * FUSE_DAX_FMAP flag in INIT request/reply
>
> * fuse_conn->famfs_iomap (enable famfs-mapped files) to denote a
> famfs-enabled connection
>
> Signed-off-by: John Groves <john@groves.net>
> ---
> fs/fuse/fuse_i.h | 3 +++
> fs/fuse/inode.c | 6 ++++++
> include/uapi/linux/fuse.h | 5 +++++
> 3 files changed, 14 insertions(+)
>
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index c13e1f9a2f12..5e2c93433823 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -240,6 +240,9 @@
> * - add FUSE_COPY_FILE_RANGE_64
> * - add struct fuse_copy_file_range_out
> * - add FUSE_NOTIFY_PRUNE
> + *
> + * 7.46
> + * - Add FUSE_DAX_FMAP capability - ability to handle in-kernel fsdax maps
very minor nit: the extra spacing before this line (and subsequent
lines in later patches) should be removed
> */
>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 13/21] famfs_fuse: Famfs mount opt: -o shadow=<shadowpath>
2026-01-07 15:33 ` [PATCH V3 13/21] famfs_fuse: Famfs mount opt: -o shadow=<shadowpath> John Groves
@ 2026-01-09 19:22 ` Joanne Koong
2026-01-10 0:38 ` John Groves
0 siblings, 1 reply; 74+ messages in thread
From: Joanne Koong @ 2026-01-09 19:22 UTC (permalink / raw)
To: John Groves
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On Wed, Jan 7, 2026 at 7:34 AM John Groves <John@groves.net> wrote:
>
> The shadow path is a (usually in tmpfs) file system area used by the
> famfs user space to communicate with the famfs fuse server. There is a
> minor dilemma that the user space tools must be able to resolve from a
> mount point path to a shadow path. Passing in the 'shadow=<path>'
> argument at mount time causes the shadow path to be exposed via
> /proc/mounts, Solving this dilemma. The shadow path is not otherwise
> used in the kernel.
Instead of using mount options to pass the userspace metadata, could
/sys/fs be used instead? The client is able to get the connection id
by stat-ing the famfs mount path. There could be a
/sys/fs/fuse/connections/{id}/metadata file that the server fills out
with whatever metadata needs to be read by the client. Having
something like this would be useful to non-famfs servers as well.
>
> Signed-off-by: John Groves <john@groves.net>
> ---
> fs/fuse/fuse_i.h | 25 ++++++++++++++++++++++++-
> fs/fuse/inode.c | 28 +++++++++++++++++++++++++++-
> 2 files changed, 51 insertions(+), 2 deletions(-)
>
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index ec2446099010..84d0ee2a501d 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -620,9 +620,11 @@ struct fuse_fs_context {
> unsigned int blksize;
> const char *subtype;
>
> - /* DAX device, may be NULL */
> + /* DAX device for virtiofs, may be NULL */
> struct dax_device *dax_dev;
>
> + const char *shadow; /* famfs - null if not famfs */
> +
> /* fuse_dev pointer to fill in, should contain NULL on entry */
> void **fudptr;
> };
> @@ -998,6 +1000,18 @@ struct fuse_conn {
> /* Request timeout (in jiffies). 0 = no timeout */
> unsigned int req_timeout;
> } timeout;
> +
> + /*
> + * This is a workaround until fuse uses iomap for reads.
> + * For fuseblk servers, this represents the blocksize passed in at
> + * mount time and for regular fuse servers, this is equivalent to
> + * inode->i_blkbits.
> + */
> + u8 blkbits;
> +
I think you meant to remove these lines?
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> + char *shadow;
Should this be const char * too?
> +#endif
> };
>
> /*
> @@ -1631,4 +1645,13 @@ extern void fuse_sysctl_unregister(void);
> #define fuse_sysctl_unregister() do { } while (0)
> #endif /* CONFIG_SYSCTL */
>
> +/* famfs.c */
> +
> +static inline void famfs_teardown(struct fuse_conn *fc)
> +{
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> + kfree(fc->shadow);
> +#endif
> +}
> +
> #endif /* _FS_FUSE_I_H */
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index acabf92a11f8..2e0844aabbae 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -783,6 +783,9 @@ enum {
> OPT_ALLOW_OTHER,
> OPT_MAX_READ,
> OPT_BLKSIZE,
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> + OPT_SHADOW,
> +#endif
> OPT_ERR
> };
>
> @@ -797,6 +800,9 @@ static const struct fs_parameter_spec fuse_fs_parameters[] = {
> fsparam_u32 ("max_read", OPT_MAX_READ),
> fsparam_u32 ("blksize", OPT_BLKSIZE),
> fsparam_string ("subtype", OPT_SUBTYPE),
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> + fsparam_string("shadow", OPT_SHADOW),
nit: having the spacing for ("shadow", align with the lines above
would be aesthetically nice
> +#endif
> {}
> };
>
> @@ -892,6 +898,15 @@ static int fuse_parse_param(struct fs_context *fsc, struct fs_parameter *param)
> ctx->blksize = result.uint_32;
> break;
>
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> + case OPT_SHADOW:
> + if (ctx->shadow)
> + return invalfc(fsc, "Multiple shadows specified");
> + ctx->shadow = param->string;
> + param->string = NULL;
> + break;
> +#endif
> +
> default:
> return -EINVAL;
> }
> @@ -905,6 +920,7 @@ static void fuse_free_fsc(struct fs_context *fsc)
>
> if (ctx) {
> kfree(ctx->subtype);
> + kfree(ctx->shadow);
> kfree(ctx);
> }
> }
> @@ -936,7 +952,10 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
> else if (fc->dax_mode == FUSE_DAX_INODE_USER)
> seq_puts(m, ",dax=inode");
> #endif
> -
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> + if (fc->shadow)
> + seq_printf(m, ",shadow=%s", fc->shadow);
> +#endif
> return 0;
> }
>
> @@ -1041,6 +1060,8 @@ void fuse_conn_put(struct fuse_conn *fc)
> WARN_ON(atomic_read(&bucket->count) != 1);
> kfree(bucket);
> }
> + famfs_teardown(fc);
imo it looks a bit cleaner with
if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
famfs_teardown(fc);
which also matches the pattern the passthrough config below uses
> +
> if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> fuse_backing_files_free(fc);
> call_rcu(&fc->rcu, delayed_release);
> @@ -1916,6 +1937,11 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
> *ctx->fudptr = fud;
> wake_up_all(&fuse_dev_waitq);
> }
> +
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> + fc->shadow = kstrdup(ctx->shadow, GFP_KERNEL);
Is a shadow path a must-have for a famfs mount? if so, then should the
mount fail if the allocation here fails?
Thanks,
Joanne
> +#endif
> +
> mutex_unlock(&fuse_mutex);
> return 0;
>
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 18/21] famfs_fuse: Add holder_operations for dax notify_failure()
2026-01-08 15:17 ` Jonathan Cameron
@ 2026-01-09 21:00 ` John Groves
0 siblings, 0 replies; 74+ messages in thread
From: John Groves @ 2026-01-09 21:00 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On 26/01/08 03:17PM, Jonathan Cameron wrote:
> On Wed, 7 Jan 2026 09:33:27 -0600
> John Groves <John@Groves.net> wrote:
>
> > Memory errors are at least somewhat more likely on disaggregated memory
> > than on-board memory. This commit registers to be notified by fsdev_dax
> > in the event that a memory failure is detected.
> >
> > When a file access resolves to a daxdev with memory errors, it will fail
> > with an appropriate error.
> >
> > If a daxdev failed fs_dax_get(), we set dd->dax_err. If a daxdev called
> > our notify_failure(), set dd->error. When any of the above happens, set
> > (file)->error and stop allowing access.
> >
> > In general, the recovery from memory errors is to unmount the file
> > system and re-initialize the memory, but there may be usable degraded
> > modes of operation - particularly in the future when famfs supports
> > file systems backed by more than one daxdev. In those cases,
> > accessing data that is on a working daxdev can still work.
> >
> > For now, return errors for any file that has encountered a memory or dax
> > error.
> >
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> > fs/fuse/famfs.c | 115 +++++++++++++++++++++++++++++++++++++++---
> > fs/fuse/famfs_kfmap.h | 3 +-
> > 2 files changed, 109 insertions(+), 9 deletions(-)
> >
> > diff --git a/fs/fuse/famfs.c b/fs/fuse/famfs.c
> > index c02b14789c6e..4eb87c5c628e 100644
> > --- a/fs/fuse/famfs.c
> > +++ b/fs/fuse/famfs.c
>
> > @@ -254,6 +288,38 @@ famfs_update_daxdev_table(
> > return 0;
> > }
> >
> > +static void
> > +famfs_set_daxdev_err(
> > + struct fuse_conn *fc,
> > + struct dax_device *dax_devp)
> > +{
> > + int i;
> > +
> > + /* Gotta search the list by dax_devp;
> > + * read lock because we're not adding or removing daxdev entries
> > + */
> > + down_read(&fc->famfs_devlist_sem);
>
> Use a guard()
Done
>
> > + for (i = 0; i < fc->dax_devlist->nslots; i++) {
> > + if (fc->dax_devlist->devlist[i].valid) {
> > + struct famfs_daxdev *dd = &fc->dax_devlist->devlist[i];
> > +
> > + if (dd->devp != dax_devp)
> > + continue;
> > +
> > + dd->error = true;
> > + up_read(&fc->famfs_devlist_sem);
> > +
> > + pr_err("%s: memory error on daxdev %s (%d)\n",
> > + __func__, dd->name, i);
> > + goto done;
> > + }
> > + }
> > + up_read(&fc->famfs_devlist_sem);
> > + pr_err("%s: memory err on unrecognized daxdev\n", __func__);
> > +
> > +done:
>
> If this isn't getting more interesting, just return above.
Right - simplified.
>
> > +}
> > +
> > /***************************************************************************/
> >
> > void
> > @@ -611,6 +677,26 @@ famfs_file_init_dax(
> >
> > static ssize_t famfs_file_bad(struct inode *inode);
> >
> > +static int famfs_dax_err(struct famfs_daxdev *dd)
>
> I'd introduce this earlier in the series to reduce need
> to refactor below.
Will mull that over when I further mull the helpers in fuse_i.h that are
hard to rebase...
>
> > +{
> > + if (!dd->valid) {
> > + pr_err("%s: daxdev=%s invalid\n",
> > + __func__, dd->name);
> > + return -EIO;
> > + }
> > + if (dd->dax_err) {
> > + pr_err("%s: daxdev=%s dax_err\n",
> > + __func__, dd->name);
> > + return -EIO;
> > + }
> > + if (dd->error) {
> > + pr_err("%s: daxdev=%s memory error\n",
> > + __func__, dd->name);
> > + return -EHWPOISON;
> > + }
> > + return 0;
> > +}
>
> ...
>
> > @@ -966,7 +1064,8 @@ famfs_file_bad(struct inode *inode)
> > return -EIO;
> > }
> > if (meta->error) {
> > - pr_debug("%s: previously detected metadata errors\n", __func__);
> > + pr_debug("%s: previously detected metadata errors\n",
> > + __func__);
>
> Spurious change.
Derp. Reverted out
Thanks Jonathan
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 11/21] famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX
2026-01-09 18:16 ` Joanne Koong
@ 2026-01-09 22:15 ` John Groves
0 siblings, 0 replies; 74+ messages in thread
From: John Groves @ 2026-01-09 22:15 UTC (permalink / raw)
To: Joanne Koong
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On 26/01/09 10:16AM, Joanne Koong wrote:
> On Wed, Jan 7, 2026 at 7:34 AM John Groves <John@groves.net> wrote:
> >
> > Virtio_fs now needs to determine if an inode is DAX && not famfs.
>
> nit: it was unclear to me why this patch changed the macro to take in
> a struct fuse_inode until I looked at patch 14. it might be useful
> here to add a line about that
Thanks Joanne; I beefed up the comment, and also added a dummy
fuse_file_famfs() macro so the new FUSE_IS_VIRTIO_DAX() macro shows
what it's gonna do. I should have done a better commit message...
Next rev will have a better one.
>
> >
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> > fs/fuse/dir.c | 2 +-
> > fs/fuse/file.c | 13 ++++++++-----
> > fs/fuse/fuse_i.h | 6 +++++-
> > fs/fuse/inode.c | 4 ++--
> > fs/fuse/iomode.c | 2 +-
> > 5 files changed, 17 insertions(+), 10 deletions(-)
> >
> > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> > index 4b6b3d2758ff..1400c9d733ba 100644
> > --- a/fs/fuse/dir.c
> > +++ b/fs/fuse/dir.c
> > @@ -2153,7 +2153,7 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
> > is_truncate = true;
> > }
> >
> > - if (FUSE_IS_DAX(inode) && is_truncate) {
> > + if (FUSE_IS_VIRTIO_DAX(fi) && is_truncate) {
> > filemap_invalidate_lock(mapping);
> > fault_blocked = true;
> > err = fuse_dax_break_layouts(inode, 0, -1);
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index 01bc894e9c2b..093569033ed1 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -252,7 +252,7 @@ static int fuse_open(struct inode *inode, struct file *file)
> > int err;
> > bool is_truncate = (file->f_flags & O_TRUNC) && fc->atomic_o_trunc;
> > bool is_wb_truncate = is_truncate && fc->writeback_cache;
> > - bool dax_truncate = is_truncate && FUSE_IS_DAX(inode);
> > + bool dax_truncate = is_truncate && FUSE_IS_VIRTIO_DAX(fi);
> >
> > if (fuse_is_bad(inode))
> > return -EIO;
> > @@ -1812,11 +1812,12 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> > struct file *file = iocb->ki_filp;
> > struct fuse_file *ff = file->private_data;
> > struct inode *inode = file_inode(file);
> > + struct fuse_inode *fi = get_fuse_inode(inode);
> >
> > if (fuse_is_bad(inode))
> > return -EIO;
> >
> > - if (FUSE_IS_DAX(inode))
> > + if (FUSE_IS_VIRTIO_DAX(fi))
> > return fuse_dax_read_iter(iocb, to);
> >
> > /* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
> > @@ -1833,11 +1834,12 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> > struct file *file = iocb->ki_filp;
> > struct fuse_file *ff = file->private_data;
> > struct inode *inode = file_inode(file);
> > + struct fuse_inode *fi = get_fuse_inode(inode);
> >
> > if (fuse_is_bad(inode))
> > return -EIO;
> >
> > - if (FUSE_IS_DAX(inode))
> > + if (FUSE_IS_VIRTIO_DAX(fi))
> > return fuse_dax_write_iter(iocb, from);
> >
> > /* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
> > @@ -2370,10 +2372,11 @@ static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
> > struct fuse_file *ff = file->private_data;
> > struct fuse_conn *fc = ff->fm->fc;
> > struct inode *inode = file_inode(file);
> > + struct fuse_inode *fi = get_fuse_inode(inode);
> > int rc;
> >
> > /* DAX mmap is superior to direct_io mmap */
> > - if (FUSE_IS_DAX(inode))
> > + if (FUSE_IS_VIRTIO_DAX(fi))
> > return fuse_dax_mmap(file, vma);
> >
> > /*
> > @@ -2934,7 +2937,7 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
> > .mode = mode
> > };
> > int err;
> > - bool block_faults = FUSE_IS_DAX(inode) &&
> > + bool block_faults = FUSE_IS_VIRTIO_DAX(fi) &&
> > (!(mode & FALLOC_FL_KEEP_SIZE) ||
> > (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE)));
> >
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index 7f16049387d1..17736c0a6d2f 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -1508,7 +1508,11 @@ void fuse_free_conn(struct fuse_conn *fc);
> >
> > /* dax.c */
> >
> > -#define FUSE_IS_DAX(inode) (IS_ENABLED(CONFIG_FUSE_DAX) && IS_DAX(inode))
> > +/* This macro is used by virtio_fs, but now it also needs to filter for
> > + * "not famfs"
> > + */
>
> Did you mean to add this comment to "patch 14/21: famfs_fuse: Plumb
> the GET_FMAP message/response" instead? it seems like that's the patch
> that adds the "&& !fuse_file_famfs(fuse_inode))" part to this.
The idea I was going for is for this commit to substitute the new macro name
(FUSE_IS_VIRTIO_DAX()) without otherwise changing functionality - and then
to plumb the famfs test later.
The revised version of this commit adds a dummy test (fuse_file_famfs(inode)), so
it's more apparent what this commit is trying to do. So I hope it will make
more sense ;)
>
> Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
>
> Thanks,
> Joanne
Thanks Joanne!
John
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 12/21] famfs_fuse: Basic fuse kernel ABI enablement for famfs
2026-01-09 18:29 ` Joanne Koong
@ 2026-01-09 22:58 ` John Groves
0 siblings, 0 replies; 74+ messages in thread
From: John Groves @ 2026-01-09 22:58 UTC (permalink / raw)
To: Joanne Koong
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On 26/01/09 10:29AM, Joanne Koong wrote:
> On Wed, Jan 7, 2026 at 7:34 AM John Groves <John@groves.net> wrote:
> >
> > * FUSE_DAX_FMAP flag in INIT request/reply
> >
> > * fuse_conn->famfs_iomap (enable famfs-mapped files) to denote a
> > famfs-enabled connection
> >
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> > fs/fuse/fuse_i.h | 3 +++
> > fs/fuse/inode.c | 6 ++++++
> > include/uapi/linux/fuse.h | 5 +++++
> > 3 files changed, 14 insertions(+)
> >
> > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > index c13e1f9a2f12..5e2c93433823 100644
> > --- a/include/uapi/linux/fuse.h
> > +++ b/include/uapi/linux/fuse.h
> > @@ -240,6 +240,9 @@
> > * - add FUSE_COPY_FILE_RANGE_64
> > * - add struct fuse_copy_file_range_out
> > * - add FUSE_NOTIFY_PRUNE
> > + *
> > + * 7.46
> > + * - Add FUSE_DAX_FMAP capability - ability to handle in-kernel fsdax maps
>
> very minor nit: the extra spacing before this line (and subsequent
> lines in later patches) should be removed
>
> > */
> >
>
> Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Thanks Joanne - fixed!
John
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 13/21] famfs_fuse: Famfs mount opt: -o shadow=<shadowpath>
2026-01-09 19:22 ` Joanne Koong
@ 2026-01-10 0:38 ` John Groves
2026-01-11 18:20 ` John Groves
0 siblings, 1 reply; 74+ messages in thread
From: John Groves @ 2026-01-10 0:38 UTC (permalink / raw)
To: Joanne Koong
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On 26/01/09 11:22AM, Joanne Koong wrote:
> On Wed, Jan 7, 2026 at 7:34 AM John Groves <John@groves.net> wrote:
> >
> > The shadow path is a (usually in tmpfs) file system area used by the
> > famfs user space to communicate with the famfs fuse server. There is a
> > minor dilemma that the user space tools must be able to resolve from a
> > mount point path to a shadow path. Passing in the 'shadow=<path>'
> > argument at mount time causes the shadow path to be exposed via
> > /proc/mounts, Solving this dilemma. The shadow path is not otherwise
> > used in the kernel.
>
> Instead of using mount options to pass the userspace metadata, could
> /sys/fs be used instead? The client is able to get the connection id
> by stat-ing the famfs mount path. There could be a
> /sys/fs/fuse/connections/{id}/metadata file that the server fills out
> with whatever metadata needs to be read by the client. Having
> something like this would be useful to non-famfs servers as well.
The shadow option isn't the only possible way to get what famfs needs,
but I do like it - I find it to be an elegant solution to the problem.
What's the problem? Well, for that you need to know some implementation
details of the famfs userspace. For the *structure* of a mounted file
system, famfs is very passthrough-like. The structure that is being
passed through is the shadow file system, which is an actual file system
(usually tmpfs). Directories are just directories, but shadow files
contain yaml that describes the file-to-dax map of the *actual* file.
On lookup, the famfs fuse server (famfs_fused), rather than stat the
file like passthrough, reads the yaml and decodes the stat and fmap info
from that.
One other detail. The shadow path must be known or created (usually
as a tmpdir, to guarantee it starts empty) at mount time. The kernel
knows about it through "-o shadow=<path>", but otherwise doesn't use
it. The famfs fuse server receives the path as an input from
'famfs mount'. The problem is that pretty much every famfs-related
user space command needs the shadow path.
In fact the the structure of the mounted file system is at
<shadow_path>/root. Also located in <shadow path> (above ./root) is a
unix domain socket for REST communication with famfs_fused. We have
plans for other files at <shadow path> and above ./root (mount-specific
config options, for example).
Playing the famfs metadata log requires finding the shadow path,
parsing the log, and creating (or potentially modifying) shadow files
in the shadow path for the mount.
So to communicate with the fuse server we parse the shadow path from
/proc/mounts and that finds the <shadow_path>/socket that can be used
to communicate with famfs_fused. And we can play the metadata log
(accessed via MPT/.meta/.log) to <shadow_path>/root/...
Having something in sysfs would be fine, but unless we pass it into
the kernel somehow (hey, like -o shadow=<shadow path>), the kernel
won't know it and can't reveal it.
A big no-go, I think, is trying to parse the shadow path from the
famfs fuse server via 'ps -ef' or 'ps -ax'. The famfs cli etc. might
be running in a container that doesn't have access to that.
Happy to discuss further...
>
> >
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> > fs/fuse/fuse_i.h | 25 ++++++++++++++++++++++++-
> > fs/fuse/inode.c | 28 +++++++++++++++++++++++++++-
> > 2 files changed, 51 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index ec2446099010..84d0ee2a501d 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -620,9 +620,11 @@ struct fuse_fs_context {
> > unsigned int blksize;
> > const char *subtype;
> >
> > - /* DAX device, may be NULL */
> > + /* DAX device for virtiofs, may be NULL */
> > struct dax_device *dax_dev;
> >
> > + const char *shadow; /* famfs - null if not famfs */
> > +
> > /* fuse_dev pointer to fill in, should contain NULL on entry */
> > void **fudptr;
> > };
> > @@ -998,6 +1000,18 @@ struct fuse_conn {
> > /* Request timeout (in jiffies). 0 = no timeout */
> > unsigned int req_timeout;
> > } timeout;
> > +
> > + /*
> > + * This is a workaround until fuse uses iomap for reads.
> > + * For fuseblk servers, this represents the blocksize passed in at
> > + * mount time and for regular fuse servers, this is equivalent to
> > + * inode->i_blkbits.
> > + */
> > + u8 blkbits;
> > +
>
> I think you meant to remove these lines?
I was gonna say those are Darrick's lines...but they came in through my patch.
So yes, I will drop them. Oops :D
I'm not sure how this leaked into my patch, but that's one of the reasons why
reviews are good - thanks!
>
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > + char *shadow;
>
> Should this be const char * too?
> > +#endif
> > };
> >
> > /*
> > @@ -1631,4 +1645,13 @@ extern void fuse_sysctl_unregister(void);
> > #define fuse_sysctl_unregister() do { } while (0)
> > #endif /* CONFIG_SYSCTL */
> >
> > +/* famfs.c */
> > +
> > +static inline void famfs_teardown(struct fuse_conn *fc)
> > +{
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > + kfree(fc->shadow);
> > +#endif
> > +}
> > +
> > #endif /* _FS_FUSE_I_H */
> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > index acabf92a11f8..2e0844aabbae 100644
> > --- a/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -783,6 +783,9 @@ enum {
> > OPT_ALLOW_OTHER,
> > OPT_MAX_READ,
> > OPT_BLKSIZE,
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > + OPT_SHADOW,
> > +#endif
> > OPT_ERR
> > };
> >
> > @@ -797,6 +800,9 @@ static const struct fs_parameter_spec fuse_fs_parameters[] = {
> > fsparam_u32 ("max_read", OPT_MAX_READ),
> > fsparam_u32 ("blksize", OPT_BLKSIZE),
> > fsparam_string ("subtype", OPT_SUBTYPE),
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > + fsparam_string("shadow", OPT_SHADOW),
>
> nit: having the spacing for ("shadow", align with the lines above
> would be aesthetically nice
Done, thanks
>
> > +#endif
> > {}
> > };
> >
> > @@ -892,6 +898,15 @@ static int fuse_parse_param(struct fs_context *fsc, struct fs_parameter *param)
> > ctx->blksize = result.uint_32;
> > break;
> >
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > + case OPT_SHADOW:
> > + if (ctx->shadow)
> > + return invalfc(fsc, "Multiple shadows specified");
> > + ctx->shadow = param->string;
> > + param->string = NULL;
> > + break;
> > +#endif
> > +
> > default:
> > return -EINVAL;
> > }
> > @@ -905,6 +920,7 @@ static void fuse_free_fsc(struct fs_context *fsc)
> >
> > if (ctx) {
> > kfree(ctx->subtype);
> > + kfree(ctx->shadow);
> > kfree(ctx);
> > }
> > }
> > @@ -936,7 +952,10 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
> > else if (fc->dax_mode == FUSE_DAX_INODE_USER)
> > seq_puts(m, ",dax=inode");
> > #endif
> > -
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > + if (fc->shadow)
> > + seq_printf(m, ",shadow=%s", fc->shadow);
> > +#endif
> > return 0;
> > }
> >
> > @@ -1041,6 +1060,8 @@ void fuse_conn_put(struct fuse_conn *fc)
> > WARN_ON(atomic_read(&bucket->count) != 1);
> > kfree(bucket);
> > }
> > + famfs_teardown(fc);
>
> imo it looks a bit cleaner with
>
> if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
> famfs_teardown(fc);
>
> which also matches the pattern the passthrough config below uses
Done
>
> > +
> > if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> > fuse_backing_files_free(fc);
> > call_rcu(&fc->rcu, delayed_release);
> > @@ -1916,6 +1937,11 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
> > *ctx->fudptr = fud;
> > wake_up_all(&fuse_dev_waitq);
> > }
> > +
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > + fc->shadow = kstrdup(ctx->shadow, GFP_KERNEL);
>
> Is a shadow path a must-have for a famfs mount? if so, then should the
> mount fail if the allocation here fails?
Summarized above...
>
> Thanks,
> Joanne
> > +#endif
> > +
> > mutex_unlock(&fuse_mutex);
> > return 0;
> >
> > --
> > 2.49.0
> >
Thanks Joanne!
John
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 13/21] famfs_fuse: Famfs mount opt: -o shadow=<shadowpath>
2026-01-10 0:38 ` John Groves
@ 2026-01-11 18:20 ` John Groves
0 siblings, 0 replies; 74+ messages in thread
From: John Groves @ 2026-01-11 18:20 UTC (permalink / raw)
To: Joanne Koong
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On 26/01/09 06:38PM, John Groves wrote:
> On 26/01/09 11:22AM, Joanne Koong wrote:
> > On Wed, Jan 7, 2026 at 7:34 AM John Groves <John@groves.net> wrote:
> > >
> > > The shadow path is a (usually in tmpfs) file system area used by the
> > > famfs user space to communicate with the famfs fuse server. There is a
> > > minor dilemma that the user space tools must be able to resolve from a
> > > mount point path to a shadow path. Passing in the 'shadow=<path>'
> > > argument at mount time causes the shadow path to be exposed via
> > > /proc/mounts, Solving this dilemma. The shadow path is not otherwise
> > > used in the kernel.
> >
> > Instead of using mount options to pass the userspace metadata, could
> > /sys/fs be used instead? The client is able to get the connection id
> > by stat-ing the famfs mount path. There could be a
> > /sys/fs/fuse/connections/{id}/metadata file that the server fills out
> > with whatever metadata needs to be read by the client. Having
> > something like this would be useful to non-famfs servers as well.
>
> The shadow option isn't the only possible way to get what famfs needs,
> but I do like it - I find it to be an elegant solution to the problem.
>
> What's the problem? Well, for that you need to know some implementation
> details of the famfs userspace. For the *structure* of a mounted file
> system, famfs is very passthrough-like. The structure that is being
> passed through is the shadow file system, which is an actual file system
> (usually tmpfs). Directories are just directories, but shadow files
> contain yaml that describes the file-to-dax map of the *actual* file.
> On lookup, the famfs fuse server (famfs_fused), rather than stat the
> file like passthrough, reads the yaml and decodes the stat and fmap info
> from that.
>
> One other detail. The shadow path must be known or created (usually
> as a tmpdir, to guarantee it starts empty) at mount time. The kernel
> knows about it through "-o shadow=<path>", but otherwise doesn't use
> it. The famfs fuse server receives the path as an input from
> 'famfs mount'. The problem is that pretty much every famfs-related
> user space command needs the shadow path.
>
> In fact the the structure of the mounted file system is at
> <shadow_path>/root. Also located in <shadow path> (above ./root) is a
> unix domain socket for REST communication with famfs_fused. We have
> plans for other files at <shadow path> and above ./root (mount-specific
> config options, for example).
>
> Playing the famfs metadata log requires finding the shadow path,
> parsing the log, and creating (or potentially modifying) shadow files
> in the shadow path for the mount.
>
> So to communicate with the fuse server we parse the shadow path from
> /proc/mounts and that finds the <shadow_path>/socket that can be used
> to communicate with famfs_fused. And we can play the metadata log
> (accessed via MPT/.meta/.log) to <shadow_path>/root/...
>
> Having something in sysfs would be fine, but unless we pass it into
> the kernel somehow (hey, like -o shadow=<shadow path>), the kernel
> won't know it and can't reveal it.
>
> A big no-go, I think, is trying to parse the shadow path from the
> famfs fuse server via 'ps -ef' or 'ps -ax'. The famfs cli etc. might
> be running in a container that doesn't have access to that.
>
> Happy to discuss further...
After all that blather (from me), I've been thinking about resolving
mount points to shadow paths, and I came to the realization that it's
actually easy to enable retrieving the shadow path from the fuse
server as an extended attribute.
I implemented that this morning, and it appears to be passing all tests.
So I anticipate that I'll be able to drop this patch from the series
when I send V4 - which should be in the next few days unless discussion
heats up in the mean time.
Thinking back... when I implemented the '-o shadow=<path>' thingy
more than a year ago, I still had a *lot* of unsolved problems to
tackle. Once I had "a solution" I moved on - but the xattr idea looks
solid to me (though if anybody can point out flaws, I'd appreciate it).
(there's an Alice's Restaurant joke in there somewhere if you squint,
about not having to take out the garbage for a long time, but probably
only for old people like me...)
Regards,
John
[ ... ]
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 4/4] fuse: add famfs DAX fmap support
2026-01-08 15:31 ` Jonathan Cameron
@ 2026-01-11 18:24 ` John Groves
0 siblings, 0 replies; 74+ messages in thread
From: John Groves @ 2026-01-11 18:24 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On 26/01/08 03:31PM, Jonathan Cameron wrote:
> On Wed, 7 Jan 2026 09:34:43 -0600
> John Groves <John@Groves.net> wrote:
>
> > Add new FUSE operations and capability for famfs DAX file mapping:
> >
> > - FUSE_CAP_DAX_FMAP: New capability flag at bit 32 (using want_ext/capable_ext
> > fields) to indicate kernel and userspace support for DAX fmaps
> >
> > - GET_FMAP: New operation to retrieve a file map for DAX-mapped files.
> > Returns a fuse_famfs_fmap_header followed by simple or interleaved
> > extent descriptors. The kernel passes the file size as an argument.
> >
> > - GET_DAXDEV: New operation to retrieve DAX device info by index.
> > Called when GET_FMAP returns an fmap referencing a previously
> > unknown DAX device.
> >
> > These operations enable FUSE filesystems to provide direct access
> > mappings to persistent memory, allowing the kernel to map files
> > directly to DAX devices without page cache intermediation.
> >
> > Signed-off-by: John Groves <john@groves.net>
>
>
> > ---
> > include/fuse_common.h | 5 +++++
> > include/fuse_lowlevel.h | 37 +++++++++++++++++++++++++++++++++++++
> > lib/fuse_lowlevel.c | 31 ++++++++++++++++++++++++++++++-
> > 3 files changed, 72 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/fuse_common.h b/include/fuse_common.h
> > index 041188e..e428ddb 100644
> > --- a/include/fuse_common.h
> > +++ b/include/fuse_common.h
> > @@ -512,6 +512,11 @@ struct fuse_loop_config_v1 {
> > */
> > #define FUSE_CAP_OVER_IO_URING (1UL << 31)
> >
> > +/**
> > + * handle files that use famfs dax fmaps
> > + */
> > +#define FUSE_CAP_DAX_FMAP (1UL<<32)
>
> From the context above, looks like local style is spaces around <<
Fixed, thanks!
[ ... ]
John
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 21/21] famfs_fuse: Add documentation
2026-01-08 15:27 ` Jonathan Cameron
@ 2026-01-11 18:53 ` John Groves
0 siblings, 0 replies; 74+ messages in thread
From: John Groves @ 2026-01-11 18:53 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On 26/01/08 03:27PM, Jonathan Cameron wrote:
> On Wed, 7 Jan 2026 09:33:30 -0600
> John Groves <John@Groves.net> wrote:
>
> > Add Documentation/filesystems/famfs.rst and update MAINTAINERS
> >
> > Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
> > Tested-by: Randy Dunlap <rdunlap@infradead.org>
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> > Documentation/filesystems/famfs.rst | 142 ++++++++++++++++++++++++++++
> > Documentation/filesystems/index.rst | 1 +
> > MAINTAINERS | 1 +
> > 3 files changed, 144 insertions(+)
> > create mode 100644 Documentation/filesystems/famfs.rst
> >
> > diff --git a/Documentation/filesystems/famfs.rst b/Documentation/filesystems/famfs.rst
> > new file mode 100644
> > index 000000000000..0d3c9ba9b7a8
> > --- /dev/null
> > +++ b/Documentation/filesystems/famfs.rst
>
> > +Principles of Operation
> > +=======================
> ....
> > +When an app accesses a data object in a famfs file, there is no page cache
> > +involvement. The CPU cache is loaded directly from the shared memory. In
> > +some use cases, this is an enormous reduction read amplification compared
> > +to loading an entire page into the page cache.
> > +
> Trivial but this double blank line seems inconsistent.
> I don't mind if it's one or two, but do the same everywhere.
This doc is identical to the the previous series, becuase I kept the Reviewed-by
and Tested-by tags from Randy. I'm happy to remove the extra blank line if he
or somebody from the doc team thinks I should.
>
> > +
> > +Famfs is Not a Conventional File System
> > +---------------------------------------
>
> Nice doc.
>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Thanks Jonathan!
John
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 10/21] famfs_fuse: Kconfig
2026-01-08 12:36 ` Jonathan Cameron
@ 2026-01-12 16:46 ` John Groves
0 siblings, 0 replies; 74+ messages in thread
From: John Groves @ 2026-01-12 16:46 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield,
John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On 26/01/08 12:36PM, Jonathan Cameron wrote:
> On Wed, 7 Jan 2026 09:33:19 -0600
> John Groves <John@Groves.net> wrote:
>
> > Add FUSE_FAMFS_DAX config parameter, to control compilation of famfs
> > within fuse.
> >
> > Signed-off-by: John Groves <john@groves.net>
>
> A separate commit for this doesn't obviously add anything over combining
> it with first place the CONFIG_xxx is used.
>
> Maybe it's a convention for fs/fuse though. If it is ignore me.
I've squashed this into the first commit that uses FUSE_FAMFS_DAX,
which is 2 commits later...
Thanks,
John
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: [PATCH V3 07/21] dax: prevent driver unbind while filesystem holds device
2026-01-07 15:33 ` [PATCH V3 07/21] dax: prevent driver unbind while filesystem holds device John Groves
2026-01-08 12:34 ` Jonathan Cameron
@ 2026-01-12 18:55 ` John Groves
1 sibling, 0 replies; 74+ messages in thread
From: John Groves @ 2026-01-12 18:55 UTC (permalink / raw)
To: Miklos Szeredi, Dan Williams, Bernd Schubert, Alison Schofield
Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
Matthew Wilcox, Jan Kara, Alexander Viro, David Hildenbrand,
Christian Brauner, Darrick J . Wong, Randy Dunlap, Jeff Layton,
Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
Josef Bacik, Bagas Sanjaya, Chen Linxuan, James Morse, Fuad Tabba,
Sean Christopherson, Shivank Garg, Ackerley Tng, Gregory Price,
Aravind Ramesh, Ajay Joshi, venkataravis, linux-doc, linux-kernel,
nvdimm, linux-cxl, linux-fsdevel
On 26/01/07 09:33AM, John Groves wrote:
> From: John Groves <John@Groves.net>
>
> Add custom bind/unbind sysfs attributes for the dax bus that check
> whether a filesystem has registered as a holder (via fs_dax_get())
> before allowing driver unbind.
>
> When a filesystem like famfs mounts on a dax device, it registers
> itself as the holder via dax_holder_ops. Previously, there was no
> mechanism to prevent driver unbind while the filesystem was mounted,
> which could cause some havoc.
>
> The new unbind_store() checks dax_holder() and returns -EBUSY if
> a holder is registered, giving userspace proper feedback that the
> device is in use.
>
> To use our custom bind/unbind handlers instead of the default ones,
> set suppress_bind_attrs=true on all dax drivers during registration.
>
> Signed-off-by: John Groves <john@groves.net>
After a discussion with Dan Williams, I will be dropping this patch
from the series. If the fsdev-mode driver gets unbound under famfs,
famfs will just stop working.
Based on feedback so far, V4 should be coming in the next few days.
Regards,
John
^ permalink raw reply [flat|nested] 74+ messages in thread
end of thread, other threads:[~2026-01-12 18:55 UTC | newest]
Thread overview: 74+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-07 15:32 [PATCH BUNDLE] famfs: Fabric-Attached Memory File System John Groves
2026-01-07 15:33 ` [PATCH V3 00/21] famfs: port into fuse John Groves
2026-01-07 15:33 ` [PATCH V3 01/21] dax: move dax_pgoff_to_phys from [drivers/dax/] device.c to bus.c John Groves
2026-01-08 10:43 ` Jonathan Cameron
2026-01-08 13:25 ` John Groves
2026-01-08 15:20 ` Jonathan Cameron
2026-01-07 15:33 ` [PATCH V3 02/21] dax: add fsdev.c driver for fs-dax on character dax John Groves
2026-01-08 11:31 ` Jonathan Cameron
2026-01-08 14:32 ` John Groves
2026-01-08 15:12 ` John Groves
2026-01-08 21:15 ` John Groves
2026-01-08 23:25 ` Gregory Price
2026-01-07 15:33 ` [PATCH V3 03/21] dax: Save the kva from memremap John Groves
2026-01-08 11:32 ` Jonathan Cameron
2026-01-08 15:15 ` John Groves
2026-01-07 15:33 ` [PATCH V3 04/21] dax: Add dax_operations for use by fs-dax on fsdev dax John Groves
2026-01-08 11:50 ` Jonathan Cameron
2026-01-08 15:59 ` John Groves
2026-01-08 16:10 ` Jonathan Cameron
2026-01-07 15:33 ` [PATCH V3 05/21] dax: Add dax_set_ops() for setting dax_operations at bind time John Groves
2026-01-08 12:06 ` Jonathan Cameron
2026-01-08 16:20 ` John Groves
2026-01-07 15:33 ` [PATCH V3 06/21] dax: Add fs_dax_get() func to prepare dax for fs-dax usage John Groves
2026-01-08 12:27 ` Jonathan Cameron
2026-01-08 16:45 ` John Groves
2026-01-07 15:33 ` [PATCH V3 07/21] dax: prevent driver unbind while filesystem holds device John Groves
2026-01-08 12:34 ` Jonathan Cameron
2026-01-08 18:08 ` John Groves
2026-01-12 18:55 ` John Groves
2026-01-07 15:33 ` [PATCH V3 08/21] dax: export dax_dev_get() John Groves
2026-01-07 15:33 ` [PATCH V3 09/21] famfs_fuse: magic.h: Add famfs magic numbers John Groves
2026-01-07 15:33 ` [PATCH V3 10/21] famfs_fuse: Kconfig John Groves
2026-01-08 12:36 ` Jonathan Cameron
2026-01-12 16:46 ` John Groves
2026-01-07 15:33 ` [PATCH V3 11/21] famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/ John Groves
2026-01-09 18:16 ` Joanne Koong
2026-01-09 22:15 ` [PATCH V3 11/21] famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX John Groves
2026-01-07 15:33 ` [PATCH V3 12/21] famfs_fuse: Basic fuse kernel ABI enablement for famfs John Groves
2026-01-09 18:29 ` Joanne Koong
2026-01-09 22:58 ` John Groves
2026-01-07 15:33 ` [PATCH V3 13/21] famfs_fuse: Famfs mount opt: -o shadow=<shadowpath> John Groves
2026-01-09 19:22 ` Joanne Koong
2026-01-10 0:38 ` John Groves
2026-01-11 18:20 ` John Groves
2026-01-07 15:33 ` [PATCH V3 14/21] famfs_fuse: Plumb the GET_FMAP message/response John Groves
2026-01-08 12:49 ` Jonathan Cameron
2026-01-09 2:12 ` John Groves
2026-01-07 15:33 ` [PATCH V3 15/21] famfs_fuse: Create files with famfs fmaps John Groves
2026-01-07 21:30 ` John Groves
2026-01-08 13:14 ` Jonathan Cameron
2026-01-09 14:30 ` John Groves
2026-01-07 15:33 ` [PATCH V3 16/21] famfs_fuse: GET_DAXDEV message and daxdev_table John Groves
2026-01-08 14:45 ` Jonathan Cameron
2026-01-07 15:33 ` [PATCH V3 17/21] famfs_fuse: Plumb dax iomap and fuse read/write/mmap John Groves
2026-01-08 15:13 ` Jonathan Cameron
2026-01-09 17:44 ` John Groves
2026-01-07 15:33 ` [PATCH V3 18/21] famfs_fuse: Add holder_operations for dax notify_failure() John Groves
2026-01-08 15:17 ` Jonathan Cameron
2026-01-09 21:00 ` John Groves
2026-01-07 15:33 ` [PATCH V3 19/21] famfs_fuse: Add DAX address_space_operations with noop_dirty_folio John Groves
2026-01-07 15:33 ` [PATCH V3 20/21] famfs_fuse: Add famfs fmap metadata documentation John Groves
2026-01-07 15:33 ` [PATCH V3 21/21] famfs_fuse: Add documentation John Groves
2026-01-08 15:27 ` Jonathan Cameron
2026-01-11 18:53 ` John Groves
2026-01-07 15:34 ` [PATCH V3 0/4] libfuse: add basic famfs support to libfuse John Groves
2026-01-07 15:34 ` [PATCH V3 1/4] fuse_kernel.h: bring up to baseline 6.19 John Groves
2026-01-07 15:34 ` [PATCH V3 2/4] fuse_kernel.h: add famfs DAX fmap protocol definitions John Groves
2026-01-07 15:34 ` [PATCH V3 3/4] fuse: add API to set kernel mount options John Groves
2026-01-07 15:34 ` [PATCH V3 4/4] fuse: add famfs DAX fmap support John Groves
2026-01-08 15:31 ` Jonathan Cameron
2026-01-11 18:24 ` John Groves
2026-01-07 15:34 ` [PATCH 0/2] ndctl: Add daxctl support for the new "famfs" mode of devdax John Groves
2026-01-07 15:34 ` [PATCH 1/2] daxctl: Add support for famfs mode John Groves
2026-01-07 15:34 ` [PATCH 2/2] Add test/daxctl-famfs.sh to test famfs mode transitions: John Groves
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox