linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC V2 00/18] famfs: port into fuse
@ 2025-07-03 18:50 John Groves
  2025-07-03 18:50 ` [RFC V2 01/18] dev_dax_iomap: Move dax_pgoff_to_phys() from device.c to bus.c John Groves
                   ` (18 more replies)
  0 siblings, 19 replies; 91+ messages in thread
From: John Groves @ 2025-07-03 18:50 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Randy Dunlap, Jeff Layton, Kent Overstreet,
	linux-doc, linux-kernel, nvdimm, linux-cxl, linux-fsdevel,
	Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
	Josef Bacik, Aravind Ramesh, Ajay Joshi, John Groves

Changes since v1:

- The GET_FMAP message/response has been moved from LOOKUP to OPEN, as was
  the pretty much unanimous consensus.
- Made the response payload to GET_FMAP variable sized (patch 12)
- Dodgy kerneldoc comments cleaned up or removed.
- Fixed memory leak of fc->shadow in patch 11 (thanks Joanne)
- Dropped many pr_debug and pr_notice calls

Open Issues:

- This is still marked RFC because I have not tackled the "poisoned page
  problem" yet (see below the original description below). That's next on my
  agenda for this patch set; I'm planning to address that in V3, and to drop
  RFC and make V3 mergeable.
- Note: this patch is still against 6.14 because of the interaction of the
  poisoned page issue with Alistair Popple's multitudinous recent DAX
  patches. ;) I have some work to do to move forward, but the next rev will
  do that.
- Because I haven't moved forward past 6.14, the related libfuse patch [2.1]
  is out of sync with the libfuse master branch. This will be addressed in the
  next version.

Other Notes:
- This patch is available as a git branch at [2.2]

References to V2
[2.1] - https://github.com/libfuse/libfuse/pull/1271
[2.2] - https://github.com/cxl-micron-reskit/famfs-linux/tree/famfs-fuse-v2


Original Description:

This is the initial RFC for the fabric-attached memory file system (famfs)
integration into fuse. In order to function, this requires a related patch
to libfuse [1] and the famfs user space [2]. 

This RFC is mainly intended to socialize the approach and get feedback from
the fuse developers and maintainers. There is some dax work that needs to
be done before this should be merged (see the "poisoned page|folio problem"
below).

This patch set fully works with Linux 6.14 -- passing all existing famfs
smoke and unit tests -- and I encourage existing famfs users to test it.

This is really two patch sets mashed up:

* The patches with the dev_dax_iomap: prefix fill in missing functionality for
  devdax to host an fs-dax file system.
* The famfs_fuse: patches add famfs into fs/fuse/. These are effectively
  unchanged since last year.

Because this is not ready to merge yet, I have felt free to leave some debug
prints in place because we still find them useful; those will be cleaned up
in a subsequent revision.

Famfs Overview

Famfs exposes shared memory as a file system. Famfs consumes shared memory
from dax devices, and provides memory-mappable files that map directly to
the memory - no page cache involvement. Famfs differs from conventional
file systems in fs-dax mode, in that it handles in-memory metadata in a
sharable way (which begins with never caching dirty shared metadata).

Famfs started as a standalone file system [3,4], but the consensus at LSFMM
2024 [5] was that it should be ported into fuse - and this RFC is the first
public evidence that I've been working on that.

The key performance requirement is that famfs must resolve mapping faults
without upcalls. This is achieved by fully caching the file-to-devdax
metadata for all active files. This is done via two fuse client/server
message/response pairs: GET_FMAP and GET_DAXDEV.

Famfs remains the first fs-dax file system that is backed by devdax rather
than pmem in fs-dax mode (hence the need for the dev_dax_iomap fixups).

Notes

* Once the dev_dax_iomap patches land, I suspect it may make sense for
  virtiofs to update to use the improved interface.

* I'm currently maintaining compatibility between the famfs user space and
  both the standalone famfs kernel file system and this new fuse
  implementation. In the near future I'll be running performance comparisons
  and sharing them - but there is no reason to expect significant degradation
  with fuse, since famfs caches entire "fmaps" in the kernel to resolve
  faults with no upcalls. This patch has a bit too much debug turned on to
  to that testing quite yet. A branch 

* Two new fuse messages / responses are added: GET_FMAP and GET_DAXDEV.

* When a file is looked up in a famfs mount, the LOOKUP is followed by a
  GET_FMAP message and response. The "fmap" is the full file-to-dax mapping,
  allowing the fuse/famfs kernel code to handle read/write/fault without any
  upcalls.

* After each GET_FMAP, the fmap is checked for extents that reference
  previously-unknown daxdevs. Each such occurrence is handled with a
  GET_DAXDEV message and response.

* Daxdevs are stored in a table (which might become an xarray at some point).
  When entries are added to the table, we acquire exclusive access to the
  daxdev via the fs_dax_get() call (modeled after how fs-dax handles this
  with pmem devices). famfs provides holder_operations to devdax, providing
  a notification path in the event of memory errors.

* If devdax notifies famfs of memory errors on a dax device, famfs currently
  blocks all subsequent accesses to data on that device. The recovery is to
  re-initialize the memory and file system. Famfs is memory, not storage...

* Because famfs uses backing (devdax) devices, only privileged mounts are
  supported.

* The famfs kernel code never accesses the memory directly - it only
  facilitates read, write and mmap on behalf of user processes. As such,
  the RAS of the shared memory affects applications, but not the kernel.

* Famfs has backing device(s), but they are devdax (char) rather than
  block. Right now there is no way to tell the vfs layer that famfs has a
  char backing device (unless we say it's block, but it's not). Currently
  we use the standard anonymous fuse fs_type - but I'm not sure that's
  ultimately optimal (thoughts?)

The "poisoned page|folio problem"

* Background: before doing a kernel mount, the famfs user space [2] validates
  the superblock and log. This is done via raw mmap of the primary devdax
  device. If valid, the file system is mounted, and the superblock and log
  get exposed through a pair of files (.meta/.superblock and .meta/.log) -
  because we can't be using raw device mmap when a file system is mounted
  on the device. But this exposes a devdax bug and warning...

* Pages that have been memory mapped via devdax are left in a permanently
  problematic state. Devdax sets page|folio->mapping when a page is accessed
  via raw devdax mmap (as famfs does before mount), but never cleans it up.
  When the pages of the famfs superblock and log are accessed via the "meta"
  files after mount, we see a WARN_ONCE() in dax_insert_entry(), which
  notices that page|folio->mapping is still set. I intend to address this
  prior to asking for the famfs patches to be merged.

* Alistair Popple's recent dax patch series [6], which has been merged
  for 6.15, addresses some dax issues, but sadly does not fix the poisoned
  page|folio problem - its enhanced refcount checking turns the warning into
  an error.

* This 6.14 patch set disables the warning; a proper fix will be required for
  famfs to work at all in 6.15. Dan W. and I are actively discussing how to do
  this properly...

* In terms of the correct functionality of famfs, the warning can be ignored.

References

[1] - https://github.com/libfuse/libfuse/pull/1200
[2] - https://github.com/cxl-micron-reskit/famfs
[3] - https://lore.kernel.org/linux-cxl/cover.1708709155.git.john@groves.net/
[4] - https://lore.kernel.org/linux-cxl/cover.1714409084.git.john@groves.net/
[5] - https://lwn.net/Articles/983105/
[6] - https://lore.kernel.org/linux-cxl/cover.8068ad144a7eea4a813670301f4d2a86a8e68ec4.1740713401.git-series.apopple@nvidia.com/


John Groves (18):
  dev_dax_iomap: Move dax_pgoff_to_phys() from device.c to bus.c
  dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage
  dev_dax_iomap: Save the kva from memremap
  dev_dax_iomap: Add dax_operations for use by fs-dax on devdax
  dev_dax_iomap: export dax_dev_get()
  dev_dax_iomap: (ignore!) Drop poisoned page warning in fs/dax.c
  famfs_fuse: magic.h: Add famfs magic numbers
  famfs_fuse: Kconfig
  famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/
  famfs_fuse: Basic fuse kernel ABI enablement for famfs
  famfs_fuse: Basic famfs mount opts
  famfs_fuse: Plumb the GET_FMAP message/response
  famfs_fuse: Create files with famfs fmaps
  famfs_fuse: GET_DAXDEV message and daxdev_table
  famfs_fuse: Plumb dax iomap and fuse read/write/mmap
  famfs_fuse: Add holder_operations for dax notify_failure()
  famfs_fuse: Add famfs metadata documentation
  famfs_fuse: Add documentation

 Documentation/filesystems/famfs.rst |  142 ++++
 Documentation/filesystems/index.rst |    1 +
 MAINTAINERS                         |   10 +
 drivers/dax/Kconfig                 |    6 +
 drivers/dax/bus.c                   |  144 +++-
 drivers/dax/dax-private.h           |    1 +
 drivers/dax/device.c                |   38 +-
 drivers/dax/super.c                 |   33 +-
 fs/dax.c                            |    1 -
 fs/fuse/Kconfig                     |   13 +
 fs/fuse/Makefile                    |    2 +-
 fs/fuse/dir.c                       |    2 +-
 fs/fuse/famfs.c                     | 1087 +++++++++++++++++++++++++++
 fs/fuse/famfs_kfmap.h               |  166 ++++
 fs/fuse/file.c                      |  124 ++-
 fs/fuse/fuse_i.h                    |   67 +-
 fs/fuse/inode.c                     |   59 +-
 fs/fuse/iomode.c                    |    2 +-
 fs/namei.c                          |    1 +
 include/linux/dax.h                 |    6 +
 include/uapi/linux/fuse.h           |   96 +++
 include/uapi/linux/magic.h          |    2 +
 22 files changed, 1961 insertions(+), 42 deletions(-)
 create mode 100644 Documentation/filesystems/famfs.rst
 create mode 100644 fs/fuse/famfs.c
 create mode 100644 fs/fuse/famfs_kfmap.h


base-commit: b9d5d463c216763cec719c04536ea9e14512cad4
-- 
2.49.0


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [RFC V2 01/18] dev_dax_iomap: Move dax_pgoff_to_phys() from device.c to bus.c
  2025-07-03 18:50 [RFC V2 00/18] famfs: port into fuse John Groves
@ 2025-07-03 18:50 ` John Groves
  2025-07-03 18:50 ` [RFC V2 02/18] dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage John Groves
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 91+ messages in thread
From: John Groves @ 2025-07-03 18:50 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Randy Dunlap, Jeff Layton, Kent Overstreet,
	linux-doc, linux-kernel, nvdimm, linux-cxl, linux-fsdevel,
	Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
	Josef Bacik, Aravind Ramesh, Ajay Joshi, John Groves

No changes to the function - just moved it.

dev_dax_iomap needs to call this function from
drivers/dax/bus.c.

drivers/dax/bus.c can't call functions in drivers/dax/device.c -
that creates a circular linkage dependency - but device.c can
call functions in bus.c. Also exports dax_pgoff_to_phys() since
both bus.c and device.c now call it.

Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/bus.c    | 24 ++++++++++++++++++++++++
 drivers/dax/device.c | 23 -----------------------
 2 files changed, 24 insertions(+), 23 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index fde29e0ad68b..9d9a4ae7bbc0 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1417,6 +1417,30 @@ static const struct device_type dev_dax_type = {
 	.groups = dax_attribute_groups,
 };
 
+/* see "strong" declaration in tools/testing/nvdimm/dax-dev.c  */
+__weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
+			      unsigned long size)
+{
+	int i;
+
+	for (i = 0; i < dev_dax->nr_range; i++) {
+		struct dev_dax_range *dax_range = &dev_dax->ranges[i];
+		struct range *range = &dax_range->range;
+		unsigned long long pgoff_end;
+		phys_addr_t phys;
+
+		pgoff_end = dax_range->pgoff + PHYS_PFN(range_len(range)) - 1;
+		if (pgoff < dax_range->pgoff || pgoff > pgoff_end)
+			continue;
+		phys = PFN_PHYS(pgoff - dax_range->pgoff) + range->start;
+		if (phys + size - 1 <= range->end)
+			return phys;
+		break;
+	}
+	return -1;
+}
+EXPORT_SYMBOL_GPL(dax_pgoff_to_phys);
+
 static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
 {
 	struct dax_region *dax_region = data->dax_region;
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 6d74e62bbee0..29f61771fef0 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -50,29 +50,6 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
 	return 0;
 }
 
-/* see "strong" declaration in tools/testing/nvdimm/dax-dev.c */
-__weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
-		unsigned long size)
-{
-	int i;
-
-	for (i = 0; i < dev_dax->nr_range; i++) {
-		struct dev_dax_range *dax_range = &dev_dax->ranges[i];
-		struct range *range = &dax_range->range;
-		unsigned long long pgoff_end;
-		phys_addr_t phys;
-
-		pgoff_end = dax_range->pgoff + PHYS_PFN(range_len(range)) - 1;
-		if (pgoff < dax_range->pgoff || pgoff > pgoff_end)
-			continue;
-		phys = PFN_PHYS(pgoff - dax_range->pgoff) + range->start;
-		if (phys + size - 1 <= range->end)
-			return phys;
-		break;
-	}
-	return -1;
-}
-
 static void dax_set_mapping(struct vm_fault *vmf, pfn_t pfn,
 			      unsigned long fault_size)
 {
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC V2 02/18] dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage
  2025-07-03 18:50 [RFC V2 00/18] famfs: port into fuse John Groves
  2025-07-03 18:50 ` [RFC V2 01/18] dev_dax_iomap: Move dax_pgoff_to_phys() from device.c to bus.c John Groves
@ 2025-07-03 18:50 ` John Groves
  2025-07-04 10:39   ` Jonathan Cameron
  2025-07-03 18:50 ` [RFC V2 03/18] dev_dax_iomap: Save the kva from memremap John Groves
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 91+ messages in thread
From: John Groves @ 2025-07-03 18:50 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Randy Dunlap, Jeff Layton, Kent Overstreet,
	linux-doc, linux-kernel, nvdimm, linux-cxl, linux-fsdevel,
	Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
	Josef Bacik, Aravind Ramesh, Ajay Joshi, John Groves

This function should be called by fs-dax file systems after opening the
devdax device. This adds holder_operations, which effects exclusivity
between callers of fs_dax_get().

This function serves the same role as fs_dax_get_by_bdev(), which dax
file systems call after opening the pmem block device.

This also adds the CONFIG_DEV_DAX_IOMAP Kconfig parameter

Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/Kconfig |  6 ++++++
 drivers/dax/super.c | 30 ++++++++++++++++++++++++++++++
 include/linux/dax.h |  5 +++++
 3 files changed, 41 insertions(+)

diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index d656e4c0eb84..ad19fa966b8b 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -78,4 +78,10 @@ config DEV_DAX_KMEM
 
 	  Say N if unsure.
 
+config DEV_DAX_IOMAP
+       depends on DEV_DAX && DAX
+       def_bool y
+       help
+         Support iomap mapping of devdax devices (for FS-DAX file
+         systems that reside on character /dev/dax devices)
 endif
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index e16d1d40d773..48bab9b5f341 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -122,6 +122,36 @@ void fs_put_dax(struct dax_device *dax_dev, void *holder)
 EXPORT_SYMBOL_GPL(fs_put_dax);
 #endif /* CONFIG_BLOCK && CONFIG_FS_DAX */
 
+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+/**
+ * fs_dax_get()
+ *
+ * fs-dax file systems call this function to prepare to use a devdax device for
+ * fsdax. This is like fs_dax_get_by_bdev(), but the caller already has struct
+ * dev_dax (and there  * is no bdev). The holder makes this exclusive.
+ *
+ * @dax_dev: dev to be prepared for fs-dax usage
+ * @holder: filesystem or mapped device inside the dax_device
+ * @hops: operations for the inner holder
+ *
+ * Returns: 0 on success, <0 on failure
+ */
+int fs_dax_get(struct dax_device *dax_dev, void *holder,
+	const struct dax_holder_operations *hops)
+{
+	if (!dax_dev || !dax_alive(dax_dev) || !igrab(&dax_dev->inode))
+		return -ENODEV;
+
+	if (cmpxchg(&dax_dev->holder_data, NULL, holder))
+		return -EBUSY;
+
+	dax_dev->holder_ops = hops;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(fs_dax_get);
+#endif /* DEV_DAX_IOMAP */
+
 enum dax_device_flags {
 	/* !alive + rcu grace period == no new operations / mappings */
 	DAXDEV_ALIVE,
diff --git a/include/linux/dax.h b/include/linux/dax.h
index df41a0017b31..86bf5922f1b0 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -51,6 +51,11 @@ struct dax_holder_operations {
 
 #if IS_ENABLED(CONFIG_DAX)
 struct dax_device *alloc_dax(void *private, const struct dax_operations *ops);
+
+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+int fs_dax_get(struct dax_device *dax_dev, void *holder, const struct dax_holder_operations *hops);
+struct dax_device *inode_dax(struct inode *inode);
+#endif
 void *dax_holder(struct dax_device *dax_dev);
 void put_dax(struct dax_device *dax_dev);
 void kill_dax(struct dax_device *dax_dev);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC V2 03/18] dev_dax_iomap: Save the kva from memremap
  2025-07-03 18:50 [RFC V2 00/18] famfs: port into fuse John Groves
  2025-07-03 18:50 ` [RFC V2 01/18] dev_dax_iomap: Move dax_pgoff_to_phys() from device.c to bus.c John Groves
  2025-07-03 18:50 ` [RFC V2 02/18] dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage John Groves
@ 2025-07-03 18:50 ` John Groves
  2025-07-04 11:11   ` Jonathan Cameron
  2025-07-03 18:50 ` [RFC V2 04/18] dev_dax_iomap: Add dax_operations for use by fs-dax on devdax John Groves
                   ` (15 subsequent siblings)
  18 siblings, 1 reply; 91+ messages in thread
From: John Groves @ 2025-07-03 18:50 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Randy Dunlap, Jeff Layton, Kent Overstreet,
	linux-doc, linux-kernel, nvdimm, linux-cxl, linux-fsdevel,
	Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
	Josef Bacik, Aravind Ramesh, Ajay Joshi, John Groves

Save the kva from memremap because we need it for iomap rw support.

Prior to famfs, there were no iomap users of /dev/dax - so the virtual
address from memremap was not needed.

Also: in some cases dev_dax_probe() is called with the first
dev_dax->range offset past the start of pgmap[0].range. In those cases
we need to add the difference to virt_addr in order to have the physaddr's
in dev_dax->ranges match dev_dax->virt_addr.

This happens with devdax devices that started as pmem and got converted
to devdax. I'm not sure whether the offset is due to label storage, or
page tables, but this works in all known cases.

Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/dax-private.h |  1 +
 drivers/dax/device.c      | 15 +++++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 0867115aeef2..2a6b07813f9f 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -81,6 +81,7 @@ struct dev_dax_range {
 struct dev_dax {
 	struct dax_region *region;
 	struct dax_device *dax_dev;
+	void *virt_addr;
 	unsigned int align;
 	int target_node;
 	bool dyn_id;
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 29f61771fef0..583150478dcc 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -372,6 +372,7 @@ static int dev_dax_probe(struct dev_dax *dev_dax)
 	struct dax_device *dax_dev = dev_dax->dax_dev;
 	struct device *dev = &dev_dax->dev;
 	struct dev_pagemap *pgmap;
+	u64 data_offset = 0;
 	struct inode *inode;
 	struct cdev *cdev;
 	void *addr;
@@ -426,6 +427,20 @@ static int dev_dax_probe(struct dev_dax *dev_dax)
 	if (IS_ERR(addr))
 		return PTR_ERR(addr);
 
+	/* Detect whether the data is at a non-zero offset into the memory */
+	if (pgmap->range.start != dev_dax->ranges[0].range.start) {
+		u64 phys = dev_dax->ranges[0].range.start;
+		u64 pgmap_phys = dev_dax->pgmap[0].range.start;
+		u64 vmemmap_shift = dev_dax->pgmap[0].vmemmap_shift;
+
+		if (!WARN_ON(pgmap_phys > phys))
+			data_offset = phys - pgmap_phys;
+
+		pr_debug("%s: offset detected phys=%llx pgmap_phys=%llx offset=%llx shift=%llx\n",
+		       __func__, phys, pgmap_phys, data_offset, vmemmap_shift);
+	}
+	dev_dax->virt_addr = addr + data_offset;
+
 	inode = dax_inode(dax_dev);
 	cdev = inode->i_cdev;
 	cdev_init(cdev, &dax_fops);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC V2 04/18] dev_dax_iomap: Add dax_operations for use by fs-dax on devdax
  2025-07-03 18:50 [RFC V2 00/18] famfs: port into fuse John Groves
                   ` (2 preceding siblings ...)
  2025-07-03 18:50 ` [RFC V2 03/18] dev_dax_iomap: Save the kva from memremap John Groves
@ 2025-07-03 18:50 ` John Groves
  2025-07-04 12:47   ` Jonathan Cameron
  2025-07-03 18:50 ` [RFC V2 05/18] dev_dax_iomap: export dax_dev_get() John Groves
                   ` (14 subsequent siblings)
  18 siblings, 1 reply; 91+ messages in thread
From: John Groves @ 2025-07-03 18:50 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Randy Dunlap, Jeff Layton, Kent Overstreet,
	linux-doc, linux-kernel, nvdimm, linux-cxl, linux-fsdevel,
	Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
	Josef Bacik, Aravind Ramesh, Ajay Joshi, John Groves

Notes about this commit:

* These methods are based on pmem_dax_ops from drivers/nvdimm/pmem.c

* dev_dax_direct_access() is returns the hpa, pfn and kva. The kva was
  newly stored as dev_dax->virt_addr by dev_dax_probe().

* The hpa/pfn are used for mmap (dax_iomap_fault()), and the kva is used
  for read/write (dax_iomap_rw())

* dev_dax_recovery_write() and dev_dax_zero_page_range() have not been
  tested yet. I'm looking for suggestions as to how to test those.

Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/bus.c | 120 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 115 insertions(+), 5 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 9d9a4ae7bbc0..61a8d1b3c07a 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -7,6 +7,10 @@
 #include <linux/slab.h>
 #include <linux/dax.h>
 #include <linux/io.h>
+#include <linux/backing-dev.h>
+#include <linux/pfn_t.h>
+#include <linux/range.h>
+#include <linux/uio.h>
 #include "dax-private.h"
 #include "bus.h"
 
@@ -1441,6 +1445,105 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
 }
 EXPORT_SYMBOL_GPL(dax_pgoff_to_phys);
 
+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+
+static void write_dax(void *pmem_addr, struct page *page,
+		unsigned int off, unsigned int len)
+{
+	unsigned int chunk;
+	void *mem;
+
+	while (len) {
+		mem = kmap_local_page(page);
+		chunk = min_t(unsigned int, len, PAGE_SIZE - off);
+		memcpy_flushcache(pmem_addr, mem + off, chunk);
+		kunmap_local(mem);
+		len -= chunk;
+		off = 0;
+		page++;
+		pmem_addr += chunk;
+	}
+}
+
+static long __dev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
+			long nr_pages, enum dax_access_mode mode, void **kaddr,
+			pfn_t *pfn)
+{
+	struct dev_dax *dev_dax = dax_get_private(dax_dev);
+	size_t size = nr_pages << PAGE_SHIFT;
+	size_t offset = pgoff << PAGE_SHIFT;
+	void *virt_addr = dev_dax->virt_addr + offset;
+	u64 flags = PFN_DEV|PFN_MAP;
+	phys_addr_t phys;
+	pfn_t local_pfn;
+	size_t dax_size;
+
+	WARN_ON(!dev_dax->virt_addr);
+
+	if (down_read_interruptible(&dax_dev_rwsem))
+		return 0; /* no valid data since we were killed */
+	dax_size = dev_dax_size(dev_dax);
+	up_read(&dax_dev_rwsem);
+
+	phys = dax_pgoff_to_phys(dev_dax, pgoff, nr_pages << PAGE_SHIFT);
+
+	if (kaddr)
+		*kaddr = virt_addr;
+
+	local_pfn = phys_to_pfn_t(phys, flags); /* are flags correct? */
+	if (pfn)
+		*pfn = local_pfn;
+
+	/* This the valid size at the specified address */
+	return PHYS_PFN(min_t(size_t, size, dax_size - offset));
+}
+
+static int dev_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
+				    size_t nr_pages)
+{
+	long resid = nr_pages << PAGE_SHIFT;
+	long offset = pgoff << PAGE_SHIFT;
+
+	/* Break into one write per dax region */
+	while (resid > 0) {
+		void *kaddr;
+		pgoff_t poff = offset >> PAGE_SHIFT;
+		long len = __dev_dax_direct_access(dax_dev, poff,
+						   nr_pages, DAX_ACCESS, &kaddr, NULL);
+		len = min_t(long, len, PAGE_SIZE);
+		write_dax(kaddr, ZERO_PAGE(0), offset, len);
+
+		offset += len;
+		resid  -= len;
+	}
+	return 0;
+}
+
+static long dev_dax_direct_access(struct dax_device *dax_dev,
+		pgoff_t pgoff, long nr_pages, enum dax_access_mode mode,
+		void **kaddr, pfn_t *pfn)
+{
+	return __dev_dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr, pfn);
+}
+
+static size_t dev_dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
+		void *addr, size_t bytes, struct iov_iter *i)
+{
+	size_t off;
+
+	off = offset_in_page(addr);
+
+	return _copy_from_iter_flushcache(addr, bytes, i);
+}
+
+static const struct dax_operations dev_dax_ops = {
+	.direct_access = dev_dax_direct_access,
+	.zero_page_range = dev_dax_zero_page_range,
+	.recovery_write = dev_dax_recovery_write,
+};
+
+#endif /* IS_ENABLED(CONFIG_DEV_DAX_IOMAP) */
+
 static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
 {
 	struct dax_region *dax_region = data->dax_region;
@@ -1496,11 +1599,18 @@ static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
 		}
 	}
 
-	/*
-	 * No dax_operations since there is no access to this device outside of
-	 * mmap of the resulting character device.
-	 */
-	dax_dev = alloc_dax(dev_dax, NULL);
+	if (IS_ENABLED(CONFIG_DEV_DAX_IOMAP))
+		/* holder_ops currently populated separately in a slightly
+		 * hacky way
+		 */
+		dax_dev = alloc_dax(dev_dax, &dev_dax_ops);
+	else
+		/*
+		 * No dax_operations since there is no access to this device
+		 * outside of mmap of the resulting character device.
+		 */
+		dax_dev = alloc_dax(dev_dax, NULL);
+
 	if (IS_ERR(dax_dev)) {
 		rc = PTR_ERR(dax_dev);
 		goto err_alloc_dax;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC V2 05/18] dev_dax_iomap: export dax_dev_get()
  2025-07-03 18:50 [RFC V2 00/18] famfs: port into fuse John Groves
                   ` (3 preceding siblings ...)
  2025-07-03 18:50 ` [RFC V2 04/18] dev_dax_iomap: Add dax_operations for use by fs-dax on devdax John Groves
@ 2025-07-03 18:50 ` John Groves
  2025-07-03 18:50 ` [RFC V2 06/18] dev_dax_iomap: (ignore!) Drop poisoned page warning in fs/dax.c John Groves
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 91+ messages in thread
From: John Groves @ 2025-07-03 18:50 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Randy Dunlap, Jeff Layton, Kent Overstreet,
	linux-doc, linux-kernel, nvdimm, linux-cxl, linux-fsdevel,
	Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
	Josef Bacik, Aravind Ramesh, Ajay Joshi, John Groves

famfs needs access to dev_dax_get()

Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/super.c | 3 ++-
 include/linux/dax.h | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 48bab9b5f341..033fd841c2bb 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -452,7 +452,7 @@ static int dax_set(struct inode *inode, void *data)
 	return 0;
 }
 
-static struct dax_device *dax_dev_get(dev_t devt)
+struct dax_device *dax_dev_get(dev_t devt)
 {
 	struct dax_device *dax_dev;
 	struct inode *inode;
@@ -475,6 +475,7 @@ static struct dax_device *dax_dev_get(dev_t devt)
 
 	return dax_dev;
 }
+EXPORT_SYMBOL_GPL(dax_dev_get);
 
 struct dax_device *alloc_dax(void *private, const struct dax_operations *ops)
 {
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 86bf5922f1b0..c7bf03535b52 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -55,6 +55,7 @@ struct dax_device *alloc_dax(void *private, const struct dax_operations *ops);
 #if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
 int fs_dax_get(struct dax_device *dax_dev, void *holder, const struct dax_holder_operations *hops);
 struct dax_device *inode_dax(struct inode *inode);
+struct dax_device *dax_dev_get(dev_t devt);
 #endif
 void *dax_holder(struct dax_device *dax_dev);
 void put_dax(struct dax_device *dax_dev);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC V2 06/18] dev_dax_iomap: (ignore!) Drop poisoned page warning in fs/dax.c
  2025-07-03 18:50 [RFC V2 00/18] famfs: port into fuse John Groves
                   ` (4 preceding siblings ...)
  2025-07-03 18:50 ` [RFC V2 05/18] dev_dax_iomap: export dax_dev_get() John Groves
@ 2025-07-03 18:50 ` John Groves
  2025-07-03 18:50 ` [RFC V2 07/18] famfs_fuse: magic.h: Add famfs magic numbers John Groves
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 91+ messages in thread
From: John Groves @ 2025-07-03 18:50 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Randy Dunlap, Jeff Layton, Kent Overstreet,
	linux-doc, linux-kernel, nvdimm, linux-cxl, linux-fsdevel,
	Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
	Josef Bacik, Aravind Ramesh, Ajay Joshi, John Groves

This just works around a the "poisoned page" warning that will be
properly fixed in a future version of this patch set. Please ignore
for the moment.

Signed-off-by: John Groves <john@groves.net>
---
 fs/dax.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 21b47402b3dc..635937593d5e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -369,7 +369,6 @@ static void dax_associate_entry(void *entry, struct address_space *mapping,
 		if (shared) {
 			dax_page_share_get(page);
 		} else {
-			WARN_ON_ONCE(page->mapping);
 			page->mapping = mapping;
 			page->index = index + i++;
 		}
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC V2 07/18] famfs_fuse: magic.h: Add famfs magic numbers
  2025-07-03 18:50 [RFC V2 00/18] famfs: port into fuse John Groves
                   ` (5 preceding siblings ...)
  2025-07-03 18:50 ` [RFC V2 06/18] dev_dax_iomap: (ignore!) Drop poisoned page warning in fs/dax.c John Groves
@ 2025-07-03 18:50 ` John Groves
  2025-07-03 18:50 ` [RFC V2 08/18] famfs_fuse: Kconfig John Groves
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 91+ messages in thread
From: John Groves @ 2025-07-03 18:50 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Randy Dunlap, Jeff Layton, Kent Overstreet,
	linux-doc, linux-kernel, nvdimm, linux-cxl, linux-fsdevel,
	Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
	Josef Bacik, Aravind Ramesh, Ajay Joshi, John Groves

Famfs distinguishes between its on-media and in-memory superblocks

Signed-off-by: John Groves <john@groves.net>
---
 include/uapi/linux/magic.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index bb575f3ab45e..ee497665d8d7 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -38,6 +38,8 @@
 #define OVERLAYFS_SUPER_MAGIC	0x794c7630
 #define FUSE_SUPER_MAGIC	0x65735546
 #define BCACHEFS_SUPER_MAGIC	0xca451a4e
+#define FAMFS_SUPER_MAGIC	0x87b282ff
+#define FAMFS_STATFS_MAGIC      0x87b282fd
 
 #define MINIX_SUPER_MAGIC	0x137F		/* minix v1 fs, 14 char names */
 #define MINIX_SUPER_MAGIC2	0x138F		/* minix v1 fs, 30 char names */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC V2 08/18] famfs_fuse: Kconfig
  2025-07-03 18:50 [RFC V2 00/18] famfs: port into fuse John Groves
                   ` (6 preceding siblings ...)
  2025-07-03 18:50 ` [RFC V2 07/18] famfs_fuse: magic.h: Add famfs magic numbers John Groves
@ 2025-07-03 18:50 ` John Groves
  2025-07-03 18:50 ` [RFC V2 09/18] famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/ John Groves
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 91+ messages in thread
From: John Groves @ 2025-07-03 18:50 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Randy Dunlap, Jeff Layton, Kent Overstreet,
	linux-doc, linux-kernel, nvdimm, linux-cxl, linux-fsdevel,
	Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
	Josef Bacik, Aravind Ramesh, Ajay Joshi, John Groves

Add FUSE_FAMFS_DAX config parameter, to control compilation of famfs
within fuse.

Signed-off-by: John Groves <john@groves.net>
---
 fs/fuse/Kconfig | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
index ca215a3cba3e..e6d554f2a21c 100644
--- a/fs/fuse/Kconfig
+++ b/fs/fuse/Kconfig
@@ -75,3 +75,16 @@ config FUSE_IO_URING
 
 	  If you want to allow fuse server/client communication through io-uring,
 	  answer Y
+
+config FUSE_FAMFS_DAX
+	bool "FUSE support for fs-dax filesystems backed by devdax"
+	depends on FUSE_FS
+	default FUSE_FS
+	select DEV_DAX_IOMAP
+	help
+	  This enables the fabric-attached memory file system (famfs),
+	  which enables formatting devdax memory as a file system. Famfs
+	  is primarily intended for scale-out shared access to
+	  disaggregated memory.
+
+	  To enable famfs or other fuse/fs-dax file systems, answer Y
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC V2 09/18] famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/
  2025-07-03 18:50 [RFC V2 00/18] famfs: port into fuse John Groves
                   ` (7 preceding siblings ...)
  2025-07-03 18:50 ` [RFC V2 08/18] famfs_fuse: Kconfig John Groves
@ 2025-07-03 18:50 ` John Groves
  2025-07-04  8:44   ` Amir Goldstein
  2025-07-03 18:50 ` [RFC V2 10/18] famfs_fuse: Basic fuse kernel ABI enablement for famfs John Groves
                   ` (9 subsequent siblings)
  18 siblings, 1 reply; 91+ messages in thread
From: John Groves @ 2025-07-03 18:50 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Randy Dunlap, Jeff Layton, Kent Overstreet,
	linux-doc, linux-kernel, nvdimm, linux-cxl, linux-fsdevel,
	Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
	Josef Bacik, Aravind Ramesh, Ajay Joshi, John Groves

Virtio_fs now needs to determine if an inode is DAX && not famfs.

Signed-off-by: John Groves <john@groves.net>
---
 fs/fuse/dir.c    |  2 +-
 fs/fuse/file.c   | 13 ++++++++-----
 fs/fuse/fuse_i.h |  6 +++++-
 fs/fuse/inode.c  |  2 +-
 fs/fuse/iomode.c |  2 +-
 5 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 8f699c67561f..ad8cdf7b864a 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1939,7 +1939,7 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
 		is_truncate = true;
 	}
 
-	if (FUSE_IS_DAX(inode) && is_truncate) {
+	if (FUSE_IS_VIRTIO_DAX(fi) && is_truncate) {
 		filemap_invalidate_lock(mapping);
 		fault_blocked = true;
 		err = fuse_dax_break_layouts(inode, 0, -1);
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 754378dd9f71..93b82660f0c8 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -239,7 +239,7 @@ static int fuse_open(struct inode *inode, struct file *file)
 	int err;
 	bool is_truncate = (file->f_flags & O_TRUNC) && fc->atomic_o_trunc;
 	bool is_wb_truncate = is_truncate && fc->writeback_cache;
-	bool dax_truncate = is_truncate && FUSE_IS_DAX(inode);
+	bool dax_truncate = is_truncate && FUSE_IS_VIRTIO_DAX(fi);
 
 	if (fuse_is_bad(inode))
 		return -EIO;
@@ -1770,11 +1770,12 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	struct file *file = iocb->ki_filp;
 	struct fuse_file *ff = file->private_data;
 	struct inode *inode = file_inode(file);
+	struct fuse_inode *fi = get_fuse_inode(inode);
 
 	if (fuse_is_bad(inode))
 		return -EIO;
 
-	if (FUSE_IS_DAX(inode))
+	if (FUSE_IS_VIRTIO_DAX(fi))
 		return fuse_dax_read_iter(iocb, to);
 
 	/* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
@@ -1791,11 +1792,12 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	struct file *file = iocb->ki_filp;
 	struct fuse_file *ff = file->private_data;
 	struct inode *inode = file_inode(file);
+	struct fuse_inode *fi = get_fuse_inode(inode);
 
 	if (fuse_is_bad(inode))
 		return -EIO;
 
-	if (FUSE_IS_DAX(inode))
+	if (FUSE_IS_VIRTIO_DAX(fi))
 		return fuse_dax_write_iter(iocb, from);
 
 	/* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
@@ -2627,10 +2629,11 @@ static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
 	struct fuse_file *ff = file->private_data;
 	struct fuse_conn *fc = ff->fm->fc;
 	struct inode *inode = file_inode(file);
+	struct fuse_inode *fi = get_fuse_inode(inode);
 	int rc;
 
 	/* DAX mmap is superior to direct_io mmap */
-	if (FUSE_IS_DAX(inode))
+	if (FUSE_IS_VIRTIO_DAX(fi))
 		return fuse_dax_mmap(file, vma);
 
 	/*
@@ -3191,7 +3194,7 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
 		.mode = mode
 	};
 	int err;
-	bool block_faults = FUSE_IS_DAX(inode) &&
+	bool block_faults = FUSE_IS_VIRTIO_DAX(fi) &&
 		(!(mode & FALLOC_FL_KEEP_SIZE) ||
 		 (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE)));
 
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 2086dac7243b..9d87ac48d724 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1426,7 +1426,11 @@ void fuse_free_conn(struct fuse_conn *fc);
 
 /* dax.c */
 
-#define FUSE_IS_DAX(inode) (IS_ENABLED(CONFIG_FUSE_DAX) && IS_DAX(inode))
+/* This macro is used by virtio_fs, but now it also needs to filter for
+ * "not famfs"
+ */
+#define FUSE_IS_VIRTIO_DAX(fuse_inode) (IS_ENABLED(CONFIG_FUSE_DAX)	\
+					&& IS_DAX(&fuse_inode->inode))
 
 ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to);
 ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index e9db2cb8c150..29147657a99f 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -164,7 +164,7 @@ static void fuse_evict_inode(struct inode *inode)
 	if (inode->i_sb->s_flags & SB_ACTIVE) {
 		struct fuse_conn *fc = get_fuse_conn(inode);
 
-		if (FUSE_IS_DAX(inode))
+		if (FUSE_IS_VIRTIO_DAX(fi))
 			fuse_dax_inode_cleanup(inode);
 		if (fi->nlookup) {
 			fuse_queue_forget(fc, fi->forget, fi->nodeid,
diff --git a/fs/fuse/iomode.c b/fs/fuse/iomode.c
index c99e285f3183..aec4aecb5d79 100644
--- a/fs/fuse/iomode.c
+++ b/fs/fuse/iomode.c
@@ -204,7 +204,7 @@ int fuse_file_io_open(struct file *file, struct inode *inode)
 	 * io modes are not relevant with DAX and with server that does not
 	 * implement open.
 	 */
-	if (FUSE_IS_DAX(inode) || !ff->args)
+	if (FUSE_IS_VIRTIO_DAX(fi) || !ff->args)
 		return 0;
 
 	/*
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC V2 10/18] famfs_fuse: Basic fuse kernel ABI enablement for famfs
  2025-07-03 18:50 [RFC V2 00/18] famfs: port into fuse John Groves
                   ` (8 preceding siblings ...)
  2025-07-03 18:50 ` [RFC V2 09/18] famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/ John Groves
@ 2025-07-03 18:50 ` John Groves
  2025-07-03 22:45   ` John Groves
  2025-07-04  7:54   ` Amir Goldstein
  2025-07-03 18:50 ` [RFC V2 11/18] famfs_fuse: Basic famfs mount opts John Groves
                   ` (8 subsequent siblings)
  18 siblings, 2 replies; 91+ messages in thread
From: John Groves @ 2025-07-03 18:50 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Randy Dunlap, Jeff Layton, Kent Overstreet,
	linux-doc, linux-kernel, nvdimm, linux-cxl, linux-fsdevel,
	Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
	Josef Bacik, Aravind Ramesh, Ajay Joshi, John Groves

* FUSE_DAX_FMAP flag in INIT request/reply

* fuse_conn->famfs_iomap (enable famfs-mapped files) to denote a
  famfs-enabled connection

Signed-off-by: John Groves <john@groves.net>
---
 fs/fuse/fuse_i.h          |  3 +++
 fs/fuse/inode.c           | 14 ++++++++++++++
 include/uapi/linux/fuse.h |  4 ++++
 3 files changed, 21 insertions(+)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 9d87ac48d724..a592c1002861 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -873,6 +873,9 @@ struct fuse_conn {
 	/* Use io_uring for communication */
 	unsigned int io_uring;
 
+	/* dev_dax_iomap support for famfs */
+	unsigned int famfs_iomap:1;
+
 	/** Maximum stack depth for passthrough backing files */
 	int max_stack_depth;
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 29147657a99f..e48e11c3f9f3 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1392,6 +1392,18 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
 			}
 			if (flags & FUSE_OVER_IO_URING && fuse_uring_enabled())
 				fc->io_uring = 1;
+			if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX) &&
+			    flags & FUSE_DAX_FMAP) {
+				/* XXX: Should also check that fuse server
+				 * has CAP_SYS_RAWIO and/or CAP_SYS_ADMIN,
+				 * since it is directing the kernel to access
+				 * dax memory directly - but this function
+				 * appears not to be called in fuse server
+				 * process context (b/c even if it drops
+				 * those capabilities, they are held here).
+				 */
+				fc->famfs_iomap = 1;
+			}
 		} else {
 			ra_pages = fc->max_read / PAGE_SIZE;
 			fc->no_lock = 1;
@@ -1450,6 +1462,8 @@ void fuse_send_init(struct fuse_mount *fm)
 		flags |= FUSE_SUBMOUNTS;
 	if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
 		flags |= FUSE_PASSTHROUGH;
+	if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
+		flags |= FUSE_DAX_FMAP;
 
 	/*
 	 * This is just an information flag for fuse server. No need to check
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 5e0eb41d967e..6c384640c79b 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -229,6 +229,8 @@
  *    - FUSE_URING_IN_OUT_HEADER_SZ
  *    - FUSE_URING_OP_IN_OUT_SZ
  *    - enum fuse_uring_cmd
+ *  7.43
+ *    - Add FUSE_DAX_FMAP capability - ability to handle in-kernel fsdax maps
  */
 
 #ifndef _LINUX_FUSE_H
@@ -435,6 +437,7 @@ struct fuse_file_lock {
  *		    of the request ID indicates resend requests
  * FUSE_ALLOW_IDMAP: allow creation of idmapped mounts
  * FUSE_OVER_IO_URING: Indicate that client supports io-uring
+ * FUSE_DAX_FMAP: kernel supports dev_dax_iomap (aka famfs) fmaps
  */
 #define FUSE_ASYNC_READ		(1 << 0)
 #define FUSE_POSIX_LOCKS	(1 << 1)
@@ -482,6 +485,7 @@ struct fuse_file_lock {
 #define FUSE_DIRECT_IO_RELAX	FUSE_DIRECT_IO_ALLOW_MMAP
 #define FUSE_ALLOW_IDMAP	(1ULL << 40)
 #define FUSE_OVER_IO_URING	(1ULL << 41)
+#define FUSE_DAX_FMAP		(1ULL << 42)
 
 /**
  * CUSE INIT request/reply flags
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC V2 11/18] famfs_fuse: Basic famfs mount opts
  2025-07-03 18:50 [RFC V2 00/18] famfs: port into fuse John Groves
                   ` (9 preceding siblings ...)
  2025-07-03 18:50 ` [RFC V2 10/18] famfs_fuse: Basic fuse kernel ABI enablement for famfs John Groves
@ 2025-07-03 18:50 ` John Groves
  2025-07-09  3:59   ` Darrick J. Wong
  2025-07-03 18:50 ` [RFC V2 12/18] famfs_fuse: Plumb the GET_FMAP message/response John Groves
                   ` (7 subsequent siblings)
  18 siblings, 1 reply; 91+ messages in thread
From: John Groves @ 2025-07-03 18:50 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Randy Dunlap, Jeff Layton, Kent Overstreet,
	linux-doc, linux-kernel, nvdimm, linux-cxl, linux-fsdevel,
	Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
	Josef Bacik, Aravind Ramesh, Ajay Joshi, John Groves

* -o shadow=<shadowpath>
* -o daxdev=<daxdev>

Signed-off-by: John Groves <john@groves.net>
---
 fs/fuse/fuse_i.h |  8 +++++++-
 fs/fuse/inode.c  | 28 +++++++++++++++++++++++++++-
 2 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index a592c1002861..f4ee61046578 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -583,9 +583,11 @@ struct fuse_fs_context {
 	unsigned int blksize;
 	const char *subtype;
 
-	/* DAX device, may be NULL */
+	/* DAX device for virtiofs, may be NULL */
 	struct dax_device *dax_dev;
 
+	const char *shadow; /* famfs - null if not famfs */
+
 	/* fuse_dev pointer to fill in, should contain NULL on entry */
 	void **fudptr;
 };
@@ -941,6 +943,10 @@ struct fuse_conn {
 	/**  uring connection information*/
 	struct fuse_ring *ring;
 #endif
+
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+	char *shadow;
+#endif
 };
 
 /*
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index e48e11c3f9f3..a7e1cf8257b0 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -766,6 +766,9 @@ enum {
 	OPT_ALLOW_OTHER,
 	OPT_MAX_READ,
 	OPT_BLKSIZE,
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+	OPT_SHADOW,
+#endif
 	OPT_ERR
 };
 
@@ -780,6 +783,9 @@ static const struct fs_parameter_spec fuse_fs_parameters[] = {
 	fsparam_u32	("max_read",		OPT_MAX_READ),
 	fsparam_u32	("blksize",		OPT_BLKSIZE),
 	fsparam_string	("subtype",		OPT_SUBTYPE),
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+	fsparam_string("shadow",		OPT_SHADOW),
+#endif
 	{}
 };
 
@@ -875,6 +881,15 @@ static int fuse_parse_param(struct fs_context *fsc, struct fs_parameter *param)
 		ctx->blksize = result.uint_32;
 		break;
 
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+	case OPT_SHADOW:
+		if (ctx->shadow)
+			return invalfc(fsc, "Multiple shadows specified");
+		ctx->shadow = param->string;
+		param->string = NULL;
+		break;
+#endif
+
 	default:
 		return -EINVAL;
 	}
@@ -888,6 +903,7 @@ static void fuse_free_fsc(struct fs_context *fsc)
 
 	if (ctx) {
 		kfree(ctx->subtype);
+		kfree(ctx->shadow);
 		kfree(ctx);
 	}
 }
@@ -919,7 +935,10 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
 	else if (fc->dax_mode == FUSE_DAX_INODE_USER)
 		seq_puts(m, ",dax=inode");
 #endif
-
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+	if (fc->shadow)
+		seq_printf(m, ",shadow=%s", fc->shadow);
+#endif
 	return 0;
 }
 
@@ -1017,6 +1036,9 @@ void fuse_conn_put(struct fuse_conn *fc)
 		}
 		if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
 			fuse_backing_files_free(fc);
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+		kfree(fc->shadow);
+#endif
 		call_rcu(&fc->rcu, delayed_release);
 	}
 }
@@ -1834,6 +1856,10 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
 	sb->s_root = root_dentry;
 	if (ctx->fudptr)
 		*ctx->fudptr = fud;
+
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+	fc->shadow = kstrdup(ctx->shadow, GFP_KERNEL);
+#endif
 	mutex_unlock(&fuse_mutex);
 	return 0;
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC V2 12/18] famfs_fuse: Plumb the GET_FMAP message/response
  2025-07-03 18:50 [RFC V2 00/18] famfs: port into fuse John Groves
                   ` (10 preceding siblings ...)
  2025-07-03 18:50 ` [RFC V2 11/18] famfs_fuse: Basic famfs mount opts John Groves
@ 2025-07-03 18:50 ` John Groves
  2025-07-04  8:54   ` Amir Goldstein
                     ` (2 more replies)
  2025-07-03 18:50 ` [RFC V2 13/18] famfs_fuse: Create files with famfs fmaps John Groves
                   ` (6 subsequent siblings)
  18 siblings, 3 replies; 91+ messages in thread
From: John Groves @ 2025-07-03 18:50 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Randy Dunlap, Jeff Layton, Kent Overstreet,
	linux-doc, linux-kernel, nvdimm, linux-cxl, linux-fsdevel,
	Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
	Josef Bacik, Aravind Ramesh, Ajay Joshi, John Groves

Upon completion of an OPEN, if we're in famfs-mode we do a GET_FMAP to
retrieve and cache up the file-to-dax map in the kernel. If this
succeeds, read/write/mmap are resolved direct-to-dax with no upcalls.

GET_FMAP has a variable-size response payload, and the allocated size
is sent in the in_args[0].size field. If the fmap would overflow the
message, the fuse server sends a reply of size 'sizeof(uint32_t)' which
specifies the size of the fmap message. Then the kernel can realloc a
large enough buffer and try again.

Signed-off-by: John Groves <john@groves.net>
---
 fs/fuse/file.c            | 84 +++++++++++++++++++++++++++++++++++++++
 fs/fuse/fuse_i.h          | 36 ++++++++++++++++-
 fs/fuse/inode.c           | 19 +++++++--
 fs/fuse/iomode.c          |  2 +-
 include/uapi/linux/fuse.h | 18 +++++++++
 5 files changed, 154 insertions(+), 5 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 93b82660f0c8..8616fb0a6d61 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -230,6 +230,77 @@ static void fuse_truncate_update_attr(struct inode *inode, struct file *file)
 	fuse_invalidate_attr_mask(inode, FUSE_STATX_MODSIZE);
 }
 
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+
+#define FMAP_BUFSIZE 4096
+
+static int
+fuse_get_fmap(struct fuse_mount *fm, struct inode *inode, u64 nodeid)
+{
+	struct fuse_get_fmap_in inarg = { 0 };
+	size_t fmap_bufsize = FMAP_BUFSIZE;
+	ssize_t fmap_size;
+	int retries = 1;
+	void *fmap_buf;
+	int rc;
+
+	FUSE_ARGS(args);
+
+	fmap_buf = kcalloc(1, FMAP_BUFSIZE, GFP_KERNEL);
+	if (!fmap_buf)
+		return -EIO;
+
+ retry_once:
+	inarg.size = fmap_bufsize;
+
+	args.opcode = FUSE_GET_FMAP;
+	args.nodeid = nodeid;
+
+	args.in_numargs = 1;
+	args.in_args[0].size = sizeof(inarg);
+	args.in_args[0].value = &inarg;
+
+	/* Variable-sized output buffer
+	 * this causes fuse_simple_request() to return the size of the
+	 * output payload
+	 */
+	args.out_argvar = true;
+	args.out_numargs = 1;
+	args.out_args[0].size = fmap_bufsize;
+	args.out_args[0].value = fmap_buf;
+
+	/* Send GET_FMAP command */
+	rc = fuse_simple_request(fm, &args);
+	if (rc < 0) {
+		pr_err("%s: err=%d from fuse_simple_request()\n",
+		       __func__, rc);
+		return rc;
+	}
+	fmap_size = rc;
+
+	if (retries && fmap_size == sizeof(uint32_t)) {
+		/* fmap size exceeded fmap_bufsize;
+		 * actual fmap size returned in fmap_buf;
+		 * realloc and retry once
+		 */
+		fmap_bufsize = *((uint32_t *)fmap_buf);
+
+		--retries;
+		kfree(fmap_buf);
+		fmap_buf = kcalloc(1, fmap_bufsize, GFP_KERNEL);
+		if (!fmap_buf)
+			return -EIO;
+
+		goto retry_once;
+	}
+
+	/* Will call famfs_file_init_dax() when that gets added */
+
+	kfree(fmap_buf);
+	return 0;
+}
+#endif
+
 static int fuse_open(struct inode *inode, struct file *file)
 {
 	struct fuse_mount *fm = get_fuse_mount(inode);
@@ -263,6 +334,19 @@ static int fuse_open(struct inode *inode, struct file *file)
 
 	err = fuse_do_open(fm, get_node_id(inode), file, false);
 	if (!err) {
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+		if (fm->fc->famfs_iomap) {
+			if (S_ISREG(inode->i_mode)) {
+				int rc;
+				/* Get the famfs fmap */
+				rc = fuse_get_fmap(fm, inode,
+						   get_node_id(inode));
+				if (rc)
+					pr_err("%s: fuse_get_fmap err=%d\n",
+					       __func__, rc);
+			}
+		}
+#endif
 		ff = file->private_data;
 		err = fuse_finish_open(inode, file);
 		if (err)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index f4ee61046578..e01d6e5c6e93 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -193,6 +193,10 @@ struct fuse_inode {
 	/** Reference to backing file in passthrough mode */
 	struct fuse_backing *fb;
 #endif
+
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+	void *famfs_meta;
+#endif
 };
 
 /** FUSE inode state bits */
@@ -945,6 +949,8 @@ struct fuse_conn {
 #endif
 
 #if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+	struct rw_semaphore famfs_devlist_sem;
+	struct famfs_dax_devlist *dax_devlist;
 	char *shadow;
 #endif
 };
@@ -1435,11 +1441,14 @@ void fuse_free_conn(struct fuse_conn *fc);
 
 /* dax.c */
 
+static inline int fuse_file_famfs(struct fuse_inode *fi); /* forward */
+
 /* This macro is used by virtio_fs, but now it also needs to filter for
  * "not famfs"
  */
 #define FUSE_IS_VIRTIO_DAX(fuse_inode) (IS_ENABLED(CONFIG_FUSE_DAX)	\
-					&& IS_DAX(&fuse_inode->inode))
+					&& IS_DAX(&fuse_inode->inode)	\
+					&& !fuse_file_famfs(fuse_inode))
 
 ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to);
 ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from);
@@ -1550,4 +1559,29 @@ extern void fuse_sysctl_unregister(void);
 #define fuse_sysctl_unregister()	do { } while (0)
 #endif /* CONFIG_SYSCTL */
 
+/* famfs.c */
+static inline struct fuse_backing *famfs_meta_set(struct fuse_inode *fi,
+						       void *meta)
+{
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+	return xchg(&fi->famfs_meta, meta);
+#else
+	return NULL;
+#endif
+}
+
+static inline void famfs_meta_free(struct fuse_inode *fi)
+{
+	/* Stub wil be connected in a subsequent commit */
+}
+
+static inline int fuse_file_famfs(struct fuse_inode *fi)
+{
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+	return (READ_ONCE(fi->famfs_meta) != NULL);
+#else
+	return 0;
+#endif
+}
+
 #endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index a7e1cf8257b0..b071d16f7d04 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -117,6 +117,9 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
 	if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
 		fuse_inode_backing_set(fi, NULL);
 
+	if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
+		famfs_meta_set(fi, NULL);
+
 	return &fi->inode;
 
 out_free_forget:
@@ -138,6 +141,13 @@ static void fuse_free_inode(struct inode *inode)
 	if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
 		fuse_backing_put(fuse_inode_backing(fi));
 
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+	if (S_ISREG(inode->i_mode) && fi->famfs_meta) {
+		famfs_meta_free(fi);
+		famfs_meta_set(fi, NULL);
+	}
+#endif
+
 	kmem_cache_free(fuse_inode_cachep, fi);
 }
 
@@ -1002,6 +1012,9 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm,
 	if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
 		fuse_backing_files_init(fc);
 
+	if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
+		pr_notice("%s: Kernel is FUSE_FAMFS_DAX capable\n", __func__);
+
 	INIT_LIST_HEAD(&fc->mounts);
 	list_add(&fm->fc_entry, &fc->mounts);
 	fm->fc = fc;
@@ -1036,9 +1049,8 @@ void fuse_conn_put(struct fuse_conn *fc)
 		}
 		if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
 			fuse_backing_files_free(fc);
-#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
-		kfree(fc->shadow);
-#endif
+		if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
+			kfree(fc->shadow);
 		call_rcu(&fc->rcu, delayed_release);
 	}
 }
@@ -1425,6 +1437,7 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
 				 * those capabilities, they are held here).
 				 */
 				fc->famfs_iomap = 1;
+				init_rwsem(&fc->famfs_devlist_sem);
 			}
 		} else {
 			ra_pages = fc->max_read / PAGE_SIZE;
diff --git a/fs/fuse/iomode.c b/fs/fuse/iomode.c
index aec4aecb5d79..443b337b0c05 100644
--- a/fs/fuse/iomode.c
+++ b/fs/fuse/iomode.c
@@ -204,7 +204,7 @@ int fuse_file_io_open(struct file *file, struct inode *inode)
 	 * io modes are not relevant with DAX and with server that does not
 	 * implement open.
 	 */
-	if (FUSE_IS_VIRTIO_DAX(fi) || !ff->args)
+	if (FUSE_IS_VIRTIO_DAX(fi) || fuse_file_famfs(fi) || !ff->args)
 		return 0;
 
 	/*
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 6c384640c79b..dff5aa62543e 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -654,6 +654,10 @@ enum fuse_opcode {
 	FUSE_TMPFILE		= 51,
 	FUSE_STATX		= 52,
 
+	/* Famfs / devdax opcodes */
+	FUSE_GET_FMAP           = 53,
+	FUSE_GET_DAXDEV         = 54,
+
 	/* CUSE specific operations */
 	CUSE_INIT		= 4096,
 
@@ -888,6 +892,16 @@ struct fuse_access_in {
 	uint32_t	padding;
 };
 
+struct fuse_get_fmap_in {
+	uint32_t	size;
+	uint32_t	padding;
+};
+
+struct fuse_get_fmap_out {
+	uint32_t	size;
+	uint32_t	padding;
+};
+
 struct fuse_init_in {
 	uint32_t	major;
 	uint32_t	minor;
@@ -1284,4 +1298,8 @@ struct fuse_uring_cmd_req {
 	uint8_t padding[6];
 };
 
+/* Famfs fmap message components */
+
+#define FAMFS_FMAP_MAX 32768 /* Largest supported fmap message */
+
 #endif /* _LINUX_FUSE_H */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC V2 13/18] famfs_fuse: Create files with famfs fmaps
  2025-07-03 18:50 [RFC V2 00/18] famfs: port into fuse John Groves
                   ` (11 preceding siblings ...)
  2025-07-03 18:50 ` [RFC V2 12/18] famfs_fuse: Plumb the GET_FMAP message/response John Groves
@ 2025-07-03 18:50 ` John Groves
  2025-07-04  9:01   ` Amir Goldstein
  2025-07-03 18:50 ` [RFC V2 14/18] famfs_fuse: GET_DAXDEV message and daxdev_table John Groves
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 91+ messages in thread
From: John Groves @ 2025-07-03 18:50 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Randy Dunlap, Jeff Layton, Kent Overstreet,
	linux-doc, linux-kernel, nvdimm, linux-cxl, linux-fsdevel,
	Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
	Josef Bacik, Aravind Ramesh, Ajay Joshi, John Groves

On completion of GET_FMAP message/response, setup the full famfs
metadata such that it's possible to handle read/write/mmap directly to
dax. Note that the devdax_iomap plumbing is not in yet...

Update MAINTAINERS for the new files.

Signed-off-by: John Groves <john@groves.net>
---
 MAINTAINERS               |   9 +
 fs/fuse/Makefile          |   2 +-
 fs/fuse/famfs.c           | 360 ++++++++++++++++++++++++++++++++++++++
 fs/fuse/famfs_kfmap.h     |  63 +++++++
 fs/fuse/file.c            |  15 +-
 fs/fuse/fuse_i.h          |  16 +-
 fs/fuse/inode.c           |   2 +-
 include/uapi/linux/fuse.h |  56 ++++++
 8 files changed, 518 insertions(+), 5 deletions(-)
 create mode 100644 fs/fuse/famfs.c
 create mode 100644 fs/fuse/famfs_kfmap.h

diff --git a/MAINTAINERS b/MAINTAINERS
index c0d5232a473b..02688f27a4d0 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8808,6 +8808,15 @@ F:	Documentation/networking/failover.rst
 F:	include/net/failover.h
 F:	net/core/failover.c
 
+FAMFS
+M:	John Groves <jgroves@micron.com>
+M:	John Groves <John@Groves.net>
+L:	linux-cxl@vger.kernel.org
+L:	linux-fsdevel@vger.kernel.org
+S:	Supported
+F:	fs/fuse/famfs.c
+F:	fs/fuse/famfs_kfmap.h
+
 FANOTIFY
 M:	Jan Kara <jack@suse.cz>
 R:	Amir Goldstein <amir73il@gmail.com>
diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
index 3f0f312a31c1..65a12975d734 100644
--- a/fs/fuse/Makefile
+++ b/fs/fuse/Makefile
@@ -16,5 +16,5 @@ fuse-$(CONFIG_FUSE_DAX) += dax.o
 fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o
 fuse-$(CONFIG_SYSCTL) += sysctl.o
 fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o
-
+fuse-$(CONFIG_FUSE_FAMFS_DAX) += famfs.o
 virtiofs-y := virtio_fs.o
diff --git a/fs/fuse/famfs.c b/fs/fuse/famfs.c
new file mode 100644
index 000000000000..41c4d92f1451
--- /dev/null
+++ b/fs/fuse/famfs.c
@@ -0,0 +1,360 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * famfs - dax file system for shared fabric-attached memory
+ *
+ * Copyright 2023-2025 Micron Technology, Inc.
+ *
+ * This file system, originally based on ramfs the dax support from xfs,
+ * is intended to allow multiple host systems to mount a common file system
+ * view of dax files that map to shared memory.
+ */
+
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/dax.h>
+#include <linux/iomap.h>
+#include <linux/path.h>
+#include <linux/namei.h>
+#include <linux/string.h>
+
+#include "famfs_kfmap.h"
+#include "fuse_i.h"
+
+
+void
+__famfs_meta_free(void *famfs_meta)
+{
+	struct famfs_file_meta *fmap = famfs_meta;
+
+	if (!fmap)
+		return;
+
+	if (fmap) {
+		switch (fmap->fm_extent_type) {
+		case SIMPLE_DAX_EXTENT:
+			kfree(fmap->se);
+			break;
+		case INTERLEAVED_EXTENT:
+			if (fmap->ie)
+				kfree(fmap->ie->ie_strips);
+
+			kfree(fmap->ie);
+			break;
+		default:
+			pr_err("%s: invalid fmap type\n", __func__);
+			break;
+		}
+	}
+	kfree(fmap);
+}
+
+static int
+famfs_check_ext_alignment(struct famfs_meta_simple_ext *se)
+{
+	int errs = 0;
+
+	if (se->dev_index != 0)
+		errs++;
+
+	/* TODO: pass in alignment so we can support the other page sizes */
+	if (!IS_ALIGNED(se->ext_offset, PMD_SIZE))
+		errs++;
+
+	if (!IS_ALIGNED(se->ext_len, PMD_SIZE))
+		errs++;
+
+	return errs;
+}
+
+/**
+ * famfs_fuse_meta_alloc() - Allocate famfs file metadata
+ * @metap:       Pointer to an mcache_map_meta pointer
+ * @ext_count:  The number of extents needed
+ *
+ * Returns: 0=success
+ *          -errno=failure
+ */
+static int
+famfs_fuse_meta_alloc(
+	void *fmap_buf,
+	size_t fmap_buf_size,
+	struct famfs_file_meta **metap)
+{
+	struct famfs_file_meta *meta = NULL;
+	struct fuse_famfs_fmap_header *fmh;
+	size_t extent_total = 0;
+	size_t next_offset = 0;
+	int errs = 0;
+	int i, j;
+	int rc;
+
+	fmh = (struct fuse_famfs_fmap_header *)fmap_buf;
+
+	/* Move past fmh in fmap_buf */
+	next_offset += sizeof(*fmh);
+	if (next_offset > fmap_buf_size) {
+		pr_err("%s:%d: fmap_buf underflow offset/size %ld/%ld\n",
+		       __func__, __LINE__, next_offset, fmap_buf_size);
+		return -EINVAL;
+	}
+
+	if (fmh->nextents < 1) {
+		pr_err("%s: nextents %d < 1\n", __func__, fmh->nextents);
+		return -EINVAL;
+	}
+
+	if (fmh->nextents > FUSE_FAMFS_MAX_EXTENTS) {
+		pr_err("%s: nextents %d > max (%d) 1\n",
+		       __func__, fmh->nextents, FUSE_FAMFS_MAX_EXTENTS);
+		return -E2BIG;
+	}
+
+	meta = kzalloc(sizeof(*meta), GFP_KERNEL);
+	if (!meta)
+		return -ENOMEM;
+
+	meta->error = false;
+	meta->file_type = fmh->file_type;
+	meta->file_size = fmh->file_size;
+	meta->fm_extent_type = fmh->ext_type;
+
+	switch (fmh->ext_type) {
+	case FUSE_FAMFS_EXT_SIMPLE: {
+		struct fuse_famfs_simple_ext *se_in;
+
+		se_in = (struct fuse_famfs_simple_ext *)(fmap_buf + next_offset);
+
+		/* Move past simple extents */
+		next_offset += fmh->nextents * sizeof(*se_in);
+		if (next_offset > fmap_buf_size) {
+			pr_err("%s:%d: fmap_buf underflow offset/size %ld/%ld\n",
+			       __func__, __LINE__, next_offset, fmap_buf_size);
+			rc = -EINVAL;
+			goto errout;
+		}
+
+		meta->fm_nextents = fmh->nextents;
+
+		meta->se = kcalloc(meta->fm_nextents, sizeof(*(meta->se)),
+				   GFP_KERNEL);
+		if (!meta->se) {
+			rc = -ENOMEM;
+			goto errout;
+		}
+
+		if ((meta->fm_nextents > FUSE_FAMFS_MAX_EXTENTS) ||
+		    (meta->fm_nextents < 1)) {
+			rc = -EINVAL;
+			goto errout;
+		}
+
+		for (i = 0; i < fmh->nextents; i++) {
+			meta->se[i].dev_index  = se_in[i].se_devindex;
+			meta->se[i].ext_offset = se_in[i].se_offset;
+			meta->se[i].ext_len    = se_in[i].se_len;
+
+			/* Record bitmap of referenced daxdev indices */
+			meta->dev_bitmap |= (1 << meta->se[i].dev_index);
+
+			errs += famfs_check_ext_alignment(&meta->se[i]);
+
+			extent_total += meta->se[i].ext_len;
+		}
+		break;
+	}
+
+	case FUSE_FAMFS_EXT_INTERLEAVE: {
+		s64 size_remainder = meta->file_size;
+		struct fuse_famfs_iext *ie_in;
+		int niext = fmh->nextents;
+
+		meta->fm_niext = niext;
+
+		/* Allocate interleaved extent */
+		meta->ie = kcalloc(niext, sizeof(*(meta->ie)), GFP_KERNEL);
+		if (!meta->ie) {
+			rc = -ENOMEM;
+			goto errout;
+		}
+
+		/*
+		 * Each interleaved extent has a simple extent list of strips.
+		 * Outer loop is over separate interleaved extents
+		 */
+		for (i = 0; i < niext; i++) {
+			u64 nstrips;
+			struct fuse_famfs_simple_ext *sie_in;
+
+			/* ie_in = one interleaved extent in fmap_buf */
+			ie_in = (struct fuse_famfs_iext *)
+				(fmap_buf + next_offset);
+
+			/* Move past one interleaved extent header in fmap_buf */
+			next_offset += sizeof(*ie_in);
+			if (next_offset > fmap_buf_size) {
+				pr_err("%s:%d: fmap_buf underflow offset/size %ld/%ld\n",
+				       __func__, __LINE__, next_offset,
+				       fmap_buf_size);
+				rc = -EINVAL;
+				goto errout;
+			}
+
+			nstrips = ie_in->ie_nstrips;
+			meta->ie[i].fie_chunk_size = ie_in->ie_chunk_size;
+			meta->ie[i].fie_nstrips    = ie_in->ie_nstrips;
+			meta->ie[i].fie_nbytes     = ie_in->ie_nbytes;
+
+			if (!meta->ie[i].fie_nbytes) {
+				pr_err("%s: zero-length interleave!\n",
+				       __func__);
+				rc = -EINVAL;
+				goto errout;
+			}
+
+			/* sie_in = the strip extents in fmap_buf */
+			sie_in = (struct fuse_famfs_simple_ext *)
+				(fmap_buf + next_offset);
+
+			/* Move past strip extents in fmap_buf */
+			next_offset += nstrips * sizeof(*sie_in);
+			if (next_offset > fmap_buf_size) {
+				pr_err("%s:%d: fmap_buf underflow offset/size %ld/%ld\n",
+				       __func__, __LINE__, next_offset,
+				       fmap_buf_size);
+				rc = -EINVAL;
+				goto errout;
+			}
+
+			if ((nstrips > FUSE_FAMFS_MAX_STRIPS) || (nstrips < 1)) {
+				pr_err("%s: invalid nstrips=%lld (max=%d)\n",
+				       __func__, nstrips,
+				       FUSE_FAMFS_MAX_STRIPS);
+				errs++;
+			}
+
+			/* Allocate strip extent array */
+			meta->ie[i].ie_strips = kcalloc(ie_in->ie_nstrips,
+					sizeof(meta->ie[i].ie_strips[0]),
+							GFP_KERNEL);
+			if (!meta->ie[i].ie_strips) {
+				rc = -ENOMEM;
+				goto errout;
+			}
+
+			/* Inner loop is over strips */
+			for (j = 0; j < nstrips; j++) {
+				struct famfs_meta_simple_ext *strips_out;
+				u64 devindex = sie_in[j].se_devindex;
+				u64 offset   = sie_in[j].se_offset;
+				u64 len      = sie_in[j].se_len;
+
+				strips_out = meta->ie[i].ie_strips;
+				strips_out[j].dev_index  = devindex;
+				strips_out[j].ext_offset = offset;
+				strips_out[j].ext_len    = len;
+
+				/* Record bitmap of referenced daxdev indices */
+				meta->dev_bitmap |= (1 << devindex);
+
+				extent_total += len;
+				errs += famfs_check_ext_alignment(&strips_out[j]);
+				size_remainder -= len;
+			}
+		}
+
+		if (size_remainder > 0) {
+			/* Sum of interleaved extent sizes is less than file size! */
+			pr_err("%s: size_remainder %lld (0x%llx)\n",
+			       __func__, size_remainder, size_remainder);
+			rc = -EINVAL;
+			goto errout;
+		}
+		break;
+	}
+
+	default:
+		pr_err("%s: invalid ext_type %d\n", __func__, fmh->ext_type);
+		rc = -EINVAL;
+		goto errout;
+	}
+
+	if (errs > 0) {
+		pr_err("%s: %d alignment errors found\n", __func__, errs);
+		rc = -EINVAL;
+		goto errout;
+	}
+
+	/* More sanity checks */
+	if (extent_total < meta->file_size) {
+		pr_err("%s: file size %ld larger than map size %ld\n",
+		       __func__, meta->file_size, extent_total);
+		rc = -EINVAL;
+		goto errout;
+	}
+
+	*metap = meta;
+
+	return 0;
+errout:
+	__famfs_meta_free(meta);
+	return rc;
+}
+
+/**
+ * famfs_file_init_dax() - init famfs dax file metadata
+ *
+ * @fm:        fuse_mount
+ * @inode:     the inode
+ * @fmap_buf:  fmap response message
+ * @fmap_size: Size of the fmap message
+ *
+ * Initialize famfs metadata for a file, based on the contents of the GET_FMAP
+ * response
+ *
+ * Return: 0=success
+ *          -errno=failure
+ */
+int
+famfs_file_init_dax(
+	struct fuse_mount *fm,
+	struct inode *inode,
+	void *fmap_buf,
+	size_t fmap_size)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct famfs_file_meta *meta = NULL;
+	int rc;
+
+	if (fi->famfs_meta) {
+		pr_notice("%s: i_no=%ld fmap_size=%ld ALREADY INITIALIZED\n",
+			  __func__,
+			  inode->i_ino, fmap_size);
+		return -EEXIST;
+	}
+
+	rc = famfs_fuse_meta_alloc(fmap_buf, fmap_size, &meta);
+	if (rc)
+		goto errout;
+
+	/* Publish the famfs metadata on fi->famfs_meta */
+	inode_lock(inode);
+	if (fi->famfs_meta) {
+		rc = -EEXIST; /* file already has famfs metadata */
+	} else {
+		if (famfs_meta_set(fi, meta) != NULL) {
+			pr_err("%s: file already had metadata\n", __func__);
+			rc = -EALREADY;
+			goto errout;
+		}
+		i_size_write(inode, meta->file_size);
+		inode->i_flags |= S_DAX;
+	}
+	inode_unlock(inode);
+
+ errout:
+	if (rc)
+		__famfs_meta_free(meta);
+
+	return rc;
+}
+
diff --git a/fs/fuse/famfs_kfmap.h b/fs/fuse/famfs_kfmap.h
new file mode 100644
index 000000000000..ce785d76719c
--- /dev/null
+++ b/fs/fuse/famfs_kfmap.h
@@ -0,0 +1,63 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * famfs - dax file system for shared fabric-attached memory
+ *
+ * Copyright 2023-2025 Micron Technology, Inc.
+ */
+#ifndef FAMFS_KFMAP_H
+#define FAMFS_KFMAP_H
+
+/*
+ * These structures are the in-memory metadata format for famfs files. Metadata
+ * retrieved via the GET_FMAP response is converted to this format for use in
+ * resolving file mapping faults.
+ */
+
+enum famfs_file_type {
+	FAMFS_REG,
+	FAMFS_SUPERBLOCK,
+	FAMFS_LOG,
+};
+
+/* We anticipate the possiblity of supporting additional types of extents */
+enum famfs_extent_type {
+	SIMPLE_DAX_EXTENT,
+	INTERLEAVED_EXTENT,
+	INVALID_EXTENT_TYPE,
+};
+
+struct famfs_meta_simple_ext {
+	u64 dev_index;
+	u64 ext_offset;
+	u64 ext_len;
+};
+
+struct famfs_meta_interleaved_ext {
+	u64 fie_nstrips;
+	u64 fie_chunk_size;
+	u64 fie_nbytes;
+	struct famfs_meta_simple_ext *ie_strips;
+};
+
+/*
+ * Each famfs dax file has this hanging from its fuse_inode->famfs_meta
+ */
+struct famfs_file_meta {
+	bool                   error;
+	enum famfs_file_type   file_type;
+	size_t                 file_size;
+	enum famfs_extent_type fm_extent_type;
+	u64 dev_bitmap; /* bitmap of referenced daxdevs by index */
+	union { /* This will make code a bit more readable */
+		struct {
+			size_t         fm_nextents;
+			struct famfs_meta_simple_ext  *se;
+		};
+		struct {
+			size_t         fm_niext;
+			struct famfs_meta_interleaved_ext *ie;
+		};
+	};
+};
+
+#endif /* FAMFS_KFMAP_H */
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 8616fb0a6d61..5d205eadb48f 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -237,6 +237,7 @@ static void fuse_truncate_update_attr(struct inode *inode, struct file *file)
 static int
 fuse_get_fmap(struct fuse_mount *fm, struct inode *inode, u64 nodeid)
 {
+	struct fuse_inode *fi = get_fuse_inode(inode);
 	struct fuse_get_fmap_in inarg = { 0 };
 	size_t fmap_bufsize = FMAP_BUFSIZE;
 	ssize_t fmap_size;
@@ -246,6 +247,10 @@ fuse_get_fmap(struct fuse_mount *fm, struct inode *inode, u64 nodeid)
 
 	FUSE_ARGS(args);
 
+	/* Don't retrieve if we already have the famfs metadata */
+	if (fi->famfs_meta)
+		return 0;
+
 	fmap_buf = kcalloc(1, FMAP_BUFSIZE, GFP_KERNEL);
 	if (!fmap_buf)
 		return -EIO;
@@ -285,6 +290,13 @@ fuse_get_fmap(struct fuse_mount *fm, struct inode *inode, u64 nodeid)
 		 */
 		fmap_bufsize = *((uint32_t *)fmap_buf);
 
+		if (fmap_bufsize < fmap_msg_min_size()
+		    || fmap_bufsize > FAMFS_FMAP_MAX) {
+			pr_err("%s: fmap_size=%ld out of range\n",
+			       __func__, fmap_bufsize);
+			return -EIO;
+		}
+
 		--retries;
 		kfree(fmap_buf);
 		fmap_buf = kcalloc(1, fmap_bufsize, GFP_KERNEL);
@@ -294,7 +306,8 @@ fuse_get_fmap(struct fuse_mount *fm, struct inode *inode, u64 nodeid)
 		goto retry_once;
 	}
 
-	/* Will call famfs_file_init_dax() when that gets added */
+	/* Convert fmap into in-memory format and hang from inode */
+	famfs_file_init_dax(fm, inode, fmap_buf, fmap_size);
 
 	kfree(fmap_buf);
 	return 0;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index e01d6e5c6e93..fb6095655403 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1560,11 +1560,18 @@ extern void fuse_sysctl_unregister(void);
 #endif /* CONFIG_SYSCTL */
 
 /* famfs.c */
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+int famfs_file_init_dax(struct fuse_mount *fm,
+			     struct inode *inode, void *fmap_buf,
+			     size_t fmap_size);
+void __famfs_meta_free(void *map);
+#endif
+
 static inline struct fuse_backing *famfs_meta_set(struct fuse_inode *fi,
 						       void *meta)
 {
 #if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
-	return xchg(&fi->famfs_meta, meta);
+	return cmpxchg(&fi->famfs_meta, NULL, meta);
 #else
 	return NULL;
 #endif
@@ -1572,7 +1579,12 @@ static inline struct fuse_backing *famfs_meta_set(struct fuse_inode *fi,
 
 static inline void famfs_meta_free(struct fuse_inode *fi)
 {
-	/* Stub wil be connected in a subsequent commit */
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+	if (fi->famfs_meta != NULL) {
+		__famfs_meta_free(fi->famfs_meta);
+		famfs_meta_set(fi, NULL);
+	}
+#endif
 }
 
 static inline int fuse_file_famfs(struct fuse_inode *fi)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index b071d16f7d04..1682755abf30 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -118,7 +118,7 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
 		fuse_inode_backing_set(fi, NULL);
 
 	if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
-		famfs_meta_set(fi, NULL);
+		fi->famfs_meta = NULL; /* XXX new inodes currently not zeroed; why not? */
 
 	return &fi->inode;
 
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index dff5aa62543e..ecaaa62910f0 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -231,6 +231,13 @@
  *    - enum fuse_uring_cmd
  *  7.43
  *    - Add FUSE_DAX_FMAP capability - ability to handle in-kernel fsdax maps
+ *    - Add the following structures for the GET_FMAP message reply components:
+ *      - struct fuse_famfs_simple_ext
+ *      - struct fuse_famfs_iext
+ *      - struct fuse_famfs_fmap_header
+ *    - Add the following enumerated types
+ *      - enum fuse_famfs_file_type
+ *      - enum famfs_ext_type
  */
 
 #ifndef _LINUX_FUSE_H
@@ -1300,6 +1307,55 @@ struct fuse_uring_cmd_req {
 
 /* Famfs fmap message components */
 
+#define FAMFS_FMAP_VERSION 1
+
 #define FAMFS_FMAP_MAX 32768 /* Largest supported fmap message */
+#define FUSE_FAMFS_MAX_EXTENTS 32
+#define FUSE_FAMFS_MAX_STRIPS 32
+
+enum fuse_famfs_file_type {
+	FUSE_FAMFS_FILE_REG,
+	FUSE_FAMFS_FILE_SUPERBLOCK,
+	FUSE_FAMFS_FILE_LOG,
+};
+
+enum famfs_ext_type {
+	FUSE_FAMFS_EXT_SIMPLE = 0,
+	FUSE_FAMFS_EXT_INTERLEAVE = 1,
+};
+
+struct fuse_famfs_simple_ext {
+	uint32_t se_devindex;
+	uint32_t reserved;
+	uint64_t se_offset;
+	uint64_t se_len;
+};
+
+struct fuse_famfs_iext { /* Interleaved extent */
+	uint32_t ie_nstrips;
+	uint32_t ie_chunk_size;
+	uint64_t ie_nbytes; /* Total bytes for this interleaved_ext;
+			     * sum of strips may be more
+			     */
+	uint64_t reserved;
+};
+
+struct fuse_famfs_fmap_header {
+	uint8_t file_type; /* enum famfs_file_type */
+	uint8_t reserved;
+	uint16_t fmap_version;
+	uint32_t ext_type; /* enum famfs_log_ext_type */
+	uint32_t nextents;
+	uint32_t reserved0;
+	uint64_t file_size;
+	uint64_t reserved1;
+};
+
+static inline int32_t fmap_msg_min_size(void)
+{
+	/* Smallest fmap message is a header plus one simple extent */
+	return (sizeof(struct fuse_famfs_fmap_header)
+		+ sizeof(struct fuse_famfs_simple_ext));
+}
 
 #endif /* _LINUX_FUSE_H */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC V2 14/18] famfs_fuse: GET_DAXDEV message and daxdev_table
  2025-07-03 18:50 [RFC V2 00/18] famfs: port into fuse John Groves
                   ` (12 preceding siblings ...)
  2025-07-03 18:50 ` [RFC V2 13/18] famfs_fuse: Create files with famfs fmaps John Groves
@ 2025-07-03 18:50 ` John Groves
  2025-07-04 13:20   ` Jonathan Cameron
  2025-08-14 13:58   ` Miklos Szeredi
  2025-07-03 18:50 ` [RFC V2 15/18] famfs_fuse: Plumb dax iomap and fuse read/write/mmap John Groves
                   ` (4 subsequent siblings)
  18 siblings, 2 replies; 91+ messages in thread
From: John Groves @ 2025-07-03 18:50 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Randy Dunlap, Jeff Layton, Kent Overstreet,
	linux-doc, linux-kernel, nvdimm, linux-cxl, linux-fsdevel,
	Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
	Josef Bacik, Aravind Ramesh, Ajay Joshi, John Groves

* The new GET_DAXDEV message/response is enabled
* The command it triggered by the update_daxdev_table() call, if there
  are any daxdevs in the subject fmap that are not represented in the
  daxdev_dable yet.

Signed-off-by: John Groves <john@groves.net>
---
 fs/fuse/famfs.c           | 227 ++++++++++++++++++++++++++++++++++++++
 fs/fuse/famfs_kfmap.h     |  26 +++++
 fs/fuse/fuse_i.h          |   1 +
 fs/fuse/inode.c           |   4 +-
 fs/namei.c                |   1 +
 include/uapi/linux/fuse.h |  18 +++
 6 files changed, 276 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/famfs.c b/fs/fuse/famfs.c
index 41c4d92f1451..f5e01032b825 100644
--- a/fs/fuse/famfs.c
+++ b/fs/fuse/famfs.c
@@ -20,6 +20,230 @@
 #include "famfs_kfmap.h"
 #include "fuse_i.h"
 
+/*
+ * famfs_teardown()
+ *
+ * Deallocate famfs metadata for a fuse_conn
+ */
+void /* XXX valid xfs or fuse format? */
+famfs_teardown(struct fuse_conn *fc)
+{
+	struct famfs_dax_devlist *devlist = fc->dax_devlist;
+	int i;
+
+	fc->dax_devlist = NULL;
+
+	if (!devlist)
+		return;
+
+	if (!devlist->devlist)
+		goto out;
+
+	/* Close & release all the daxdevs in our table */
+	for (i = 0; i < devlist->nslots; i++) {
+		if (devlist->devlist[i].valid && devlist->devlist[i].devp)
+			fs_put_dax(devlist->devlist[i].devp, fc);
+	}
+	kfree(devlist->devlist);
+
+out:
+	kfree(devlist);
+}
+
+static int
+famfs_verify_daxdev(const char *pathname, dev_t *devno)
+{
+	struct inode *inode;
+	struct path path;
+	int err;
+
+	if (!pathname || !*pathname)
+		return -EINVAL;
+
+	err = kern_path(pathname, LOOKUP_FOLLOW, &path);
+	if (err)
+		return err;
+
+	inode = d_backing_inode(path.dentry);
+	if (!S_ISCHR(inode->i_mode)) {
+		err = -EINVAL;
+		goto out_path_put;
+	}
+
+	if (!may_open_dev(&path)) { /* had to export this */
+		err = -EACCES;
+		goto out_path_put;
+	}
+
+	*devno = inode->i_rdev;
+
+out_path_put:
+	path_put(&path);
+	return err;
+}
+
+/**
+ * famfs_fuse_get_daxdev() - Retrieve info for a DAX device from fuse server
+ *
+ * Send a GET_DAXDEV message to the fuse server to retrieve info on a
+ * dax device.
+ *
+ * @fm:     fuse_mount
+ * @index:  the index of the dax device; daxdevs are referred to by index
+ *          in fmaps, and the server resolves the index to a particular daxdev
+ *
+ * Returns: 0=success
+ *          -errno=failure
+ */
+static int
+famfs_fuse_get_daxdev(struct fuse_mount *fm, const u64 index)
+{
+	struct fuse_daxdev_out daxdev_out = { 0 };
+	struct fuse_conn *fc = fm->fc;
+	struct famfs_daxdev *daxdev;
+	int err = 0;
+
+	FUSE_ARGS(args);
+
+	/* Store the daxdev in our table */
+	if (index >= fc->dax_devlist->nslots) {
+		pr_err("%s: index(%lld) > nslots(%d)\n",
+		       __func__, index, fc->dax_devlist->nslots);
+		err = -EINVAL;
+		goto out;
+	}
+
+	args.opcode = FUSE_GET_DAXDEV;
+	args.nodeid = index;
+
+	args.in_numargs = 0;
+
+	args.out_numargs = 1;
+	args.out_args[0].size = sizeof(daxdev_out);
+	args.out_args[0].value = &daxdev_out;
+
+	/* Send GET_DAXDEV command */
+	err = fuse_simple_request(fm, &args);
+	if (err) {
+		pr_err("%s: err=%d from fuse_simple_request()\n",
+		       __func__, err);
+		/*
+		 * Error will be that the payload is smaller than FMAP_BUFSIZE,
+		 * which is the max we can handle. Empty payload handled below.
+		 */
+		goto out;
+	}
+
+	down_write(&fc->famfs_devlist_sem);
+
+	daxdev = &fc->dax_devlist->devlist[index];
+
+	/* Abort if daxdev is now valid */
+	if (daxdev->valid) {
+		up_write(&fc->famfs_devlist_sem);
+		/* We already have a valid entry at this index */
+		err = -EALREADY;
+		goto out;
+	}
+
+	/* Verify that the dev is valid and can be opened and gets the devno */
+	err = famfs_verify_daxdev(daxdev_out.name, &daxdev->devno);
+	if (err) {
+		up_write(&fc->famfs_devlist_sem);
+		pr_err("%s: err=%d from famfs_verify_daxdev()\n", __func__, err);
+		goto out;
+	}
+
+	/* This will fail if it's not a dax device */
+	daxdev->devp = dax_dev_get(daxdev->devno);
+	if (!daxdev->devp) {
+		up_write(&fc->famfs_devlist_sem);
+		pr_warn("%s: device %s not found or not dax\n",
+			__func__, daxdev_out.name);
+		err = -ENODEV;
+		goto out;
+	}
+
+	daxdev->name = kstrdup(daxdev_out.name, GFP_KERNEL);
+	wmb(); /* all daxdev fields must be visible before marking it valid */
+	daxdev->valid = 1;
+
+	up_write(&fc->famfs_devlist_sem);
+
+out:
+	return err;
+}
+
+/**
+ * famfs_update_daxdev_table() - Update the daxdev table
+ * @fm   - fuse_mount
+ * @meta - famfs_file_meta, in-memory format, built from a GET_FMAP response
+ *
+ * This function is called for each new file fmap, to verify whether all
+ * referenced daxdevs are already known (i.e. in the table). Any daxdev
+ * indices that referenced in @meta but not in the table will be retrieved via
+ * famfs_fuse_get_daxdev() and added to the table
+ *
+ * Return: 0=success
+ *         -errno=failure
+ */
+static int
+famfs_update_daxdev_table(
+	struct fuse_mount *fm,
+	const struct famfs_file_meta *meta)
+{
+	struct famfs_dax_devlist *local_devlist;
+	struct fuse_conn *fc = fm->fc;
+	int err;
+	int i;
+
+	/* First time through we will need to allocate the dax_devlist */
+	if (!fc->dax_devlist) {
+		local_devlist = kcalloc(1, sizeof(*fc->dax_devlist), GFP_KERNEL);
+		if (!local_devlist)
+			return -ENOMEM;
+
+		local_devlist->nslots = MAX_DAXDEVS;
+
+		local_devlist->devlist = kcalloc(MAX_DAXDEVS,
+						 sizeof(struct famfs_daxdev),
+						 GFP_KERNEL);
+		if (!local_devlist->devlist) {
+			kfree(local_devlist);
+			return -ENOMEM;
+		}
+
+		/* We don't need the famfs_devlist_sem here because we use cmpxchg... */
+		if (cmpxchg(&fc->dax_devlist, NULL, local_devlist) != NULL) {
+			kfree(local_devlist->devlist);
+			kfree(local_devlist); /* another thread beat us to it */
+		}
+	}
+
+	down_read(&fc->famfs_devlist_sem);
+	for (i = 0; i < fc->dax_devlist->nslots; i++) {
+		if (meta->dev_bitmap & (1ULL << i)) {
+			/* This file meta struct references devindex i
+			 * if devindex i isn't in the table; get it...
+			 */
+			if (!(fc->dax_devlist->devlist[i].valid)) {
+				up_read(&fc->famfs_devlist_sem);
+
+				err = famfs_fuse_get_daxdev(fm, i);
+				if (err)
+					pr_err("%s: failed to get daxdev=%d\n",
+					       __func__, i);
+
+				down_read(&fc->famfs_devlist_sem);
+			}
+		}
+	}
+	up_read(&fc->famfs_devlist_sem);
+
+	return 0;
+}
+
+/***************************************************************************/
 
 void
 __famfs_meta_free(void *famfs_meta)
@@ -336,6 +560,9 @@ famfs_file_init_dax(
 	if (rc)
 		goto errout;
 
+	/* Make sure this fmap doesn't reference any unknown daxdevs */
+	famfs_update_daxdev_table(fm, meta);
+
 	/* Publish the famfs metadata on fi->famfs_meta */
 	inode_lock(inode);
 	if (fi->famfs_meta) {
diff --git a/fs/fuse/famfs_kfmap.h b/fs/fuse/famfs_kfmap.h
index ce785d76719c..f79707b9f761 100644
--- a/fs/fuse/famfs_kfmap.h
+++ b/fs/fuse/famfs_kfmap.h
@@ -60,4 +60,30 @@ struct famfs_file_meta {
 	};
 };
 
+/**
+ * famfs_daxdev - tracking struct for a daxdev within a famfs file system
+ *
+ * This is the in-memory daxdev metadata that is populated by
+ * the responses to GET_FMAP messages
+ */
+struct famfs_daxdev {
+	/* Include dev uuid? */
+	bool valid;
+	bool error;
+	dev_t devno;
+	struct dax_device *devp;
+	char *name;
+};
+
+#define MAX_DAXDEVS 24
+
+/**
+ * famfs_dax_devlist - list of famfs_daxdev's
+ */
+struct famfs_dax_devlist {
+	int nslots;
+	int ndevs;
+	struct famfs_daxdev *devlist; /* XXX: make this an xarray! */
+};
+
 #endif /* FAMFS_KFMAP_H */
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index fb6095655403..37298551539c 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1565,6 +1565,7 @@ int famfs_file_init_dax(struct fuse_mount *fm,
 			     struct inode *inode, void *fmap_buf,
 			     size_t fmap_size);
 void __famfs_meta_free(void *map);
+void famfs_teardown(struct fuse_conn *fc);
 #endif
 
 static inline struct fuse_backing *famfs_meta_set(struct fuse_inode *fi,
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 1682755abf30..c29e9d96ea92 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1049,8 +1049,10 @@ void fuse_conn_put(struct fuse_conn *fc)
 		}
 		if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
 			fuse_backing_files_free(fc);
-		if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
+		if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)) {
 			kfree(fc->shadow);
+			famfs_teardown(fc);
+		}
 		call_rcu(&fc->rcu, delayed_release);
 	}
 }
diff --git a/fs/namei.c b/fs/namei.c
index ecb7b95c2ca3..75a1e1d46593 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3380,6 +3380,7 @@ bool may_open_dev(const struct path *path)
 	return !(path->mnt->mnt_flags & MNT_NODEV) &&
 		!(path->mnt->mnt_sb->s_iflags & SB_I_NODEV);
 }
+EXPORT_SYMBOL(may_open_dev);
 
 static int may_open(struct mnt_idmap *idmap, const struct path *path,
 		    int acc_mode, int flag)
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index ecaaa62910f0..8a81b6c334fe 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -235,6 +235,9 @@
  *      - struct fuse_famfs_simple_ext
  *      - struct fuse_famfs_iext
  *      - struct fuse_famfs_fmap_header
+ *    - Add the following structs for the GET_DAXDEV message and reply
+ *      - struct fuse_get_daxdev_in
+ *      - struct fuse_get_daxdev_out
  *    - Add the following enumerated types
  *      - enum fuse_famfs_file_type
  *      - enum famfs_ext_type
@@ -1351,6 +1354,20 @@ struct fuse_famfs_fmap_header {
 	uint64_t reserved1;
 };
 
+struct fuse_get_daxdev_in {
+	uint32_t        daxdev_num;
+};
+
+#define DAXDEV_NAME_MAX 256
+struct fuse_daxdev_out {
+	uint16_t index;
+	uint16_t reserved;
+	uint32_t reserved2;
+	uint64_t reserved3; /* enough space for a uuid if we need it */
+	uint64_t reserved4;
+	char name[DAXDEV_NAME_MAX];
+};
+
 static inline int32_t fmap_msg_min_size(void)
 {
 	/* Smallest fmap message is a header plus one simple extent */
@@ -1358,4 +1375,5 @@ static inline int32_t fmap_msg_min_size(void)
 		+ sizeof(struct fuse_famfs_simple_ext));
 }
 
+
 #endif /* _LINUX_FUSE_H */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC V2 15/18] famfs_fuse: Plumb dax iomap and fuse read/write/mmap
  2025-07-03 18:50 [RFC V2 00/18] famfs: port into fuse John Groves
                   ` (13 preceding siblings ...)
  2025-07-03 18:50 ` [RFC V2 14/18] famfs_fuse: GET_DAXDEV message and daxdev_table John Groves
@ 2025-07-03 18:50 ` John Groves
  2025-07-04  9:13   ` Amir Goldstein
  2025-07-03 18:50 ` [RFC V2 16/18] famfs_fuse: Add holder_operations for dax notify_failure() John Groves
                   ` (3 subsequent siblings)
  18 siblings, 1 reply; 91+ messages in thread
From: John Groves @ 2025-07-03 18:50 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Randy Dunlap, Jeff Layton, Kent Overstreet,
	linux-doc, linux-kernel, nvdimm, linux-cxl, linux-fsdevel,
	Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
	Josef Bacik, Aravind Ramesh, Ajay Joshi, John Groves

This commit fills in read/write/mmap handling for famfs files. The
dev_dax_iomap interface is used - just like xfs in fs-dax mode.

Signed-off-by: John Groves <john@groves.net>
---
 fs/fuse/famfs.c  | 436 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/fuse/file.c   |  14 ++
 fs/fuse/fuse_i.h |   3 +
 3 files changed, 453 insertions(+)

diff --git a/fs/fuse/famfs.c b/fs/fuse/famfs.c
index f5e01032b825..1973eb10b60b 100644
--- a/fs/fuse/famfs.c
+++ b/fs/fuse/famfs.c
@@ -585,3 +585,439 @@ famfs_file_init_dax(
 	return rc;
 }
 
+/*********************************************************************
+ * iomap_operations
+ *
+ * This stuff uses the iomap (dax-related) helpers to resolve file offsets to
+ * offsets within a dax device.
+ */
+
+static ssize_t famfs_file_bad(struct inode *inode);
+
+static int
+famfs_interleave_fileofs_to_daxofs(struct inode *inode, struct iomap *iomap,
+			 loff_t file_offset, off_t len, unsigned int flags)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct famfs_file_meta *meta = fi->famfs_meta;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	loff_t local_offset = file_offset;
+	int i;
+
+	/* This function is only for extent_type INTERLEAVED_EXTENT */
+	if (meta->fm_extent_type != INTERLEAVED_EXTENT) {
+		pr_err("%s: bad extent type\n", __func__);
+		goto err_out;
+	}
+
+	if (famfs_file_bad(inode))
+		goto err_out;
+
+	iomap->offset = file_offset;
+
+	for (i = 0; i < meta->fm_niext; i++) {
+		struct famfs_meta_interleaved_ext *fei = &meta->ie[i];
+		u64 chunk_size = fei->fie_chunk_size;
+		u64 nstrips = fei->fie_nstrips;
+		u64 ext_size = fei->fie_nbytes;
+
+		ext_size = min_t(u64, ext_size, meta->file_size);
+
+		if (ext_size == 0) {
+			pr_err("%s: ext_size=%lld file_size=%ld\n",
+			       __func__, fei->fie_nbytes, meta->file_size);
+			goto err_out;
+		}
+
+		/* Is the data is in this striped extent? */
+		if (local_offset < ext_size) {
+			u64 chunk_num       = local_offset / chunk_size;
+			u64 chunk_offset    = local_offset % chunk_size;
+			u64 stripe_num      = chunk_num / nstrips;
+			u64 strip_num       = chunk_num % nstrips;
+			u64 chunk_remainder = chunk_size - chunk_offset;
+			u64 strip_offset    = chunk_offset + (stripe_num * chunk_size);
+			u64 strip_dax_ofs = fei->ie_strips[strip_num].ext_offset;
+			u64 strip_devidx = fei->ie_strips[strip_num].dev_index;
+
+			if (!fc->dax_devlist->devlist[strip_devidx].valid) {
+				pr_err("%s: daxdev=%lld invalid\n", __func__,
+					strip_devidx);
+				goto err_out;
+			}
+			iomap->addr    = strip_dax_ofs + strip_offset;
+			iomap->offset  = file_offset;
+			iomap->length  = min_t(loff_t, len, chunk_remainder);
+
+			iomap->dax_dev = fc->dax_devlist->devlist[strip_devidx].devp;
+
+			iomap->type    = IOMAP_MAPPED;
+			iomap->flags   = flags;
+
+			return 0;
+		}
+		local_offset -= ext_size; /* offset is beyond this striped extent */
+	}
+
+ err_out:
+	pr_err("%s: err_out\n", __func__);
+
+	/* We fell out the end of the extent list.
+	 * Set iomap to zero length in this case, and return 0
+	 * This just means that the r/w is past EOF
+	 */
+	iomap->addr    = 0; /* there is no valid dax device offset */
+	iomap->offset  = file_offset; /* file offset */
+	iomap->length  = 0; /* this had better result in no access to dax mem */
+	iomap->dax_dev = NULL;
+	iomap->type    = IOMAP_MAPPED;
+	iomap->flags   = flags;
+
+	return 0;
+}
+
+/**
+ * famfs_fileofs_to_daxofs() - Resolve (file, offset, len) to (daxdev, offset, len)
+ *
+ * This function is called by famfs_fuse_iomap_begin() to resolve an offset in a
+ * file to an offset in a dax device. This is upcalled from dax from calls to
+ * both  * dax_iomap_fault() and dax_iomap_rw(). Dax finishes the job resolving
+ * a fault to a specific physical page (the fault case) or doing a memcpy
+ * variant (the rw case)
+ *
+ * Pages can be PTE (4k), PMD (2MiB) or (theoretically) PuD (1GiB)
+ * (these sizes are for X86; may vary on other cpu architectures
+ *
+ * @inode:  The file where the fault occurred
+ * @iomap:       To be filled in to indicate where to find the right memory,
+ *               relative  to a dax device.
+ * @file_offset: Within the file where the fault occurred (will be page boundary)
+ * @len:         The length of the faulted mapping (will be a page multiple)
+ *               (will be trimmed in *iomap if it's disjoint in the extent list)
+ * @flags:
+ *
+ * Return values: 0. (info is returned in a modified @iomap struct)
+ */
+static int
+famfs_fileofs_to_daxofs(struct inode *inode, struct iomap *iomap,
+			 loff_t file_offset, off_t len, unsigned int flags)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct famfs_file_meta *meta = fi->famfs_meta;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	loff_t local_offset = file_offset;
+	int i;
+
+	if (!fc->dax_devlist) {
+		pr_err("%s: null dax_devlist\n", __func__);
+		goto err_out;
+	}
+
+	if (famfs_file_bad(inode))
+		goto err_out;
+
+	if (meta->fm_extent_type == INTERLEAVED_EXTENT)
+		return famfs_interleave_fileofs_to_daxofs(inode, iomap,
+							  file_offset,
+							  len, flags);
+
+	iomap->offset = file_offset;
+
+	for (i = 0; i < meta->fm_nextents; i++) {
+		/* TODO: check devindex too */
+		loff_t dax_ext_offset = meta->se[i].ext_offset;
+		loff_t dax_ext_len    = meta->se[i].ext_len;
+		u64 daxdev_idx = meta->se[i].dev_index;
+
+		if ((dax_ext_offset == 0) &&
+		    (meta->file_type != FAMFS_SUPERBLOCK))
+			pr_warn("%s: zero offset on non-superblock file!!\n",
+				__func__);
+
+		/* local_offset is the offset minus the size of extents skipped
+		 * so far; If local_offset < dax_ext_len, the data of interest
+		 * starts in this extent
+		 */
+		if (local_offset < dax_ext_len) {
+			loff_t ext_len_remainder = dax_ext_len - local_offset;
+			struct famfs_daxdev *dd;
+
+			dd = &fc->dax_devlist->devlist[daxdev_idx];
+
+			if (!dd->valid || dd->error) {
+				pr_err("%s: daxdev=%lld %s\n", __func__,
+				       daxdev_idx,
+				       dd->valid ? "error" : "invalid");
+				goto err_out;
+			}
+
+			/*
+			 * OK, we found the file metadata extent where this
+			 * data begins
+			 * @local_offset      - The offset within the current
+			 *                      extent
+			 * @ext_len_remainder - Remaining length of ext after
+			 *                      skipping local_offset
+			 * Outputs:
+			 * iomap->addr:   the offset within the dax device where
+			 *                the  data starts
+			 * iomap->offset: the file offset
+			 * iomap->length: the valid length resolved here
+			 */
+			iomap->addr    = dax_ext_offset + local_offset;
+			iomap->offset  = file_offset;
+			iomap->length  = min_t(loff_t, len, ext_len_remainder);
+
+			iomap->dax_dev = fc->dax_devlist->devlist[daxdev_idx].devp;
+
+			iomap->type    = IOMAP_MAPPED;
+			iomap->flags   = flags;
+			return 0;
+		}
+		local_offset -= dax_ext_len; /* Get ready for the next extent */
+	}
+
+ err_out:
+	pr_err("%s: err_out\n", __func__);
+
+	/* We fell out the end of the extent list.
+	 * Set iomap to zero length in this case, and return 0
+	 * This just means that the r/w is past EOF
+	 */
+	iomap->addr    = 0; /* there is no valid dax device offset */
+	iomap->offset  = file_offset; /* file offset */
+	iomap->length  = 0; /* this had better result in no access to dax mem */
+	iomap->dax_dev = NULL;
+	iomap->type    = IOMAP_MAPPED;
+	iomap->flags   = flags;
+
+	return 0;
+}
+
+/**
+ * famfs_fuse_iomap_begin() - Handler for iomap_begin upcall from dax
+ *
+ * This function is pretty simple because files are
+ * * never partially allocated
+ * * never have holes (never sparse)
+ * * never "allocate on write"
+ *
+ * @inode:  inode for the file being accessed
+ * @offset: offset within the file
+ * @length: Length being accessed at offset
+ * @flags:
+ * @iomap:  iomap struct to be filled in, resolving (offset, length) to
+ *          (daxdev, offset, len)
+ * @srcmap:
+ */
+static int
+famfs_fuse_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
+		  unsigned int flags, struct iomap *iomap, struct iomap *srcmap)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct famfs_file_meta *meta = fi->famfs_meta;
+	size_t size;
+
+	size = i_size_read(inode);
+
+	WARN_ON(size != meta->file_size);
+
+	return famfs_fileofs_to_daxofs(inode, iomap, offset, length, flags);
+}
+
+/* Note: We never need a special set of write_iomap_ops because famfs never
+ * performs allocation on write.
+ */
+const struct iomap_ops famfs_iomap_ops = {
+	.iomap_begin		= famfs_fuse_iomap_begin,
+};
+
+/*********************************************************************
+ * vm_operations
+ */
+static vm_fault_t
+__famfs_fuse_filemap_fault(struct vm_fault *vmf, unsigned int pe_size,
+		      bool write_fault)
+{
+	struct inode *inode = file_inode(vmf->vma->vm_file);
+	vm_fault_t ret;
+	pfn_t pfn;
+
+	if (!IS_DAX(file_inode(vmf->vma->vm_file))) {
+		pr_err("%s: file not marked IS_DAX!!\n", __func__);
+		return VM_FAULT_SIGBUS;
+	}
+
+	if (write_fault) {
+		sb_start_pagefault(inode->i_sb);
+		file_update_time(vmf->vma->vm_file);
+	}
+
+	ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &famfs_iomap_ops);
+	if (ret & VM_FAULT_NEEDDSYNC)
+		ret = dax_finish_sync_fault(vmf, pe_size, pfn);
+
+	if (write_fault)
+		sb_end_pagefault(inode->i_sb);
+
+	return ret;
+}
+
+static inline bool
+famfs_is_write_fault(struct vm_fault *vmf)
+{
+	return (vmf->flags & FAULT_FLAG_WRITE) &&
+	       (vmf->vma->vm_flags & VM_SHARED);
+}
+
+static vm_fault_t
+famfs_filemap_fault(struct vm_fault *vmf)
+{
+	return __famfs_fuse_filemap_fault(vmf, 0, famfs_is_write_fault(vmf));
+}
+
+static vm_fault_t
+famfs_filemap_huge_fault(struct vm_fault *vmf, unsigned int pe_size)
+{
+	return __famfs_fuse_filemap_fault(vmf, pe_size, famfs_is_write_fault(vmf));
+}
+
+static vm_fault_t
+famfs_filemap_page_mkwrite(struct vm_fault *vmf)
+{
+	return __famfs_fuse_filemap_fault(vmf, 0, true);
+}
+
+static vm_fault_t
+famfs_filemap_pfn_mkwrite(struct vm_fault *vmf)
+{
+	return __famfs_fuse_filemap_fault(vmf, 0, true);
+}
+
+static vm_fault_t
+famfs_filemap_map_pages(struct vm_fault	*vmf, pgoff_t start_pgoff,
+			pgoff_t	end_pgoff)
+{
+	return filemap_map_pages(vmf, start_pgoff, end_pgoff);
+}
+
+const struct vm_operations_struct famfs_file_vm_ops = {
+	.fault		= famfs_filemap_fault,
+	.huge_fault	= famfs_filemap_huge_fault,
+	.map_pages	= famfs_filemap_map_pages,
+	.page_mkwrite	= famfs_filemap_page_mkwrite,
+	.pfn_mkwrite	= famfs_filemap_pfn_mkwrite,
+};
+
+/*********************************************************************
+ * file_operations
+ */
+
+/**
+ * famfs_file_bad() - Check for files that aren't in a valid state
+ *
+ * @inode - inode
+ *
+ * Returns: 0=success
+ *          -errno=failure
+ */
+static ssize_t
+famfs_file_bad(struct inode *inode)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct famfs_file_meta *meta = fi->famfs_meta;
+	size_t i_size = i_size_read(inode);
+
+	if (!meta) {
+		pr_err("%s: un-initialized famfs file\n", __func__);
+		return -EIO;
+	}
+	if (meta->error) {
+		pr_debug("%s: previously detected metadata errors\n", __func__);
+		return -EIO;
+	}
+	if (i_size != meta->file_size) {
+		pr_warn("%s: i_size overwritten from %ld to %ld\n",
+		       __func__, meta->file_size, i_size);
+		meta->error = true;
+		return -ENXIO;
+	}
+	if (!IS_DAX(inode)) {
+		pr_debug("%s: inode %llx IS_DAX is false\n", __func__, (u64)inode);
+		return -ENXIO;
+	}
+	return 0;
+}
+
+static ssize_t
+famfs_fuse_rw_prep(struct kiocb *iocb, struct iov_iter *ubuf)
+{
+	struct inode *inode = iocb->ki_filp->f_mapping->host;
+	size_t i_size = i_size_read(inode);
+	size_t count = iov_iter_count(ubuf);
+	size_t max_count;
+	ssize_t rc;
+
+	rc = famfs_file_bad(inode);
+	if (rc)
+		return rc;
+
+	max_count = max_t(size_t, 0, i_size - iocb->ki_pos);
+
+	if (count > max_count)
+		iov_iter_truncate(ubuf, max_count);
+
+	if (!iov_iter_count(ubuf))
+		return 0;
+
+	return rc;
+}
+
+ssize_t
+famfs_fuse_read_iter(struct kiocb *iocb, struct iov_iter	*to)
+{
+	ssize_t rc;
+
+	rc = famfs_fuse_rw_prep(iocb, to);
+	if (rc)
+		return rc;
+
+	if (!iov_iter_count(to))
+		return 0;
+
+	rc = dax_iomap_rw(iocb, to, &famfs_iomap_ops);
+
+	file_accessed(iocb->ki_filp);
+	return rc;
+}
+
+ssize_t
+famfs_fuse_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+	ssize_t rc;
+
+	rc = famfs_fuse_rw_prep(iocb, from);
+	if (rc)
+		return rc;
+
+	if (!iov_iter_count(from))
+		return 0;
+
+	return dax_iomap_rw(iocb, from, &famfs_iomap_ops);
+}
+
+int
+famfs_fuse_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct inode *inode = file_inode(file);
+	ssize_t rc;
+
+	rc = famfs_file_bad(inode);
+	if (rc)
+		return (int)rc;
+
+	file_accessed(file);
+	vma->vm_ops = &famfs_file_vm_ops;
+	vm_flags_set(vma, VM_HUGEPAGE);
+	return 0;
+}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 5d205eadb48f..24a14b176510 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1874,6 +1874,8 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 
 	if (FUSE_IS_VIRTIO_DAX(fi))
 		return fuse_dax_read_iter(iocb, to);
+	if (fuse_file_famfs(fi))
+		return famfs_fuse_read_iter(iocb, to);
 
 	/* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
 	if (ff->open_flags & FOPEN_DIRECT_IO)
@@ -1896,6 +1898,8 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 
 	if (FUSE_IS_VIRTIO_DAX(fi))
 		return fuse_dax_write_iter(iocb, from);
+	if (fuse_file_famfs(fi))
+		return famfs_fuse_write_iter(iocb, from);
 
 	/* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
 	if (ff->open_flags & FOPEN_DIRECT_IO)
@@ -1911,10 +1915,14 @@ static ssize_t fuse_splice_read(struct file *in, loff_t *ppos,
 				unsigned int flags)
 {
 	struct fuse_file *ff = in->private_data;
+	struct inode *inode = file_inode(in);
+	struct fuse_inode *fi = get_fuse_inode(inode);
 
 	/* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
 	if (fuse_file_passthrough(ff) && !(ff->open_flags & FOPEN_DIRECT_IO))
 		return fuse_passthrough_splice_read(in, ppos, pipe, len, flags);
+	else if (fuse_file_famfs(fi))
+		return -EIO; /* direct I/O doesn't make sense in dax_iomap */
 	else
 		return filemap_splice_read(in, ppos, pipe, len, flags);
 }
@@ -1923,10 +1931,14 @@ static ssize_t fuse_splice_write(struct pipe_inode_info *pipe, struct file *out,
 				 loff_t *ppos, size_t len, unsigned int flags)
 {
 	struct fuse_file *ff = out->private_data;
+	struct inode *inode = file_inode(out);
+	struct fuse_inode *fi = get_fuse_inode(inode);
 
 	/* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
 	if (fuse_file_passthrough(ff) && !(ff->open_flags & FOPEN_DIRECT_IO))
 		return fuse_passthrough_splice_write(pipe, out, ppos, len, flags);
+	else if (fuse_file_famfs(fi))
+		return -EIO; /* direct I/O doesn't make sense in dax_iomap */
 	else
 		return iter_file_splice_write(pipe, out, ppos, len, flags);
 }
@@ -2732,6 +2744,8 @@ static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
 	/* DAX mmap is superior to direct_io mmap */
 	if (FUSE_IS_VIRTIO_DAX(fi))
 		return fuse_dax_mmap(file, vma);
+	if (fuse_file_famfs(fi))
+		return famfs_fuse_mmap(file, vma);
 
 	/*
 	 * If inode is in passthrough io mode, because it has some file open
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 37298551539c..3b3a1d95367f 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1564,6 +1564,9 @@ extern void fuse_sysctl_unregister(void);
 int famfs_file_init_dax(struct fuse_mount *fm,
 			     struct inode *inode, void *fmap_buf,
 			     size_t fmap_size);
+ssize_t famfs_fuse_write_iter(struct kiocb *iocb, struct iov_iter *from);
+ssize_t famfs_fuse_read_iter(struct kiocb *iocb, struct iov_iter	*to);
+int famfs_fuse_mmap(struct file *file, struct vm_area_struct *vma);
 void __famfs_meta_free(void *map);
 void famfs_teardown(struct fuse_conn *fc);
 #endif
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC V2 16/18] famfs_fuse: Add holder_operations for dax notify_failure()
  2025-07-03 18:50 [RFC V2 00/18] famfs: port into fuse John Groves
                   ` (14 preceding siblings ...)
  2025-07-03 18:50 ` [RFC V2 15/18] famfs_fuse: Plumb dax iomap and fuse read/write/mmap John Groves
@ 2025-07-03 18:50 ` John Groves
  2025-07-03 18:50 ` [RFC V2 17/18] famfs_fuse: Add famfs metadata documentation John Groves
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 91+ messages in thread
From: John Groves @ 2025-07-03 18:50 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Randy Dunlap, Jeff Layton, Kent Overstreet,
	linux-doc, linux-kernel, nvdimm, linux-cxl, linux-fsdevel,
	Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
	Josef Bacik, Aravind Ramesh, Ajay Joshi, John Groves

If we get a notify_failure() call on a daxdev, set its error flag and
prevent further access to that device.

Signed-off-by: John Groves <john@groves.net>
---
 fs/fuse/famfs.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 67 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/famfs.c b/fs/fuse/famfs.c
index 1973eb10b60b..62c01d5b9d78 100644
--- a/fs/fuse/famfs.c
+++ b/fs/fuse/famfs.c
@@ -20,6 +20,26 @@
 #include "famfs_kfmap.h"
 #include "fuse_i.h"
 
+static void famfs_set_daxdev_err(
+	struct fuse_conn *fc, struct dax_device *dax_devp);
+
+static int
+famfs_dax_notify_failure(struct dax_device *dax_devp, u64 offset,
+			u64 len, int mf_flags)
+{
+	struct fuse_conn *fc = dax_holder(dax_devp);
+
+	famfs_set_daxdev_err(fc, dax_devp);
+
+	return 0;
+}
+
+static const struct dax_holder_operations famfs_fuse_dax_holder_ops = {
+	.notify_failure		= famfs_dax_notify_failure,
+};
+
+/*****************************************************************************/
+
 /*
  * famfs_teardown()
  *
@@ -164,6 +184,15 @@ famfs_fuse_get_daxdev(struct fuse_mount *fm, const u64 index)
 		goto out;
 	}
 
+	err = fs_dax_get(daxdev->devp, fc, &famfs_fuse_dax_holder_ops);
+	if (err) {
+		up_write(&fc->famfs_devlist_sem);
+		pr_err("%s: fs_dax_get(%lld) failed\n",
+		       __func__, (u64)daxdev->devno);
+		err = -EBUSY;
+		goto out;
+	}
+
 	daxdev->name = kstrdup(daxdev_out.name, GFP_KERNEL);
 	wmb(); /* all daxdev fields must be visible before marking it valid */
 	daxdev->valid = 1;
@@ -243,6 +272,38 @@ famfs_update_daxdev_table(
 	return 0;
 }
 
+static void
+famfs_set_daxdev_err(
+	struct fuse_conn *fc,
+	struct dax_device *dax_devp)
+{
+	int i;
+
+	/* Gotta search the list by dax_devp;
+	 * read lock because we're not adding or removing daxdev entries
+	 */
+	down_read(&fc->famfs_devlist_sem);
+	for (i = 0; i < fc->dax_devlist->nslots; i++) {
+		if (fc->dax_devlist->devlist[i].valid) {
+			struct famfs_daxdev *dd = &fc->dax_devlist->devlist[i];
+
+			if (dd->devp != dax_devp)
+				continue;
+
+			dd->error = true;
+			up_read(&fc->famfs_devlist_sem);
+
+			pr_err("%s: memory error on daxdev %s (%d)\n",
+			       __func__, dd->name, i);
+			goto done;
+		}
+	}
+	up_read(&fc->famfs_devlist_sem);
+	pr_err("%s: memory err on unrecognized daxdev\n", __func__);
+
+done:
+}
+
 /***************************************************************************/
 
 void
@@ -631,6 +692,7 @@ famfs_interleave_fileofs_to_daxofs(struct inode *inode, struct iomap *iomap,
 
 		/* Is the data is in this striped extent? */
 		if (local_offset < ext_size) {
+			struct famfs_daxdev *dd;
 			u64 chunk_num       = local_offset / chunk_size;
 			u64 chunk_offset    = local_offset % chunk_size;
 			u64 stripe_num      = chunk_num / nstrips;
@@ -640,9 +702,11 @@ famfs_interleave_fileofs_to_daxofs(struct inode *inode, struct iomap *iomap,
 			u64 strip_dax_ofs = fei->ie_strips[strip_num].ext_offset;
 			u64 strip_devidx = fei->ie_strips[strip_num].dev_index;
 
-			if (!fc->dax_devlist->devlist[strip_devidx].valid) {
-				pr_err("%s: daxdev=%lld invalid\n", __func__,
-					strip_devidx);
+			dd = &fc->dax_devlist->devlist[strip_devidx];
+			if (!dd->valid || dd->error) {
+				pr_err("%s: daxdev=%lld %s\n", __func__,
+				       strip_devidx,
+				       dd->valid ? "error" : "invalid");
 				goto err_out;
 			}
 			iomap->addr    = strip_dax_ofs + strip_offset;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC V2 17/18] famfs_fuse: Add famfs metadata documentation
  2025-07-03 18:50 [RFC V2 00/18] famfs: port into fuse John Groves
                   ` (15 preceding siblings ...)
  2025-07-03 18:50 ` [RFC V2 16/18] famfs_fuse: Add holder_operations for dax notify_failure() John Groves
@ 2025-07-03 18:50 ` John Groves
  2025-07-03 18:50 ` [RFC V2 18/18] famfs_fuse: Add documentation John Groves
  2025-07-03 18:56 ` [RFC V2 00/18] famfs: port into fuse John Groves
  18 siblings, 0 replies; 91+ messages in thread
From: John Groves @ 2025-07-03 18:50 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Randy Dunlap, Jeff Layton, Kent Overstreet,
	linux-doc, linux-kernel, nvdimm, linux-cxl, linux-fsdevel,
	Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
	Josef Bacik, Aravind Ramesh, Ajay Joshi, John Groves

From: John Groves <John@Groves.net>

This describes the fmap metadata - both simple and interleaved

Signed-off-by: John Groves <john@groves.net>
---
 fs/fuse/famfs_kfmap.h | 87 ++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 82 insertions(+), 5 deletions(-)

diff --git a/fs/fuse/famfs_kfmap.h b/fs/fuse/famfs_kfmap.h
index f79707b9f761..2c317554b151 100644
--- a/fs/fuse/famfs_kfmap.h
+++ b/fs/fuse/famfs_kfmap.h
@@ -7,10 +7,87 @@
 #ifndef FAMFS_KFMAP_H
 #define FAMFS_KFMAP_H
 
+/* KABI version 43 (aka v2) fmap structures
+ *
+ * The location of the memory backing for a famfs file is described by
+ * the response to the GET_FMAP fuse message (defined in
+ * include/uapi/linux/fuse.h
+ *
+ * There are currently two extent formats: Simple and Interleaved.
+ *
+ * Simple extents are just (devindex, offset, length) tuples, where devindex
+ * references a devdax device that must be retrievable via the GET_DAXDEV
+ * message/response.
+ *
+ * The extent list size must be >= file_size.
+ *
+ * Interleaved extents merit some additional explanation. Interleaved
+ * extents stripe data across a collection of strips. Each strip is a
+ * contiguous allocation from a single devdax device - and is described by
+ * a simple_extent structure.
+ *
+ * Interleaved_extent example:
+ *   ie_nstrips = 4
+ *   ie_chunk_size = 2MiB
+ *   ie_nbytes = 24MiB
+ *
+ * ┌────────────┐────────────┐────────────┐────────────┐
+ * │Chunk = 0   │Chunk = 1   │Chunk = 2   │Chunk = 3   │
+ * │Strip = 0   │Strip = 1   │Strip = 2   │Strip = 3   │
+ * │Stripe = 0  │Stripe = 0  │Stripe = 0  │Stripe = 0  │
+ * │            │            │            │            │
+ * └────────────┘────────────┘────────────┘────────────┘
+ * │Chunk = 4   │Chunk = 5   │Chunk = 6   │Chunk = 7   │
+ * │Strip = 0   │Strip = 1   │Strip = 2   │Strip = 3   │
+ * │Stripe = 1  │Stripe = 1  │Stripe = 1  │Stripe = 1  │
+ * │            │            │            │            │
+ * └────────────┘────────────┘────────────┘────────────┘
+ * │Chunk = 8   │Chunk = 9   │Chunk = 10  │Chunk = 11  │
+ * │Strip = 0   │Strip = 1   │Strip = 2   │Strip = 3   │
+ * │Stripe = 2  │Stripe = 2  │Stripe = 2  │Stripe = 2  │
+ * │            │            │            │            │
+ * └────────────┘────────────┘────────────┘────────────┘
+ *
+ * * Data is laid out across chunks in chunk # order
+ * * Columns are strips
+ * * Strips are contiguous devdax extents, normally each coming from a
+ *   different memory device
+ * * Rows are stripes
+ * * The number of chunks is (int)((file_size + chunk_size - 1) / chunk_size)
+ *   (and obviously the last chunk could be partial)
+ * * The stripe_size = (nstrips * chunk_size)
+ * * chunk_num(offset) = offset / chunk_size    //integer division
+ * * strip_num(offset) = chunk_num(offset) % nchunks
+ * * stripe_num(offset) = offset / stripe_size  //integer division
+ * * ...You get the idea - see the code for more details...
+ *
+ * Some concrete examples from the layout above:
+ * * Offset 0 in the file is offset 0 in chunk 0, which is offset 0 in
+ *   strip 0
+ * * Offset 4MiB in the file is offset 0 in chunk 2, which is offset 0 in
+ *   strip 2
+ * * Offset 15MiB in the file is offset 1MiB in chunk 7, which is offset
+ *   3MiB in strip 3
+ *
+ * Notes about this metadata format:
+ *
+ * * For various reasons, chunk_size must be a multiple of the applicable
+ *   PAGE_SIZE
+ * * Since chunk_size and nstrips are constant within an interleaved_extent,
+ *   resolving a file offset to a strip offset within a single
+ *   interleaved_ext is order 1.
+ * * If nstrips==1, a list of interleaved_ext structures degenerates to a
+ *   regular extent list (albeit with some wasted struct space).
+ */
+
 /*
- * These structures are the in-memory metadata format for famfs files. Metadata
- * retrieved via the GET_FMAP response is converted to this format for use in
- * resolving file mapping faults.
+ * The structures below are the in-memory metadata format for famfs files.
+ * Metadata retrieved via the GET_FMAP response is converted to this format
+ * for use in  resolving file mapping faults.
+ *
+ * The GET_FMAP response contains the same information, but in a more
+ * message-and-versioning-friendly format. Those structs can be found in the
+ * famfs section of include/uapi/linux/fuse.h (aka fuse_kernel.h in libfuse)
  */
 
 enum famfs_file_type {
@@ -19,7 +96,7 @@ enum famfs_file_type {
 	FAMFS_LOG,
 };
 
-/* We anticipate the possiblity of supporting additional types of extents */
+/* We anticipate the possibility of supporting additional types of extents */
 enum famfs_extent_type {
 	SIMPLE_DAX_EXTENT,
 	INTERLEAVED_EXTENT,
@@ -63,7 +140,7 @@ struct famfs_file_meta {
 /**
  * famfs_daxdev - tracking struct for a daxdev within a famfs file system
  *
- * This is the in-memory daxdev metadata that is populated by
+ * This is the in-memory daxdev metadata that is populated by parsing
  * the responses to GET_FMAP messages
  */
 struct famfs_daxdev {
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC V2 18/18] famfs_fuse: Add documentation
  2025-07-03 18:50 [RFC V2 00/18] famfs: port into fuse John Groves
                   ` (16 preceding siblings ...)
  2025-07-03 18:50 ` [RFC V2 17/18] famfs_fuse: Add famfs metadata documentation John Groves
@ 2025-07-03 18:50 ` John Groves
  2025-07-04  0:27   ` Bagas Sanjaya
                     ` (2 more replies)
  2025-07-03 18:56 ` [RFC V2 00/18] famfs: port into fuse John Groves
  18 siblings, 3 replies; 91+ messages in thread
From: John Groves @ 2025-07-03 18:50 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Randy Dunlap, Jeff Layton, Kent Overstreet,
	linux-doc, linux-kernel, nvdimm, linux-cxl, linux-fsdevel,
	Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
	Josef Bacik, Aravind Ramesh, Ajay Joshi, John Groves

Add Documentation/filesystems/famfs.rst and update MAINTAINERS

Signed-off-by: John Groves <john@groves.net>
---
 Documentation/filesystems/famfs.rst | 142 ++++++++++++++++++++++++++++
 Documentation/filesystems/index.rst |   1 +
 MAINTAINERS                         |   1 +
 3 files changed, 144 insertions(+)
 create mode 100644 Documentation/filesystems/famfs.rst

diff --git a/Documentation/filesystems/famfs.rst b/Documentation/filesystems/famfs.rst
new file mode 100644
index 000000000000..0d3c9ba9b7a8
--- /dev/null
+++ b/Documentation/filesystems/famfs.rst
@@ -0,0 +1,142 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _famfs_index:
+
+==================================================================
+famfs: The fabric-attached memory file system
+==================================================================
+
+- Copyright (C) 2024-2025 Micron Technology, Inc.
+
+Introduction
+============
+Compute Express Link (CXL) provides a mechanism for disaggregated or
+fabric-attached memory (FAM). This creates opportunities for data sharing;
+clustered apps that would otherwise have to shard or replicate data can
+share one copy in disaggregated memory.
+
+Famfs, which is not CXL-specific in any way, provides a mechanism for
+multiple hosts to concurrently access data in shared memory, by giving it
+a file system interface. With famfs, any app that understands files can
+access data sets in shared memory. Although famfs supports read and write,
+the real point is to support mmap, which provides direct (dax) access to
+the memory - either writable or read-only.
+
+Shared memory can pose complex coherency and synchronization issues, but
+there are also simple cases. Two simple and eminently useful patterns that
+occur frequently in data analytics and AI are:
+
+* Serial Sharing - Only one host or process at a time has access to a file
+* Read-only Sharing - Multiple hosts or processes share read-only access
+  to a file
+
+The famfs fuse file system is part of the famfs framework; user space
+components [1] handle metadata allocation and distribution, and provide a
+low-level fuse server to expose files that map directly to [presumably
+shared] memory.
+
+The famfs framework manages coherency of its own metadata and structures,
+but does not attempt to manage coherency for applications.
+
+Famfs also provides data isolation between files. That is, even though
+the host has access to an entire memory "device" (as a devdax device), apps
+cannot write to memory for which the file is read-only, and mapping one
+file provides isolation from the memory of all other files. This is pretty
+basic, but some experimental shared memory usage patterns provide no such
+isolation.
+
+Principles of Operation
+=======================
+
+Famfs is a file system with one or more devdax devices as a first-class
+backing device(s). Metadata maintenance and query operations happen
+entirely in user space.
+
+The famfs low-level fuse server daemon provides file maps (fmaps) and
+devdax device info to the fuse/famfs kernel component so that
+read/write/mapping faults can be handled without up-calls for all active
+files.
+
+The famfs user space is responsible for maintaining and distributing
+consistent metadata. This is currently handled via an append-only
+metadata log within the memory, but this is orthogonal to the fuse/famfs
+kernel code.
+
+Once instantiated, "the same file" on each host points to the same shared
+memory, but in-memory metadata (inodes, etc.) is ephemeral on each host
+that has a famfs instance mounted. Use cases are free to allow or not
+allow mutations to data on a file-by-file basis.
+
+When an app accesses a data object in a famfs file, there is no page cache
+involvement. The CPU cache is loaded directly from the shared memory. In
+some use cases, this is an enormous reduction read amplification compared
+to loading an entire page into the page cache.
+
+
+Famfs is Not a Conventional File System
+---------------------------------------
+
+Famfs files can be accessed by conventional means, but there are
+limitations. The kernel component of fuse/famfs is not involved in the
+allocation of backing memory for files at all; the famfs user space
+creates files and responds as a low-level fuse server with fmaps and
+devdax device info upon request.
+
+Famfs differs in some important ways from conventional file systems:
+
+* Files must be pre-allocated by the famfs framework; allocation is never
+  performed on (or after) write.
+* Any operation that changes a file's size is considered to put the file
+  in an invalid state, disabling access to the data. It may be possible to
+  revisit this in the future. (Typically the famfs user space can restore
+  files to a valid state by replaying the famfs metadata log.)
+
+Famfs exists to apply the existing file system abstractions to shared
+memory so applications and workflows can more easily adapt to an
+environment with disaggregated shared memory.
+
+Memory Error Handling
+=====================
+
+Possible memory errors include timeouts, poison and unexpected
+reconfiguration of an underlying dax device. In all of these cases, famfs
+receives a call from the devdax layer via its iomap_ops->notify_failure()
+function. If any memory errors have been detected, access to the affected
+daxdev is disabled to avoid further errors or corruption.
+
+In all known cases, famfs can be unmounted cleanly. In most cases errors
+can be cleared by re-initializing the memory - at which point a new famfs
+file system can be created.
+
+Key Requirements
+================
+
+The primary requirements for famfs are:
+
+1. Must support a file system abstraction backed by sharable devdax memory
+2. Files must efficiently handle VMA faults
+3. Must support metadata distribution in a sharable way
+4. Must handle clients with a stale copy of metadata
+
+The famfs kernel component takes care of 1-2 above by caching each file's
+mapping metadata in the kernel.
+
+Requirements 3 and 4 are handled by the user space components, and are
+largely orthogonal to the functionality of the famfs kernel module.
+
+Requirements 3 and 4 cannot be met by conventional fs-dax file systems
+(e.g. xfs) because they use write-back metadata; it is not valid to mount
+such a file system on two hosts from the same in-memory image.
+
+
+Famfs Usage
+===========
+
+Famfs usage is documented at [1].
+
+
+References
+==========
+
+- [1] Famfs user space repository and documentation
+      https://github.com/cxl-micron-reskit/famfs
diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index 2636f2a41bd3..5aad315206ee 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -90,6 +90,7 @@ Documentation for filesystem implementations.
    ext3
    ext4/index
    f2fs
+   famfs
    gfs2
    gfs2-uevents
    gfs2-glocks
diff --git a/MAINTAINERS b/MAINTAINERS
index 02688f27a4d0..faa7de4a43de 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8814,6 +8814,7 @@ M:	John Groves <John@Groves.net>
 L:	linux-cxl@vger.kernel.org
 L:	linux-fsdevel@vger.kernel.org
 S:	Supported
+F:	Documentation/filesystems/famfs.rst
 F:	fs/fuse/famfs.c
 F:	fs/fuse/famfs_kfmap.h
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [RFC V2 00/18] famfs: port into fuse
  2025-07-03 18:50 [RFC V2 00/18] famfs: port into fuse John Groves
                   ` (17 preceding siblings ...)
  2025-07-03 18:50 ` [RFC V2 18/18] famfs_fuse: Add documentation John Groves
@ 2025-07-03 18:56 ` John Groves
  2025-07-09  3:26   ` Miklos Szeredi
  18 siblings, 1 reply; 91+ messages in thread
From: John Groves @ 2025-07-03 18:56 UTC (permalink / raw)
  To: Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Randy Dunlap, Jeff Layton, Kent Overstreet,
	linux-doc, linux-kernel, nvdimm, linux-cxl, linux-fsdevel,
	Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
	Josef Bacik, Aravind Ramesh, Ajay Joshi

DERP: I did it again; Miklos' email is wrong in this series. 
Forwarding to him...

John


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 10/18] famfs_fuse: Basic fuse kernel ABI enablement for famfs
  2025-07-03 18:50 ` [RFC V2 10/18] famfs_fuse: Basic fuse kernel ABI enablement for famfs John Groves
@ 2025-07-03 22:45   ` John Groves
  2025-07-07 17:32     ` Darrick J. Wong
  2025-07-04  7:54   ` Amir Goldstein
  1 sibling, 1 reply; 91+ messages in thread
From: John Groves @ 2025-07-03 22:45 UTC (permalink / raw)
  To: Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Randy Dunlap, Jeff Layton, Kent Overstreet,
	linux-doc, linux-kernel, nvdimm, linux-cxl, linux-fsdevel,
	Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
	Josef Bacik, Aravind Ramesh, Ajay Joshi

On 25/07/03 01:50PM, John Groves wrote:
> * FUSE_DAX_FMAP flag in INIT request/reply
> 
> * fuse_conn->famfs_iomap (enable famfs-mapped files) to denote a
>   famfs-enabled connection
> 
> Signed-off-by: John Groves <john@groves.net>
> ---
>  fs/fuse/fuse_i.h          |  3 +++
>  fs/fuse/inode.c           | 14 ++++++++++++++
>  include/uapi/linux/fuse.h |  4 ++++
>  3 files changed, 21 insertions(+)
> 
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 9d87ac48d724..a592c1002861 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -873,6 +873,9 @@ struct fuse_conn {
>  	/* Use io_uring for communication */
>  	unsigned int io_uring;
>  
> +	/* dev_dax_iomap support for famfs */
> +	unsigned int famfs_iomap:1;
> +
>  	/** Maximum stack depth for passthrough backing files */
>  	int max_stack_depth;
>  
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 29147657a99f..e48e11c3f9f3 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -1392,6 +1392,18 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
>  			}
>  			if (flags & FUSE_OVER_IO_URING && fuse_uring_enabled())
>  				fc->io_uring = 1;
> +			if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX) &&
> +			    flags & FUSE_DAX_FMAP) {
> +				/* XXX: Should also check that fuse server
> +				 * has CAP_SYS_RAWIO and/or CAP_SYS_ADMIN,
> +				 * since it is directing the kernel to access
> +				 * dax memory directly - but this function
> +				 * appears not to be called in fuse server
> +				 * process context (b/c even if it drops
> +				 * those capabilities, they are held here).
> +				 */
> +				fc->famfs_iomap = 1;

I think there should be a check here that the fuse server is 
capable(CAP_SYS_RAWIO) (or maybe CAP_SYS_ADMIN), but this function doesn't 
run in fuse server context. A famfs fuse server is providing fmaps, which 
map files to devdax memory, which should not be an unprivileged operation.

1) Does fs/fuse already store the capabilities of the fuse server?
2) If not, where do you suggest I do that, and where do you suggest I store
that info? The only dead-obvious place (to me) that fs/fuse runs in server
context is in fuse_dev_open(), but it doesn't store anything...

@Miklos, I'd appreciate your advice here.

Thanks!
John


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 18/18] famfs_fuse: Add documentation
  2025-07-03 18:50 ` [RFC V2 18/18] famfs_fuse: Add documentation John Groves
@ 2025-07-04  0:27   ` Bagas Sanjaya
  2025-07-04  2:22     ` Jonathan Corbet
  2025-07-04  6:09   ` Randy Dunlap
  2025-07-04  8:27   ` Amir Goldstein
  2 siblings, 1 reply; 91+ messages in thread
From: Bagas Sanjaya @ 2025-07-04  0:27 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Randy Dunlap, Jeff Layton, Kent Overstreet,
	linux-doc, linux-kernel, nvdimm, linux-cxl, linux-fsdevel,
	Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
	Josef Bacik, Aravind Ramesh, Ajay Joshi

[-- Attachment #1: Type: text/plain, Size: 568 bytes --]

On Thu, Jul 03, 2025 at 01:50:32PM -0500, John Groves wrote:
> +Requirements 3 and 4 are handled by the user space components, and are
> +largely orthogonal to the functionality of the famfs kernel module.
> +
> +Requirements 3 and 4 cannot be met by conventional fs-dax file systems

"Such requirements, however, cannot be met by ..."

> +(e.g. xfs) because they use write-back metadata; it is not valid to mount
> +such a file system on two hosts from the same in-memory image.
> +

Thanks.

-- 
An old man doll... just what I always wanted! - Clara

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 18/18] famfs_fuse: Add documentation
  2025-07-04  0:27   ` Bagas Sanjaya
@ 2025-07-04  2:22     ` Jonathan Corbet
  2025-07-04  3:53       ` Bagas Sanjaya
  0 siblings, 1 reply; 91+ messages in thread
From: Jonathan Corbet @ 2025-07-04  2:22 UTC (permalink / raw)
  To: Bagas Sanjaya, John Groves, Dan Williams, Miklos Szeredi,
	Bernd Schubert
  Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Darrick J . Wong, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

Bagas Sanjaya <bagasdotme@gmail.com> writes:

> On Thu, Jul 03, 2025 at 01:50:32PM -0500, John Groves wrote:
>> +Requirements 3 and 4 are handled by the user space components, and are
>> +largely orthogonal to the functionality of the famfs kernel module.
>> +
>> +Requirements 3 and 4 cannot be met by conventional fs-dax file systems
>
> "Such requirements, however, cannot be met by ..."

Bagas.  Stop.

John has written documentation, that is great.  Do not add needless
friction to this process.  Seriously.

Why do I have to keep telling you this?

jon

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 18/18] famfs_fuse: Add documentation
  2025-07-04  2:22     ` Jonathan Corbet
@ 2025-07-04  3:53       ` Bagas Sanjaya
  2025-07-04 18:58         ` Matthew Wilcox
  0 siblings, 1 reply; 91+ messages in thread
From: Bagas Sanjaya @ 2025-07-04  3:53 UTC (permalink / raw)
  To: Jonathan Corbet, John Groves, Dan Williams, Miklos Szeredi,
	Bernd Schubert
  Cc: John Groves, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Darrick J . Wong, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On Thu, Jul 03, 2025 at 08:22:58PM -0600, Jonathan Corbet wrote:
> Bagas.  Stop.
> 
> John has written documentation, that is great.  Do not add needless
> friction to this process.  Seriously.
> 
> Why do I have to keep telling you this?

Cause I'm more of perfectionist (detail-oriented)...

-- 
An old man doll... just what I always wanted! - Clara

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 18/18] famfs_fuse: Add documentation
  2025-07-03 18:50 ` [RFC V2 18/18] famfs_fuse: Add documentation John Groves
  2025-07-04  0:27   ` Bagas Sanjaya
@ 2025-07-04  6:09   ` Randy Dunlap
  2025-07-04  8:27   ` Amir Goldstein
  2 siblings, 0 replies; 91+ messages in thread
From: Randy Dunlap @ 2025-07-04  6:09 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Jeff Layton, Kent Overstreet, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Amir Goldstein,
	Jonathan Cameron, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
	Aravind Ramesh, Ajay Joshi



On 7/3/25 11:50 AM, John Groves wrote:
> Add Documentation/filesystems/famfs.rst and update MAINTAINERS
> 
> Signed-off-by: John Groves <john@groves.net>
> ---
>  Documentation/filesystems/famfs.rst | 142 ++++++++++++++++++++++++++++
>  Documentation/filesystems/index.rst |   1 +
>  MAINTAINERS                         |   1 +
>  3 files changed, 144 insertions(+)
>  create mode 100644 Documentation/filesystems/famfs.rst
> 


Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>

Thanks.


-- 
~Randy

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 10/18] famfs_fuse: Basic fuse kernel ABI enablement for famfs
  2025-07-03 18:50 ` [RFC V2 10/18] famfs_fuse: Basic fuse kernel ABI enablement for famfs John Groves
  2025-07-03 22:45   ` John Groves
@ 2025-07-04  7:54   ` Amir Goldstein
  2025-07-04 13:39     ` John Groves
  1 sibling, 1 reply; 91+ messages in thread
From: Amir Goldstein @ 2025-07-04  7:54 UTC (permalink / raw)
  To: John Groves, Darrick J . Wong
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Jonathan Cameron, Stefan Hajnoczi,
	Joanne Koong, Josef Bacik, Aravind Ramesh, Ajay Joshi

On Thu, Jul 3, 2025 at 8:51 PM John Groves <John@groves.net> wrote:
>
> * FUSE_DAX_FMAP flag in INIT request/reply
>
> * fuse_conn->famfs_iomap (enable famfs-mapped files) to denote a
>   famfs-enabled connection
>
> Signed-off-by: John Groves <john@groves.net>
> ---
>  fs/fuse/fuse_i.h          |  3 +++
>  fs/fuse/inode.c           | 14 ++++++++++++++
>  include/uapi/linux/fuse.h |  4 ++++
>  3 files changed, 21 insertions(+)
>
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 9d87ac48d724..a592c1002861 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -873,6 +873,9 @@ struct fuse_conn {
>         /* Use io_uring for communication */
>         unsigned int io_uring;
>
> +       /* dev_dax_iomap support for famfs */
> +       unsigned int famfs_iomap:1;
> +

pls move up to the bit fields members.

>         /** Maximum stack depth for passthrough backing files */
>         int max_stack_depth;
>
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 29147657a99f..e48e11c3f9f3 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -1392,6 +1392,18 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
>                         }
>                         if (flags & FUSE_OVER_IO_URING && fuse_uring_enabled())
>                                 fc->io_uring = 1;
> +                       if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX) &&
> +                           flags & FUSE_DAX_FMAP) {
> +                               /* XXX: Should also check that fuse server
> +                                * has CAP_SYS_RAWIO and/or CAP_SYS_ADMIN,
> +                                * since it is directing the kernel to access
> +                                * dax memory directly - but this function
> +                                * appears not to be called in fuse server
> +                                * process context (b/c even if it drops
> +                                * those capabilities, they are held here).
> +                                */
> +                               fc->famfs_iomap = 1;
> +                       }

1. As long as the mapping requests are checking capabilities we should be ok
    Right?
2. What's the deal with capable(CAP_SYS_ADMIN) in process_init_limits then?
3. Darrick mentioned the need for a synchronic INIT variant for his work on
    blockdev iomap support [1]

I also wonder how much of your patches and Darrick's patches end up
being an overlap?

Thanks,
Amir.

[1] https://lore.kernel.org/linux-fsdevel/20250613174413.GM6138@frogsfrogsfrogs/

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 18/18] famfs_fuse: Add documentation
  2025-07-03 18:50 ` [RFC V2 18/18] famfs_fuse: Add documentation John Groves
  2025-07-04  0:27   ` Bagas Sanjaya
  2025-07-04  6:09   ` Randy Dunlap
@ 2025-07-04  8:27   ` Amir Goldstein
  2025-07-04 23:36     ` Bagas Sanjaya
  2 siblings, 1 reply; 91+ messages in thread
From: Amir Goldstein @ 2025-07-04  8:27 UTC (permalink / raw)
  To: John Groves
  Cc: Dan Williams, Bernd Schubert, John Groves, Jonathan Corbet,
	Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Darrick J . Wong, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Jonathan Cameron, Stefan Hajnoczi,
	Joanne Koong, Josef Bacik, Aravind Ramesh, Ajay Joshi,
	Miklos Szeredi

On Thu, Jul 3, 2025 at 8:51 PM John Groves <John@groves.net> wrote:
>
> Add Documentation/filesystems/famfs.rst and update MAINTAINERS
>
> Signed-off-by: John Groves <john@groves.net>
> ---
>  Documentation/filesystems/famfs.rst | 142 ++++++++++++++++++++++++++++
>  Documentation/filesystems/index.rst |   1 +
>  MAINTAINERS                         |   1 +
>  3 files changed, 144 insertions(+)
>  create mode 100644 Documentation/filesystems/famfs.rst


Considering "Documentation: fuse: Consolidate FUSE docs into its own
subdirectory"
https://lore.kernel.org/linux-fsdevel/20250612032239.17561-1-bagasdotme@gmail.com/

I wonder if famfs and virtiofs should be moved into fuse subdir?
To me it makes more sense, but it's not a clear cut.

>
> diff --git a/Documentation/filesystems/famfs.rst b/Documentation/filesystems/famfs.rst
> new file mode 100644
> index 000000000000..0d3c9ba9b7a8
> --- /dev/null
> +++ b/Documentation/filesystems/famfs.rst
> @@ -0,0 +1,142 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +.. _famfs_index:
> +
> +==================================================================
> +famfs: The fabric-attached memory file system
> +==================================================================
> +
> +- Copyright (C) 2024-2025 Micron Technology, Inc.
> +
> +Introduction
> +============
> +Compute Express Link (CXL) provides a mechanism for disaggregated or
> +fabric-attached memory (FAM). This creates opportunities for data sharing;
> +clustered apps that would otherwise have to shard or replicate data can
> +share one copy in disaggregated memory.
> +
> +Famfs, which is not CXL-specific in any way, provides a mechanism for
> +multiple hosts to concurrently access data in shared memory, by giving it
> +a file system interface. With famfs, any app that understands files can
> +access data sets in shared memory. Although famfs supports read and write,
> +the real point is to support mmap, which provides direct (dax) access to
> +the memory - either writable or read-only.
> +
> +Shared memory can pose complex coherency and synchronization issues, but
> +there are also simple cases. Two simple and eminently useful patterns that
> +occur frequently in data analytics and AI are:
> +
> +* Serial Sharing - Only one host or process at a time has access to a file
> +* Read-only Sharing - Multiple hosts or processes share read-only access
> +  to a file
> +
> +The famfs fuse file system is part of the famfs framework; user space
> +components [1] handle metadata allocation and distribution, and provide a
> +low-level fuse server to expose files that map directly to [presumably
> +shared] memory.
> +
> +The famfs framework manages coherency of its own metadata and structures,
> +but does not attempt to manage coherency for applications.
> +
> +Famfs also provides data isolation between files. That is, even though
> +the host has access to an entire memory "device" (as a devdax device), apps
> +cannot write to memory for which the file is read-only, and mapping one
> +file provides isolation from the memory of all other files. This is pretty
> +basic, but some experimental shared memory usage patterns provide no such
> +isolation.
> +
> +Principles of Operation
> +=======================
> +
> +Famfs is a file system with one or more devdax devices as a first-class
> +backing device(s). Metadata maintenance and query operations happen
> +entirely in user space.
> +
> +The famfs low-level fuse server daemon provides file maps (fmaps) and
> +devdax device info to the fuse/famfs kernel component so that
> +read/write/mapping faults can be handled without up-calls for all active
> +files.
> +
> +The famfs user space is responsible for maintaining and distributing
> +consistent metadata. This is currently handled via an append-only
> +metadata log within the memory, but this is orthogonal to the fuse/famfs
> +kernel code.
> +
> +Once instantiated, "the same file" on each host points to the same shared
> +memory, but in-memory metadata (inodes, etc.) is ephemeral on each host
> +that has a famfs instance mounted. Use cases are free to allow or not
> +allow mutations to data on a file-by-file basis.
> +
> +When an app accesses a data object in a famfs file, there is no page cache
> +involvement. The CPU cache is loaded directly from the shared memory. In
> +some use cases, this is an enormous reduction read amplification compared
> +to loading an entire page into the page cache.
> +
> +
> +Famfs is Not a Conventional File System
> +---------------------------------------
> +
> +Famfs files can be accessed by conventional means, but there are
> +limitations. The kernel component of fuse/famfs is not involved in the
> +allocation of backing memory for files at all; the famfs user space
> +creates files and responds as a low-level fuse server with fmaps and
> +devdax device info upon request.
> +
> +Famfs differs in some important ways from conventional file systems:
> +
> +* Files must be pre-allocated by the famfs framework; allocation is never
> +  performed on (or after) write.
> +* Any operation that changes a file's size is considered to put the file
> +  in an invalid state, disabling access to the data. It may be possible to
> +  revisit this in the future. (Typically the famfs user space can restore
> +  files to a valid state by replaying the famfs metadata log.)
> +
> +Famfs exists to apply the existing file system abstractions to shared
> +memory so applications and workflows can more easily adapt to an
> +environment with disaggregated shared memory.
> +
> +Memory Error Handling
> +=====================
> +
> +Possible memory errors include timeouts, poison and unexpected
> +reconfiguration of an underlying dax device. In all of these cases, famfs
> +receives a call from the devdax layer via its iomap_ops->notify_failure()
> +function. If any memory errors have been detected, access to the affected
> +daxdev is disabled to avoid further errors or corruption.
> +
> +In all known cases, famfs can be unmounted cleanly. In most cases errors
> +can be cleared by re-initializing the memory - at which point a new famfs
> +file system can be created.
> +
> +Key Requirements
> +================
> +
> +The primary requirements for famfs are:
> +
> +1. Must support a file system abstraction backed by sharable devdax memory
> +2. Files must efficiently handle VMA faults
> +3. Must support metadata distribution in a sharable way
> +4. Must handle clients with a stale copy of metadata
> +
> +The famfs kernel component takes care of 1-2 above by caching each file's
> +mapping metadata in the kernel.
> +
> +Requirements 3 and 4 are handled by the user space components, and are
> +largely orthogonal to the functionality of the famfs kernel module.
> +
> +Requirements 3 and 4 cannot be met by conventional fs-dax file systems
> +(e.g. xfs) because they use write-back metadata; it is not valid to mount
> +such a file system on two hosts from the same in-memory image.
> +
> +
> +Famfs Usage
> +===========
> +
> +Famfs usage is documented at [1].
> +
> +
> +References
> +==========
> +
> +- [1] Famfs user space repository and documentation
> +      https://github.com/cxl-micron-reskit/famfs
> diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
> index 2636f2a41bd3..5aad315206ee 100644
> --- a/Documentation/filesystems/index.rst
> +++ b/Documentation/filesystems/index.rst
> @@ -90,6 +90,7 @@ Documentation for filesystem implementations.
>     ext3
>     ext4/index
>     f2fs
> +   famfs
>     gfs2
>     gfs2-uevents
>     gfs2-glocks
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 02688f27a4d0..faa7de4a43de 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -8814,6 +8814,7 @@ M:        John Groves <John@Groves.net>
>  L:     linux-cxl@vger.kernel.org
>  L:     linux-fsdevel@vger.kernel.org
>  S:     Supported
> +F:     Documentation/filesystems/famfs.rst
>  F:     fs/fuse/famfs.c
>  F:     fs/fuse/famfs_kfmap.h
>
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 09/18] famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/
  2025-07-03 18:50 ` [RFC V2 09/18] famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/ John Groves
@ 2025-07-04  8:44   ` Amir Goldstein
  0 siblings, 0 replies; 91+ messages in thread
From: Amir Goldstein @ 2025-07-04  8:44 UTC (permalink / raw)
  To: John Groves
  Cc: Dan Williams, Bernd Schubert, John Groves, Jonathan Corbet,
	Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Darrick J . Wong, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Jonathan Cameron, Stefan Hajnoczi,
	Joanne Koong, Josef Bacik, Aravind Ramesh, Ajay Joshi,
	Miklos Szeredi

On Thu, Jul 3, 2025 at 8:51 PM John Groves <John@groves.net> wrote:
>
> Virtio_fs now needs to determine if an inode is DAX && not famfs.
>
> Signed-off-by: John Groves <john@groves.net>
> ---
>  fs/fuse/dir.c    |  2 +-
>  fs/fuse/file.c   | 13 ++++++++-----
>  fs/fuse/fuse_i.h |  6 +++++-
>  fs/fuse/inode.c  |  2 +-
>  fs/fuse/iomode.c |  2 +-
>  5 files changed, 16 insertions(+), 9 deletions(-)
>
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index 8f699c67561f..ad8cdf7b864a 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -1939,7 +1939,7 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
>                 is_truncate = true;
>         }
>
> -       if (FUSE_IS_DAX(inode) && is_truncate) {
> +       if (FUSE_IS_VIRTIO_DAX(fi) && is_truncate) {
>                 filemap_invalidate_lock(mapping);
>                 fault_blocked = true;
>                 err = fuse_dax_break_layouts(inode, 0, -1);
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 754378dd9f71..93b82660f0c8 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -239,7 +239,7 @@ static int fuse_open(struct inode *inode, struct file *file)
>         int err;
>         bool is_truncate = (file->f_flags & O_TRUNC) && fc->atomic_o_trunc;
>         bool is_wb_truncate = is_truncate && fc->writeback_cache;
> -       bool dax_truncate = is_truncate && FUSE_IS_DAX(inode);
> +       bool dax_truncate = is_truncate && FUSE_IS_VIRTIO_DAX(fi);
>
>         if (fuse_is_bad(inode))
>                 return -EIO;
> @@ -1770,11 +1770,12 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
>         struct file *file = iocb->ki_filp;
>         struct fuse_file *ff = file->private_data;
>         struct inode *inode = file_inode(file);
> +       struct fuse_inode *fi = get_fuse_inode(inode);
>
>         if (fuse_is_bad(inode))
>                 return -EIO;
>
> -       if (FUSE_IS_DAX(inode))
> +       if (FUSE_IS_VIRTIO_DAX(fi))
>                 return fuse_dax_read_iter(iocb, to);
>
>         /* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
> @@ -1791,11 +1792,12 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>         struct file *file = iocb->ki_filp;
>         struct fuse_file *ff = file->private_data;
>         struct inode *inode = file_inode(file);
> +       struct fuse_inode *fi = get_fuse_inode(inode);
>
>         if (fuse_is_bad(inode))
>                 return -EIO;
>
> -       if (FUSE_IS_DAX(inode))
> +       if (FUSE_IS_VIRTIO_DAX(fi))
>                 return fuse_dax_write_iter(iocb, from);
>
>         /* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
> @@ -2627,10 +2629,11 @@ static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
>         struct fuse_file *ff = file->private_data;
>         struct fuse_conn *fc = ff->fm->fc;
>         struct inode *inode = file_inode(file);
> +       struct fuse_inode *fi = get_fuse_inode(inode);
>         int rc;
>
>         /* DAX mmap is superior to direct_io mmap */
> -       if (FUSE_IS_DAX(inode))
> +       if (FUSE_IS_VIRTIO_DAX(fi))
>                 return fuse_dax_mmap(file, vma);
>
>         /*
> @@ -3191,7 +3194,7 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
>                 .mode = mode
>         };
>         int err;
> -       bool block_faults = FUSE_IS_DAX(inode) &&
> +       bool block_faults = FUSE_IS_VIRTIO_DAX(fi) &&
>                 (!(mode & FALLOC_FL_KEEP_SIZE) ||
>                  (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE)));
>
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 2086dac7243b..9d87ac48d724 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -1426,7 +1426,11 @@ void fuse_free_conn(struct fuse_conn *fc);
>
>  /* dax.c */
>
> -#define FUSE_IS_DAX(inode) (IS_ENABLED(CONFIG_FUSE_DAX) && IS_DAX(inode))
> +/* This macro is used by virtio_fs, but now it also needs to filter for
> + * "not famfs"
> + */
> +#define FUSE_IS_VIRTIO_DAX(fuse_inode) (IS_ENABLED(CONFIG_FUSE_DAX)    \
> +                                       && IS_DAX(&fuse_inode->inode))

I think we should take this opportunity to make it
static inline fuse_is_virtio_dax()

and because some of the call site really do want to know if this is dax
should also leave a helper

static inline fuse_is_dax()

...

>
>  ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to);
>  ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from);
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index e9db2cb8c150..29147657a99f 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -164,7 +164,7 @@ static void fuse_evict_inode(struct inode *inode)
>         if (inode->i_sb->s_flags & SB_ACTIVE) {
>                 struct fuse_conn *fc = get_fuse_conn(inode);
>
> -               if (FUSE_IS_DAX(inode))
> +               if (FUSE_IS_VIRTIO_DAX(fi))
>                         fuse_dax_inode_cleanup(inode);
>                 if (fi->nlookup) {
>                         fuse_queue_forget(fc, fi->forget, fi->nodeid,
> diff --git a/fs/fuse/iomode.c b/fs/fuse/iomode.c
> index c99e285f3183..aec4aecb5d79 100644
> --- a/fs/fuse/iomode.c
> +++ b/fs/fuse/iomode.c
> @@ -204,7 +204,7 @@ int fuse_file_io_open(struct file *file, struct inode *inode)
>          * io modes are not relevant with DAX and with server that does not
>          * implement open.
>          */
> -       if (FUSE_IS_DAX(inode) || !ff->args)
> +       if (FUSE_IS_VIRTIO_DAX(fi) || !ff->args)
>                 return 0;
>

... Like here, later in your patch set you add explicit check for
!fuse_famfs_file()

If you make this into if (fuse_is_dax(fi)
it will be easier and nicer in the end.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 12/18] famfs_fuse: Plumb the GET_FMAP message/response
  2025-07-03 18:50 ` [RFC V2 12/18] famfs_fuse: Plumb the GET_FMAP message/response John Groves
@ 2025-07-04  8:54   ` Amir Goldstein
  2025-07-04 20:30     ` John Groves
  2025-07-09  4:27   ` Darrick J. Wong
  2025-08-14 13:36   ` Miklos Szeredi
  2 siblings, 1 reply; 91+ messages in thread
From: Amir Goldstein @ 2025-07-04  8:54 UTC (permalink / raw)
  To: John Groves
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Darrick J . Wong,
	Randy Dunlap, Jeff Layton, Kent Overstreet, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On Thu, Jul 3, 2025 at 8:51 PM John Groves <John@groves.net> wrote:
>
> Upon completion of an OPEN, if we're in famfs-mode we do a GET_FMAP to
> retrieve and cache up the file-to-dax map in the kernel. If this
> succeeds, read/write/mmap are resolved direct-to-dax with no upcalls.
>
> GET_FMAP has a variable-size response payload, and the allocated size
> is sent in the in_args[0].size field. If the fmap would overflow the
> message, the fuse server sends a reply of size 'sizeof(uint32_t)' which
> specifies the size of the fmap message. Then the kernel can realloc a
> large enough buffer and try again.
>
> Signed-off-by: John Groves <john@groves.net>
> ---
>  fs/fuse/file.c            | 84 +++++++++++++++++++++++++++++++++++++++
>  fs/fuse/fuse_i.h          | 36 ++++++++++++++++-
>  fs/fuse/inode.c           | 19 +++++++--
>  fs/fuse/iomode.c          |  2 +-
>  include/uapi/linux/fuse.h | 18 +++++++++
>  5 files changed, 154 insertions(+), 5 deletions(-)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 93b82660f0c8..8616fb0a6d61 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -230,6 +230,77 @@ static void fuse_truncate_update_attr(struct inode *inode, struct file *file)
>         fuse_invalidate_attr_mask(inode, FUSE_STATX_MODSIZE);
>  }
>
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)

We generally try to avoid #ifdef blocks in c files
keep them mostly in h files and use in c files
   if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))

also #if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
it a bit strange for a bool Kconfig because it looks too
much like the c code, so I prefer
#ifdef CONFIG_FUSE_FAMFS_DAX
when you have to use it

If you need entire functions compiled out, why not put them in famfs.c?

> +
> +#define FMAP_BUFSIZE 4096
> +
> +static int
> +fuse_get_fmap(struct fuse_mount *fm, struct inode *inode, u64 nodeid)
> +{
> +       struct fuse_get_fmap_in inarg = { 0 };
> +       size_t fmap_bufsize = FMAP_BUFSIZE;
> +       ssize_t fmap_size;
> +       int retries = 1;
> +       void *fmap_buf;
> +       int rc;
> +
> +       FUSE_ARGS(args);
> +
> +       fmap_buf = kcalloc(1, FMAP_BUFSIZE, GFP_KERNEL);
> +       if (!fmap_buf)
> +               return -EIO;
> +
> + retry_once:
> +       inarg.size = fmap_bufsize;
> +
> +       args.opcode = FUSE_GET_FMAP;
> +       args.nodeid = nodeid;
> +
> +       args.in_numargs = 1;
> +       args.in_args[0].size = sizeof(inarg);
> +       args.in_args[0].value = &inarg;
> +
> +       /* Variable-sized output buffer
> +        * this causes fuse_simple_request() to return the size of the
> +        * output payload
> +        */
> +       args.out_argvar = true;
> +       args.out_numargs = 1;
> +       args.out_args[0].size = fmap_bufsize;
> +       args.out_args[0].value = fmap_buf;
> +
> +       /* Send GET_FMAP command */
> +       rc = fuse_simple_request(fm, &args);
> +       if (rc < 0) {
> +               pr_err("%s: err=%d from fuse_simple_request()\n",
> +                      __func__, rc);
> +               return rc;
> +       }
> +       fmap_size = rc;
> +
> +       if (retries && fmap_size == sizeof(uint32_t)) {
> +               /* fmap size exceeded fmap_bufsize;
> +                * actual fmap size returned in fmap_buf;
> +                * realloc and retry once
> +                */
> +               fmap_bufsize = *((uint32_t *)fmap_buf);
> +
> +               --retries;
> +               kfree(fmap_buf);
> +               fmap_buf = kcalloc(1, fmap_bufsize, GFP_KERNEL);
> +               if (!fmap_buf)
> +                       return -EIO;
> +
> +               goto retry_once;
> +       }
> +
> +       /* Will call famfs_file_init_dax() when that gets added */
> +
> +       kfree(fmap_buf);
> +       return 0;
> +}
> +#endif
> +
>  static int fuse_open(struct inode *inode, struct file *file)
>  {
>         struct fuse_mount *fm = get_fuse_mount(inode);
> @@ -263,6 +334,19 @@ static int fuse_open(struct inode *inode, struct file *file)
>
>         err = fuse_do_open(fm, get_node_id(inode), file, false);
>         if (!err) {
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +               if (fm->fc->famfs_iomap) {

That should be
> +               if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX) &&
> +                   fm->fc->famfs_iomap) {

> +                       if (S_ISREG(inode->i_mode)) {
> +                               int rc;
> +                               /* Get the famfs fmap */
> +                               rc = fuse_get_fmap(fm, inode,
> +                                                  get_node_id(inode));
> +                               if (rc)
> +                                       pr_err("%s: fuse_get_fmap err=%d\n",
> +                                              __func__, rc);
> +                       }
> +               }
> +#endif
>                 ff = file->private_data;
>                 err = fuse_finish_open(inode, file);
>                 if (err)
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index f4ee61046578..e01d6e5c6e93 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -193,6 +193,10 @@ struct fuse_inode {
>         /** Reference to backing file in passthrough mode */
>         struct fuse_backing *fb;
>  #endif
> +
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +       void *famfs_meta;
> +#endif
>  };
>
>  /** FUSE inode state bits */
> @@ -945,6 +949,8 @@ struct fuse_conn {
>  #endif
>
>  #if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +       struct rw_semaphore famfs_devlist_sem;
> +       struct famfs_dax_devlist *dax_devlist;
>         char *shadow;
>  #endif
>  };
> @@ -1435,11 +1441,14 @@ void fuse_free_conn(struct fuse_conn *fc);
>
>  /* dax.c */
>
> +static inline int fuse_file_famfs(struct fuse_inode *fi); /* forward */
> +
>  /* This macro is used by virtio_fs, but now it also needs to filter for
>   * "not famfs"
>   */
>  #define FUSE_IS_VIRTIO_DAX(fuse_inode) (IS_ENABLED(CONFIG_FUSE_DAX)    \
> -                                       && IS_DAX(&fuse_inode->inode))
> +                                       && IS_DAX(&fuse_inode->inode)   \
> +                                       && !fuse_file_famfs(fuse_inode))
>
>  ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to);
>  ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from);
> @@ -1550,4 +1559,29 @@ extern void fuse_sysctl_unregister(void);
>  #define fuse_sysctl_unregister()       do { } while (0)
>  #endif /* CONFIG_SYSCTL */
>
> +/* famfs.c */
> +static inline struct fuse_backing *famfs_meta_set(struct fuse_inode *fi,
> +                                                      void *meta)
> +{
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +       return xchg(&fi->famfs_meta, meta);
> +#else
> +       return NULL;
> +#endif
> +}
> +
> +static inline void famfs_meta_free(struct fuse_inode *fi)
> +{
> +       /* Stub wil be connected in a subsequent commit */
> +}
> +
> +static inline int fuse_file_famfs(struct fuse_inode *fi)
> +{
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +       return (READ_ONCE(fi->famfs_meta) != NULL);
> +#else
> +       return 0;
> +#endif
> +}
> +
>  #endif /* _FS_FUSE_I_H */
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index a7e1cf8257b0..b071d16f7d04 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -117,6 +117,9 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
>         if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
>                 fuse_inode_backing_set(fi, NULL);
>
> +       if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
> +               famfs_meta_set(fi, NULL);
> +
>         return &fi->inode;
>
>  out_free_forget:
> @@ -138,6 +141,13 @@ static void fuse_free_inode(struct inode *inode)
>         if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
>                 fuse_backing_put(fuse_inode_backing(fi));
>
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +       if (S_ISREG(inode->i_mode) && fi->famfs_meta) {
> +               famfs_meta_free(fi);
> +               famfs_meta_set(fi, NULL);
> +       }
> +#endif
> +
>         kmem_cache_free(fuse_inode_cachep, fi);
>  }
>
> @@ -1002,6 +1012,9 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm,
>         if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
>                 fuse_backing_files_init(fc);
>
> +       if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
> +               pr_notice("%s: Kernel is FUSE_FAMFS_DAX capable\n", __func__);
> +
>         INIT_LIST_HEAD(&fc->mounts);
>         list_add(&fm->fc_entry, &fc->mounts);
>         fm->fc = fc;
> @@ -1036,9 +1049,8 @@ void fuse_conn_put(struct fuse_conn *fc)
>                 }
>                 if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
>                         fuse_backing_files_free(fc);
> -#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> -               kfree(fc->shadow);
> -#endif
> +               if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
> +                       kfree(fc->shadow);
>                 call_rcu(&fc->rcu, delayed_release);
>         }
>  }
> @@ -1425,6 +1437,7 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
>                                  * those capabilities, they are held here).
>                                  */
>                                 fc->famfs_iomap = 1;
> +                               init_rwsem(&fc->famfs_devlist_sem);
>                         }
>                 } else {
>                         ra_pages = fc->max_read / PAGE_SIZE;
> diff --git a/fs/fuse/iomode.c b/fs/fuse/iomode.c
> index aec4aecb5d79..443b337b0c05 100644
> --- a/fs/fuse/iomode.c
> +++ b/fs/fuse/iomode.c
> @@ -204,7 +204,7 @@ int fuse_file_io_open(struct file *file, struct inode *inode)
>          * io modes are not relevant with DAX and with server that does not
>          * implement open.
>          */
> -       if (FUSE_IS_VIRTIO_DAX(fi) || !ff->args)
> +       if (FUSE_IS_VIRTIO_DAX(fi) || fuse_file_famfs(fi) || !ff->args)
>                 return 0;

This is where fuse_is_dax() helper would be handy.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 13/18] famfs_fuse: Create files with famfs fmaps
  2025-07-03 18:50 ` [RFC V2 13/18] famfs_fuse: Create files with famfs fmaps John Groves
@ 2025-07-04  9:01   ` Amir Goldstein
  2025-07-05 19:27     ` John Groves
  0 siblings, 1 reply; 91+ messages in thread
From: Amir Goldstein @ 2025-07-04  9:01 UTC (permalink / raw)
  To: John Groves
  Cc: Dan Williams, Bernd Schubert, John Groves, Jonathan Corbet,
	Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Darrick J . Wong, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Jonathan Cameron, Stefan Hajnoczi,
	Joanne Koong, Josef Bacik, Aravind Ramesh, Ajay Joshi,
	Miklos Szeredi

On Thu, Jul 3, 2025 at 8:51 PM John Groves <John@groves.net> wrote:
>
> On completion of GET_FMAP message/response, setup the full famfs
> metadata such that it's possible to handle read/write/mmap directly to
> dax. Note that the devdax_iomap plumbing is not in yet...
>
> Update MAINTAINERS for the new files.
>
> Signed-off-by: John Groves <john@groves.net>
> ---
>  MAINTAINERS               |   9 +
>  fs/fuse/Makefile          |   2 +-
>  fs/fuse/famfs.c           | 360 ++++++++++++++++++++++++++++++++++++++
>  fs/fuse/famfs_kfmap.h     |  63 +++++++
>  fs/fuse/file.c            |  15 +-
>  fs/fuse/fuse_i.h          |  16 +-
>  fs/fuse/inode.c           |   2 +-
>  include/uapi/linux/fuse.h |  56 ++++++
>  8 files changed, 518 insertions(+), 5 deletions(-)
>  create mode 100644 fs/fuse/famfs.c
>  create mode 100644 fs/fuse/famfs_kfmap.h
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index c0d5232a473b..02688f27a4d0 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -8808,6 +8808,15 @@ F:       Documentation/networking/failover.rst
>  F:     include/net/failover.h
>  F:     net/core/failover.c
>
> +FAMFS
> +M:     John Groves <jgroves@micron.com>
> +M:     John Groves <John@Groves.net>
> +L:     linux-cxl@vger.kernel.org
> +L:     linux-fsdevel@vger.kernel.org
> +S:     Supported
> +F:     fs/fuse/famfs.c
> +F:     fs/fuse/famfs_kfmap.h
> +

I suggest to follow the pattern of MAINTAINERS sub entries
FILESYSTEMS [EXPORTFS]
FILESYSTEMS [IOMAP]

and call this sub entry
FUSE [FAMFS]

to order it following FUSE entry

Thanks,
Amir.

>  FANOTIFY
>  M:     Jan Kara <jack@suse.cz>
>  R:     Amir Goldstein <amir73il@gmail.com>
> diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
> index 3f0f312a31c1..65a12975d734 100644
> --- a/fs/fuse/Makefile
> +++ b/fs/fuse/Makefile
> @@ -16,5 +16,5 @@ fuse-$(CONFIG_FUSE_DAX) += dax.o
>  fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o
>  fuse-$(CONFIG_SYSCTL) += sysctl.o
>  fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o
> -
> +fuse-$(CONFIG_FUSE_FAMFS_DAX) += famfs.o
>  virtiofs-y := virtio_fs.o
> diff --git a/fs/fuse/famfs.c b/fs/fuse/famfs.c
> new file mode 100644
> index 000000000000..41c4d92f1451
> --- /dev/null
> +++ b/fs/fuse/famfs.c
> @@ -0,0 +1,360 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * famfs - dax file system for shared fabric-attached memory
> + *
> + * Copyright 2023-2025 Micron Technology, Inc.
> + *
> + * This file system, originally based on ramfs the dax support from xfs,
> + * is intended to allow multiple host systems to mount a common file system
> + * view of dax files that map to shared memory.
> + */
> +
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include <linux/dax.h>
> +#include <linux/iomap.h>
> +#include <linux/path.h>
> +#include <linux/namei.h>
> +#include <linux/string.h>
> +
> +#include "famfs_kfmap.h"
> +#include "fuse_i.h"
> +
> +
> +void
> +__famfs_meta_free(void *famfs_meta)
> +{
> +       struct famfs_file_meta *fmap = famfs_meta;
> +
> +       if (!fmap)
> +               return;
> +
> +       if (fmap) {
> +               switch (fmap->fm_extent_type) {
> +               case SIMPLE_DAX_EXTENT:
> +                       kfree(fmap->se);
> +                       break;
> +               case INTERLEAVED_EXTENT:
> +                       if (fmap->ie)
> +                               kfree(fmap->ie->ie_strips);
> +
> +                       kfree(fmap->ie);
> +                       break;
> +               default:
> +                       pr_err("%s: invalid fmap type\n", __func__);
> +                       break;
> +               }
> +       }
> +       kfree(fmap);
> +}
> +
> +static int
> +famfs_check_ext_alignment(struct famfs_meta_simple_ext *se)
> +{
> +       int errs = 0;
> +
> +       if (se->dev_index != 0)
> +               errs++;
> +
> +       /* TODO: pass in alignment so we can support the other page sizes */
> +       if (!IS_ALIGNED(se->ext_offset, PMD_SIZE))
> +               errs++;
> +
> +       if (!IS_ALIGNED(se->ext_len, PMD_SIZE))
> +               errs++;
> +
> +       return errs;
> +}
> +
> +/**
> + * famfs_fuse_meta_alloc() - Allocate famfs file metadata
> + * @metap:       Pointer to an mcache_map_meta pointer
> + * @ext_count:  The number of extents needed
> + *
> + * Returns: 0=success
> + *          -errno=failure
> + */
> +static int
> +famfs_fuse_meta_alloc(
> +       void *fmap_buf,
> +       size_t fmap_buf_size,
> +       struct famfs_file_meta **metap)
> +{
> +       struct famfs_file_meta *meta = NULL;
> +       struct fuse_famfs_fmap_header *fmh;
> +       size_t extent_total = 0;
> +       size_t next_offset = 0;
> +       int errs = 0;
> +       int i, j;
> +       int rc;
> +
> +       fmh = (struct fuse_famfs_fmap_header *)fmap_buf;
> +
> +       /* Move past fmh in fmap_buf */
> +       next_offset += sizeof(*fmh);
> +       if (next_offset > fmap_buf_size) {
> +               pr_err("%s:%d: fmap_buf underflow offset/size %ld/%ld\n",
> +                      __func__, __LINE__, next_offset, fmap_buf_size);
> +               return -EINVAL;
> +       }
> +
> +       if (fmh->nextents < 1) {
> +               pr_err("%s: nextents %d < 1\n", __func__, fmh->nextents);
> +               return -EINVAL;
> +       }
> +
> +       if (fmh->nextents > FUSE_FAMFS_MAX_EXTENTS) {
> +               pr_err("%s: nextents %d > max (%d) 1\n",
> +                      __func__, fmh->nextents, FUSE_FAMFS_MAX_EXTENTS);
> +               return -E2BIG;
> +       }
> +
> +       meta = kzalloc(sizeof(*meta), GFP_KERNEL);
> +       if (!meta)
> +               return -ENOMEM;
> +
> +       meta->error = false;
> +       meta->file_type = fmh->file_type;
> +       meta->file_size = fmh->file_size;
> +       meta->fm_extent_type = fmh->ext_type;
> +
> +       switch (fmh->ext_type) {
> +       case FUSE_FAMFS_EXT_SIMPLE: {
> +               struct fuse_famfs_simple_ext *se_in;
> +
> +               se_in = (struct fuse_famfs_simple_ext *)(fmap_buf + next_offset);
> +
> +               /* Move past simple extents */
> +               next_offset += fmh->nextents * sizeof(*se_in);
> +               if (next_offset > fmap_buf_size) {
> +                       pr_err("%s:%d: fmap_buf underflow offset/size %ld/%ld\n",
> +                              __func__, __LINE__, next_offset, fmap_buf_size);
> +                       rc = -EINVAL;
> +                       goto errout;
> +               }
> +
> +               meta->fm_nextents = fmh->nextents;
> +
> +               meta->se = kcalloc(meta->fm_nextents, sizeof(*(meta->se)),
> +                                  GFP_KERNEL);
> +               if (!meta->se) {
> +                       rc = -ENOMEM;
> +                       goto errout;
> +               }
> +
> +               if ((meta->fm_nextents > FUSE_FAMFS_MAX_EXTENTS) ||
> +                   (meta->fm_nextents < 1)) {
> +                       rc = -EINVAL;
> +                       goto errout;
> +               }
> +
> +               for (i = 0; i < fmh->nextents; i++) {
> +                       meta->se[i].dev_index  = se_in[i].se_devindex;
> +                       meta->se[i].ext_offset = se_in[i].se_offset;
> +                       meta->se[i].ext_len    = se_in[i].se_len;
> +
> +                       /* Record bitmap of referenced daxdev indices */
> +                       meta->dev_bitmap |= (1 << meta->se[i].dev_index);
> +
> +                       errs += famfs_check_ext_alignment(&meta->se[i]);
> +
> +                       extent_total += meta->se[i].ext_len;
> +               }
> +               break;
> +       }
> +
> +       case FUSE_FAMFS_EXT_INTERLEAVE: {
> +               s64 size_remainder = meta->file_size;
> +               struct fuse_famfs_iext *ie_in;
> +               int niext = fmh->nextents;
> +
> +               meta->fm_niext = niext;
> +
> +               /* Allocate interleaved extent */
> +               meta->ie = kcalloc(niext, sizeof(*(meta->ie)), GFP_KERNEL);
> +               if (!meta->ie) {
> +                       rc = -ENOMEM;
> +                       goto errout;
> +               }
> +
> +               /*
> +                * Each interleaved extent has a simple extent list of strips.
> +                * Outer loop is over separate interleaved extents
> +                */
> +               for (i = 0; i < niext; i++) {
> +                       u64 nstrips;
> +                       struct fuse_famfs_simple_ext *sie_in;
> +
> +                       /* ie_in = one interleaved extent in fmap_buf */
> +                       ie_in = (struct fuse_famfs_iext *)
> +                               (fmap_buf + next_offset);
> +
> +                       /* Move past one interleaved extent header in fmap_buf */
> +                       next_offset += sizeof(*ie_in);
> +                       if (next_offset > fmap_buf_size) {
> +                               pr_err("%s:%d: fmap_buf underflow offset/size %ld/%ld\n",
> +                                      __func__, __LINE__, next_offset,
> +                                      fmap_buf_size);
> +                               rc = -EINVAL;
> +                               goto errout;
> +                       }
> +
> +                       nstrips = ie_in->ie_nstrips;
> +                       meta->ie[i].fie_chunk_size = ie_in->ie_chunk_size;
> +                       meta->ie[i].fie_nstrips    = ie_in->ie_nstrips;
> +                       meta->ie[i].fie_nbytes     = ie_in->ie_nbytes;
> +
> +                       if (!meta->ie[i].fie_nbytes) {
> +                               pr_err("%s: zero-length interleave!\n",
> +                                      __func__);
> +                               rc = -EINVAL;
> +                               goto errout;
> +                       }
> +
> +                       /* sie_in = the strip extents in fmap_buf */
> +                       sie_in = (struct fuse_famfs_simple_ext *)
> +                               (fmap_buf + next_offset);
> +
> +                       /* Move past strip extents in fmap_buf */
> +                       next_offset += nstrips * sizeof(*sie_in);
> +                       if (next_offset > fmap_buf_size) {
> +                               pr_err("%s:%d: fmap_buf underflow offset/size %ld/%ld\n",
> +                                      __func__, __LINE__, next_offset,
> +                                      fmap_buf_size);
> +                               rc = -EINVAL;
> +                               goto errout;
> +                       }
> +
> +                       if ((nstrips > FUSE_FAMFS_MAX_STRIPS) || (nstrips < 1)) {
> +                               pr_err("%s: invalid nstrips=%lld (max=%d)\n",
> +                                      __func__, nstrips,
> +                                      FUSE_FAMFS_MAX_STRIPS);
> +                               errs++;
> +                       }
> +
> +                       /* Allocate strip extent array */
> +                       meta->ie[i].ie_strips = kcalloc(ie_in->ie_nstrips,
> +                                       sizeof(meta->ie[i].ie_strips[0]),
> +                                                       GFP_KERNEL);
> +                       if (!meta->ie[i].ie_strips) {
> +                               rc = -ENOMEM;
> +                               goto errout;
> +                       }
> +
> +                       /* Inner loop is over strips */
> +                       for (j = 0; j < nstrips; j++) {
> +                               struct famfs_meta_simple_ext *strips_out;
> +                               u64 devindex = sie_in[j].se_devindex;
> +                               u64 offset   = sie_in[j].se_offset;
> +                               u64 len      = sie_in[j].se_len;
> +
> +                               strips_out = meta->ie[i].ie_strips;
> +                               strips_out[j].dev_index  = devindex;
> +                               strips_out[j].ext_offset = offset;
> +                               strips_out[j].ext_len    = len;
> +
> +                               /* Record bitmap of referenced daxdev indices */
> +                               meta->dev_bitmap |= (1 << devindex);
> +
> +                               extent_total += len;
> +                               errs += famfs_check_ext_alignment(&strips_out[j]);
> +                               size_remainder -= len;
> +                       }
> +               }
> +
> +               if (size_remainder > 0) {
> +                       /* Sum of interleaved extent sizes is less than file size! */
> +                       pr_err("%s: size_remainder %lld (0x%llx)\n",
> +                              __func__, size_remainder, size_remainder);
> +                       rc = -EINVAL;
> +                       goto errout;
> +               }
> +               break;
> +       }
> +
> +       default:
> +               pr_err("%s: invalid ext_type %d\n", __func__, fmh->ext_type);
> +               rc = -EINVAL;
> +               goto errout;
> +       }
> +
> +       if (errs > 0) {
> +               pr_err("%s: %d alignment errors found\n", __func__, errs);
> +               rc = -EINVAL;
> +               goto errout;
> +       }
> +
> +       /* More sanity checks */
> +       if (extent_total < meta->file_size) {
> +               pr_err("%s: file size %ld larger than map size %ld\n",
> +                      __func__, meta->file_size, extent_total);
> +               rc = -EINVAL;
> +               goto errout;
> +       }
> +
> +       *metap = meta;
> +
> +       return 0;
> +errout:
> +       __famfs_meta_free(meta);
> +       return rc;
> +}
> +
> +/**
> + * famfs_file_init_dax() - init famfs dax file metadata
> + *
> + * @fm:        fuse_mount
> + * @inode:     the inode
> + * @fmap_buf:  fmap response message
> + * @fmap_size: Size of the fmap message
> + *
> + * Initialize famfs metadata for a file, based on the contents of the GET_FMAP
> + * response
> + *
> + * Return: 0=success
> + *          -errno=failure
> + */
> +int
> +famfs_file_init_dax(
> +       struct fuse_mount *fm,
> +       struct inode *inode,
> +       void *fmap_buf,
> +       size_t fmap_size)
> +{
> +       struct fuse_inode *fi = get_fuse_inode(inode);
> +       struct famfs_file_meta *meta = NULL;
> +       int rc;
> +
> +       if (fi->famfs_meta) {
> +               pr_notice("%s: i_no=%ld fmap_size=%ld ALREADY INITIALIZED\n",
> +                         __func__,
> +                         inode->i_ino, fmap_size);
> +               return -EEXIST;
> +       }
> +
> +       rc = famfs_fuse_meta_alloc(fmap_buf, fmap_size, &meta);
> +       if (rc)
> +               goto errout;
> +
> +       /* Publish the famfs metadata on fi->famfs_meta */
> +       inode_lock(inode);
> +       if (fi->famfs_meta) {
> +               rc = -EEXIST; /* file already has famfs metadata */
> +       } else {
> +               if (famfs_meta_set(fi, meta) != NULL) {
> +                       pr_err("%s: file already had metadata\n", __func__);
> +                       rc = -EALREADY;
> +                       goto errout;
> +               }
> +               i_size_write(inode, meta->file_size);
> +               inode->i_flags |= S_DAX;
> +       }
> +       inode_unlock(inode);
> +
> + errout:
> +       if (rc)
> +               __famfs_meta_free(meta);
> +
> +       return rc;
> +}
> +
> diff --git a/fs/fuse/famfs_kfmap.h b/fs/fuse/famfs_kfmap.h
> new file mode 100644
> index 000000000000..ce785d76719c
> --- /dev/null
> +++ b/fs/fuse/famfs_kfmap.h
> @@ -0,0 +1,63 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * famfs - dax file system for shared fabric-attached memory
> + *
> + * Copyright 2023-2025 Micron Technology, Inc.
> + */
> +#ifndef FAMFS_KFMAP_H
> +#define FAMFS_KFMAP_H
> +
> +/*
> + * These structures are the in-memory metadata format for famfs files. Metadata
> + * retrieved via the GET_FMAP response is converted to this format for use in
> + * resolving file mapping faults.
> + */
> +
> +enum famfs_file_type {
> +       FAMFS_REG,
> +       FAMFS_SUPERBLOCK,
> +       FAMFS_LOG,
> +};
> +
> +/* We anticipate the possiblity of supporting additional types of extents */
> +enum famfs_extent_type {
> +       SIMPLE_DAX_EXTENT,
> +       INTERLEAVED_EXTENT,
> +       INVALID_EXTENT_TYPE,
> +};
> +
> +struct famfs_meta_simple_ext {
> +       u64 dev_index;
> +       u64 ext_offset;
> +       u64 ext_len;
> +};
> +
> +struct famfs_meta_interleaved_ext {
> +       u64 fie_nstrips;
> +       u64 fie_chunk_size;
> +       u64 fie_nbytes;
> +       struct famfs_meta_simple_ext *ie_strips;
> +};
> +
> +/*
> + * Each famfs dax file has this hanging from its fuse_inode->famfs_meta
> + */
> +struct famfs_file_meta {
> +       bool                   error;
> +       enum famfs_file_type   file_type;
> +       size_t                 file_size;
> +       enum famfs_extent_type fm_extent_type;
> +       u64 dev_bitmap; /* bitmap of referenced daxdevs by index */
> +       union { /* This will make code a bit more readable */
> +               struct {
> +                       size_t         fm_nextents;
> +                       struct famfs_meta_simple_ext  *se;
> +               };
> +               struct {
> +                       size_t         fm_niext;
> +                       struct famfs_meta_interleaved_ext *ie;
> +               };
> +       };
> +};
> +
> +#endif /* FAMFS_KFMAP_H */
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 8616fb0a6d61..5d205eadb48f 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -237,6 +237,7 @@ static void fuse_truncate_update_attr(struct inode *inode, struct file *file)
>  static int
>  fuse_get_fmap(struct fuse_mount *fm, struct inode *inode, u64 nodeid)
>  {
> +       struct fuse_inode *fi = get_fuse_inode(inode);
>         struct fuse_get_fmap_in inarg = { 0 };
>         size_t fmap_bufsize = FMAP_BUFSIZE;
>         ssize_t fmap_size;
> @@ -246,6 +247,10 @@ fuse_get_fmap(struct fuse_mount *fm, struct inode *inode, u64 nodeid)
>
>         FUSE_ARGS(args);
>
> +       /* Don't retrieve if we already have the famfs metadata */
> +       if (fi->famfs_meta)
> +               return 0;
> +
>         fmap_buf = kcalloc(1, FMAP_BUFSIZE, GFP_KERNEL);
>         if (!fmap_buf)
>                 return -EIO;
> @@ -285,6 +290,13 @@ fuse_get_fmap(struct fuse_mount *fm, struct inode *inode, u64 nodeid)
>                  */
>                 fmap_bufsize = *((uint32_t *)fmap_buf);
>
> +               if (fmap_bufsize < fmap_msg_min_size()
> +                   || fmap_bufsize > FAMFS_FMAP_MAX) {
> +                       pr_err("%s: fmap_size=%ld out of range\n",
> +                              __func__, fmap_bufsize);
> +                       return -EIO;
> +               }
> +
>                 --retries;
>                 kfree(fmap_buf);
>                 fmap_buf = kcalloc(1, fmap_bufsize, GFP_KERNEL);
> @@ -294,7 +306,8 @@ fuse_get_fmap(struct fuse_mount *fm, struct inode *inode, u64 nodeid)
>                 goto retry_once;
>         }
>
> -       /* Will call famfs_file_init_dax() when that gets added */
> +       /* Convert fmap into in-memory format and hang from inode */
> +       famfs_file_init_dax(fm, inode, fmap_buf, fmap_size);
>
>         kfree(fmap_buf);
>         return 0;
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index e01d6e5c6e93..fb6095655403 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -1560,11 +1560,18 @@ extern void fuse_sysctl_unregister(void);
>  #endif /* CONFIG_SYSCTL */
>
>  /* famfs.c */
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +int famfs_file_init_dax(struct fuse_mount *fm,
> +                            struct inode *inode, void *fmap_buf,
> +                            size_t fmap_size);
> +void __famfs_meta_free(void *map);
> +#endif
> +
>  static inline struct fuse_backing *famfs_meta_set(struct fuse_inode *fi,
>                                                        void *meta)
>  {
>  #if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> -       return xchg(&fi->famfs_meta, meta);
> +       return cmpxchg(&fi->famfs_meta, NULL, meta);
>  #else
>         return NULL;
>  #endif
> @@ -1572,7 +1579,12 @@ static inline struct fuse_backing *famfs_meta_set(struct fuse_inode *fi,
>
>  static inline void famfs_meta_free(struct fuse_inode *fi)
>  {
> -       /* Stub wil be connected in a subsequent commit */
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +       if (fi->famfs_meta != NULL) {
> +               __famfs_meta_free(fi->famfs_meta);
> +               famfs_meta_set(fi, NULL);
> +       }
> +#endif
>  }
>
>  static inline int fuse_file_famfs(struct fuse_inode *fi)
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index b071d16f7d04..1682755abf30 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -118,7 +118,7 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
>                 fuse_inode_backing_set(fi, NULL);
>
>         if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
> -               famfs_meta_set(fi, NULL);
> +               fi->famfs_meta = NULL; /* XXX new inodes currently not zeroed; why not? */
>
>         return &fi->inode;
>
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index dff5aa62543e..ecaaa62910f0 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -231,6 +231,13 @@
>   *    - enum fuse_uring_cmd
>   *  7.43
>   *    - Add FUSE_DAX_FMAP capability - ability to handle in-kernel fsdax maps
> + *    - Add the following structures for the GET_FMAP message reply components:
> + *      - struct fuse_famfs_simple_ext
> + *      - struct fuse_famfs_iext
> + *      - struct fuse_famfs_fmap_header
> + *    - Add the following enumerated types
> + *      - enum fuse_famfs_file_type
> + *      - enum famfs_ext_type
>   */
>
>  #ifndef _LINUX_FUSE_H
> @@ -1300,6 +1307,55 @@ struct fuse_uring_cmd_req {
>
>  /* Famfs fmap message components */
>
> +#define FAMFS_FMAP_VERSION 1
> +
>  #define FAMFS_FMAP_MAX 32768 /* Largest supported fmap message */
> +#define FUSE_FAMFS_MAX_EXTENTS 32
> +#define FUSE_FAMFS_MAX_STRIPS 32
> +
> +enum fuse_famfs_file_type {
> +       FUSE_FAMFS_FILE_REG,
> +       FUSE_FAMFS_FILE_SUPERBLOCK,
> +       FUSE_FAMFS_FILE_LOG,
> +};
> +
> +enum famfs_ext_type {
> +       FUSE_FAMFS_EXT_SIMPLE = 0,
> +       FUSE_FAMFS_EXT_INTERLEAVE = 1,
> +};
> +
> +struct fuse_famfs_simple_ext {
> +       uint32_t se_devindex;
> +       uint32_t reserved;
> +       uint64_t se_offset;
> +       uint64_t se_len;
> +};
> +
> +struct fuse_famfs_iext { /* Interleaved extent */
> +       uint32_t ie_nstrips;
> +       uint32_t ie_chunk_size;
> +       uint64_t ie_nbytes; /* Total bytes for this interleaved_ext;
> +                            * sum of strips may be more
> +                            */
> +       uint64_t reserved;
> +};
> +
> +struct fuse_famfs_fmap_header {
> +       uint8_t file_type; /* enum famfs_file_type */
> +       uint8_t reserved;
> +       uint16_t fmap_version;
> +       uint32_t ext_type; /* enum famfs_log_ext_type */
> +       uint32_t nextents;
> +       uint32_t reserved0;
> +       uint64_t file_size;
> +       uint64_t reserved1;
> +};
> +
> +static inline int32_t fmap_msg_min_size(void)
> +{
> +       /* Smallest fmap message is a header plus one simple extent */
> +       return (sizeof(struct fuse_famfs_fmap_header)
> +               + sizeof(struct fuse_famfs_simple_ext));
> +}
>
>  #endif /* _LINUX_FUSE_H */
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 15/18] famfs_fuse: Plumb dax iomap and fuse read/write/mmap
  2025-07-03 18:50 ` [RFC V2 15/18] famfs_fuse: Plumb dax iomap and fuse read/write/mmap John Groves
@ 2025-07-04  9:13   ` Amir Goldstein
  2025-07-05 19:44     ` John Groves
  0 siblings, 1 reply; 91+ messages in thread
From: Amir Goldstein @ 2025-07-04  9:13 UTC (permalink / raw)
  To: John Groves
  Cc: Dan Williams, Bernd Schubert, John Groves, Jonathan Corbet,
	Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Darrick J . Wong, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Jonathan Cameron, Stefan Hajnoczi,
	Joanne Koong, Josef Bacik, Aravind Ramesh, Ajay Joshi,
	Miklos Szeredi

On Thu, Jul 3, 2025 at 8:51 PM John Groves <John@groves.net> wrote:
>
> This commit fills in read/write/mmap handling for famfs files. The
> dev_dax_iomap interface is used - just like xfs in fs-dax mode.
>
> Signed-off-by: John Groves <john@groves.net>
> ---
>  fs/fuse/famfs.c  | 436 +++++++++++++++++++++++++++++++++++++++++++++++
>  fs/fuse/file.c   |  14 ++
>  fs/fuse/fuse_i.h |   3 +
>  3 files changed, 453 insertions(+)
>
> diff --git a/fs/fuse/famfs.c b/fs/fuse/famfs.c
> index f5e01032b825..1973eb10b60b 100644
> --- a/fs/fuse/famfs.c
> +++ b/fs/fuse/famfs.c
> @@ -585,3 +585,439 @@ famfs_file_init_dax(
>         return rc;
>  }
>
> +/*********************************************************************
> + * iomap_operations
> + *
> + * This stuff uses the iomap (dax-related) helpers to resolve file offsets to
> + * offsets within a dax device.
> + */
> +
> +static ssize_t famfs_file_bad(struct inode *inode);
> +
> +static int
> +famfs_interleave_fileofs_to_daxofs(struct inode *inode, struct iomap *iomap,
> +                        loff_t file_offset, off_t len, unsigned int flags)
> +{
> +       struct fuse_inode *fi = get_fuse_inode(inode);
> +       struct famfs_file_meta *meta = fi->famfs_meta;
> +       struct fuse_conn *fc = get_fuse_conn(inode);
> +       loff_t local_offset = file_offset;
> +       int i;
> +
> +       /* This function is only for extent_type INTERLEAVED_EXTENT */
> +       if (meta->fm_extent_type != INTERLEAVED_EXTENT) {
> +               pr_err("%s: bad extent type\n", __func__);
> +               goto err_out;
> +       }
> +
> +       if (famfs_file_bad(inode))
> +               goto err_out;
> +
> +       iomap->offset = file_offset;
> +
> +       for (i = 0; i < meta->fm_niext; i++) {
> +               struct famfs_meta_interleaved_ext *fei = &meta->ie[i];
> +               u64 chunk_size = fei->fie_chunk_size;
> +               u64 nstrips = fei->fie_nstrips;
> +               u64 ext_size = fei->fie_nbytes;
> +
> +               ext_size = min_t(u64, ext_size, meta->file_size);
> +
> +               if (ext_size == 0) {
> +                       pr_err("%s: ext_size=%lld file_size=%ld\n",
> +                              __func__, fei->fie_nbytes, meta->file_size);
> +                       goto err_out;
> +               }
> +
> +               /* Is the data is in this striped extent? */
> +               if (local_offset < ext_size) {
> +                       u64 chunk_num       = local_offset / chunk_size;
> +                       u64 chunk_offset    = local_offset % chunk_size;
> +                       u64 stripe_num      = chunk_num / nstrips;
> +                       u64 strip_num       = chunk_num % nstrips;
> +                       u64 chunk_remainder = chunk_size - chunk_offset;
> +                       u64 strip_offset    = chunk_offset + (stripe_num * chunk_size);
> +                       u64 strip_dax_ofs = fei->ie_strips[strip_num].ext_offset;
> +                       u64 strip_devidx = fei->ie_strips[strip_num].dev_index;
> +
> +                       if (!fc->dax_devlist->devlist[strip_devidx].valid) {
> +                               pr_err("%s: daxdev=%lld invalid\n", __func__,
> +                                       strip_devidx);
> +                               goto err_out;
> +                       }
> +                       iomap->addr    = strip_dax_ofs + strip_offset;
> +                       iomap->offset  = file_offset;
> +                       iomap->length  = min_t(loff_t, len, chunk_remainder);
> +
> +                       iomap->dax_dev = fc->dax_devlist->devlist[strip_devidx].devp;
> +
> +                       iomap->type    = IOMAP_MAPPED;
> +                       iomap->flags   = flags;
> +
> +                       return 0;
> +               }
> +               local_offset -= ext_size; /* offset is beyond this striped extent */
> +       }
> +
> + err_out:
> +       pr_err("%s: err_out\n", __func__);
> +
> +       /* We fell out the end of the extent list.
> +        * Set iomap to zero length in this case, and return 0
> +        * This just means that the r/w is past EOF
> +        */
> +       iomap->addr    = 0; /* there is no valid dax device offset */
> +       iomap->offset  = file_offset; /* file offset */
> +       iomap->length  = 0; /* this had better result in no access to dax mem */
> +       iomap->dax_dev = NULL;
> +       iomap->type    = IOMAP_MAPPED;
> +       iomap->flags   = flags;
> +
> +       return 0;
> +}
> +
> +/**
> + * famfs_fileofs_to_daxofs() - Resolve (file, offset, len) to (daxdev, offset, len)
> + *
> + * This function is called by famfs_fuse_iomap_begin() to resolve an offset in a
> + * file to an offset in a dax device. This is upcalled from dax from calls to
> + * both  * dax_iomap_fault() and dax_iomap_rw(). Dax finishes the job resolving
> + * a fault to a specific physical page (the fault case) or doing a memcpy
> + * variant (the rw case)
> + *
> + * Pages can be PTE (4k), PMD (2MiB) or (theoretically) PuD (1GiB)
> + * (these sizes are for X86; may vary on other cpu architectures
> + *
> + * @inode:  The file where the fault occurred
> + * @iomap:       To be filled in to indicate where to find the right memory,
> + *               relative  to a dax device.
> + * @file_offset: Within the file where the fault occurred (will be page boundary)
> + * @len:         The length of the faulted mapping (will be a page multiple)
> + *               (will be trimmed in *iomap if it's disjoint in the extent list)
> + * @flags:
> + *
> + * Return values: 0. (info is returned in a modified @iomap struct)
> + */
> +static int
> +famfs_fileofs_to_daxofs(struct inode *inode, struct iomap *iomap,
> +                        loff_t file_offset, off_t len, unsigned int flags)
> +{
> +       struct fuse_inode *fi = get_fuse_inode(inode);
> +       struct famfs_file_meta *meta = fi->famfs_meta;
> +       struct fuse_conn *fc = get_fuse_conn(inode);
> +       loff_t local_offset = file_offset;
> +       int i;
> +
> +       if (!fc->dax_devlist) {
> +               pr_err("%s: null dax_devlist\n", __func__);
> +               goto err_out;
> +       }
> +
> +       if (famfs_file_bad(inode))
> +               goto err_out;
> +
> +       if (meta->fm_extent_type == INTERLEAVED_EXTENT)
> +               return famfs_interleave_fileofs_to_daxofs(inode, iomap,
> +                                                         file_offset,
> +                                                         len, flags);
> +
> +       iomap->offset = file_offset;
> +
> +       for (i = 0; i < meta->fm_nextents; i++) {
> +               /* TODO: check devindex too */
> +               loff_t dax_ext_offset = meta->se[i].ext_offset;
> +               loff_t dax_ext_len    = meta->se[i].ext_len;
> +               u64 daxdev_idx = meta->se[i].dev_index;
> +
> +               if ((dax_ext_offset == 0) &&
> +                   (meta->file_type != FAMFS_SUPERBLOCK))
> +                       pr_warn("%s: zero offset on non-superblock file!!\n",
> +                               __func__);
> +
> +               /* local_offset is the offset minus the size of extents skipped
> +                * so far; If local_offset < dax_ext_len, the data of interest
> +                * starts in this extent
> +                */
> +               if (local_offset < dax_ext_len) {
> +                       loff_t ext_len_remainder = dax_ext_len - local_offset;
> +                       struct famfs_daxdev *dd;
> +
> +                       dd = &fc->dax_devlist->devlist[daxdev_idx];
> +
> +                       if (!dd->valid || dd->error) {
> +                               pr_err("%s: daxdev=%lld %s\n", __func__,
> +                                      daxdev_idx,
> +                                      dd->valid ? "error" : "invalid");
> +                               goto err_out;
> +                       }
> +
> +                       /*
> +                        * OK, we found the file metadata extent where this
> +                        * data begins
> +                        * @local_offset      - The offset within the current
> +                        *                      extent
> +                        * @ext_len_remainder - Remaining length of ext after
> +                        *                      skipping local_offset
> +                        * Outputs:
> +                        * iomap->addr:   the offset within the dax device where
> +                        *                the  data starts
> +                        * iomap->offset: the file offset
> +                        * iomap->length: the valid length resolved here
> +                        */
> +                       iomap->addr    = dax_ext_offset + local_offset;
> +                       iomap->offset  = file_offset;
> +                       iomap->length  = min_t(loff_t, len, ext_len_remainder);
> +
> +                       iomap->dax_dev = fc->dax_devlist->devlist[daxdev_idx].devp;
> +
> +                       iomap->type    = IOMAP_MAPPED;
> +                       iomap->flags   = flags;
> +                       return 0;
> +               }
> +               local_offset -= dax_ext_len; /* Get ready for the next extent */
> +       }
> +
> + err_out:
> +       pr_err("%s: err_out\n", __func__);
> +
> +       /* We fell out the end of the extent list.
> +        * Set iomap to zero length in this case, and return 0
> +        * This just means that the r/w is past EOF
> +        */
> +       iomap->addr    = 0; /* there is no valid dax device offset */
> +       iomap->offset  = file_offset; /* file offset */
> +       iomap->length  = 0; /* this had better result in no access to dax mem */
> +       iomap->dax_dev = NULL;
> +       iomap->type    = IOMAP_MAPPED;
> +       iomap->flags   = flags;
> +
> +       return 0;
> +}
> +
> +/**
> + * famfs_fuse_iomap_begin() - Handler for iomap_begin upcall from dax
> + *
> + * This function is pretty simple because files are
> + * * never partially allocated
> + * * never have holes (never sparse)
> + * * never "allocate on write"
> + *
> + * @inode:  inode for the file being accessed
> + * @offset: offset within the file
> + * @length: Length being accessed at offset
> + * @flags:
> + * @iomap:  iomap struct to be filled in, resolving (offset, length) to
> + *          (daxdev, offset, len)
> + * @srcmap:
> + */
> +static int
> +famfs_fuse_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> +                 unsigned int flags, struct iomap *iomap, struct iomap *srcmap)
> +{
> +       struct fuse_inode *fi = get_fuse_inode(inode);
> +       struct famfs_file_meta *meta = fi->famfs_meta;
> +       size_t size;
> +
> +       size = i_size_read(inode);
> +
> +       WARN_ON(size != meta->file_size);
> +
> +       return famfs_fileofs_to_daxofs(inode, iomap, offset, length, flags);
> +}
> +
> +/* Note: We never need a special set of write_iomap_ops because famfs never
> + * performs allocation on write.
> + */
> +const struct iomap_ops famfs_iomap_ops = {
> +       .iomap_begin            = famfs_fuse_iomap_begin,
> +};
> +
> +/*********************************************************************
> + * vm_operations
> + */
> +static vm_fault_t
> +__famfs_fuse_filemap_fault(struct vm_fault *vmf, unsigned int pe_size,
> +                     bool write_fault)
> +{
> +       struct inode *inode = file_inode(vmf->vma->vm_file);
> +       vm_fault_t ret;
> +       pfn_t pfn;
> +
> +       if (!IS_DAX(file_inode(vmf->vma->vm_file))) {
> +               pr_err("%s: file not marked IS_DAX!!\n", __func__);
> +               return VM_FAULT_SIGBUS;
> +       }
> +
> +       if (write_fault) {
> +               sb_start_pagefault(inode->i_sb);
> +               file_update_time(vmf->vma->vm_file);
> +       }
> +
> +       ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &famfs_iomap_ops);
> +       if (ret & VM_FAULT_NEEDDSYNC)
> +               ret = dax_finish_sync_fault(vmf, pe_size, pfn);
> +
> +       if (write_fault)
> +               sb_end_pagefault(inode->i_sb);
> +
> +       return ret;
> +}
> +
> +static inline bool
> +famfs_is_write_fault(struct vm_fault *vmf)
> +{
> +       return (vmf->flags & FAULT_FLAG_WRITE) &&
> +              (vmf->vma->vm_flags & VM_SHARED);
> +}
> +
> +static vm_fault_t
> +famfs_filemap_fault(struct vm_fault *vmf)
> +{
> +       return __famfs_fuse_filemap_fault(vmf, 0, famfs_is_write_fault(vmf));
> +}
> +
> +static vm_fault_t
> +famfs_filemap_huge_fault(struct vm_fault *vmf, unsigned int pe_size)
> +{
> +       return __famfs_fuse_filemap_fault(vmf, pe_size, famfs_is_write_fault(vmf));
> +}
> +
> +static vm_fault_t
> +famfs_filemap_page_mkwrite(struct vm_fault *vmf)
> +{
> +       return __famfs_fuse_filemap_fault(vmf, 0, true);
> +}
> +
> +static vm_fault_t
> +famfs_filemap_pfn_mkwrite(struct vm_fault *vmf)
> +{
> +       return __famfs_fuse_filemap_fault(vmf, 0, true);
> +}
> +
> +static vm_fault_t
> +famfs_filemap_map_pages(struct vm_fault        *vmf, pgoff_t start_pgoff,
> +                       pgoff_t end_pgoff)
> +{
> +       return filemap_map_pages(vmf, start_pgoff, end_pgoff);
> +}
> +
> +const struct vm_operations_struct famfs_file_vm_ops = {
> +       .fault          = famfs_filemap_fault,
> +       .huge_fault     = famfs_filemap_huge_fault,
> +       .map_pages      = famfs_filemap_map_pages,
> +       .page_mkwrite   = famfs_filemap_page_mkwrite,
> +       .pfn_mkwrite    = famfs_filemap_pfn_mkwrite,
> +};
> +
> +/*********************************************************************
> + * file_operations
> + */
> +
> +/**
> + * famfs_file_bad() - Check for files that aren't in a valid state
> + *
> + * @inode - inode
> + *
> + * Returns: 0=success
> + *          -errno=failure
> + */
> +static ssize_t
> +famfs_file_bad(struct inode *inode)
> +{
> +       struct fuse_inode *fi = get_fuse_inode(inode);
> +       struct famfs_file_meta *meta = fi->famfs_meta;
> +       size_t i_size = i_size_read(inode);
> +
> +       if (!meta) {
> +               pr_err("%s: un-initialized famfs file\n", __func__);
> +               return -EIO;
> +       }
> +       if (meta->error) {
> +               pr_debug("%s: previously detected metadata errors\n", __func__);
> +               return -EIO;
> +       }
> +       if (i_size != meta->file_size) {
> +               pr_warn("%s: i_size overwritten from %ld to %ld\n",
> +                      __func__, meta->file_size, i_size);
> +               meta->error = true;
> +               return -ENXIO;
> +       }
> +       if (!IS_DAX(inode)) {
> +               pr_debug("%s: inode %llx IS_DAX is false\n", __func__, (u64)inode);
> +               return -ENXIO;
> +       }
> +       return 0;
> +}
> +
> +static ssize_t
> +famfs_fuse_rw_prep(struct kiocb *iocb, struct iov_iter *ubuf)
> +{
> +       struct inode *inode = iocb->ki_filp->f_mapping->host;
> +       size_t i_size = i_size_read(inode);
> +       size_t count = iov_iter_count(ubuf);
> +       size_t max_count;
> +       ssize_t rc;
> +
> +       rc = famfs_file_bad(inode);
> +       if (rc)
> +               return rc;
> +
> +       max_count = max_t(size_t, 0, i_size - iocb->ki_pos);
> +
> +       if (count > max_count)
> +               iov_iter_truncate(ubuf, max_count);
> +
> +       if (!iov_iter_count(ubuf))
> +               return 0;
> +
> +       return rc;
> +}
> +
> +ssize_t
> +famfs_fuse_read_iter(struct kiocb *iocb, struct iov_iter       *to)
> +{
> +       ssize_t rc;
> +
> +       rc = famfs_fuse_rw_prep(iocb, to);
> +       if (rc)
> +               return rc;
> +
> +       if (!iov_iter_count(to))
> +               return 0;
> +
> +       rc = dax_iomap_rw(iocb, to, &famfs_iomap_ops);
> +
> +       file_accessed(iocb->ki_filp);
> +       return rc;
> +}
> +
> +ssize_t
> +famfs_fuse_write_iter(struct kiocb *iocb, struct iov_iter *from)
> +{
> +       ssize_t rc;
> +
> +       rc = famfs_fuse_rw_prep(iocb, from);
> +       if (rc)
> +               return rc;
> +
> +       if (!iov_iter_count(from))
> +               return 0;
> +
> +       return dax_iomap_rw(iocb, from, &famfs_iomap_ops);
> +}
> +
> +int
> +famfs_fuse_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +       struct inode *inode = file_inode(file);
> +       ssize_t rc;
> +
> +       rc = famfs_file_bad(inode);
> +       if (rc)
> +               return (int)rc;
> +
> +       file_accessed(file);
> +       vma->vm_ops = &famfs_file_vm_ops;
> +       vm_flags_set(vma, VM_HUGEPAGE);
> +       return 0;
> +}
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 5d205eadb48f..24a14b176510 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1874,6 +1874,8 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
>
>         if (FUSE_IS_VIRTIO_DAX(fi))
>                 return fuse_dax_read_iter(iocb, to);
> +       if (fuse_file_famfs(fi))
> +               return famfs_fuse_read_iter(iocb, to);
>
>         /* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
>         if (ff->open_flags & FOPEN_DIRECT_IO)
> @@ -1896,6 +1898,8 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>
>         if (FUSE_IS_VIRTIO_DAX(fi))
>                 return fuse_dax_write_iter(iocb, from);
> +       if (fuse_file_famfs(fi))
> +               return famfs_fuse_write_iter(iocb, from);
>
>         /* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
>         if (ff->open_flags & FOPEN_DIRECT_IO)
> @@ -1911,10 +1915,14 @@ static ssize_t fuse_splice_read(struct file *in, loff_t *ppos,
>                                 unsigned int flags)
>  {
>         struct fuse_file *ff = in->private_data;
> +       struct inode *inode = file_inode(in);
> +       struct fuse_inode *fi = get_fuse_inode(inode);
>
>         /* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
>         if (fuse_file_passthrough(ff) && !(ff->open_flags & FOPEN_DIRECT_IO))
>                 return fuse_passthrough_splice_read(in, ppos, pipe, len, flags);
> +       else if (fuse_file_famfs(fi))
> +               return -EIO; /* direct I/O doesn't make sense in dax_iomap */
>         else
>                 return filemap_splice_read(in, ppos, pipe, len, flags);
>  }
> @@ -1923,10 +1931,14 @@ static ssize_t fuse_splice_write(struct pipe_inode_info *pipe, struct file *out,
>                                  loff_t *ppos, size_t len, unsigned int flags)
>  {
>         struct fuse_file *ff = out->private_data;
> +       struct inode *inode = file_inode(out);
> +       struct fuse_inode *fi = get_fuse_inode(inode);
>
>         /* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
>         if (fuse_file_passthrough(ff) && !(ff->open_flags & FOPEN_DIRECT_IO))
>                 return fuse_passthrough_splice_write(pipe, out, ppos, len, flags);
> +       else if (fuse_file_famfs(fi))
> +               return -EIO; /* direct I/O doesn't make sense in dax_iomap */
>         else
>                 return iter_file_splice_write(pipe, out, ppos, len, flags);
>  }

This looks odd.

Usually, the methods first check for FUSE_IS_VIRTIO_DAX() and
fuse_file_famfs() to get this condition out of the way so I never needed
to think about whether or not the code verifies that fuse_file_passthrough()
and fuse_file_famfs() cannot co-exist.

Is there a reason why you did not follow the same pattern here?

Also, your comment makes no sense.
splice is not the case of direct I/O - quite the contrary.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 02/18] dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage
  2025-07-03 18:50 ` [RFC V2 02/18] dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage John Groves
@ 2025-07-04 10:39   ` Jonathan Cameron
  2025-07-04 12:54     ` John Groves
  0 siblings, 1 reply; 91+ messages in thread
From: Jonathan Cameron @ 2025-07-04 10:39 UTC (permalink / raw)
  To: John Groves
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Darrick J . Wong,
	Randy Dunlap, Jeff Layton, Kent Overstreet, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Amir Goldstein,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On Thu,  3 Jul 2025 13:50:16 -0500
John Groves <John@Groves.net> wrote:

> This function should be called by fs-dax file systems after opening the
> devdax device. This adds holder_operations, which effects exclusivity
> between callers of fs_dax_get().
> 
> This function serves the same role as fs_dax_get_by_bdev(), which dax
> file systems call after opening the pmem block device.
> 
> This also adds the CONFIG_DEV_DAX_IOMAP Kconfig parameter
> 
> Signed-off-by: John Groves <john@groves.net>
Trivial stuff inline.


> ---
>  drivers/dax/Kconfig |  6 ++++++
>  drivers/dax/super.c | 30 ++++++++++++++++++++++++++++++
>  include/linux/dax.h |  5 +++++
>  3 files changed, 41 insertions(+)
> 
> diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
> index d656e4c0eb84..ad19fa966b8b 100644
> --- a/drivers/dax/Kconfig
> +++ b/drivers/dax/Kconfig
> @@ -78,4 +78,10 @@ config DEV_DAX_KMEM
>  
>  	  Say N if unsure.
>  
> +config DEV_DAX_IOMAP
> +       depends on DEV_DAX && DAX
> +       def_bool y
> +       help
> +         Support iomap mapping of devdax devices (for FS-DAX file
> +         systems that reside on character /dev/dax devices)
>  endif
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index e16d1d40d773..48bab9b5f341 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -122,6 +122,36 @@ void fs_put_dax(struct dax_device *dax_dev, void *holder)
>  EXPORT_SYMBOL_GPL(fs_put_dax);
>  #endif /* CONFIG_BLOCK && CONFIG_FS_DAX */
>  
> +#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
> +/**
> + * fs_dax_get()

Trivial but from what I recall kernel-doc isn't going to like this.
Needs a short description.

> + *
> + * fs-dax file systems call this function to prepare to use a devdax device for
> + * fsdax. This is like fs_dax_get_by_bdev(), but the caller already has struct
> + * dev_dax (and there  * is no bdev). The holder makes this exclusive.

there is no *bdev?  So * in wrong place.

> + *
> + * @dax_dev: dev to be prepared for fs-dax usage
> + * @holder: filesystem or mapped device inside the dax_device
> + * @hops: operations for the inner holder
> + *
> + * Returns: 0 on success, <0 on failure
> + */
> +int fs_dax_get(struct dax_device *dax_dev, void *holder,
> +	const struct dax_holder_operations *hops)
> +{
> +	if (!dax_dev || !dax_alive(dax_dev) || !igrab(&dax_dev->inode))
> +		return -ENODEV;
> +
> +	if (cmpxchg(&dax_dev->holder_data, NULL, holder))
> +		return -EBUSY;
> +
> +	dax_dev->holder_ops = hops;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(fs_dax_get);
> +#endif /* DEV_DAX_IOMAP */
> +
>  enum dax_device_flags {
>  	/* !alive + rcu grace period == no new operations / mappings */
>  	DAXDEV_ALIVE,
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index df41a0017b31..86bf5922f1b0 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -51,6 +51,11 @@ struct dax_holder_operations {
>  
>  #if IS_ENABLED(CONFIG_DAX)
>  struct dax_device *alloc_dax(void *private, const struct dax_operations *ops);
> +
> +#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
> +int fs_dax_get(struct dax_device *dax_dev, void *holder, const struct dax_holder_operations *hops);
> +struct dax_device *inode_dax(struct inode *inode);
> +#endif
>  void *dax_holder(struct dax_device *dax_dev);
>  void put_dax(struct dax_device *dax_dev);
>  void kill_dax(struct dax_device *dax_dev);


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 03/18] dev_dax_iomap: Save the kva from memremap
  2025-07-03 18:50 ` [RFC V2 03/18] dev_dax_iomap: Save the kva from memremap John Groves
@ 2025-07-04 11:11   ` Jonathan Cameron
  0 siblings, 0 replies; 91+ messages in thread
From: Jonathan Cameron @ 2025-07-04 11:11 UTC (permalink / raw)
  To: John Groves
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Darrick J . Wong,
	Randy Dunlap, Jeff Layton, Kent Overstreet, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Amir Goldstein,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On Thu,  3 Jul 2025 13:50:17 -0500
John Groves <John@Groves.net> wrote:

> Save the kva from memremap because we need it for iomap rw support.
> 
> Prior to famfs, there were no iomap users of /dev/dax - so the virtual
> address from memremap was not needed.
> 
> Also: in some cases dev_dax_probe() is called with the first
> dev_dax->range offset past the start of pgmap[0].range. In those cases
> we need to add the difference to virt_addr in order to have the physaddr's
> in dev_dax->ranges match dev_dax->virt_addr.
> 
> This happens with devdax devices that started as pmem and got converted
> to devdax. I'm not sure whether the offset is due to label storage, or
> page tables, but this works in all known cases.

Clearly a question we need to resolve to understand if this is correct
handling.

> 
> Signed-off-by: John Groves <john@groves.net>
> ---
>  drivers/dax/dax-private.h |  1 +
>  drivers/dax/device.c      | 15 +++++++++++++++
>  2 files changed, 16 insertions(+)
> 
> diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
> index 0867115aeef2..2a6b07813f9f 100644
> --- a/drivers/dax/dax-private.h
> +++ b/drivers/dax/dax-private.h
> @@ -81,6 +81,7 @@ struct dev_dax_range {
>  struct dev_dax {
>  	struct dax_region *region;
>  	struct dax_device *dax_dev;
> +	void *virt_addr;
>  	unsigned int align;
>  	int target_node;
>  	bool dyn_id;
> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> index 29f61771fef0..583150478dcc 100644
> --- a/drivers/dax/device.c
> +++ b/drivers/dax/device.c
> @@ -372,6 +372,7 @@ static int dev_dax_probe(struct dev_dax *dev_dax)
>  	struct dax_device *dax_dev = dev_dax->dax_dev;
>  	struct device *dev = &dev_dax->dev;
>  	struct dev_pagemap *pgmap;
> +	u64 data_offset = 0;
>  	struct inode *inode;
>  	struct cdev *cdev;
>  	void *addr;
> @@ -426,6 +427,20 @@ static int dev_dax_probe(struct dev_dax *dev_dax)
>  	if (IS_ERR(addr))
>  		return PTR_ERR(addr);
>  
> +	/* Detect whether the data is at a non-zero offset into the memory */
> +	if (pgmap->range.start != dev_dax->ranges[0].range.start) {

Using pgmap->range.start here but then getting to the same (I think)
with  dev_dax->pgmap[0].range.start is rather inconsistent.

Also, perhaps drag the assignment of phys and pgmap_phys out of this
scope so that you can use them for the condition check above and
then reuse the same in here.


> +		u64 phys = dev_dax->ranges[0].range.start;
> +		u64 pgmap_phys = dev_dax->pgmap[0].range.start;
> +		u64 vmemmap_shift = dev_dax->pgmap[0].vmemmap_shift;
> +
> +		if (!WARN_ON(pgmap_phys > phys))
> +			data_offset = phys - pgmap_phys;

In the event of the condition above being false.
phys == pgmap_phys and data_offset == 0.

So why not do this unconditionally replacing this block with something like

	/* Apply necessary offset */

	dev_dax->virt_addr = addr +
		(dev_dax->ranges[0].range.start - pgmap->range.start);
> +
> +		pr_debug("%s: offset detected phys=%llx pgmap_phys=%llx offset=%llx shift=%llx\n",
> +		       __func__, phys, pgmap_phys, data_offset, vmemmap_shift);

If it's only used in the print, I'd just put the path to vmemmap_shift directly in here
and probably get to it via pgmap->vmemmap_shift



> +	}
> +	dev_dax->virt_addr = addr + data_offset;
> +
>  	inode = dax_inode(dax_dev);
>  	cdev = inode->i_cdev;
>  	cdev_init(cdev, &dax_fops);


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 04/18] dev_dax_iomap: Add dax_operations for use by fs-dax on devdax
  2025-07-03 18:50 ` [RFC V2 04/18] dev_dax_iomap: Add dax_operations for use by fs-dax on devdax John Groves
@ 2025-07-04 12:47   ` Jonathan Cameron
  2025-07-05 22:56     ` John Groves
  0 siblings, 1 reply; 91+ messages in thread
From: Jonathan Cameron @ 2025-07-04 12:47 UTC (permalink / raw)
  To: John Groves
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Darrick J . Wong,
	Randy Dunlap, Jeff Layton, Kent Overstreet, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Amir Goldstein,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On Thu,  3 Jul 2025 13:50:18 -0500
John Groves <John@Groves.net> wrote:

> Notes about this commit:
> 
> * These methods are based on pmem_dax_ops from drivers/nvdimm/pmem.c
> 
> * dev_dax_direct_access() is returns the hpa, pfn and kva. The kva was
>   newly stored as dev_dax->virt_addr by dev_dax_probe().
> 
> * The hpa/pfn are used for mmap (dax_iomap_fault()), and the kva is used
>   for read/write (dax_iomap_rw())
> 
> * dev_dax_recovery_write() and dev_dax_zero_page_range() have not been
>   tested yet. I'm looking for suggestions as to how to test those.
> 
> Signed-off-by: John Groves <john@groves.net>
A few trivial things noticed whilst reading through.

> ---
>  drivers/dax/bus.c | 120 ++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 115 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 9d9a4ae7bbc0..61a8d1b3c07a 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -7,6 +7,10 @@
>  #include <linux/slab.h>
>  #include <linux/dax.h>
>  #include <linux/io.h>
> +#include <linux/backing-dev.h>
> +#include <linux/pfn_t.h>
> +#include <linux/range.h>
> +#include <linux/uio.h>
>  #include "dax-private.h"
>  #include "bus.h"
>  
> @@ -1441,6 +1445,105 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
>  }
>  EXPORT_SYMBOL_GPL(dax_pgoff_to_phys);
>  
> +#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
> +
> +static void write_dax(void *pmem_addr, struct page *page,
> +		unsigned int off, unsigned int len)
> +{
> +	unsigned int chunk;
> +	void *mem;

I'd move these two into the loop - similar to what you have
in other cases with more local scope.

> +
> +	while (len) {
> +		mem = kmap_local_page(page);
> +		chunk = min_t(unsigned int, len, PAGE_SIZE - off);
> +		memcpy_flushcache(pmem_addr, mem + off, chunk);
> +		kunmap_local(mem);
> +		len -= chunk;
> +		off = 0;
> +		page++;
> +		pmem_addr += chunk;
> +	}
> +}
> +
> +static long __dev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
> +			long nr_pages, enum dax_access_mode mode, void **kaddr,
> +			pfn_t *pfn)
> +{
> +	struct dev_dax *dev_dax = dax_get_private(dax_dev);
> +	size_t size = nr_pages << PAGE_SHIFT;
> +	size_t offset = pgoff << PAGE_SHIFT;
> +	void *virt_addr = dev_dax->virt_addr + offset;
> +	u64 flags = PFN_DEV|PFN_MAP;

spaces around the |

Though given it's in just one place, just put these inline next
to the question...


> +	phys_addr_t phys;
> +	pfn_t local_pfn;
> +	size_t dax_size;
> +
> +	WARN_ON(!dev_dax->virt_addr);
> +
> +	if (down_read_interruptible(&dax_dev_rwsem))
> +		return 0; /* no valid data since we were killed */
> +	dax_size = dev_dax_size(dev_dax);
> +	up_read(&dax_dev_rwsem);
> +
> +	phys = dax_pgoff_to_phys(dev_dax, pgoff, nr_pages << PAGE_SHIFT);
> +
> +	if (kaddr)
> +		*kaddr = virt_addr;
> +
> +	local_pfn = phys_to_pfn_t(phys, flags); /* are flags correct? */
> +	if (pfn)
> +		*pfn = local_pfn;
> +
> +	/* This the valid size at the specified address */
> +	return PHYS_PFN(min_t(size_t, size, dax_size - offset));
> +}

> +static size_t dev_dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
> +		void *addr, size_t bytes, struct iov_iter *i)
> +{
> +	size_t off;
> +
> +	off = offset_in_page(addr);

Unused.
> +
> +	return _copy_from_iter_flushcache(addr, bytes, i);
> +}



^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 02/18] dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage
  2025-07-04 10:39   ` Jonathan Cameron
@ 2025-07-04 12:54     ` John Groves
  0 siblings, 0 replies; 91+ messages in thread
From: John Groves @ 2025-07-04 12:54 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Darrick J . Wong,
	Randy Dunlap, Jeff Layton, Kent Overstreet, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Amir Goldstein,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On 25/07/04 11:39AM, Jonathan Cameron wrote:
> On Thu,  3 Jul 2025 13:50:16 -0500
> John Groves <John@Groves.net> wrote:
> 
> > This function should be called by fs-dax file systems after opening the
> > devdax device. This adds holder_operations, which effects exclusivity
> > between callers of fs_dax_get().
> > 
> > This function serves the same role as fs_dax_get_by_bdev(), which dax
> > file systems call after opening the pmem block device.
> > 
> > This also adds the CONFIG_DEV_DAX_IOMAP Kconfig parameter
> > 
> > Signed-off-by: John Groves <john@groves.net>
> Trivial stuff inline.
> 
> 
> > ---
> >  drivers/dax/Kconfig |  6 ++++++
> >  drivers/dax/super.c | 30 ++++++++++++++++++++++++++++++
> >  include/linux/dax.h |  5 +++++
> >  3 files changed, 41 insertions(+)
> > 
> > diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
> > index d656e4c0eb84..ad19fa966b8b 100644
> > --- a/drivers/dax/Kconfig
> > +++ b/drivers/dax/Kconfig
> > @@ -78,4 +78,10 @@ config DEV_DAX_KMEM
> >  
> >  	  Say N if unsure.
> >  
> > +config DEV_DAX_IOMAP
> > +       depends on DEV_DAX && DAX
> > +       def_bool y
> > +       help
> > +         Support iomap mapping of devdax devices (for FS-DAX file
> > +         systems that reside on character /dev/dax devices)
> >  endif
> > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > index e16d1d40d773..48bab9b5f341 100644
> > --- a/drivers/dax/super.c
> > +++ b/drivers/dax/super.c
> > @@ -122,6 +122,36 @@ void fs_put_dax(struct dax_device *dax_dev, void *holder)
> >  EXPORT_SYMBOL_GPL(fs_put_dax);
> >  #endif /* CONFIG_BLOCK && CONFIG_FS_DAX */
> >  
> > +#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
> > +/**
> > + * fs_dax_get()
> 
> Trivial but from what I recall kernel-doc isn't going to like this.
> Needs a short description.

Right you are. I thought I'd checked all those, but missed this one.
Queued to -next.

> 
> > + *
> > + * fs-dax file systems call this function to prepare to use a devdax device for
> > + * fsdax. This is like fs_dax_get_by_bdev(), but the caller already has struct
> > + * dev_dax (and there  * is no bdev). The holder makes this exclusive.
> 
> there is no *bdev?  So * in wrong place.

I think that's a line-break-refactor malfunction on my part. Aueued to -next.

Thanks,
John


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 14/18] famfs_fuse: GET_DAXDEV message and daxdev_table
  2025-07-03 18:50 ` [RFC V2 14/18] famfs_fuse: GET_DAXDEV message and daxdev_table John Groves
@ 2025-07-04 13:20   ` Jonathan Cameron
  2025-07-06 17:07     ` John Groves
  2025-08-14 13:58   ` Miklos Szeredi
  1 sibling, 1 reply; 91+ messages in thread
From: Jonathan Cameron @ 2025-07-04 13:20 UTC (permalink / raw)
  To: John Groves
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Darrick J . Wong,
	Randy Dunlap, Jeff Layton, Kent Overstreet, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Amir Goldstein,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On Thu,  3 Jul 2025 13:50:28 -0500
John Groves <John@Groves.net> wrote:

> * The new GET_DAXDEV message/response is enabled
> * The command it triggered by the update_daxdev_table() call, if there
>   are any daxdevs in the subject fmap that are not represented in the
>   daxdev_dable yet.
> 
> Signed-off-by: John Groves <john@groves.net>

More drive by stuff you can ignore for now if you like.

> ---
>  fs/fuse/famfs.c           | 227 ++++++++++++++++++++++++++++++++++++++
>  fs/fuse/famfs_kfmap.h     |  26 +++++
>  fs/fuse/fuse_i.h          |   1 +
>  fs/fuse/inode.c           |   4 +-
>  fs/namei.c                |   1 +
>  include/uapi/linux/fuse.h |  18 +++
>  6 files changed, 276 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/fuse/famfs.c b/fs/fuse/famfs.c
> index 41c4d92f1451..f5e01032b825 100644
> --- a/fs/fuse/famfs.c
> +++ b/fs/fuse/famfs.c

> +/**
> + * famfs_fuse_get_daxdev() - Retrieve info for a DAX device from fuse server
> + *
> + * Send a GET_DAXDEV message to the fuse server to retrieve info on a
> + * dax device.
> + *
> + * @fm:     fuse_mount
> + * @index:  the index of the dax device; daxdevs are referred to by index
> + *          in fmaps, and the server resolves the index to a particular daxdev
> + *
> + * Returns: 0=success
> + *          -errno=failure
> + */
> +static int
> +famfs_fuse_get_daxdev(struct fuse_mount *fm, const u64 index)
> +{
> +	struct fuse_daxdev_out daxdev_out = { 0 };
> +	struct fuse_conn *fc = fm->fc;
> +	struct famfs_daxdev *daxdev;
> +	int err = 0;
> +
> +	FUSE_ARGS(args);
> +
> +	/* Store the daxdev in our table */
> +	if (index >= fc->dax_devlist->nslots) {
> +		pr_err("%s: index(%lld) > nslots(%d)\n",
> +		       __func__, index, fc->dax_devlist->nslots);
> +		err = -EINVAL;
> +		goto out;
> +	}
> +
> +	args.opcode = FUSE_GET_DAXDEV;
> +	args.nodeid = index;
> +
> +	args.in_numargs = 0;
> +
> +	args.out_numargs = 1;
> +	args.out_args[0].size = sizeof(daxdev_out);
> +	args.out_args[0].value = &daxdev_out;
> +
> +	/* Send GET_DAXDEV command */
> +	err = fuse_simple_request(fm, &args);
> +	if (err) {
> +		pr_err("%s: err=%d from fuse_simple_request()\n",
> +		       __func__, err);
> +		/*
> +		 * Error will be that the payload is smaller than FMAP_BUFSIZE,
> +		 * which is the max we can handle. Empty payload handled below.
> +		 */
> +		goto out;
> +	}
> +
> +	down_write(&fc->famfs_devlist_sem);

Worth thinking about guard() in this code in general.
Simplify some of the error paths at least.

> +
> +	daxdev = &fc->dax_devlist->devlist[index];
> +
> +	/* Abort if daxdev is now valid */
> +	if (daxdev->valid) {
> +		up_write(&fc->famfs_devlist_sem);
> +		/* We already have a valid entry at this index */
> +		err = -EALREADY;
> +		goto out;
> +	}
> +
> +	/* Verify that the dev is valid and can be opened and gets the devno */
> +	err = famfs_verify_daxdev(daxdev_out.name, &daxdev->devno);
> +	if (err) {
> +		up_write(&fc->famfs_devlist_sem);
> +		pr_err("%s: err=%d from famfs_verify_daxdev()\n", __func__, err);
> +		goto out;
> +	}
> +
> +	/* This will fail if it's not a dax device */
> +	daxdev->devp = dax_dev_get(daxdev->devno);
> +	if (!daxdev->devp) {
> +		up_write(&fc->famfs_devlist_sem);
> +		pr_warn("%s: device %s not found or not dax\n",
> +			__func__, daxdev_out.name);
> +		err = -ENODEV;
> +		goto out;
> +	}
> +
> +	daxdev->name = kstrdup(daxdev_out.name, GFP_KERNEL);
> +	wmb(); /* all daxdev fields must be visible before marking it valid */
> +	daxdev->valid = 1;
> +
> +	up_write(&fc->famfs_devlist_sem);
> +
> +out:
> +	return err;
> +}
> +
> +/**
> + * famfs_update_daxdev_table() - Update the daxdev table
> + * @fm   - fuse_mount
> + * @meta - famfs_file_meta, in-memory format, built from a GET_FMAP response
> + *
> + * This function is called for each new file fmap, to verify whether all
> + * referenced daxdevs are already known (i.e. in the table). Any daxdev
> + * indices that referenced in @meta but not in the table will be retrieved via
> + * famfs_fuse_get_daxdev() and added to the table
> + *
> + * Return: 0=success
> + *         -errno=failure
> + */
> +static int
> +famfs_update_daxdev_table(
> +	struct fuse_mount *fm,
> +	const struct famfs_file_meta *meta)
> +{
> +	struct famfs_dax_devlist *local_devlist;
> +	struct fuse_conn *fc = fm->fc;
> +	int err;
> +	int i;
> +
> +	/* First time through we will need to allocate the dax_devlist */
> +	if (!fc->dax_devlist) {
> +		local_devlist = kcalloc(1, sizeof(*fc->dax_devlist), GFP_KERNEL);
> +		if (!local_devlist)
> +			return -ENOMEM;
> +
> +		local_devlist->nslots = MAX_DAXDEVS;
> +
> +		local_devlist->devlist = kcalloc(MAX_DAXDEVS,
> +						 sizeof(struct famfs_daxdev),
> +						 GFP_KERNEL);
> +		if (!local_devlist->devlist) {
> +			kfree(local_devlist);
> +			return -ENOMEM;
> +		}
> +
> +		/* We don't need the famfs_devlist_sem here because we use cmpxchg... */
> +		if (cmpxchg(&fc->dax_devlist, NULL, local_devlist) != NULL) {
> +			kfree(local_devlist->devlist);
> +			kfree(local_devlist); /* another thread beat us to it */
> +		}
> +	}
> +
> +	down_read(&fc->famfs_devlist_sem);
> +	for (i = 0; i < fc->dax_devlist->nslots; i++) {
> +		if (meta->dev_bitmap & (1ULL << i)) {
Flip for readability.
		if (!(meta->dev_bitmap & (1ULL << i))
			continue;

Or can we use bitmap_from_arr64() and
for_each_set_bit() to optimize this a little.

> +			/* This file meta struct references devindex i
> +			 * if devindex i isn't in the table; get it...
> +			 */
> +			if (!(fc->dax_devlist->devlist[i].valid)) {
> +				up_read(&fc->famfs_devlist_sem);
> +
> +				err = famfs_fuse_get_daxdev(fm, i);
> +				if (err)
> +					pr_err("%s: failed to get daxdev=%d\n",
> +					       __func__, i);
Don't want to surface that error?
> +
> +				down_read(&fc->famfs_devlist_sem);
> +			}
> +		}
> +	}
> +	up_read(&fc->famfs_devlist_sem);
> +
> +	return 0;
> +}
> +
> +/***************************************************************************/

?

> diff --git a/fs/fuse/famfs_kfmap.h b/fs/fuse/famfs_kfmap.h
> index ce785d76719c..f79707b9f761 100644
> --- a/fs/fuse/famfs_kfmap.h
> +++ b/fs/fuse/famfs_kfmap.h
> @@ -60,4 +60,30 @@ struct famfs_file_meta {
>  	};
>  };
>  
> +/**
> + * famfs_daxdev - tracking struct for a daxdev within a famfs file system
> + *
> + * This is the in-memory daxdev metadata that is populated by
> + * the responses to GET_FMAP messages
> + */
> +struct famfs_daxdev {
> +	/* Include dev uuid? */
> +	bool valid;
> +	bool error;
> +	dev_t devno;
> +	struct dax_device *devp;
> +	char *name;
> +};
> +
> +#define MAX_DAXDEVS 24
> +
> +/**
> + * famfs_dax_devlist - list of famfs_daxdev's

Run kernel-doc script over these. It gets grumpy about partial
documentation.

> + */
> +struct famfs_dax_devlist {
> +	int nslots;
> +	int ndevs;
> +	struct famfs_daxdev *devlist; /* XXX: make this an xarray! */
> +};
> +
>  #endif /* FAMFS_KFMAP_H */

> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index ecaaa62910f0..8a81b6c334fe 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -235,6 +235,9 @@
>   *      - struct fuse_famfs_simple_ext
>   *      - struct fuse_famfs_iext
>   *      - struct fuse_famfs_fmap_header
> + *    - Add the following structs for the GET_DAXDEV message and reply
> + *      - struct fuse_get_daxdev_in
> + *      - struct fuse_get_daxdev_out
>   *    - Add the following enumerated types
>   *      - enum fuse_famfs_file_type
>   *      - enum famfs_ext_type
> @@ -1351,6 +1354,20 @@ struct fuse_famfs_fmap_header {
>  	uint64_t reserved1;
>  };
>  
> +struct fuse_get_daxdev_in {
> +	uint32_t        daxdev_num;
> +};
> +
> +#define DAXDEV_NAME_MAX 256
> +struct fuse_daxdev_out {
> +	uint16_t index;
> +	uint16_t reserved;
> +	uint32_t reserved2;
> +	uint64_t reserved3; /* enough space for a uuid if we need it */

Odd place for the comment. If it just refers to reserved3 then nope
not enough space.  If you mean that and reserved4 then fiar enough
but that's not obvious as it stands.

> +	uint64_t reserved4;
> +	char name[DAXDEV_NAME_MAX];
> +};
> +
>  static inline int32_t fmap_msg_min_size(void)
>  {
>  	/* Smallest fmap message is a header plus one simple extent */
> @@ -1358,4 +1375,5 @@ static inline int32_t fmap_msg_min_size(void)
>  		+ sizeof(struct fuse_famfs_simple_ext));
>  }
>  
> +
Stray change.  Worth a quick scrub to clean these out (even in an RFC) as they just add
noise to the bits you want people to look at!

>  #endif /* _LINUX_FUSE_H */


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 10/18] famfs_fuse: Basic fuse kernel ABI enablement for famfs
  2025-07-04  7:54   ` Amir Goldstein
@ 2025-07-04 13:39     ` John Groves
  2025-07-07 17:39       ` Darrick J. Wong
  0 siblings, 1 reply; 91+ messages in thread
From: John Groves @ 2025-07-04 13:39 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Darrick J . Wong, Dan Williams, Miklos Szeredi, Bernd Schubert,
	John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Randy Dunlap, Jeff Layton, Kent Overstreet, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On 25/07/04 09:54AM, Amir Goldstein wrote:
> On Thu, Jul 3, 2025 at 8:51 PM John Groves <John@groves.net> wrote:
> >
> > * FUSE_DAX_FMAP flag in INIT request/reply
> >
> > * fuse_conn->famfs_iomap (enable famfs-mapped files) to denote a
> >   famfs-enabled connection
> >
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> >  fs/fuse/fuse_i.h          |  3 +++
> >  fs/fuse/inode.c           | 14 ++++++++++++++
> >  include/uapi/linux/fuse.h |  4 ++++
> >  3 files changed, 21 insertions(+)
> >
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index 9d87ac48d724..a592c1002861 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -873,6 +873,9 @@ struct fuse_conn {
> >         /* Use io_uring for communication */
> >         unsigned int io_uring;
> >
> > +       /* dev_dax_iomap support for famfs */
> > +       unsigned int famfs_iomap:1;
> > +
> 
> pls move up to the bit fields members.

Oops, done, thanks.

> 
> >         /** Maximum stack depth for passthrough backing files */
> >         int max_stack_depth;
> >
> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > index 29147657a99f..e48e11c3f9f3 100644
> > --- a/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -1392,6 +1392,18 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
> >                         }
> >                         if (flags & FUSE_OVER_IO_URING && fuse_uring_enabled())
> >                                 fc->io_uring = 1;
> > +                       if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX) &&
> > +                           flags & FUSE_DAX_FMAP) {
> > +                               /* XXX: Should also check that fuse server
> > +                                * has CAP_SYS_RAWIO and/or CAP_SYS_ADMIN,
> > +                                * since it is directing the kernel to access
> > +                                * dax memory directly - but this function
> > +                                * appears not to be called in fuse server
> > +                                * process context (b/c even if it drops
> > +                                * those capabilities, they are held here).
> > +                                */
> > +                               fc->famfs_iomap = 1;
> > +                       }
> 
> 1. As long as the mapping requests are checking capabilities we should be ok
>     Right?

It depends on the definition of "are", or maybe of "mapping requests" ;)

Forgive me if this *is* obvious, but the fuse server capabilities are what
I think need to be checked here - not the app that it accessing a file.

An app accessing a regular file doesn't need permission to do raw access to
the underlying block dev, but the fuse server does - becuase it is directing
the kernel to access that for apps.

> 2. What's the deal with capable(CAP_SYS_ADMIN) in process_init_limits then?

I *think* that's checking the capabilities of the app that is accessing the
file, and not the fuse server. But I might be wrong - I have not pulled very
hard on that thread yet.

> 3. Darrick mentioned the need for a synchronic INIT variant for his work on
>     blockdev iomap support [1]

I'm not sure that's the same thing (Darrick?), but I do think Darrick's
use case probably needs to check capabilities for a server that is sending
apps (via files) off to access extents of block devices.

> 
> I also wonder how much of your patches and Darrick's patches end up
> being an overlap?

Darrick and I spent some time hashing through this, and came to the conclusion
that the actual overlap is slim-to-none. 

> 
> Thanks,
> Amir.
> 
> [1] https://lore.kernel.org/linux-fsdevel/20250613174413.GM6138@frogsfrogsfrogs/

Thank you!
John

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 18/18] famfs_fuse: Add documentation
  2025-07-04  3:53       ` Bagas Sanjaya
@ 2025-07-04 18:58         ` Matthew Wilcox
  2025-07-04 23:29           ` Bagas Sanjaya
  0 siblings, 1 reply; 91+ messages in thread
From: Matthew Wilcox @ 2025-07-04 18:58 UTC (permalink / raw)
  To: Bagas Sanjaya
  Cc: Jonathan Corbet, John Groves, Dan Williams, Miklos Szeredi,
	Bernd Schubert, John Groves, Vishal Verma, Dave Jiang, Jan Kara,
	Alexander Viro, Christian Brauner, Darrick J . Wong, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On Fri, Jul 04, 2025 at 10:53:23AM +0700, Bagas Sanjaya wrote:
> On Thu, Jul 03, 2025 at 08:22:58PM -0600, Jonathan Corbet wrote:
> > Bagas.  Stop.
> > 
> > John has written documentation, that is great.  Do not add needless
> > friction to this process.  Seriously.
> > 
> > Why do I have to keep telling you this?
> 
> Cause I'm more of perfectionist (detail-oriented)...

Reviews aren't about you.  They're about producing a better patch.
Do your reviews produce better patches or do they make the perfect the
enemy of the good?

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 12/18] famfs_fuse: Plumb the GET_FMAP message/response
  2025-07-04  8:54   ` Amir Goldstein
@ 2025-07-04 20:30     ` John Groves
  2025-07-05  0:06       ` John Groves
  0 siblings, 1 reply; 91+ messages in thread
From: John Groves @ 2025-07-04 20:30 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Darrick J . Wong,
	Randy Dunlap, Jeff Layton, Kent Overstreet, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On 25/07/04 10:54AM, Amir Goldstein wrote:
> On Thu, Jul 3, 2025 at 8:51 PM John Groves <John@groves.net> wrote:
> >
> > Upon completion of an OPEN, if we're in famfs-mode we do a GET_FMAP to
> > retrieve and cache up the file-to-dax map in the kernel. If this
> > succeeds, read/write/mmap are resolved direct-to-dax with no upcalls.
> >
> > GET_FMAP has a variable-size response payload, and the allocated size
> > is sent in the in_args[0].size field. If the fmap would overflow the
> > message, the fuse server sends a reply of size 'sizeof(uint32_t)' which
> > specifies the size of the fmap message. Then the kernel can realloc a
> > large enough buffer and try again.
> >
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> >  fs/fuse/file.c            | 84 +++++++++++++++++++++++++++++++++++++++
> >  fs/fuse/fuse_i.h          | 36 ++++++++++++++++-
> >  fs/fuse/inode.c           | 19 +++++++--
> >  fs/fuse/iomode.c          |  2 +-
> >  include/uapi/linux/fuse.h | 18 +++++++++
> >  5 files changed, 154 insertions(+), 5 deletions(-)
> >
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index 93b82660f0c8..8616fb0a6d61 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -230,6 +230,77 @@ static void fuse_truncate_update_attr(struct inode *inode, struct file *file)
> >         fuse_invalidate_attr_mask(inode, FUSE_STATX_MODSIZE);
> >  }
> >
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> 
> We generally try to avoid #ifdef blocks in c files
> keep them mostly in h files and use in c files
>    if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
> 
> also #if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> it a bit strange for a bool Kconfig because it looks too
> much like the c code, so I prefer
> #ifdef CONFIG_FUSE_FAMFS_DAX
> when you have to use it
> 
> If you need entire functions compiled out, why not put them in famfs.c?

Perhaps moving fuse_get_fmap() to famfs.c is the best approach. Will try that
first.

Regarding '#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)', vs.
'#ifdef CONFIG_FUSE_FAMFS_DAX' vs. '#if CONFIG_FUSE_FAMFS_DAX'...

I've learned to be cautious there because the latter two are undefined if
CONFIG_FUSE_FAMFS_DAX=m. I've been burned by this.

My original thinking was that famfs made sense as a module, but I'm leaning
the other way now - and in this series fs/fuse/Kconfig makes it a bool - 
meaning all three macro tests will work because a bool can't be set to 'm'. 

So to the extent that I need conditional compilation macros I can switch
to '#ifdef...'.


> 
> > +
> > +#define FMAP_BUFSIZE 4096
> > +
> > +static int
> > +fuse_get_fmap(struct fuse_mount *fm, struct inode *inode, u64 nodeid)
> > +{
> > +       struct fuse_get_fmap_in inarg = { 0 };
> > +       size_t fmap_bufsize = FMAP_BUFSIZE;
> > +       ssize_t fmap_size;
> > +       int retries = 1;
> > +       void *fmap_buf;
> > +       int rc;
> > +
> > +       FUSE_ARGS(args);
> > +
> > +       fmap_buf = kcalloc(1, FMAP_BUFSIZE, GFP_KERNEL);
> > +       if (!fmap_buf)
> > +               return -EIO;
> > +
> > + retry_once:
> > +       inarg.size = fmap_bufsize;
> > +
> > +       args.opcode = FUSE_GET_FMAP;
> > +       args.nodeid = nodeid;
> > +
> > +       args.in_numargs = 1;
> > +       args.in_args[0].size = sizeof(inarg);
> > +       args.in_args[0].value = &inarg;
> > +
> > +       /* Variable-sized output buffer
> > +        * this causes fuse_simple_request() to return the size of the
> > +        * output payload
> > +        */
> > +       args.out_argvar = true;
> > +       args.out_numargs = 1;
> > +       args.out_args[0].size = fmap_bufsize;
> > +       args.out_args[0].value = fmap_buf;
> > +
> > +       /* Send GET_FMAP command */
> > +       rc = fuse_simple_request(fm, &args);
> > +       if (rc < 0) {
> > +               pr_err("%s: err=%d from fuse_simple_request()\n",
> > +                      __func__, rc);
> > +               return rc;
> > +       }
> > +       fmap_size = rc;
> > +
> > +       if (retries && fmap_size == sizeof(uint32_t)) {
> > +               /* fmap size exceeded fmap_bufsize;
> > +                * actual fmap size returned in fmap_buf;
> > +                * realloc and retry once
> > +                */
> > +               fmap_bufsize = *((uint32_t *)fmap_buf);
> > +
> > +               --retries;
> > +               kfree(fmap_buf);
> > +               fmap_buf = kcalloc(1, fmap_bufsize, GFP_KERNEL);
> > +               if (!fmap_buf)
> > +                       return -EIO;
> > +
> > +               goto retry_once;
> > +       }
> > +
> > +       /* Will call famfs_file_init_dax() when that gets added */
> > +
> > +       kfree(fmap_buf);
> > +       return 0;
> > +}
> > +#endif
> > +
> >  static int fuse_open(struct inode *inode, struct file *file)
> >  {
> >         struct fuse_mount *fm = get_fuse_mount(inode);
> > @@ -263,6 +334,19 @@ static int fuse_open(struct inode *inode, struct file *file)
> >
> >         err = fuse_do_open(fm, get_node_id(inode), file, false);
> >         if (!err) {
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > +               if (fm->fc->famfs_iomap) {
> 
> That should be
> > +               if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX) &&
> > +                   fm->fc->famfs_iomap) {
> 
> > +                       if (S_ISREG(inode->i_mode)) {
> > +                               int rc;
> > +                               /* Get the famfs fmap */
> > +                               rc = fuse_get_fmap(fm, inode,
> > +                                                  get_node_id(inode));
> > +                               if (rc)
> > +                                       pr_err("%s: fuse_get_fmap err=%d\n",
> > +                                              __func__, rc);
> > +                       }
> > +               }
> > +#endif
> >                 ff = file->private_data;
> >                 err = fuse_finish_open(inode, file);
> >                 if (err)
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index f4ee61046578..e01d6e5c6e93 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -193,6 +193,10 @@ struct fuse_inode {
> >         /** Reference to backing file in passthrough mode */
> >         struct fuse_backing *fb;
> >  #endif
> > +
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > +       void *famfs_meta;
> > +#endif
> >  };
> >
> >  /** FUSE inode state bits */
> > @@ -945,6 +949,8 @@ struct fuse_conn {
> >  #endif
> >
> >  #if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > +       struct rw_semaphore famfs_devlist_sem;
> > +       struct famfs_dax_devlist *dax_devlist;
> >         char *shadow;
> >  #endif
> >  };
> > @@ -1435,11 +1441,14 @@ void fuse_free_conn(struct fuse_conn *fc);
> >
> >  /* dax.c */
> >
> > +static inline int fuse_file_famfs(struct fuse_inode *fi); /* forward */
> > +
> >  /* This macro is used by virtio_fs, but now it also needs to filter for
> >   * "not famfs"
> >   */
> >  #define FUSE_IS_VIRTIO_DAX(fuse_inode) (IS_ENABLED(CONFIG_FUSE_DAX)    \
> > -                                       && IS_DAX(&fuse_inode->inode))
> > +                                       && IS_DAX(&fuse_inode->inode)   \
> > +                                       && !fuse_file_famfs(fuse_inode))
> >
> >  ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to);
> >  ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from);
> > @@ -1550,4 +1559,29 @@ extern void fuse_sysctl_unregister(void);
> >  #define fuse_sysctl_unregister()       do { } while (0)
> >  #endif /* CONFIG_SYSCTL */
> >
> > +/* famfs.c */
> > +static inline struct fuse_backing *famfs_meta_set(struct fuse_inode *fi,
> > +                                                      void *meta)
> > +{
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > +       return xchg(&fi->famfs_meta, meta);
> > +#else
> > +       return NULL;
> > +#endif
> > +}
> > +
> > +static inline void famfs_meta_free(struct fuse_inode *fi)
> > +{
> > +       /* Stub wil be connected in a subsequent commit */
> > +}
> > +
> > +static inline int fuse_file_famfs(struct fuse_inode *fi)
> > +{
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > +       return (READ_ONCE(fi->famfs_meta) != NULL);
> > +#else
> > +       return 0;
> > +#endif
> > +}
> > +
> >  #endif /* _FS_FUSE_I_H */
> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > index a7e1cf8257b0..b071d16f7d04 100644
> > --- a/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -117,6 +117,9 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
> >         if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> >                 fuse_inode_backing_set(fi, NULL);
> >
> > +       if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
> > +               famfs_meta_set(fi, NULL);
> > +
> >         return &fi->inode;
> >
> >  out_free_forget:
> > @@ -138,6 +141,13 @@ static void fuse_free_inode(struct inode *inode)
> >         if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> >                 fuse_backing_put(fuse_inode_backing(fi));
> >
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > +       if (S_ISREG(inode->i_mode) && fi->famfs_meta) {
> > +               famfs_meta_free(fi);
> > +               famfs_meta_set(fi, NULL);
> > +       }
> > +#endif
> > +
> >         kmem_cache_free(fuse_inode_cachep, fi);
> >  }
> >
> > @@ -1002,6 +1012,9 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm,
> >         if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> >                 fuse_backing_files_init(fc);
> >
> > +       if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
> > +               pr_notice("%s: Kernel is FUSE_FAMFS_DAX capable\n", __func__);
> > +
> >         INIT_LIST_HEAD(&fc->mounts);
> >         list_add(&fm->fc_entry, &fc->mounts);
> >         fm->fc = fc;
> > @@ -1036,9 +1049,8 @@ void fuse_conn_put(struct fuse_conn *fc)
> >                 }
> >                 if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> >                         fuse_backing_files_free(fc);
> > -#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > -               kfree(fc->shadow);
> > -#endif
> > +               if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
> > +                       kfree(fc->shadow);
> >                 call_rcu(&fc->rcu, delayed_release);
> >         }
> >  }
> > @@ -1425,6 +1437,7 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
> >                                  * those capabilities, they are held here).
> >                                  */
> >                                 fc->famfs_iomap = 1;
> > +                               init_rwsem(&fc->famfs_devlist_sem);
> >                         }
> >                 } else {
> >                         ra_pages = fc->max_read / PAGE_SIZE;
> > diff --git a/fs/fuse/iomode.c b/fs/fuse/iomode.c
> > index aec4aecb5d79..443b337b0c05 100644
> > --- a/fs/fuse/iomode.c
> > +++ b/fs/fuse/iomode.c
> > @@ -204,7 +204,7 @@ int fuse_file_io_open(struct file *file, struct inode *inode)
> >          * io modes are not relevant with DAX and with server that does not
> >          * implement open.
> >          */
> > -       if (FUSE_IS_VIRTIO_DAX(fi) || !ff->args)
> > +       if (FUSE_IS_VIRTIO_DAX(fi) || fuse_file_famfs(fi) || !ff->args)
> >                 return 0;
> 
> This is where fuse_is_dax() helper would be handy.

Roger that. Thinking through it...

> 
> Thanks,
> Amir.

Thank you,
John


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 18/18] famfs_fuse: Add documentation
  2025-07-04 18:58         ` Matthew Wilcox
@ 2025-07-04 23:29           ` Bagas Sanjaya
  2025-07-04 23:43             ` Matthew Wilcox
  0 siblings, 1 reply; 91+ messages in thread
From: Bagas Sanjaya @ 2025-07-04 23:29 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jonathan Corbet, John Groves, Dan Williams, Miklos Szeredi,
	Bernd Schubert, John Groves, Vishal Verma, Dave Jiang, Jan Kara,
	Alexander Viro, Christian Brauner, Darrick J . Wong, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On Fri, Jul 04, 2025 at 07:58:28PM +0100, Matthew Wilcox wrote:
> On Fri, Jul 04, 2025 at 10:53:23AM +0700, Bagas Sanjaya wrote:
> > On Thu, Jul 03, 2025 at 08:22:58PM -0600, Jonathan Corbet wrote:
> > > Bagas.  Stop.
> > > 
> > > John has written documentation, that is great.  Do not add needless
> > > friction to this process.  Seriously.
> > > 
> > > Why do I have to keep telling you this?
> > 
> > Cause I'm more of perfectionist (detail-oriented)...
> 
> Reviews aren't about you.  They're about producing a better patch.
> Do your reviews produce better patches or do they make the perfect the
> enemy of the good?

I'm looking for any Sphinx warnings, but if there's none, I check for
better wording or improving the docs output.

-- 
An old man doll... just what I always wanted! - Clara

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 18/18] famfs_fuse: Add documentation
  2025-07-04  8:27   ` Amir Goldstein
@ 2025-07-04 23:36     ` Bagas Sanjaya
  0 siblings, 0 replies; 91+ messages in thread
From: Bagas Sanjaya @ 2025-07-04 23:36 UTC (permalink / raw)
  To: Amir Goldstein, John Groves
  Cc: Dan Williams, Bernd Schubert, John Groves, Jonathan Corbet,
	Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Darrick J . Wong, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Jonathan Cameron, Stefan Hajnoczi,
	Joanne Koong, Josef Bacik, Aravind Ramesh, Ajay Joshi,
	Miklos Szeredi

[-- Attachment #1: Type: text/plain, Size: 1199 bytes --]

On Fri, Jul 04, 2025 at 10:27:03AM +0200, Amir Goldstein wrote:
> On Thu, Jul 3, 2025 at 8:51 PM John Groves <John@groves.net> wrote:
> >
> > Add Documentation/filesystems/famfs.rst and update MAINTAINERS
> >
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> >  Documentation/filesystems/famfs.rst | 142 ++++++++++++++++++++++++++++
> >  Documentation/filesystems/index.rst |   1 +
> >  MAINTAINERS                         |   1 +
> >  3 files changed, 144 insertions(+)
> >  create mode 100644 Documentation/filesystems/famfs.rst
> 
> 
> Considering "Documentation: fuse: Consolidate FUSE docs into its own
> subdirectory"
> https://lore.kernel.org/linux-fsdevel/20250612032239.17561-1-bagasdotme@gmail.com/
> 
> I wonder if famfs and virtiofs should be moved into fuse subdir?
> To me it makes more sense, but it's not a clear cut.
> 

I guess these can stay in their place as-is for now. However, if we later have
more fuse-based filesystems (at least 3 or 4), placing them in
Documentation/filesystems/fuse-based might make sense (fuse subdir documents
fuse framework itself, though).

Thanks.

-- 
An old man doll... just what I always wanted! - Clara

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 18/18] famfs_fuse: Add documentation
  2025-07-04 23:29           ` Bagas Sanjaya
@ 2025-07-04 23:43             ` Matthew Wilcox
  2025-07-05  1:11               ` Bagas Sanjaya
  0 siblings, 1 reply; 91+ messages in thread
From: Matthew Wilcox @ 2025-07-04 23:43 UTC (permalink / raw)
  To: Bagas Sanjaya
  Cc: Jonathan Corbet, John Groves, Dan Williams, Miklos Szeredi,
	Bernd Schubert, John Groves, Vishal Verma, Dave Jiang, Jan Kara,
	Alexander Viro, Christian Brauner, Darrick J . Wong, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On Sat, Jul 05, 2025 at 06:29:03AM +0700, Bagas Sanjaya wrote:
> On Fri, Jul 04, 2025 at 07:58:28PM +0100, Matthew Wilcox wrote:
> > On Fri, Jul 04, 2025 at 10:53:23AM +0700, Bagas Sanjaya wrote:
> > > On Thu, Jul 03, 2025 at 08:22:58PM -0600, Jonathan Corbet wrote:
> > > > Bagas.  Stop.
> > > > 
> > > > John has written documentation, that is great.  Do not add needless
> > > > friction to this process.  Seriously.
> > > > 
> > > > Why do I have to keep telling you this?
> > > 
> > > Cause I'm more of perfectionist (detail-oriented)...
> > 
> > Reviews aren't about you.  They're about producing a better patch.
> > Do your reviews produce better patches or do they make the perfect the
> > enemy of the good?
> 
> I'm looking for any Sphinx warnings, but if there's none, I check for
> better wording or improving the docs output.

That's appreciated.  Really.  But what you should be looking for is
unclear or misleading wording.  Not "this should be 'may' instead of
'might'".  The review you give is often closer to nitpicking than
serious review.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 12/18] famfs_fuse: Plumb the GET_FMAP message/response
  2025-07-04 20:30     ` John Groves
@ 2025-07-05  0:06       ` John Groves
  2025-07-05  7:58         ` Amir Goldstein
  0 siblings, 1 reply; 91+ messages in thread
From: John Groves @ 2025-07-05  0:06 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Darrick J . Wong,
	Randy Dunlap, Jeff Layton, Kent Overstreet, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On 25/07/04 03:30PM, John Groves wrote:
> On 25/07/04 10:54AM, Amir Goldstein wrote:
> > On Thu, Jul 3, 2025 at 8:51 PM John Groves <John@groves.net> wrote:
> > >
> > > Upon completion of an OPEN, if we're in famfs-mode we do a GET_FMAP to
> > > retrieve and cache up the file-to-dax map in the kernel. If this
> > > succeeds, read/write/mmap are resolved direct-to-dax with no upcalls.
> > >
> > > GET_FMAP has a variable-size response payload, and the allocated size
> > > is sent in the in_args[0].size field. If the fmap would overflow the
> > > message, the fuse server sends a reply of size 'sizeof(uint32_t)' which
> > > specifies the size of the fmap message. Then the kernel can realloc a
> > > large enough buffer and try again.
> > >
> > > Signed-off-by: John Groves <john@groves.net>
> > > ---
> > >  fs/fuse/file.c            | 84 +++++++++++++++++++++++++++++++++++++++
> > >  fs/fuse/fuse_i.h          | 36 ++++++++++++++++-
> > >  fs/fuse/inode.c           | 19 +++++++--
> > >  fs/fuse/iomode.c          |  2 +-
> > >  include/uapi/linux/fuse.h | 18 +++++++++
> > >  5 files changed, 154 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > > index 93b82660f0c8..8616fb0a6d61 100644
> > > --- a/fs/fuse/file.c
> > > +++ b/fs/fuse/file.c
> > > @@ -230,6 +230,77 @@ static void fuse_truncate_update_attr(struct inode *inode, struct file *file)
> > >         fuse_invalidate_attr_mask(inode, FUSE_STATX_MODSIZE);
> > >  }
> > >
> > > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > 
> > We generally try to avoid #ifdef blocks in c files
> > keep them mostly in h files and use in c files
> >    if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
> > 
> > also #if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > it a bit strange for a bool Kconfig because it looks too
> > much like the c code, so I prefer
> > #ifdef CONFIG_FUSE_FAMFS_DAX
> > when you have to use it
> > 
> > If you need entire functions compiled out, why not put them in famfs.c?
> 
> Perhaps moving fuse_get_fmap() to famfs.c is the best approach. Will try that
> first.
> 
> Regarding '#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)', vs.
> '#ifdef CONFIG_FUSE_FAMFS_DAX' vs. '#if CONFIG_FUSE_FAMFS_DAX'...
> 
> I've learned to be cautious there because the latter two are undefined if
> CONFIG_FUSE_FAMFS_DAX=m. I've been burned by this.
> 
> My original thinking was that famfs made sense as a module, but I'm leaning
> the other way now - and in this series fs/fuse/Kconfig makes it a bool - 
> meaning all three macro tests will work because a bool can't be set to 'm'. 
> 
> So to the extent that I need conditional compilation macros I can switch
> to '#ifdef...'.

Doh. Spirit of full disclosure: this commit doesn't build if
CONFIG_FUSE_FAMFS_DAX is not set (!=y). So the conditionals are at
risk if getting worse, not better. Working on it...

<snip>

Thanks,
John


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 18/18] famfs_fuse: Add documentation
  2025-07-04 23:43             ` Matthew Wilcox
@ 2025-07-05  1:11               ` Bagas Sanjaya
  0 siblings, 0 replies; 91+ messages in thread
From: Bagas Sanjaya @ 2025-07-05  1:11 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jonathan Corbet, John Groves, Dan Williams, Miklos Szeredi,
	Bernd Schubert, John Groves, Vishal Verma, Dave Jiang, Jan Kara,
	Alexander Viro, Christian Brauner, Darrick J . Wong, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

[-- Attachment #1: Type: text/plain, Size: 576 bytes --]

On Sat, Jul 05, 2025 at 12:43:18AM +0100, Matthew Wilcox wrote:
> On Sat, Jul 05, 2025 at 06:29:03AM +0700, Bagas Sanjaya wrote:
> > I'm looking for any Sphinx warnings, but if there's none, I check for
> > better wording or improving the docs output.
> 
> That's appreciated.  Really.  But what you should be looking for is
> unclear or misleading wording.  Not "this should be 'may' instead of
> 'might'".  The review you give is often closer to nitpicking than
> serious review.

Thanks for the tip!

-- 
An old man doll... just what I always wanted! - Clara

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 12/18] famfs_fuse: Plumb the GET_FMAP message/response
  2025-07-05  0:06       ` John Groves
@ 2025-07-05  7:58         ` Amir Goldstein
  2025-07-05 19:17           ` John Groves
  0 siblings, 1 reply; 91+ messages in thread
From: Amir Goldstein @ 2025-07-05  7:58 UTC (permalink / raw)
  To: John Groves
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Darrick J . Wong,
	Randy Dunlap, Jeff Layton, Kent Overstreet, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On Sat, Jul 5, 2025 at 2:06 AM John Groves <John@groves.net> wrote:
>
> On 25/07/04 03:30PM, John Groves wrote:
> > On 25/07/04 10:54AM, Amir Goldstein wrote:
> > > On Thu, Jul 3, 2025 at 8:51 PM John Groves <John@groves.net> wrote:
> > > >
> > > > Upon completion of an OPEN, if we're in famfs-mode we do a GET_FMAP to
> > > > retrieve and cache up the file-to-dax map in the kernel. If this
> > > > succeeds, read/write/mmap are resolved direct-to-dax with no upcalls.
> > > >
> > > > GET_FMAP has a variable-size response payload, and the allocated size
> > > > is sent in the in_args[0].size field. If the fmap would overflow the
> > > > message, the fuse server sends a reply of size 'sizeof(uint32_t)' which
> > > > specifies the size of the fmap message. Then the kernel can realloc a
> > > > large enough buffer and try again.
> > > >
> > > > Signed-off-by: John Groves <john@groves.net>
> > > > ---
> > > >  fs/fuse/file.c            | 84 +++++++++++++++++++++++++++++++++++++++
> > > >  fs/fuse/fuse_i.h          | 36 ++++++++++++++++-
> > > >  fs/fuse/inode.c           | 19 +++++++--
> > > >  fs/fuse/iomode.c          |  2 +-
> > > >  include/uapi/linux/fuse.h | 18 +++++++++
> > > >  5 files changed, 154 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > > > index 93b82660f0c8..8616fb0a6d61 100644
> > > > --- a/fs/fuse/file.c
> > > > +++ b/fs/fuse/file.c
> > > > @@ -230,6 +230,77 @@ static void fuse_truncate_update_attr(struct inode *inode, struct file *file)
> > > >         fuse_invalidate_attr_mask(inode, FUSE_STATX_MODSIZE);
> > > >  }
> > > >
> > > > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > >
> > > We generally try to avoid #ifdef blocks in c files
> > > keep them mostly in h files and use in c files
> > >    if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
> > >
> > > also #if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > > it a bit strange for a bool Kconfig because it looks too
> > > much like the c code, so I prefer
> > > #ifdef CONFIG_FUSE_FAMFS_DAX
> > > when you have to use it
> > >
> > > If you need entire functions compiled out, why not put them in famfs.c?
> >
> > Perhaps moving fuse_get_fmap() to famfs.c is the best approach. Will try that
> > first.
> >
> > Regarding '#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)', vs.
> > '#ifdef CONFIG_FUSE_FAMFS_DAX' vs. '#if CONFIG_FUSE_FAMFS_DAX'...
> >
> > I've learned to be cautious there because the latter two are undefined if
> > CONFIG_FUSE_FAMFS_DAX=m. I've been burned by this.

Yes, that's a risk, but as the code is shaping up right now,
I do not foresee FAMFS becoming a module(?)

> >
> > My original thinking was that famfs made sense as a module, but I'm leaning
> > the other way now - and in this series fs/fuse/Kconfig makes it a bool -
> > meaning all three macro tests will work because a bool can't be set to 'm'.
> >
> > So to the extent that I need conditional compilation macros I can switch
> > to '#ifdef...'.
>
> Doh. Spirit of full disclosure: this commit doesn't build if
> CONFIG_FUSE_FAMFS_DAX is not set (!=y). So the conditionals are at
> risk if getting worse, not better. Working on it...
>

You're probably going to need to add stub inline functions
for all the functions from famfs.c and a few more wrappers
I guess.

The right amount of ifdefs in C files is really a matter of judgement,
but the fewer the better for code flow clarity.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 12/18] famfs_fuse: Plumb the GET_FMAP message/response
  2025-07-05  7:58         ` Amir Goldstein
@ 2025-07-05 19:17           ` John Groves
  0 siblings, 0 replies; 91+ messages in thread
From: John Groves @ 2025-07-05 19:17 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Darrick J . Wong,
	Randy Dunlap, Jeff Layton, Kent Overstreet, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On 25/07/05 09:58AM, Amir Goldstein wrote:
> On Sat, Jul 5, 2025 at 2:06 AM John Groves <John@groves.net> wrote:
> >
> > On 25/07/04 03:30PM, John Groves wrote:
> > > On 25/07/04 10:54AM, Amir Goldstein wrote:
> > > > On Thu, Jul 3, 2025 at 8:51 PM John Groves <John@groves.net> wrote:
> > > > >
> > > > > Upon completion of an OPEN, if we're in famfs-mode we do a GET_FMAP to
> > > > > retrieve and cache up the file-to-dax map in the kernel. If this
> > > > > succeeds, read/write/mmap are resolved direct-to-dax with no upcalls.
> > > > >
> > > > > GET_FMAP has a variable-size response payload, and the allocated size
> > > > > is sent in the in_args[0].size field. If the fmap would overflow the
> > > > > message, the fuse server sends a reply of size 'sizeof(uint32_t)' which
> > > > > specifies the size of the fmap message. Then the kernel can realloc a
> > > > > large enough buffer and try again.
> > > > >
> > > > > Signed-off-by: John Groves <john@groves.net>
> > > > > ---
> > > > >  fs/fuse/file.c            | 84 +++++++++++++++++++++++++++++++++++++++
> > > > >  fs/fuse/fuse_i.h          | 36 ++++++++++++++++-
> > > > >  fs/fuse/inode.c           | 19 +++++++--
> > > > >  fs/fuse/iomode.c          |  2 +-
> > > > >  include/uapi/linux/fuse.h | 18 +++++++++
> > > > >  5 files changed, 154 insertions(+), 5 deletions(-)
> > > > >
> > > > > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > > > > index 93b82660f0c8..8616fb0a6d61 100644
> > > > > --- a/fs/fuse/file.c
> > > > > +++ b/fs/fuse/file.c
> > > > > @@ -230,6 +230,77 @@ static void fuse_truncate_update_attr(struct inode *inode, struct file *file)
> > > > >         fuse_invalidate_attr_mask(inode, FUSE_STATX_MODSIZE);
> > > > >  }
> > > > >
> > > > > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > > >
> > > > We generally try to avoid #ifdef blocks in c files
> > > > keep them mostly in h files and use in c files
> > > >    if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
> > > >
> > > > also #if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > > > it a bit strange for a bool Kconfig because it looks too
> > > > much like the c code, so I prefer
> > > > #ifdef CONFIG_FUSE_FAMFS_DAX
> > > > when you have to use it
> > > >
> > > > If you need entire functions compiled out, why not put them in famfs.c?
> > >
> > > Perhaps moving fuse_get_fmap() to famfs.c is the best approach. Will try that
> > > first.
> > >
> > > Regarding '#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)', vs.
> > > '#ifdef CONFIG_FUSE_FAMFS_DAX' vs. '#if CONFIG_FUSE_FAMFS_DAX'...
> > >
> > > I've learned to be cautious there because the latter two are undefined if
> > > CONFIG_FUSE_FAMFS_DAX=m. I've been burned by this.
> 
> Yes, that's a risk, but as the code is shaping up right now,
> I do not foresee FAMFS becoming a module(?)

Yeah, I can't think of a good reason to go that way at this point.

> 
> > >
> > > My original thinking was that famfs made sense as a module, but I'm leaning
> > > the other way now - and in this series fs/fuse/Kconfig makes it a bool -
> > > meaning all three macro tests will work because a bool can't be set to 'm'.
> > >
> > > So to the extent that I need conditional compilation macros I can switch
> > > to '#ifdef...'.
> >
> > Doh. Spirit of full disclosure: this commit doesn't build if
> > CONFIG_FUSE_FAMFS_DAX is not set (!=y). So the conditionals are at
> > risk if getting worse, not better. Working on it...
> >
> 
> You're probably going to need to add stub inline functions
> for all the functions from famfs.c and a few more wrappers
> I guess.
> 
> The right amount of ifdefs in C files is really a matter of judgement,
> but the fewer the better for code flow clarity.
> 
> Thanks,
> Amir.

Right - I've done that now, and it actually looks pretty clean to me.

Thanks!
John


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 13/18] famfs_fuse: Create files with famfs fmaps
  2025-07-04  9:01   ` Amir Goldstein
@ 2025-07-05 19:27     ` John Groves
  0 siblings, 0 replies; 91+ messages in thread
From: John Groves @ 2025-07-05 19:27 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Dan Williams, Bernd Schubert, John Groves, Jonathan Corbet,
	Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Darrick J . Wong, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Jonathan Cameron, Stefan Hajnoczi,
	Joanne Koong, Josef Bacik, Aravind Ramesh, Ajay Joshi,
	Miklos Szeredi

On 25/07/04 11:01AM, Amir Goldstein wrote:
> On Thu, Jul 3, 2025 at 8:51 PM John Groves <John@groves.net> wrote:
> >
> > On completion of GET_FMAP message/response, setup the full famfs
> > metadata such that it's possible to handle read/write/mmap directly to
> > dax. Note that the devdax_iomap plumbing is not in yet...
> >
> > Update MAINTAINERS for the new files.
> >
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> >  MAINTAINERS               |   9 +
> >  fs/fuse/Makefile          |   2 +-
> >  fs/fuse/famfs.c           | 360 ++++++++++++++++++++++++++++++++++++++
> >  fs/fuse/famfs_kfmap.h     |  63 +++++++
> >  fs/fuse/file.c            |  15 +-
> >  fs/fuse/fuse_i.h          |  16 +-
> >  fs/fuse/inode.c           |   2 +-
> >  include/uapi/linux/fuse.h |  56 ++++++
> >  8 files changed, 518 insertions(+), 5 deletions(-)
> >  create mode 100644 fs/fuse/famfs.c
> >  create mode 100644 fs/fuse/famfs_kfmap.h
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index c0d5232a473b..02688f27a4d0 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -8808,6 +8808,15 @@ F:       Documentation/networking/failover.rst
> >  F:     include/net/failover.h
> >  F:     net/core/failover.c
> >
> > +FAMFS
> > +M:     John Groves <jgroves@micron.com>
> > +M:     John Groves <John@Groves.net>
> > +L:     linux-cxl@vger.kernel.org
> > +L:     linux-fsdevel@vger.kernel.org
> > +S:     Supported
> > +F:     fs/fuse/famfs.c
> > +F:     fs/fuse/famfs_kfmap.h
> > +
> 
> I suggest to follow the pattern of MAINTAINERS sub entries
> FILESYSTEMS [EXPORTFS]
> FILESYSTEMS [IOMAP]
> 
> and call this sub entry
> FUSE [FAMFS]
> 
> to order it following FUSE entry
> 
> Thanks,
> Amir.

Done, and queued to -next

Thanks,
John


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 15/18] famfs_fuse: Plumb dax iomap and fuse read/write/mmap
  2025-07-04  9:13   ` Amir Goldstein
@ 2025-07-05 19:44     ` John Groves
  0 siblings, 0 replies; 91+ messages in thread
From: John Groves @ 2025-07-05 19:44 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Dan Williams, Bernd Schubert, John Groves, Jonathan Corbet,
	Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Darrick J . Wong, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Jonathan Cameron, Stefan Hajnoczi,
	Joanne Koong, Josef Bacik, Aravind Ramesh, Ajay Joshi,
	Miklos Szeredi

On 25/07/04 11:13AM, Amir Goldstein wrote:
> On Thu, Jul 3, 2025 at 8:51 PM John Groves <John@groves.net> wrote:
> >
> > This commit fills in read/write/mmap handling for famfs files. The
> > dev_dax_iomap interface is used - just like xfs in fs-dax mode.
> >
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> >  fs/fuse/famfs.c  | 436 +++++++++++++++++++++++++++++++++++++++++++++++
> >  fs/fuse/file.c   |  14 ++
> >  fs/fuse/fuse_i.h |   3 +
> >  3 files changed, 453 insertions(+)
> >
> > diff --git a/fs/fuse/famfs.c b/fs/fuse/famfs.c
> > index f5e01032b825..1973eb10b60b 100644
> > --- a/fs/fuse/famfs.c
> > +++ b/fs/fuse/famfs.c
> > @@ -585,3 +585,439 @@ famfs_file_init_dax(
> >         return rc;
> >  }
> >
> > +/*********************************************************************
> > + * iomap_operations
> > + *
> > + * This stuff uses the iomap (dax-related) helpers to resolve file offsets to
> > + * offsets within a dax device.
> > + */
> > +
> > +static ssize_t famfs_file_bad(struct inode *inode);
> > +
> > +static int
> > +famfs_interleave_fileofs_to_daxofs(struct inode *inode, struct iomap *iomap,
> > +                        loff_t file_offset, off_t len, unsigned int flags)
> > +{
> > +       struct fuse_inode *fi = get_fuse_inode(inode);
> > +       struct famfs_file_meta *meta = fi->famfs_meta;
> > +       struct fuse_conn *fc = get_fuse_conn(inode);
> > +       loff_t local_offset = file_offset;
> > +       int i;
> > +
> > +       /* This function is only for extent_type INTERLEAVED_EXTENT */
> > +       if (meta->fm_extent_type != INTERLEAVED_EXTENT) {
> > +               pr_err("%s: bad extent type\n", __func__);
> > +               goto err_out;
> > +       }
> > +
> > +       if (famfs_file_bad(inode))
> > +               goto err_out;
> > +
> > +       iomap->offset = file_offset;
> > +
> > +       for (i = 0; i < meta->fm_niext; i++) {
> > +               struct famfs_meta_interleaved_ext *fei = &meta->ie[i];
> > +               u64 chunk_size = fei->fie_chunk_size;
> > +               u64 nstrips = fei->fie_nstrips;
> > +               u64 ext_size = fei->fie_nbytes;
> > +
> > +               ext_size = min_t(u64, ext_size, meta->file_size);
> > +
> > +               if (ext_size == 0) {
> > +                       pr_err("%s: ext_size=%lld file_size=%ld\n",
> > +                              __func__, fei->fie_nbytes, meta->file_size);
> > +                       goto err_out;
> > +               }
> > +
> > +               /* Is the data is in this striped extent? */
> > +               if (local_offset < ext_size) {
> > +                       u64 chunk_num       = local_offset / chunk_size;
> > +                       u64 chunk_offset    = local_offset % chunk_size;
> > +                       u64 stripe_num      = chunk_num / nstrips;
> > +                       u64 strip_num       = chunk_num % nstrips;
> > +                       u64 chunk_remainder = chunk_size - chunk_offset;
> > +                       u64 strip_offset    = chunk_offset + (stripe_num * chunk_size);
> > +                       u64 strip_dax_ofs = fei->ie_strips[strip_num].ext_offset;
> > +                       u64 strip_devidx = fei->ie_strips[strip_num].dev_index;
> > +
> > +                       if (!fc->dax_devlist->devlist[strip_devidx].valid) {
> > +                               pr_err("%s: daxdev=%lld invalid\n", __func__,
> > +                                       strip_devidx);
> > +                               goto err_out;
> > +                       }
> > +                       iomap->addr    = strip_dax_ofs + strip_offset;
> > +                       iomap->offset  = file_offset;
> > +                       iomap->length  = min_t(loff_t, len, chunk_remainder);
> > +
> > +                       iomap->dax_dev = fc->dax_devlist->devlist[strip_devidx].devp;
> > +
> > +                       iomap->type    = IOMAP_MAPPED;
> > +                       iomap->flags   = flags;
> > +
> > +                       return 0;
> > +               }
> > +               local_offset -= ext_size; /* offset is beyond this striped extent */
> > +       }
> > +
> > + err_out:
> > +       pr_err("%s: err_out\n", __func__);
> > +
> > +       /* We fell out the end of the extent list.
> > +        * Set iomap to zero length in this case, and return 0
> > +        * This just means that the r/w is past EOF
> > +        */
> > +       iomap->addr    = 0; /* there is no valid dax device offset */
> > +       iomap->offset  = file_offset; /* file offset */
> > +       iomap->length  = 0; /* this had better result in no access to dax mem */
> > +       iomap->dax_dev = NULL;
> > +       iomap->type    = IOMAP_MAPPED;
> > +       iomap->flags   = flags;
> > +
> > +       return 0;
> > +}
> > +
> > +/**
> > + * famfs_fileofs_to_daxofs() - Resolve (file, offset, len) to (daxdev, offset, len)
> > + *
> > + * This function is called by famfs_fuse_iomap_begin() to resolve an offset in a
> > + * file to an offset in a dax device. This is upcalled from dax from calls to
> > + * both  * dax_iomap_fault() and dax_iomap_rw(). Dax finishes the job resolving
> > + * a fault to a specific physical page (the fault case) or doing a memcpy
> > + * variant (the rw case)
> > + *
> > + * Pages can be PTE (4k), PMD (2MiB) or (theoretically) PuD (1GiB)
> > + * (these sizes are for X86; may vary on other cpu architectures
> > + *
> > + * @inode:  The file where the fault occurred
> > + * @iomap:       To be filled in to indicate where to find the right memory,
> > + *               relative  to a dax device.
> > + * @file_offset: Within the file where the fault occurred (will be page boundary)
> > + * @len:         The length of the faulted mapping (will be a page multiple)
> > + *               (will be trimmed in *iomap if it's disjoint in the extent list)
> > + * @flags:
> > + *
> > + * Return values: 0. (info is returned in a modified @iomap struct)
> > + */
> > +static int
> > +famfs_fileofs_to_daxofs(struct inode *inode, struct iomap *iomap,
> > +                        loff_t file_offset, off_t len, unsigned int flags)
> > +{
> > +       struct fuse_inode *fi = get_fuse_inode(inode);
> > +       struct famfs_file_meta *meta = fi->famfs_meta;
> > +       struct fuse_conn *fc = get_fuse_conn(inode);
> > +       loff_t local_offset = file_offset;
> > +       int i;
> > +
> > +       if (!fc->dax_devlist) {
> > +               pr_err("%s: null dax_devlist\n", __func__);
> > +               goto err_out;
> > +       }
> > +
> > +       if (famfs_file_bad(inode))
> > +               goto err_out;
> > +
> > +       if (meta->fm_extent_type == INTERLEAVED_EXTENT)
> > +               return famfs_interleave_fileofs_to_daxofs(inode, iomap,
> > +                                                         file_offset,
> > +                                                         len, flags);
> > +
> > +       iomap->offset = file_offset;
> > +
> > +       for (i = 0; i < meta->fm_nextents; i++) {
> > +               /* TODO: check devindex too */
> > +               loff_t dax_ext_offset = meta->se[i].ext_offset;
> > +               loff_t dax_ext_len    = meta->se[i].ext_len;
> > +               u64 daxdev_idx = meta->se[i].dev_index;
> > +
> > +               if ((dax_ext_offset == 0) &&
> > +                   (meta->file_type != FAMFS_SUPERBLOCK))
> > +                       pr_warn("%s: zero offset on non-superblock file!!\n",
> > +                               __func__);
> > +
> > +               /* local_offset is the offset minus the size of extents skipped
> > +                * so far; If local_offset < dax_ext_len, the data of interest
> > +                * starts in this extent
> > +                */
> > +               if (local_offset < dax_ext_len) {
> > +                       loff_t ext_len_remainder = dax_ext_len - local_offset;
> > +                       struct famfs_daxdev *dd;
> > +
> > +                       dd = &fc->dax_devlist->devlist[daxdev_idx];
> > +
> > +                       if (!dd->valid || dd->error) {
> > +                               pr_err("%s: daxdev=%lld %s\n", __func__,
> > +                                      daxdev_idx,
> > +                                      dd->valid ? "error" : "invalid");
> > +                               goto err_out;
> > +                       }
> > +
> > +                       /*
> > +                        * OK, we found the file metadata extent where this
> > +                        * data begins
> > +                        * @local_offset      - The offset within the current
> > +                        *                      extent
> > +                        * @ext_len_remainder - Remaining length of ext after
> > +                        *                      skipping local_offset
> > +                        * Outputs:
> > +                        * iomap->addr:   the offset within the dax device where
> > +                        *                the  data starts
> > +                        * iomap->offset: the file offset
> > +                        * iomap->length: the valid length resolved here
> > +                        */
> > +                       iomap->addr    = dax_ext_offset + local_offset;
> > +                       iomap->offset  = file_offset;
> > +                       iomap->length  = min_t(loff_t, len, ext_len_remainder);
> > +
> > +                       iomap->dax_dev = fc->dax_devlist->devlist[daxdev_idx].devp;
> > +
> > +                       iomap->type    = IOMAP_MAPPED;
> > +                       iomap->flags   = flags;
> > +                       return 0;
> > +               }
> > +               local_offset -= dax_ext_len; /* Get ready for the next extent */
> > +       }
> > +
> > + err_out:
> > +       pr_err("%s: err_out\n", __func__);
> > +
> > +       /* We fell out the end of the extent list.
> > +        * Set iomap to zero length in this case, and return 0
> > +        * This just means that the r/w is past EOF
> > +        */
> > +       iomap->addr    = 0; /* there is no valid dax device offset */
> > +       iomap->offset  = file_offset; /* file offset */
> > +       iomap->length  = 0; /* this had better result in no access to dax mem */
> > +       iomap->dax_dev = NULL;
> > +       iomap->type    = IOMAP_MAPPED;
> > +       iomap->flags   = flags;
> > +
> > +       return 0;
> > +}
> > +
> > +/**
> > + * famfs_fuse_iomap_begin() - Handler for iomap_begin upcall from dax
> > + *
> > + * This function is pretty simple because files are
> > + * * never partially allocated
> > + * * never have holes (never sparse)
> > + * * never "allocate on write"
> > + *
> > + * @inode:  inode for the file being accessed
> > + * @offset: offset within the file
> > + * @length: Length being accessed at offset
> > + * @flags:
> > + * @iomap:  iomap struct to be filled in, resolving (offset, length) to
> > + *          (daxdev, offset, len)
> > + * @srcmap:
> > + */
> > +static int
> > +famfs_fuse_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> > +                 unsigned int flags, struct iomap *iomap, struct iomap *srcmap)
> > +{
> > +       struct fuse_inode *fi = get_fuse_inode(inode);
> > +       struct famfs_file_meta *meta = fi->famfs_meta;
> > +       size_t size;
> > +
> > +       size = i_size_read(inode);
> > +
> > +       WARN_ON(size != meta->file_size);
> > +
> > +       return famfs_fileofs_to_daxofs(inode, iomap, offset, length, flags);
> > +}
> > +
> > +/* Note: We never need a special set of write_iomap_ops because famfs never
> > + * performs allocation on write.
> > + */
> > +const struct iomap_ops famfs_iomap_ops = {
> > +       .iomap_begin            = famfs_fuse_iomap_begin,
> > +};
> > +
> > +/*********************************************************************
> > + * vm_operations
> > + */
> > +static vm_fault_t
> > +__famfs_fuse_filemap_fault(struct vm_fault *vmf, unsigned int pe_size,
> > +                     bool write_fault)
> > +{
> > +       struct inode *inode = file_inode(vmf->vma->vm_file);
> > +       vm_fault_t ret;
> > +       pfn_t pfn;
> > +
> > +       if (!IS_DAX(file_inode(vmf->vma->vm_file))) {
> > +               pr_err("%s: file not marked IS_DAX!!\n", __func__);
> > +               return VM_FAULT_SIGBUS;
> > +       }
> > +
> > +       if (write_fault) {
> > +               sb_start_pagefault(inode->i_sb);
> > +               file_update_time(vmf->vma->vm_file);
> > +       }
> > +
> > +       ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &famfs_iomap_ops);
> > +       if (ret & VM_FAULT_NEEDDSYNC)
> > +               ret = dax_finish_sync_fault(vmf, pe_size, pfn);
> > +
> > +       if (write_fault)
> > +               sb_end_pagefault(inode->i_sb);
> > +
> > +       return ret;
> > +}
> > +
> > +static inline bool
> > +famfs_is_write_fault(struct vm_fault *vmf)
> > +{
> > +       return (vmf->flags & FAULT_FLAG_WRITE) &&
> > +              (vmf->vma->vm_flags & VM_SHARED);
> > +}
> > +
> > +static vm_fault_t
> > +famfs_filemap_fault(struct vm_fault *vmf)
> > +{
> > +       return __famfs_fuse_filemap_fault(vmf, 0, famfs_is_write_fault(vmf));
> > +}
> > +
> > +static vm_fault_t
> > +famfs_filemap_huge_fault(struct vm_fault *vmf, unsigned int pe_size)
> > +{
> > +       return __famfs_fuse_filemap_fault(vmf, pe_size, famfs_is_write_fault(vmf));
> > +}
> > +
> > +static vm_fault_t
> > +famfs_filemap_page_mkwrite(struct vm_fault *vmf)
> > +{
> > +       return __famfs_fuse_filemap_fault(vmf, 0, true);
> > +}
> > +
> > +static vm_fault_t
> > +famfs_filemap_pfn_mkwrite(struct vm_fault *vmf)
> > +{
> > +       return __famfs_fuse_filemap_fault(vmf, 0, true);
> > +}
> > +
> > +static vm_fault_t
> > +famfs_filemap_map_pages(struct vm_fault        *vmf, pgoff_t start_pgoff,
> > +                       pgoff_t end_pgoff)
> > +{
> > +       return filemap_map_pages(vmf, start_pgoff, end_pgoff);
> > +}
> > +
> > +const struct vm_operations_struct famfs_file_vm_ops = {
> > +       .fault          = famfs_filemap_fault,
> > +       .huge_fault     = famfs_filemap_huge_fault,
> > +       .map_pages      = famfs_filemap_map_pages,
> > +       .page_mkwrite   = famfs_filemap_page_mkwrite,
> > +       .pfn_mkwrite    = famfs_filemap_pfn_mkwrite,
> > +};
> > +
> > +/*********************************************************************
> > + * file_operations
> > + */
> > +
> > +/**
> > + * famfs_file_bad() - Check for files that aren't in a valid state
> > + *
> > + * @inode - inode
> > + *
> > + * Returns: 0=success
> > + *          -errno=failure
> > + */
> > +static ssize_t
> > +famfs_file_bad(struct inode *inode)
> > +{
> > +       struct fuse_inode *fi = get_fuse_inode(inode);
> > +       struct famfs_file_meta *meta = fi->famfs_meta;
> > +       size_t i_size = i_size_read(inode);
> > +
> > +       if (!meta) {
> > +               pr_err("%s: un-initialized famfs file\n", __func__);
> > +               return -EIO;
> > +       }
> > +       if (meta->error) {
> > +               pr_debug("%s: previously detected metadata errors\n", __func__);
> > +               return -EIO;
> > +       }
> > +       if (i_size != meta->file_size) {
> > +               pr_warn("%s: i_size overwritten from %ld to %ld\n",
> > +                      __func__, meta->file_size, i_size);
> > +               meta->error = true;
> > +               return -ENXIO;
> > +       }
> > +       if (!IS_DAX(inode)) {
> > +               pr_debug("%s: inode %llx IS_DAX is false\n", __func__, (u64)inode);
> > +               return -ENXIO;
> > +       }
> > +       return 0;
> > +}
> > +
> > +static ssize_t
> > +famfs_fuse_rw_prep(struct kiocb *iocb, struct iov_iter *ubuf)
> > +{
> > +       struct inode *inode = iocb->ki_filp->f_mapping->host;
> > +       size_t i_size = i_size_read(inode);
> > +       size_t count = iov_iter_count(ubuf);
> > +       size_t max_count;
> > +       ssize_t rc;
> > +
> > +       rc = famfs_file_bad(inode);
> > +       if (rc)
> > +               return rc;
> > +
> > +       max_count = max_t(size_t, 0, i_size - iocb->ki_pos);
> > +
> > +       if (count > max_count)
> > +               iov_iter_truncate(ubuf, max_count);
> > +
> > +       if (!iov_iter_count(ubuf))
> > +               return 0;
> > +
> > +       return rc;
> > +}
> > +
> > +ssize_t
> > +famfs_fuse_read_iter(struct kiocb *iocb, struct iov_iter       *to)
> > +{
> > +       ssize_t rc;
> > +
> > +       rc = famfs_fuse_rw_prep(iocb, to);
> > +       if (rc)
> > +               return rc;
> > +
> > +       if (!iov_iter_count(to))
> > +               return 0;
> > +
> > +       rc = dax_iomap_rw(iocb, to, &famfs_iomap_ops);
> > +
> > +       file_accessed(iocb->ki_filp);
> > +       return rc;
> > +}
> > +
> > +ssize_t
> > +famfs_fuse_write_iter(struct kiocb *iocb, struct iov_iter *from)
> > +{
> > +       ssize_t rc;
> > +
> > +       rc = famfs_fuse_rw_prep(iocb, from);
> > +       if (rc)
> > +               return rc;
> > +
> > +       if (!iov_iter_count(from))
> > +               return 0;
> > +
> > +       return dax_iomap_rw(iocb, from, &famfs_iomap_ops);
> > +}
> > +
> > +int
> > +famfs_fuse_mmap(struct file *file, struct vm_area_struct *vma)
> > +{
> > +       struct inode *inode = file_inode(file);
> > +       ssize_t rc;
> > +
> > +       rc = famfs_file_bad(inode);
> > +       if (rc)
> > +               return (int)rc;
> > +
> > +       file_accessed(file);
> > +       vma->vm_ops = &famfs_file_vm_ops;
> > +       vm_flags_set(vma, VM_HUGEPAGE);
> > +       return 0;
> > +}
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index 5d205eadb48f..24a14b176510 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -1874,6 +1874,8 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> >
> >         if (FUSE_IS_VIRTIO_DAX(fi))
> >                 return fuse_dax_read_iter(iocb, to);
> > +       if (fuse_file_famfs(fi))
> > +               return famfs_fuse_read_iter(iocb, to);
> >
> >         /* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
> >         if (ff->open_flags & FOPEN_DIRECT_IO)
> > @@ -1896,6 +1898,8 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> >
> >         if (FUSE_IS_VIRTIO_DAX(fi))
> >                 return fuse_dax_write_iter(iocb, from);
> > +       if (fuse_file_famfs(fi))
> > +               return famfs_fuse_write_iter(iocb, from);
> >
> >         /* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
> >         if (ff->open_flags & FOPEN_DIRECT_IO)
> > @@ -1911,10 +1915,14 @@ static ssize_t fuse_splice_read(struct file *in, loff_t *ppos,
> >                                 unsigned int flags)
> >  {
> >         struct fuse_file *ff = in->private_data;
> > +       struct inode *inode = file_inode(in);
> > +       struct fuse_inode *fi = get_fuse_inode(inode);
> >
> >         /* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
> >         if (fuse_file_passthrough(ff) && !(ff->open_flags & FOPEN_DIRECT_IO))
> >                 return fuse_passthrough_splice_read(in, ppos, pipe, len, flags);
> > +       else if (fuse_file_famfs(fi))
> > +               return -EIO; /* direct I/O doesn't make sense in dax_iomap */
> >         else
> >                 return filemap_splice_read(in, ppos, pipe, len, flags);
> >  }
> > @@ -1923,10 +1931,14 @@ static ssize_t fuse_splice_write(struct pipe_inode_info *pipe, struct file *out,
> >                                  loff_t *ppos, size_t len, unsigned int flags)
> >  {
> >         struct fuse_file *ff = out->private_data;
> > +       struct inode *inode = file_inode(out);
> > +       struct fuse_inode *fi = get_fuse_inode(inode);
> >
> >         /* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
> >         if (fuse_file_passthrough(ff) && !(ff->open_flags & FOPEN_DIRECT_IO))
> >                 return fuse_passthrough_splice_write(pipe, out, ppos, len, flags);
> > +       else if (fuse_file_famfs(fi))
> > +               return -EIO; /* direct I/O doesn't make sense in dax_iomap */
> >         else
> >                 return iter_file_splice_write(pipe, out, ppos, len, flags);
> >  }
> 
> This looks odd.
> 
> Usually, the methods first check for FUSE_IS_VIRTIO_DAX() and
> fuse_file_famfs() to get this condition out of the way so I never needed
> to think about whether or not the code verifies that fuse_file_passthrough()
> and fuse_file_famfs() cannot co-exist.
> 
> Is there a reason why you did not follow the same pattern here?

I think I just got a little sloppy. I'll do the famfs test first. Unless we
can rule out this path for famfs. But unlike virtiofs, famfs doesn't have 
separate file_operations, so I suppose it must be checked.

> 
> Also, your comment makes no sense.
> splice is not the case of direct I/O - quite the contrary.

Sorry, brain fart on the comment. Splice doesn't make sense for famfs because
famfs doesn't use the page cache. Will fix the comment too.

> 
> Thanks,
> Amir.

Thank you Amir!
John


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 04/18] dev_dax_iomap: Add dax_operations for use by fs-dax on devdax
  2025-07-04 12:47   ` Jonathan Cameron
@ 2025-07-05 22:56     ` John Groves
  0 siblings, 0 replies; 91+ messages in thread
From: John Groves @ 2025-07-05 22:56 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Darrick J . Wong,
	Randy Dunlap, Jeff Layton, Kent Overstreet, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Amir Goldstein,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On 25/07/04 01:47PM, Jonathan Cameron wrote:
> On Thu,  3 Jul 2025 13:50:18 -0500
> John Groves <John@Groves.net> wrote:
> 
> > Notes about this commit:
> > 
> > * These methods are based on pmem_dax_ops from drivers/nvdimm/pmem.c
> > 
> > * dev_dax_direct_access() is returns the hpa, pfn and kva. The kva was
> >   newly stored as dev_dax->virt_addr by dev_dax_probe().
> > 
> > * The hpa/pfn are used for mmap (dax_iomap_fault()), and the kva is used
> >   for read/write (dax_iomap_rw())
> > 
> > * dev_dax_recovery_write() and dev_dax_zero_page_range() have not been
> >   tested yet. I'm looking for suggestions as to how to test those.
> > 
> > Signed-off-by: John Groves <john@groves.net>
> A few trivial things noticed whilst reading through.

BTW thanks for looking at the dev_dax_iomap part of the series. These are
basically identical to the two standalone-famfs series' I put out last year,
but have IIRC not gotten review comments before this.

> 
> > ---
> >  drivers/dax/bus.c | 120 ++++++++++++++++++++++++++++++++++++++++++++--
> >  1 file changed, 115 insertions(+), 5 deletions(-)
> > 
> > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> > index 9d9a4ae7bbc0..61a8d1b3c07a 100644
> > --- a/drivers/dax/bus.c
> > +++ b/drivers/dax/bus.c
> > @@ -7,6 +7,10 @@
> >  #include <linux/slab.h>
> >  #include <linux/dax.h>
> >  #include <linux/io.h>
> > +#include <linux/backing-dev.h>
> > +#include <linux/pfn_t.h>
> > +#include <linux/range.h>
> > +#include <linux/uio.h>
> >  #include "dax-private.h"
> >  #include "bus.h"
> >  
> > @@ -1441,6 +1445,105 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
> >  }
> >  EXPORT_SYMBOL_GPL(dax_pgoff_to_phys);
> >  
> > +#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
> > +
> > +static void write_dax(void *pmem_addr, struct page *page,
> > +		unsigned int off, unsigned int len)
> > +{
> > +	unsigned int chunk;
> > +	void *mem;
> 
> I'd move these two into the loop - similar to what you have
> in other cases with more local scope.

Done, thanks.

> 
> > +
> > +	while (len) {
> > +		mem = kmap_local_page(page);
> > +		chunk = min_t(unsigned int, len, PAGE_SIZE - off);
> > +		memcpy_flushcache(pmem_addr, mem + off, chunk);
> > +		kunmap_local(mem);
> > +		len -= chunk;
> > +		off = 0;
> > +		page++;
> > +		pmem_addr += chunk;
> > +	}
> > +}
> > +
> > +static long __dev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
> > +			long nr_pages, enum dax_access_mode mode, void **kaddr,
> > +			pfn_t *pfn)
> > +{
> > +	struct dev_dax *dev_dax = dax_get_private(dax_dev);
> > +	size_t size = nr_pages << PAGE_SHIFT;
> > +	size_t offset = pgoff << PAGE_SHIFT;
> > +	void *virt_addr = dev_dax->virt_addr + offset;
> > +	u64 flags = PFN_DEV|PFN_MAP;
> 
> spaces around the |
> 
> Though given it's in just one place, just put these inline next
> to the question...

Done and done.

> 
> 
> > +	phys_addr_t phys;
> > +	pfn_t local_pfn;
> > +	size_t dax_size;
> > +
> > +	WARN_ON(!dev_dax->virt_addr);
> > +
> > +	if (down_read_interruptible(&dax_dev_rwsem))
> > +		return 0; /* no valid data since we were killed */
> > +	dax_size = dev_dax_size(dev_dax);
> > +	up_read(&dax_dev_rwsem);
> > +
> > +	phys = dax_pgoff_to_phys(dev_dax, pgoff, nr_pages << PAGE_SHIFT);
> > +
> > +	if (kaddr)
> > +		*kaddr = virt_addr;
> > +
> > +	local_pfn = phys_to_pfn_t(phys, flags); /* are flags correct? */
> > +	if (pfn)
> > +		*pfn = local_pfn;
> > +
> > +	/* This the valid size at the specified address */
> > +	return PHYS_PFN(min_t(size_t, size, dax_size - offset));
> > +}
> 
> > +static size_t dev_dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
> > +		void *addr, size_t bytes, struct iov_iter *i)
> > +{
> > +	size_t off;
> > +
> > +	off = offset_in_page(addr);
> 
> Unused.

Righto. Thanks.

> > +
> > +	return _copy_from_iter_flushcache(addr, bytes, i);
> > +}
> 
> 

Thanks!
John


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 14/18] famfs_fuse: GET_DAXDEV message and daxdev_table
  2025-07-04 13:20   ` Jonathan Cameron
@ 2025-07-06 17:07     ` John Groves
  0 siblings, 0 replies; 91+ messages in thread
From: John Groves @ 2025-07-06 17:07 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Darrick J . Wong,
	Randy Dunlap, Jeff Layton, Kent Overstreet, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Amir Goldstein,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On 25/07/04 02:20PM, Jonathan Cameron wrote:
> On Thu,  3 Jul 2025 13:50:28 -0500
> John Groves <John@Groves.net> wrote:
> 
> > * The new GET_DAXDEV message/response is enabled
> > * The command it triggered by the update_daxdev_table() call, if there
> >   are any daxdevs in the subject fmap that are not represented in the
> >   daxdev_dable yet.
> > 
> > Signed-off-by: John Groves <john@groves.net>
> 
> More drive by stuff you can ignore for now if you like.

Always appreciated...

> 
> > ---
> >  fs/fuse/famfs.c           | 227 ++++++++++++++++++++++++++++++++++++++
> >  fs/fuse/famfs_kfmap.h     |  26 +++++
> >  fs/fuse/fuse_i.h          |   1 +
> >  fs/fuse/inode.c           |   4 +-
> >  fs/namei.c                |   1 +
> >  include/uapi/linux/fuse.h |  18 +++
> >  6 files changed, 276 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/fuse/famfs.c b/fs/fuse/famfs.c
> > index 41c4d92f1451..f5e01032b825 100644
> > --- a/fs/fuse/famfs.c
> > +++ b/fs/fuse/famfs.c
> 
> > +/**
> > + * famfs_fuse_get_daxdev() - Retrieve info for a DAX device from fuse server
> > + *
> > + * Send a GET_DAXDEV message to the fuse server to retrieve info on a
> > + * dax device.
> > + *
> > + * @fm:     fuse_mount
> > + * @index:  the index of the dax device; daxdevs are referred to by index
> > + *          in fmaps, and the server resolves the index to a particular daxdev
> > + *
> > + * Returns: 0=success
> > + *          -errno=failure
> > + */
> > +static int
> > +famfs_fuse_get_daxdev(struct fuse_mount *fm, const u64 index)
> > +{
> > +	struct fuse_daxdev_out daxdev_out = { 0 };
> > +	struct fuse_conn *fc = fm->fc;
> > +	struct famfs_daxdev *daxdev;
> > +	int err = 0;
> > +
> > +	FUSE_ARGS(args);
> > +
> > +	/* Store the daxdev in our table */
> > +	if (index >= fc->dax_devlist->nslots) {
> > +		pr_err("%s: index(%lld) > nslots(%d)\n",
> > +		       __func__, index, fc->dax_devlist->nslots);
> > +		err = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	args.opcode = FUSE_GET_DAXDEV;
> > +	args.nodeid = index;
> > +
> > +	args.in_numargs = 0;
> > +
> > +	args.out_numargs = 1;
> > +	args.out_args[0].size = sizeof(daxdev_out);
> > +	args.out_args[0].value = &daxdev_out;
> > +
> > +	/* Send GET_DAXDEV command */
> > +	err = fuse_simple_request(fm, &args);
> > +	if (err) {
> > +		pr_err("%s: err=%d from fuse_simple_request()\n",
> > +		       __func__, err);
> > +		/*
> > +		 * Error will be that the payload is smaller than FMAP_BUFSIZE,
> > +		 * which is the max we can handle. Empty payload handled below.
> > +		 */
> > +		goto out;
> > +	}
> > +
> > +	down_write(&fc->famfs_devlist_sem);
> 
> Worth thinking about guard() in this code in general.
> Simplify some of the error paths at least.

Thinking about it. Not sure I'll go there yet; I find the guard macros 
a bit confusing...

> 
> > +
> > +	daxdev = &fc->dax_devlist->devlist[index];
> > +
> > +	/* Abort if daxdev is now valid */
> > +	if (daxdev->valid) {
> > +		up_write(&fc->famfs_devlist_sem);
> > +		/* We already have a valid entry at this index */
> > +		err = -EALREADY;
> > +		goto out;
> > +	}
> > +
> > +	/* Verify that the dev is valid and can be opened and gets the devno */
> > +	err = famfs_verify_daxdev(daxdev_out.name, &daxdev->devno);
> > +	if (err) {
> > +		up_write(&fc->famfs_devlist_sem);
> > +		pr_err("%s: err=%d from famfs_verify_daxdev()\n", __func__, err);
> > +		goto out;
> > +	}
> > +
> > +	/* This will fail if it's not a dax device */
> > +	daxdev->devp = dax_dev_get(daxdev->devno);
> > +	if (!daxdev->devp) {
> > +		up_write(&fc->famfs_devlist_sem);
> > +		pr_warn("%s: device %s not found or not dax\n",
> > +			__func__, daxdev_out.name);
> > +		err = -ENODEV;
> > +		goto out;
> > +	}
> > +
> > +	daxdev->name = kstrdup(daxdev_out.name, GFP_KERNEL);
> > +	wmb(); /* all daxdev fields must be visible before marking it valid */
> > +	daxdev->valid = 1;
> > +
> > +	up_write(&fc->famfs_devlist_sem);
> > +
> > +out:
> > +	return err;
> > +}
> > +
> > +/**
> > + * famfs_update_daxdev_table() - Update the daxdev table
> > + * @fm   - fuse_mount
> > + * @meta - famfs_file_meta, in-memory format, built from a GET_FMAP response
> > + *
> > + * This function is called for each new file fmap, to verify whether all
> > + * referenced daxdevs are already known (i.e. in the table). Any daxdev
> > + * indices that referenced in @meta but not in the table will be retrieved via
> > + * famfs_fuse_get_daxdev() and added to the table
> > + *
> > + * Return: 0=success
> > + *         -errno=failure
> > + */
> > +static int
> > +famfs_update_daxdev_table(
> > +	struct fuse_mount *fm,
> > +	const struct famfs_file_meta *meta)
> > +{
> > +	struct famfs_dax_devlist *local_devlist;
> > +	struct fuse_conn *fc = fm->fc;
> > +	int err;
> > +	int i;
> > +
> > +	/* First time through we will need to allocate the dax_devlist */
> > +	if (!fc->dax_devlist) {
> > +		local_devlist = kcalloc(1, sizeof(*fc->dax_devlist), GFP_KERNEL);
> > +		if (!local_devlist)
> > +			return -ENOMEM;
> > +
> > +		local_devlist->nslots = MAX_DAXDEVS;
> > +
> > +		local_devlist->devlist = kcalloc(MAX_DAXDEVS,
> > +						 sizeof(struct famfs_daxdev),
> > +						 GFP_KERNEL);
> > +		if (!local_devlist->devlist) {
> > +			kfree(local_devlist);
> > +			return -ENOMEM;
> > +		}
> > +
> > +		/* We don't need the famfs_devlist_sem here because we use cmpxchg... */
> > +		if (cmpxchg(&fc->dax_devlist, NULL, local_devlist) != NULL) {
> > +			kfree(local_devlist->devlist);
> > +			kfree(local_devlist); /* another thread beat us to it */
> > +		}
> > +	}
> > +
> > +	down_read(&fc->famfs_devlist_sem);
> > +	for (i = 0; i < fc->dax_devlist->nslots; i++) {
> > +		if (meta->dev_bitmap & (1ULL << i)) {
> Flip for readability.
> 		if (!(meta->dev_bitmap & (1ULL << i))
> 			continue;

I like it - done..

> 
> Or can we use bitmap_from_arr64() and
> for_each_set_bit() to optimize this a little.

Could do, but I feel like that's a bit harder [for me] to read.

> 
> > +			/* This file meta struct references devindex i
> > +			 * if devindex i isn't in the table; get it...
> > +			 */
> > +			if (!(fc->dax_devlist->devlist[i].valid)) {
> > +				up_read(&fc->famfs_devlist_sem);
> > +
> > +				err = famfs_fuse_get_daxdev(fm, i);
> > +				if (err)
> > +					pr_err("%s: failed to get daxdev=%d\n",
> > +					       __func__, i);
> Don't want to surface that error?

I'm thinking on that. Failure to retrieve a dax device is currently
game over for the whole mount (because there is just one of them currently,
and it's retrieved to get access to the superblock and metadata log).
Once additional daxdevs are enabled there will be more nuance, but any
file that references a 'missing' dax device will be non-operative, so
putting something in the log makes sense to me.

I may surface it a bit differently, but I think it needs to surface.

> > +
> > +				down_read(&fc->famfs_devlist_sem);
> > +			}
> > +		}
> > +	}
> > +	up_read(&fc->famfs_devlist_sem);
> > +
> > +	return 0;
> > +}
> > +
> > +/***************************************************************************/
> 
> ?

One of my tics is divider comments. Will probably drop it though ;)

> 
> > diff --git a/fs/fuse/famfs_kfmap.h b/fs/fuse/famfs_kfmap.h
> > index ce785d76719c..f79707b9f761 100644
> > --- a/fs/fuse/famfs_kfmap.h
> > +++ b/fs/fuse/famfs_kfmap.h
> > @@ -60,4 +60,30 @@ struct famfs_file_meta {
> >  	};
> >  };
> >  
> > +/**
> > + * famfs_daxdev - tracking struct for a daxdev within a famfs file system
> > + *
> > + * This is the in-memory daxdev metadata that is populated by
> > + * the responses to GET_FMAP messages
> > + */
> > +struct famfs_daxdev {
> > +	/* Include dev uuid? */
> > +	bool valid;
> > +	bool error;
> > +	dev_t devno;
> > +	struct dax_device *devp;
> > +	char *name;
> > +};
> > +
> > +#define MAX_DAXDEVS 24
> > +
> > +/**
> > + * famfs_dax_devlist - list of famfs_daxdev's
> 
> Run kernel-doc script over these. It gets grumpy about partial
> documentation.

Thank you... I just did, and fixed a couple of issues it complained about.

> 
> > + */
> > +struct famfs_dax_devlist {
> > +	int nslots;
> > +	int ndevs;
> > +	struct famfs_daxdev *devlist; /* XXX: make this an xarray! */
> > +};
> > +
> >  #endif /* FAMFS_KFMAP_H */
> 
> > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > index ecaaa62910f0..8a81b6c334fe 100644
> > --- a/include/uapi/linux/fuse.h
> > +++ b/include/uapi/linux/fuse.h
> > @@ -235,6 +235,9 @@
> >   *      - struct fuse_famfs_simple_ext
> >   *      - struct fuse_famfs_iext
> >   *      - struct fuse_famfs_fmap_header
> > + *    - Add the following structs for the GET_DAXDEV message and reply
> > + *      - struct fuse_get_daxdev_in
> > + *      - struct fuse_get_daxdev_out
> >   *    - Add the following enumerated types
> >   *      - enum fuse_famfs_file_type
> >   *      - enum famfs_ext_type
> > @@ -1351,6 +1354,20 @@ struct fuse_famfs_fmap_header {
> >  	uint64_t reserved1;
> >  };
> >  
> > +struct fuse_get_daxdev_in {
> > +	uint32_t        daxdev_num;
> > +};
> > +
> > +#define DAXDEV_NAME_MAX 256
> > +struct fuse_daxdev_out {
> > +	uint16_t index;
> > +	uint16_t reserved;
> > +	uint32_t reserved2;
> > +	uint64_t reserved3; /* enough space for a uuid if we need it */
> 
> Odd place for the comment. If it just refers to reserved3 then nope
> not enough space.  If you mean that and reserved4 then fiar enough
> but that's not obvious as it stands.

Good point. Moved it above in -next

> 
> > +	uint64_t reserved4;
> > +	char name[DAXDEV_NAME_MAX];
> > +};
> > +
> >  static inline int32_t fmap_msg_min_size(void)
> >  {
> >  	/* Smallest fmap message is a header plus one simple extent */
> > @@ -1358,4 +1375,5 @@ static inline int32_t fmap_msg_min_size(void)
> >  		+ sizeof(struct fuse_famfs_simple_ext));
> >  }
> >  
> > +
> Stray change.  Worth a quick scrub to clean these out (even in an RFC) as they just add
> noise to the bits you want people to look at!

Yup, will fix.

BTW, public service announcement: I've discovered the awesomeness of jj
(aka ju jutsu, aka jj-vcs) as a wrapper for git that is great at the kind
of rebase problems that come with factoring and re-factoring patch set
branches. Without jj, more stuff like this would have slipped through ;)

<snip>

Thanks!
John

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 10/18] famfs_fuse: Basic fuse kernel ABI enablement for famfs
  2025-07-03 22:45   ` John Groves
@ 2025-07-07 17:32     ` Darrick J. Wong
  0 siblings, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2025-07-07 17:32 UTC (permalink / raw)
  To: John Groves
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On Thu, Jul 03, 2025 at 05:45:48PM -0500, John Groves wrote:
> On 25/07/03 01:50PM, John Groves wrote:
> > * FUSE_DAX_FMAP flag in INIT request/reply
> > 
> > * fuse_conn->famfs_iomap (enable famfs-mapped files) to denote a
> >   famfs-enabled connection
> > 
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> >  fs/fuse/fuse_i.h          |  3 +++
> >  fs/fuse/inode.c           | 14 ++++++++++++++
> >  include/uapi/linux/fuse.h |  4 ++++
> >  3 files changed, 21 insertions(+)
> > 
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index 9d87ac48d724..a592c1002861 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -873,6 +873,9 @@ struct fuse_conn {
> >  	/* Use io_uring for communication */
> >  	unsigned int io_uring;
> >  
> > +	/* dev_dax_iomap support for famfs */
> > +	unsigned int famfs_iomap:1;
> > +
> >  	/** Maximum stack depth for passthrough backing files */
> >  	int max_stack_depth;
> >  
> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > index 29147657a99f..e48e11c3f9f3 100644
> > --- a/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -1392,6 +1392,18 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
> >  			}
> >  			if (flags & FUSE_OVER_IO_URING && fuse_uring_enabled())
> >  				fc->io_uring = 1;
> > +			if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX) &&
> > +			    flags & FUSE_DAX_FMAP) {
> > +				/* XXX: Should also check that fuse server
> > +				 * has CAP_SYS_RAWIO and/or CAP_SYS_ADMIN,
> > +				 * since it is directing the kernel to access
> > +				 * dax memory directly - but this function
> > +				 * appears not to be called in fuse server
> > +				 * process context (b/c even if it drops
> > +				 * those capabilities, they are held here).
> > +				 */
> > +				fc->famfs_iomap = 1;
> 
> I think there should be a check here that the fuse server is 
> capable(CAP_SYS_RAWIO) (or maybe CAP_SYS_ADMIN), but this function doesn't 
> run in fuse server context. A famfs fuse server is providing fmaps, which 
> map files to devdax memory, which should not be an unprivileged operation.

I thought process_init_reply /does/ run in the fuse server's context.
It calls process_init_limits, which checks for capable(CAP_SYS_ADMIN)...

--D

> 1) Does fs/fuse already store the capabilities of the fuse server?
> 2) If not, where do you suggest I do that, and where do you suggest I store
> that info? The only dead-obvious place (to me) that fs/fuse runs in server
> context is in fuse_dev_open(), but it doesn't store anything...
> 
> @Miklos, I'd appreciate your advice here.
> 
> Thanks!
> John
> 
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 10/18] famfs_fuse: Basic fuse kernel ABI enablement for famfs
  2025-07-04 13:39     ` John Groves
@ 2025-07-07 17:39       ` Darrick J. Wong
  2025-07-08 12:02         ` John Groves
  0 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2025-07-07 17:39 UTC (permalink / raw)
  To: John Groves
  Cc: Amir Goldstein, Dan Williams, Miklos Szeredi, Bernd Schubert,
	John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Randy Dunlap, Jeff Layton, Kent Overstreet, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On Fri, Jul 04, 2025 at 08:39:59AM -0500, John Groves wrote:
> On 25/07/04 09:54AM, Amir Goldstein wrote:
> > On Thu, Jul 3, 2025 at 8:51 PM John Groves <John@groves.net> wrote:
> > >
> > > * FUSE_DAX_FMAP flag in INIT request/reply
> > >
> > > * fuse_conn->famfs_iomap (enable famfs-mapped files) to denote a
> > >   famfs-enabled connection
> > >
> > > Signed-off-by: John Groves <john@groves.net>
> > > ---
> > >  fs/fuse/fuse_i.h          |  3 +++
> > >  fs/fuse/inode.c           | 14 ++++++++++++++
> > >  include/uapi/linux/fuse.h |  4 ++++
> > >  3 files changed, 21 insertions(+)
> > >
> > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > index 9d87ac48d724..a592c1002861 100644
> > > --- a/fs/fuse/fuse_i.h
> > > +++ b/fs/fuse/fuse_i.h
> > > @@ -873,6 +873,9 @@ struct fuse_conn {
> > >         /* Use io_uring for communication */
> > >         unsigned int io_uring;
> > >
> > > +       /* dev_dax_iomap support for famfs */
> > > +       unsigned int famfs_iomap:1;
> > > +
> > 
> > pls move up to the bit fields members.
> 
> Oops, done, thanks.
> 
> > 
> > >         /** Maximum stack depth for passthrough backing files */
> > >         int max_stack_depth;
> > >
> > > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > > index 29147657a99f..e48e11c3f9f3 100644
> > > --- a/fs/fuse/inode.c
> > > +++ b/fs/fuse/inode.c
> > > @@ -1392,6 +1392,18 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
> > >                         }
> > >                         if (flags & FUSE_OVER_IO_URING && fuse_uring_enabled())
> > >                                 fc->io_uring = 1;
> > > +                       if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX) &&
> > > +                           flags & FUSE_DAX_FMAP) {
> > > +                               /* XXX: Should also check that fuse server
> > > +                                * has CAP_SYS_RAWIO and/or CAP_SYS_ADMIN,
> > > +                                * since it is directing the kernel to access
> > > +                                * dax memory directly - but this function
> > > +                                * appears not to be called in fuse server
> > > +                                * process context (b/c even if it drops
> > > +                                * those capabilities, they are held here).
> > > +                                */
> > > +                               fc->famfs_iomap = 1;
> > > +                       }
> > 
> > 1. As long as the mapping requests are checking capabilities we should be ok
> >     Right?
> 
> It depends on the definition of "are", or maybe of "mapping requests" ;)
> 
> Forgive me if this *is* obvious, but the fuse server capabilities are what
> I think need to be checked here - not the app that it accessing a file.
> 
> An app accessing a regular file doesn't need permission to do raw access to
> the underlying block dev, but the fuse server does - becuase it is directing
> the kernel to access that for apps.
> 
> > 2. What's the deal with capable(CAP_SYS_ADMIN) in process_init_limits then?
> 
> I *think* that's checking the capabilities of the app that is accessing the
> file, and not the fuse server. But I might be wrong - I have not pulled very
> hard on that thread yet.

The init reply should be processed in the context of the fuse server.
At that point the kernel hasn't exposed the fs to user programs, so
(AFAICT) there won't be any other programs accessing that fuse mount.

> > 3. Darrick mentioned the need for a synchronic INIT variant for his work on
> >     blockdev iomap support [1]
> 
> I'm not sure that's the same thing (Darrick?), but I do think Darrick's
> use case probably needs to check capabilities for a server that is sending
> apps (via files) off to access extents of block devices.

I don't know either, Miklos hasn't responded to my questions.  I think
the motivation for a synchronous 

As for fuse/iomap, I just only need to ask the kernel if iomap support
is available before calling ext2fs_open2() because the iomap question
has some implications for how we open the ext4 filesystem.

> > I also wonder how much of your patches and Darrick's patches end up
> > being an overlap?
> 
> Darrick and I spent some time hashing through this, and came to the conclusion
> that the actual overlap is slim-to-none. 

Yeah.  The neat thing about FMAPs is that you can establish repeating
patterns, which is useful for interleaved DRAM/pmem devices.  Disk
filesystems don't do repeating patterns, so they'd much rather manage
non-repeating mappings.

--D

> > 
> > Thanks,
> > Amir.
> > 
> > [1] https://lore.kernel.org/linux-fsdevel/20250613174413.GM6138@frogsfrogsfrogs/
> 
> Thank you!
> John
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 10/18] famfs_fuse: Basic fuse kernel ABI enablement for famfs
  2025-07-07 17:39       ` Darrick J. Wong
@ 2025-07-08 12:02         ` John Groves
  2025-07-09  1:53           ` Darrick J. Wong
  0 siblings, 1 reply; 91+ messages in thread
From: John Groves @ 2025-07-08 12:02 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Amir Goldstein, Dan Williams, Miklos Szeredi, Bernd Schubert,
	John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Randy Dunlap, Jeff Layton, Kent Overstreet, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On 25/07/07 10:39AM, Darrick J. Wong wrote:
> On Fri, Jul 04, 2025 at 08:39:59AM -0500, John Groves wrote:
> > On 25/07/04 09:54AM, Amir Goldstein wrote:
> > > On Thu, Jul 3, 2025 at 8:51 PM John Groves <John@groves.net> wrote:
> > > >
> > > > * FUSE_DAX_FMAP flag in INIT request/reply
> > > >
> > > > * fuse_conn->famfs_iomap (enable famfs-mapped files) to denote a
> > > >   famfs-enabled connection
> > > >
> > > > Signed-off-by: John Groves <john@groves.net>
> > > > ---
> > > >  fs/fuse/fuse_i.h          |  3 +++
> > > >  fs/fuse/inode.c           | 14 ++++++++++++++
> > > >  include/uapi/linux/fuse.h |  4 ++++
> > > >  3 files changed, 21 insertions(+)
> > > >
> > > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > > index 9d87ac48d724..a592c1002861 100644
> > > > --- a/fs/fuse/fuse_i.h
> > > > +++ b/fs/fuse/fuse_i.h
> > > > @@ -873,6 +873,9 @@ struct fuse_conn {
> > > >         /* Use io_uring for communication */
> > > >         unsigned int io_uring;
> > > >
> > > > +       /* dev_dax_iomap support for famfs */
> > > > +       unsigned int famfs_iomap:1;
> > > > +
> > > 
> > > pls move up to the bit fields members.
> > 
> > Oops, done, thanks.
> > 
> > > 
> > > >         /** Maximum stack depth for passthrough backing files */
> > > >         int max_stack_depth;
> > > >
> > > > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > > > index 29147657a99f..e48e11c3f9f3 100644
> > > > --- a/fs/fuse/inode.c
> > > > +++ b/fs/fuse/inode.c
> > > > @@ -1392,6 +1392,18 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
> > > >                         }
> > > >                         if (flags & FUSE_OVER_IO_URING && fuse_uring_enabled())
> > > >                                 fc->io_uring = 1;
> > > > +                       if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX) &&
> > > > +                           flags & FUSE_DAX_FMAP) {
> > > > +                               /* XXX: Should also check that fuse server
> > > > +                                * has CAP_SYS_RAWIO and/or CAP_SYS_ADMIN,
> > > > +                                * since it is directing the kernel to access
> > > > +                                * dax memory directly - but this function
> > > > +                                * appears not to be called in fuse server
> > > > +                                * process context (b/c even if it drops
> > > > +                                * those capabilities, they are held here).
> > > > +                                */
> > > > +                               fc->famfs_iomap = 1;
> > > > +                       }
> > > 
> > > 1. As long as the mapping requests are checking capabilities we should be ok
> > >     Right?
> > 
> > It depends on the definition of "are", or maybe of "mapping requests" ;)
> > 
> > Forgive me if this *is* obvious, but the fuse server capabilities are what
> > I think need to be checked here - not the app that it accessing a file.
> > 
> > An app accessing a regular file doesn't need permission to do raw access to
> > the underlying block dev, but the fuse server does - becuase it is directing
> > the kernel to access that for apps.
> > 
> > > 2. What's the deal with capable(CAP_SYS_ADMIN) in process_init_limits then?
> > 
> > I *think* that's checking the capabilities of the app that is accessing the
> > file, and not the fuse server. But I might be wrong - I have not pulled very
> > hard on that thread yet.
> 
> The init reply should be processed in the context of the fuse server.
> At that point the kernel hasn't exposed the fs to user programs, so
> (AFAICT) there won't be any other programs accessing that fuse mount.

Hmm. It would be good if you're right about that. My fuse server *is* running
as root, and when I check those capabilities in process_init_reply(), I
find those capabilities. So far so good.

Then I added code to my fuse server to drop those capabilities prior to
starting the fuse session (prctl(PR_CAPBSET_DROP, CAP_SYS_RAWIO) and 
prctl(PR_CAPBSET_DROP, CAP_SYS_ADMIN). I expected (hoped?) to see those 
capabilities disappear in process_init_reply() - but they did not disappear.

I'm all ears if somebody can see a flaw in my logic here. Otherwise, the
capabilities need to be stashed away before the reply is processsed, when 
fs/fuse *is* running in fuse server context.

I'm somewhat surprised if that isn't already happening somewhere...

> 
> > > 3. Darrick mentioned the need for a synchronic INIT variant for his work on
> > >     blockdev iomap support [1]
> > 
> > I'm not sure that's the same thing (Darrick?), but I do think Darrick's
> > use case probably needs to check capabilities for a server that is sending
> > apps (via files) off to access extents of block devices.
> 
> I don't know either, Miklos hasn't responded to my questions.  I think
> the motivation for a synchronous 

?

> 
> As for fuse/iomap, I just only need to ask the kernel if iomap support
> is available before calling ext2fs_open2() because the iomap question
> has some implications for how we open the ext4 filesystem.
> 
> > > I also wonder how much of your patches and Darrick's patches end up
> > > being an overlap?
> > 
> > Darrick and I spent some time hashing through this, and came to the conclusion
> > that the actual overlap is slim-to-none. 
> 
> Yeah.  The neat thing about FMAPs is that you can establish repeating
> patterns, which is useful for interleaved DRAM/pmem devices.  Disk
> filesystems don't do repeating patterns, so they'd much rather manage
> non-repeating mappings.

Right. Interleaving is critical to how we use memory, so fmaps are designed
to support it.

Tangent: at some point a broader-than-just-me discussion of how block devices
have the device mapper, but memory has no such layout tools, might be good
to have. Without such a thing (which might or might not be possible/practical),
it's essential that famfs do the interleaving. Lacking a mapper layer also
means that we need dax to provide a clean "device abstraction" (meaning
a single CXL allocation [which has a uuid/tag] needs to appear as a single
dax device whether or not it's HPA-contiguous).

Cheers,
John


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 10/18] famfs_fuse: Basic fuse kernel ABI enablement for famfs
  2025-07-08 12:02         ` John Groves
@ 2025-07-09  1:53           ` Darrick J. Wong
  2025-07-11  1:32             ` John Groves
  0 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2025-07-09  1:53 UTC (permalink / raw)
  To: John Groves
  Cc: Amir Goldstein, Dan Williams, Miklos Szeredi, Bernd Schubert,
	John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Randy Dunlap, Jeff Layton, Kent Overstreet, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On Tue, Jul 08, 2025 at 07:02:03AM -0500, John Groves wrote:
> On 25/07/07 10:39AM, Darrick J. Wong wrote:
> > On Fri, Jul 04, 2025 at 08:39:59AM -0500, John Groves wrote:
> > > On 25/07/04 09:54AM, Amir Goldstein wrote:
> > > > On Thu, Jul 3, 2025 at 8:51 PM John Groves <John@groves.net> wrote:
> > > > >
> > > > > * FUSE_DAX_FMAP flag in INIT request/reply
> > > > >
> > > > > * fuse_conn->famfs_iomap (enable famfs-mapped files) to denote a
> > > > >   famfs-enabled connection
> > > > >
> > > > > Signed-off-by: John Groves <john@groves.net>
> > > > > ---
> > > > >  fs/fuse/fuse_i.h          |  3 +++
> > > > >  fs/fuse/inode.c           | 14 ++++++++++++++
> > > > >  include/uapi/linux/fuse.h |  4 ++++
> > > > >  3 files changed, 21 insertions(+)
> > > > >
> > > > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > > > index 9d87ac48d724..a592c1002861 100644
> > > > > --- a/fs/fuse/fuse_i.h
> > > > > +++ b/fs/fuse/fuse_i.h
> > > > > @@ -873,6 +873,9 @@ struct fuse_conn {
> > > > >         /* Use io_uring for communication */
> > > > >         unsigned int io_uring;
> > > > >
> > > > > +       /* dev_dax_iomap support for famfs */
> > > > > +       unsigned int famfs_iomap:1;
> > > > > +
> > > > 
> > > > pls move up to the bit fields members.
> > > 
> > > Oops, done, thanks.
> > > 
> > > > 
> > > > >         /** Maximum stack depth for passthrough backing files */
> > > > >         int max_stack_depth;
> > > > >
> > > > > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > > > > index 29147657a99f..e48e11c3f9f3 100644
> > > > > --- a/fs/fuse/inode.c
> > > > > +++ b/fs/fuse/inode.c
> > > > > @@ -1392,6 +1392,18 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
> > > > >                         }
> > > > >                         if (flags & FUSE_OVER_IO_URING && fuse_uring_enabled())
> > > > >                                 fc->io_uring = 1;
> > > > > +                       if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX) &&
> > > > > +                           flags & FUSE_DAX_FMAP) {
> > > > > +                               /* XXX: Should also check that fuse server
> > > > > +                                * has CAP_SYS_RAWIO and/or CAP_SYS_ADMIN,
> > > > > +                                * since it is directing the kernel to access
> > > > > +                                * dax memory directly - but this function
> > > > > +                                * appears not to be called in fuse server
> > > > > +                                * process context (b/c even if it drops
> > > > > +                                * those capabilities, they are held here).
> > > > > +                                */
> > > > > +                               fc->famfs_iomap = 1;
> > > > > +                       }
> > > > 
> > > > 1. As long as the mapping requests are checking capabilities we should be ok
> > > >     Right?
> > > 
> > > It depends on the definition of "are", or maybe of "mapping requests" ;)
> > > 
> > > Forgive me if this *is* obvious, but the fuse server capabilities are what
> > > I think need to be checked here - not the app that it accessing a file.
> > > 
> > > An app accessing a regular file doesn't need permission to do raw access to
> > > the underlying block dev, but the fuse server does - becuase it is directing
> > > the kernel to access that for apps.
> > > 
> > > > 2. What's the deal with capable(CAP_SYS_ADMIN) in process_init_limits then?
> > > 
> > > I *think* that's checking the capabilities of the app that is accessing the
> > > file, and not the fuse server. But I might be wrong - I have not pulled very
> > > hard on that thread yet.
> > 
> > The init reply should be processed in the context of the fuse server.
> > At that point the kernel hasn't exposed the fs to user programs, so
> > (AFAICT) there won't be any other programs accessing that fuse mount.
> 
> Hmm. It would be good if you're right about that. My fuse server *is* running
> as root, and when I check those capabilities in process_init_reply(), I
> find those capabilities. So far so good.
> 
> Then I added code to my fuse server to drop those capabilities prior to
> starting the fuse session (prctl(PR_CAPBSET_DROP, CAP_SYS_RAWIO) and 
> prctl(PR_CAPBSET_DROP, CAP_SYS_ADMIN). I expected (hoped?) to see those 
> capabilities disappear in process_init_reply() - but they did not disappear.
> 
> I'm all ears if somebody can see a flaw in my logic here. Otherwise, the
> capabilities need to be stashed away before the reply is processsed, when 
> fs/fuse *is* running in fuse server context.
> 
> I'm somewhat surprised if that isn't already happening somewhere...

Hrm.  I *thought* that since FUSE_INIT isn't queued as a background
command, it should still execute in the same process context as the fuse
server.

OTOH it also occurs to me that I have this code in fuse_send_init:

	if (has_capability_noaudit(current, CAP_SYS_RAWIO))
		flags |= FUSE_IOMAP | FUSE_IOMAP_DIRECTIO | FUSE_IOMAP_PAGECACHE;
	...
	ia->in.flags = flags;
	ia->in.flags2 = flags >> 32;

which means that we only advertise iomap support in FUSE_INIT if the
process running fuse_fill_super (which you hope is the fuse server)
actually has CAP_SYS_RAWIO.  Would that work for you?  Or are you
dropping privileges before you even open /dev/fuse?

Note: I might decide to relax that approach later on, since iomap
requires you to have opened a block device ... which implies that the
process had read/write access to start with; and maybe we're ok with
unprivileged fuse2fs servers running on a chmod 666 block device?

<shrug> always easier to /relax/ the privilege checks. :)

> > > > 3. Darrick mentioned the need for a synchronic INIT variant for his work on
> > > >     blockdev iomap support [1]
> > > 
> > > I'm not sure that's the same thing (Darrick?), but I do think Darrick's
> > > use case probably needs to check capabilities for a server that is sending
> > > apps (via files) off to access extents of block devices.
> > 
> > I don't know either, Miklos hasn't responded to my questions.  I think
> > the motivation for a synchronous 
> 
> ?

..."I don't know what his motivations for synchronous FUSE_INIT are."

I guess I fubard vim. :(

> > As for fuse/iomap, I just only need to ask the kernel if iomap support
> > is available before calling ext2fs_open2() because the iomap question
> > has some implications for how we open the ext4 filesystem.
> > 
> > > > I also wonder how much of your patches and Darrick's patches end up
> > > > being an overlap?
> > > 
> > > Darrick and I spent some time hashing through this, and came to the conclusion
> > > that the actual overlap is slim-to-none. 
> > 
> > Yeah.  The neat thing about FMAPs is that you can establish repeating
> > patterns, which is useful for interleaved DRAM/pmem devices.  Disk
> > filesystems don't do repeating patterns, so they'd much rather manage
> > non-repeating mappings.
> 
> Right. Interleaving is critical to how we use memory, so fmaps are designed
> to support it.
> 
> Tangent: at some point a broader-than-just-me discussion of how block devices
> have the device mapper, but memory has no such layout tools, might be good
> to have. Without such a thing (which might or might not be possible/practical),
> it's essential that famfs do the interleaving. Lacking a mapper layer also
> means that we need dax to provide a clean "device abstraction" (meaning
> a single CXL allocation [which has a uuid/tag] needs to appear as a single
> dax device whether or not it's HPA-contiguous).

Well it's not as simple as device-mapper, where we can intercept struct
bio and remap/split it to our heart's content.  I guess you could do
that with an iovec...?  Would be sorta amusing if you could software
RAID10 some DRAM. :P

--D

> Cheers,
> John
> 
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 00/18] famfs: port into fuse
  2025-07-03 18:56 ` [RFC V2 00/18] famfs: port into fuse John Groves
@ 2025-07-09  3:26   ` Miklos Szeredi
  2025-07-11  1:18     ` John Groves
  0 siblings, 1 reply; 91+ messages in thread
From: Miklos Szeredi @ 2025-07-09  3:26 UTC (permalink / raw)
  To: John Groves
  Cc: Dan Williams, Bernd Schubert, John Groves, Jonathan Corbet,
	Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Darrick J . Wong, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On Thu, 3 Jul 2025 at 20:56, John Groves <John@groves.net> wrote:
>
> DERP: I did it again; Miklos' email is wrong in this series.

linux-fsdevel also lands in my inbox, so I don't even notice.

I won't get to review this until August, sorry about that.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 11/18] famfs_fuse: Basic famfs mount opts
  2025-07-03 18:50 ` [RFC V2 11/18] famfs_fuse: Basic famfs mount opts John Groves
@ 2025-07-09  3:59   ` Darrick J. Wong
  2025-07-11 15:28     ` John Groves
  0 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2025-07-09  3:59 UTC (permalink / raw)
  To: John Groves
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On Thu, Jul 03, 2025 at 01:50:25PM -0500, John Groves wrote:
> * -o shadow=<shadowpath>

What is a shadow?

> * -o daxdev=<daxdev>

And, uh, if there's a FUSE_GET_DAXDEV command, then what does this mount
option do?  Pre-populate the first element of that set?

--D

> Signed-off-by: John Groves <john@groves.net>
> ---
>  fs/fuse/fuse_i.h |  8 +++++++-
>  fs/fuse/inode.c  | 28 +++++++++++++++++++++++++++-
>  2 files changed, 34 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index a592c1002861..f4ee61046578 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -583,9 +583,11 @@ struct fuse_fs_context {
>  	unsigned int blksize;
>  	const char *subtype;
>  
> -	/* DAX device, may be NULL */
> +	/* DAX device for virtiofs, may be NULL */
>  	struct dax_device *dax_dev;
>  
> +	const char *shadow; /* famfs - null if not famfs */
> +
>  	/* fuse_dev pointer to fill in, should contain NULL on entry */
>  	void **fudptr;
>  };
> @@ -941,6 +943,10 @@ struct fuse_conn {
>  	/**  uring connection information*/
>  	struct fuse_ring *ring;
>  #endif
> +
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +	char *shadow;
> +#endif
>  };
>  
>  /*
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index e48e11c3f9f3..a7e1cf8257b0 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -766,6 +766,9 @@ enum {
>  	OPT_ALLOW_OTHER,
>  	OPT_MAX_READ,
>  	OPT_BLKSIZE,
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +	OPT_SHADOW,
> +#endif
>  	OPT_ERR
>  };
>  
> @@ -780,6 +783,9 @@ static const struct fs_parameter_spec fuse_fs_parameters[] = {
>  	fsparam_u32	("max_read",		OPT_MAX_READ),
>  	fsparam_u32	("blksize",		OPT_BLKSIZE),
>  	fsparam_string	("subtype",		OPT_SUBTYPE),
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +	fsparam_string("shadow",		OPT_SHADOW),
> +#endif
>  	{}
>  };
>  
> @@ -875,6 +881,15 @@ static int fuse_parse_param(struct fs_context *fsc, struct fs_parameter *param)
>  		ctx->blksize = result.uint_32;
>  		break;
>  
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +	case OPT_SHADOW:
> +		if (ctx->shadow)
> +			return invalfc(fsc, "Multiple shadows specified");
> +		ctx->shadow = param->string;
> +		param->string = NULL;
> +		break;
> +#endif
> +
>  	default:
>  		return -EINVAL;
>  	}
> @@ -888,6 +903,7 @@ static void fuse_free_fsc(struct fs_context *fsc)
>  
>  	if (ctx) {
>  		kfree(ctx->subtype);
> +		kfree(ctx->shadow);
>  		kfree(ctx);
>  	}
>  }
> @@ -919,7 +935,10 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
>  	else if (fc->dax_mode == FUSE_DAX_INODE_USER)
>  		seq_puts(m, ",dax=inode");
>  #endif
> -
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +	if (fc->shadow)
> +		seq_printf(m, ",shadow=%s", fc->shadow);
> +#endif
>  	return 0;
>  }
>  
> @@ -1017,6 +1036,9 @@ void fuse_conn_put(struct fuse_conn *fc)
>  		}
>  		if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
>  			fuse_backing_files_free(fc);
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +		kfree(fc->shadow);
> +#endif
>  		call_rcu(&fc->rcu, delayed_release);
>  	}
>  }
> @@ -1834,6 +1856,10 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
>  	sb->s_root = root_dentry;
>  	if (ctx->fudptr)
>  		*ctx->fudptr = fud;
> +
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +	fc->shadow = kstrdup(ctx->shadow, GFP_KERNEL);
> +#endif
>  	mutex_unlock(&fuse_mutex);
>  	return 0;
>  
> -- 
> 2.49.0
> 
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 12/18] famfs_fuse: Plumb the GET_FMAP message/response
  2025-07-03 18:50 ` [RFC V2 12/18] famfs_fuse: Plumb the GET_FMAP message/response John Groves
  2025-07-04  8:54   ` Amir Goldstein
@ 2025-07-09  4:27   ` Darrick J. Wong
  2025-07-11 13:46     ` John Groves
  2025-08-14 13:36   ` Miklos Szeredi
  2 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2025-07-09  4:27 UTC (permalink / raw)
  To: John Groves
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On Thu, Jul 03, 2025 at 01:50:26PM -0500, John Groves wrote:
> Upon completion of an OPEN, if we're in famfs-mode we do a GET_FMAP to
> retrieve and cache up the file-to-dax map in the kernel. If this
> succeeds, read/write/mmap are resolved direct-to-dax with no upcalls.
> 
> GET_FMAP has a variable-size response payload, and the allocated size
> is sent in the in_args[0].size field. If the fmap would overflow the
> message, the fuse server sends a reply of size 'sizeof(uint32_t)' which
> specifies the size of the fmap message. Then the kernel can realloc a
> large enough buffer and try again.
> 
> Signed-off-by: John Groves <john@groves.net>
> ---
>  fs/fuse/file.c            | 84 +++++++++++++++++++++++++++++++++++++++
>  fs/fuse/fuse_i.h          | 36 ++++++++++++++++-
>  fs/fuse/inode.c           | 19 +++++++--
>  fs/fuse/iomode.c          |  2 +-
>  include/uapi/linux/fuse.h | 18 +++++++++
>  5 files changed, 154 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 93b82660f0c8..8616fb0a6d61 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -230,6 +230,77 @@ static void fuse_truncate_update_attr(struct inode *inode, struct file *file)
>  	fuse_invalidate_attr_mask(inode, FUSE_STATX_MODSIZE);
>  }
>  
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +
> +#define FMAP_BUFSIZE 4096

PAGE_SIZE ?

> +
> +static int
> +fuse_get_fmap(struct fuse_mount *fm, struct inode *inode, u64 nodeid)
> +{
> +	struct fuse_get_fmap_in inarg = { 0 };
> +	size_t fmap_bufsize = FMAP_BUFSIZE;
> +	ssize_t fmap_size;
> +	int retries = 1;
> +	void *fmap_buf;
> +	int rc;
> +
> +	FUSE_ARGS(args);
> +
> +	fmap_buf = kcalloc(1, FMAP_BUFSIZE, GFP_KERNEL);
> +	if (!fmap_buf)
> +		return -EIO;
> +
> + retry_once:
> +	inarg.size = fmap_bufsize;
> +
> +	args.opcode = FUSE_GET_FMAP;
> +	args.nodeid = nodeid;
> +
> +	args.in_numargs = 1;
> +	args.in_args[0].size = sizeof(inarg);
> +	args.in_args[0].value = &inarg;
> +
> +	/* Variable-sized output buffer
> +	 * this causes fuse_simple_request() to return the size of the
> +	 * output payload
> +	 */
> +	args.out_argvar = true;
> +	args.out_numargs = 1;
> +	args.out_args[0].size = fmap_bufsize;
> +	args.out_args[0].value = fmap_buf;
> +
> +	/* Send GET_FMAP command */
> +	rc = fuse_simple_request(fm, &args);
> +	if (rc < 0) {
> +		pr_err("%s: err=%d from fuse_simple_request()\n",
> +		       __func__, rc);
> +		return rc;
> +	}
> +	fmap_size = rc;
> +
> +	if (retries && fmap_size == sizeof(uint32_t)) {
> +		/* fmap size exceeded fmap_bufsize;
> +		 * actual fmap size returned in fmap_buf;
> +		 * realloc and retry once
> +		 */
> +		fmap_bufsize = *((uint32_t *)fmap_buf);
> +
> +		--retries;
> +		kfree(fmap_buf);
> +		fmap_buf = kcalloc(1, fmap_bufsize, GFP_KERNEL);
> +		if (!fmap_buf)
> +			return -EIO;
> +
> +		goto retry_once;
> +	}
> +
> +	/* Will call famfs_file_init_dax() when that gets added */

Hard to say what this does without looking further down in the patchset.
:)

> +	kfree(fmap_buf);
> +	return 0;
> +}
> +#endif
> +
>  static int fuse_open(struct inode *inode, struct file *file)
>  {
>  	struct fuse_mount *fm = get_fuse_mount(inode);
> @@ -263,6 +334,19 @@ static int fuse_open(struct inode *inode, struct file *file)
>  
>  	err = fuse_do_open(fm, get_node_id(inode), file, false);
>  	if (!err) {
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +		if (fm->fc->famfs_iomap) {
> +			if (S_ISREG(inode->i_mode)) {

/me wonders if you want to turn this into a dumb helper to reduce the
indenting levels?

#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
static inline bool fuse_is_famfs_file(struct inode *inode)
{
	return fm->fc->famfs_iomap && S_ISREG(inode->i_mode);
}
#else
# define fuse_is_famfs_file(...)	(false)
#endif

	if (!err) {
		if (fuse_is_famfs_file(inode)) {
			rc = fuse_get_fmap(fm, inode);
			...
		}
	}

> +				int rc;
> +				/* Get the famfs fmap */
> +				rc = fuse_get_fmap(fm, inode,
> +						   get_node_id(inode));

Just get_node_id inside fuse_get_fmap to reduce the parameter count.

> +				if (rc)
> +					pr_err("%s: fuse_get_fmap err=%d\n",
> +					       __func__, rc);
> +			}
> +		}
> +#endif
>  		ff = file->private_data;
>  		err = fuse_finish_open(inode, file);
>  		if (err)
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index f4ee61046578..e01d6e5c6e93 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -193,6 +193,10 @@ struct fuse_inode {
>  	/** Reference to backing file in passthrough mode */
>  	struct fuse_backing *fb;
>  #endif
> +
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +	void *famfs_meta;
> +#endif

What gets stored in here?

>  };
>  
>  /** FUSE inode state bits */
> @@ -945,6 +949,8 @@ struct fuse_conn {
>  #endif
>  
>  #if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +	struct rw_semaphore famfs_devlist_sem;
> +	struct famfs_dax_devlist *dax_devlist;
>  	char *shadow;
>  #endif
>  };
> @@ -1435,11 +1441,14 @@ void fuse_free_conn(struct fuse_conn *fc);
>  
>  /* dax.c */
>  
> +static inline int fuse_file_famfs(struct fuse_inode *fi); /* forward */
> +
>  /* This macro is used by virtio_fs, but now it also needs to filter for
>   * "not famfs"
>   */
>  #define FUSE_IS_VIRTIO_DAX(fuse_inode) (IS_ENABLED(CONFIG_FUSE_DAX)	\
> -					&& IS_DAX(&fuse_inode->inode))
> +					&& IS_DAX(&fuse_inode->inode)	\
> +					&& !fuse_file_famfs(fuse_inode))
>  
>  ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to);
>  ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from);
> @@ -1550,4 +1559,29 @@ extern void fuse_sysctl_unregister(void);
>  #define fuse_sysctl_unregister()	do { } while (0)
>  #endif /* CONFIG_SYSCTL */
>  
> +/* famfs.c */
> +static inline struct fuse_backing *famfs_meta_set(struct fuse_inode *fi,
> +						       void *meta)
> +{
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +	return xchg(&fi->famfs_meta, meta);
> +#else
> +	return NULL;
> +#endif
> +}
> +
> +static inline void famfs_meta_free(struct fuse_inode *fi)
> +{
> +	/* Stub wil be connected in a subsequent commit */
> +}
> +
> +static inline int fuse_file_famfs(struct fuse_inode *fi)
> +{
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +	return (READ_ONCE(fi->famfs_meta) != NULL);
> +#else
> +	return 0;
> +#endif
> +}

...or maybe this is the predicate you want to see if you really need to
fmapping related stuff?

> +
>  #endif /* _FS_FUSE_I_H */
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index a7e1cf8257b0..b071d16f7d04 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -117,6 +117,9 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
>  	if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
>  		fuse_inode_backing_set(fi, NULL);
>  
> +	if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
> +		famfs_meta_set(fi, NULL);
> +
>  	return &fi->inode;
>  
>  out_free_forget:
> @@ -138,6 +141,13 @@ static void fuse_free_inode(struct inode *inode)
>  	if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
>  		fuse_backing_put(fuse_inode_backing(fi));
>  
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +	if (S_ISREG(inode->i_mode) && fi->famfs_meta) {
> +		famfs_meta_free(fi);
> +		famfs_meta_set(fi, NULL);

_free should null out the pointer, no?

--D

> +	}
> +#endif
> +
>  	kmem_cache_free(fuse_inode_cachep, fi);
>  }
>  
> @@ -1002,6 +1012,9 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm,
>  	if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
>  		fuse_backing_files_init(fc);
>  
> +	if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
> +		pr_notice("%s: Kernel is FUSE_FAMFS_DAX capable\n", __func__);
> +
>  	INIT_LIST_HEAD(&fc->mounts);
>  	list_add(&fm->fc_entry, &fc->mounts);
>  	fm->fc = fc;
> @@ -1036,9 +1049,8 @@ void fuse_conn_put(struct fuse_conn *fc)
>  		}
>  		if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
>  			fuse_backing_files_free(fc);
> -#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> -		kfree(fc->shadow);
> -#endif
> +		if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
> +			kfree(fc->shadow);
>  		call_rcu(&fc->rcu, delayed_release);
>  	}
>  }
> @@ -1425,6 +1437,7 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
>  				 * those capabilities, they are held here).
>  				 */
>  				fc->famfs_iomap = 1;
> +				init_rwsem(&fc->famfs_devlist_sem);
>  			}
>  		} else {
>  			ra_pages = fc->max_read / PAGE_SIZE;
> diff --git a/fs/fuse/iomode.c b/fs/fuse/iomode.c
> index aec4aecb5d79..443b337b0c05 100644
> --- a/fs/fuse/iomode.c
> +++ b/fs/fuse/iomode.c
> @@ -204,7 +204,7 @@ int fuse_file_io_open(struct file *file, struct inode *inode)
>  	 * io modes are not relevant with DAX and with server that does not
>  	 * implement open.
>  	 */
> -	if (FUSE_IS_VIRTIO_DAX(fi) || !ff->args)
> +	if (FUSE_IS_VIRTIO_DAX(fi) || fuse_file_famfs(fi) || !ff->args)
>  		return 0;
>  
>  	/*
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index 6c384640c79b..dff5aa62543e 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -654,6 +654,10 @@ enum fuse_opcode {
>  	FUSE_TMPFILE		= 51,
>  	FUSE_STATX		= 52,
>  
> +	/* Famfs / devdax opcodes */
> +	FUSE_GET_FMAP           = 53,
> +	FUSE_GET_DAXDEV         = 54,
> +
>  	/* CUSE specific operations */
>  	CUSE_INIT		= 4096,
>  
> @@ -888,6 +892,16 @@ struct fuse_access_in {
>  	uint32_t	padding;
>  };
>  
> +struct fuse_get_fmap_in {
> +	uint32_t	size;
> +	uint32_t	padding;
> +};
> +
> +struct fuse_get_fmap_out {
> +	uint32_t	size;
> +	uint32_t	padding;
> +};
> +
>  struct fuse_init_in {
>  	uint32_t	major;
>  	uint32_t	minor;
> @@ -1284,4 +1298,8 @@ struct fuse_uring_cmd_req {
>  	uint8_t padding[6];
>  };
>  
> +/* Famfs fmap message components */
> +
> +#define FAMFS_FMAP_MAX 32768 /* Largest supported fmap message */
> +
>  #endif /* _LINUX_FUSE_H */
> -- 
> 2.49.0
> 
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 00/18] famfs: port into fuse
  2025-07-09  3:26   ` Miklos Szeredi
@ 2025-07-11  1:18     ` John Groves
  0 siblings, 0 replies; 91+ messages in thread
From: John Groves @ 2025-07-11  1:18 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Dan Williams, Bernd Schubert, John Groves, Jonathan Corbet,
	Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Darrick J . Wong, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On 25/07/09 05:26AM, Miklos Szeredi wrote:
> On Thu, 3 Jul 2025 at 20:56, John Groves <John@groves.net> wrote:
> >
> > DERP: I did it again; Miklos' email is wrong in this series.
> 
> linux-fsdevel also lands in my inbox, so I don't even notice.
> 
> I won't get to review this until August, sorry about that.
> 
> Thanks,
> Miklos

Thanks Miklos. I'll probably get one more update out to this series by
August. Best possible case, I will have fixed the poisoned page problem - 
but I haven't worked out what the fix is yet, so that's an aspiration.

Regards,
John


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 10/18] famfs_fuse: Basic fuse kernel ABI enablement for famfs
  2025-07-09  1:53           ` Darrick J. Wong
@ 2025-07-11  1:32             ` John Groves
  2025-07-12  4:49               ` Darrick J. Wong
  2025-08-11 18:30               ` John Groves
  0 siblings, 2 replies; 91+ messages in thread
From: John Groves @ 2025-07-11  1:32 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Amir Goldstein, Dan Williams, Miklos Szeredi, Bernd Schubert,
	John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Randy Dunlap, Jeff Layton, Kent Overstreet, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On 25/07/08 06:53PM, Darrick J. Wong wrote:
> On Tue, Jul 08, 2025 at 07:02:03AM -0500, John Groves wrote:
> > On 25/07/07 10:39AM, Darrick J. Wong wrote:
> > > On Fri, Jul 04, 2025 at 08:39:59AM -0500, John Groves wrote:
> > > > On 25/07/04 09:54AM, Amir Goldstein wrote:
> > > > > On Thu, Jul 3, 2025 at 8:51 PM John Groves <John@groves.net> wrote:
> > > > > >
> > > > > > * FUSE_DAX_FMAP flag in INIT request/reply
> > > > > >
> > > > > > * fuse_conn->famfs_iomap (enable famfs-mapped files) to denote a
> > > > > >   famfs-enabled connection
> > > > > >
> > > > > > Signed-off-by: John Groves <john@groves.net>
> > > > > > ---
> > > > > >  fs/fuse/fuse_i.h          |  3 +++
> > > > > >  fs/fuse/inode.c           | 14 ++++++++++++++
> > > > > >  include/uapi/linux/fuse.h |  4 ++++
> > > > > >  3 files changed, 21 insertions(+)
> > > > > >
> > > > > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > > > > index 9d87ac48d724..a592c1002861 100644
> > > > > > --- a/fs/fuse/fuse_i.h
> > > > > > +++ b/fs/fuse/fuse_i.h
> > > > > > @@ -873,6 +873,9 @@ struct fuse_conn {
> > > > > >         /* Use io_uring for communication */
> > > > > >         unsigned int io_uring;
> > > > > >
> > > > > > +       /* dev_dax_iomap support for famfs */
> > > > > > +       unsigned int famfs_iomap:1;
> > > > > > +
> > > > > 
> > > > > pls move up to the bit fields members.
> > > > 
> > > > Oops, done, thanks.
> > > > 
> > > > > 
> > > > > >         /** Maximum stack depth for passthrough backing files */
> > > > > >         int max_stack_depth;
> > > > > >
> > > > > > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > > > > > index 29147657a99f..e48e11c3f9f3 100644
> > > > > > --- a/fs/fuse/inode.c
> > > > > > +++ b/fs/fuse/inode.c
> > > > > > @@ -1392,6 +1392,18 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
> > > > > >                         }
> > > > > >                         if (flags & FUSE_OVER_IO_URING && fuse_uring_enabled())
> > > > > >                                 fc->io_uring = 1;
> > > > > > +                       if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX) &&
> > > > > > +                           flags & FUSE_DAX_FMAP) {
> > > > > > +                               /* XXX: Should also check that fuse server
> > > > > > +                                * has CAP_SYS_RAWIO and/or CAP_SYS_ADMIN,
> > > > > > +                                * since it is directing the kernel to access
> > > > > > +                                * dax memory directly - but this function
> > > > > > +                                * appears not to be called in fuse server
> > > > > > +                                * process context (b/c even if it drops
> > > > > > +                                * those capabilities, they are held here).
> > > > > > +                                */
> > > > > > +                               fc->famfs_iomap = 1;
> > > > > > +                       }
> > > > > 
> > > > > 1. As long as the mapping requests are checking capabilities we should be ok
> > > > >     Right?
> > > > 
> > > > It depends on the definition of "are", or maybe of "mapping requests" ;)
> > > > 
> > > > Forgive me if this *is* obvious, but the fuse server capabilities are what
> > > > I think need to be checked here - not the app that it accessing a file.
> > > > 
> > > > An app accessing a regular file doesn't need permission to do raw access to
> > > > the underlying block dev, but the fuse server does - becuase it is directing
> > > > the kernel to access that for apps.
> > > > 
> > > > > 2. What's the deal with capable(CAP_SYS_ADMIN) in process_init_limits then?
> > > > 
> > > > I *think* that's checking the capabilities of the app that is accessing the
> > > > file, and not the fuse server. But I might be wrong - I have not pulled very
> > > > hard on that thread yet.
> > > 
> > > The init reply should be processed in the context of the fuse server.
> > > At that point the kernel hasn't exposed the fs to user programs, so
> > > (AFAICT) there won't be any other programs accessing that fuse mount.
> > 
> > Hmm. It would be good if you're right about that. My fuse server *is* running
> > as root, and when I check those capabilities in process_init_reply(), I
> > find those capabilities. So far so good.
> > 
> > Then I added code to my fuse server to drop those capabilities prior to
> > starting the fuse session (prctl(PR_CAPBSET_DROP, CAP_SYS_RAWIO) and 
> > prctl(PR_CAPBSET_DROP, CAP_SYS_ADMIN). I expected (hoped?) to see those 
> > capabilities disappear in process_init_reply() - but they did not disappear.
> > 
> > I'm all ears if somebody can see a flaw in my logic here. Otherwise, the
> > capabilities need to be stashed away before the reply is processsed, when 
> > fs/fuse *is* running in fuse server context.
> > 
> > I'm somewhat surprised if that isn't already happening somewhere...
> 
> Hrm.  I *thought* that since FUSE_INIT isn't queued as a background
> command, it should still execute in the same process context as the fuse
> server.
> 
> OTOH it also occurs to me that I have this code in fuse_send_init:
> 
> 	if (has_capability_noaudit(current, CAP_SYS_RAWIO))
> 		flags |= FUSE_IOMAP | FUSE_IOMAP_DIRECTIO | FUSE_IOMAP_PAGECACHE;
> 	...
> 	ia->in.flags = flags;
> 	ia->in.flags2 = flags >> 32;
> 
> which means that we only advertise iomap support in FUSE_INIT if the
> process running fuse_fill_super (which you hope is the fuse server)
> actually has CAP_SYS_RAWIO.  Would that work for you?  Or are you
> dropping privileges before you even open /dev/fuse?

Ah - that might be the answer. I will check if dropped capabilities 
disappear in fuse_send_init. If so, I can work with that - not advertising 
the famfs capability unless the capability is present at that point looks 
like a perfectly good option. Thanks for that idea!

> 
> Note: I might decide to relax that approach later on, since iomap
> requires you to have opened a block device ... which implies that the
> process had read/write access to start with; and maybe we're ok with
> unprivileged fuse2fs servers running on a chmod 666 block device?
> 
> <shrug> always easier to /relax/ the privilege checks. :)

My policy on security is that I'm against it...

> 
> > > > > 3. Darrick mentioned the need for a synchronic INIT variant for his work on
> > > > >     blockdev iomap support [1]
> > > > 
> > > > I'm not sure that's the same thing (Darrick?), but I do think Darrick's
> > > > use case probably needs to check capabilities for a server that is sending
> > > > apps (via files) off to access extents of block devices.
> > > 
> > > I don't know either, Miklos hasn't responded to my questions.  I think
> > > the motivation for a synchronous 
> > 
> > ?
> 
> ..."I don't know what his motivations for synchronous FUSE_INIT are."
> 
> I guess I fubard vim. :(

So I'm not alone...

> 
> > > As for fuse/iomap, I just only need to ask the kernel if iomap support
> > > is available before calling ext2fs_open2() because the iomap question
> > > has some implications for how we open the ext4 filesystem.
> > > 
> > > > > I also wonder how much of your patches and Darrick's patches end up
> > > > > being an overlap?
> > > > 
> > > > Darrick and I spent some time hashing through this, and came to the conclusion
> > > > that the actual overlap is slim-to-none. 
> > > 
> > > Yeah.  The neat thing about FMAPs is that you can establish repeating
> > > patterns, which is useful for interleaved DRAM/pmem devices.  Disk
> > > filesystems don't do repeating patterns, so they'd much rather manage
> > > non-repeating mappings.
> > 
> > Right. Interleaving is critical to how we use memory, so fmaps are designed
> > to support it.
> > 
> > Tangent: at some point a broader-than-just-me discussion of how block devices
> > have the device mapper, but memory has no such layout tools, might be good
> > to have. Without such a thing (which might or might not be possible/practical),
> > it's essential that famfs do the interleaving. Lacking a mapper layer also
> > means that we need dax to provide a clean "device abstraction" (meaning
> > a single CXL allocation [which has a uuid/tag] needs to appear as a single
> > dax device whether or not it's HPA-contiguous).
> 
> Well it's not as simple as device-mapper, where we can intercept struct
> bio and remap/split it to our heart's content.  I guess you could do
> that with an iovec...?  Would be sorta amusing if you could software
> RAID10 some DRAM. :P

SW RAID, and mapper in general, has a "store and forward" property (or maybe
"store, transmogrify, and forward") that doesn't really work for memory. 
It's vma's (and files) that can remap memory address regions. Layered vma's 
anyone? I need to think about whether that's utter nonsense, or just mostly 
nonsense.

Continuing to think about this...

Thanks!
John



^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 12/18] famfs_fuse: Plumb the GET_FMAP message/response
  2025-07-09  4:27   ` Darrick J. Wong
@ 2025-07-11 13:46     ` John Groves
  0 siblings, 0 replies; 91+ messages in thread
From: John Groves @ 2025-07-11 13:46 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On 25/07/08 09:27PM, Darrick J. Wong wrote:
> On Thu, Jul 03, 2025 at 01:50:26PM -0500, John Groves wrote:
> > Upon completion of an OPEN, if we're in famfs-mode we do a GET_FMAP to
> > retrieve and cache up the file-to-dax map in the kernel. If this
> > succeeds, read/write/mmap are resolved direct-to-dax with no upcalls.
> > 
> > GET_FMAP has a variable-size response payload, and the allocated size
> > is sent in the in_args[0].size field. If the fmap would overflow the
> > message, the fuse server sends a reply of size 'sizeof(uint32_t)' which
> > specifies the size of the fmap message. Then the kernel can realloc a
> > large enough buffer and try again.
> > 
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> >  fs/fuse/file.c            | 84 +++++++++++++++++++++++++++++++++++++++
> >  fs/fuse/fuse_i.h          | 36 ++++++++++++++++-
> >  fs/fuse/inode.c           | 19 +++++++--
> >  fs/fuse/iomode.c          |  2 +-
> >  include/uapi/linux/fuse.h | 18 +++++++++
> >  5 files changed, 154 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index 93b82660f0c8..8616fb0a6d61 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -230,6 +230,77 @@ static void fuse_truncate_update_attr(struct inode *inode, struct file *file)
> >  	fuse_invalidate_attr_mask(inode, FUSE_STATX_MODSIZE);
> >  }
> >  
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > +
> > +#define FMAP_BUFSIZE 4096
> 
> PAGE_SIZE ?

Like it. Queued to -next

> 
> > +
> > +static int
> > +fuse_get_fmap(struct fuse_mount *fm, struct inode *inode, u64 nodeid)
> > +{
> > +	struct fuse_get_fmap_in inarg = { 0 };
> > +	size_t fmap_bufsize = FMAP_BUFSIZE;
> > +	ssize_t fmap_size;
> > +	int retries = 1;
> > +	void *fmap_buf;
> > +	int rc;
> > +
> > +	FUSE_ARGS(args);
> > +
> > +	fmap_buf = kcalloc(1, FMAP_BUFSIZE, GFP_KERNEL);
> > +	if (!fmap_buf)
> > +		return -EIO;
> > +
> > + retry_once:
> > +	inarg.size = fmap_bufsize;
> > +
> > +	args.opcode = FUSE_GET_FMAP;
> > +	args.nodeid = nodeid;
> > +
> > +	args.in_numargs = 1;
> > +	args.in_args[0].size = sizeof(inarg);
> > +	args.in_args[0].value = &inarg;
> > +
> > +	/* Variable-sized output buffer
> > +	 * this causes fuse_simple_request() to return the size of the
> > +	 * output payload
> > +	 */
> > +	args.out_argvar = true;
> > +	args.out_numargs = 1;
> > +	args.out_args[0].size = fmap_bufsize;
> > +	args.out_args[0].value = fmap_buf;
> > +
> > +	/* Send GET_FMAP command */
> > +	rc = fuse_simple_request(fm, &args);
> > +	if (rc < 0) {
> > +		pr_err("%s: err=%d from fuse_simple_request()\n",
> > +		       __func__, rc);
> > +		return rc;
> > +	}
> > +	fmap_size = rc;
> > +
> > +	if (retries && fmap_size == sizeof(uint32_t)) {
> > +		/* fmap size exceeded fmap_bufsize;
> > +		 * actual fmap size returned in fmap_buf;
> > +		 * realloc and retry once
> > +		 */
> > +		fmap_bufsize = *((uint32_t *)fmap_buf);
> > +
> > +		--retries;
> > +		kfree(fmap_buf);
> > +		fmap_buf = kcalloc(1, fmap_bufsize, GFP_KERNEL);
> > +		if (!fmap_buf)
> > +			return -EIO;
> > +
> > +		goto retry_once;
> > +	}
> > +
> > +	/* Will call famfs_file_init_dax() when that gets added */
> 
> Hard to say what this does without looking further down in the patchset.
> :)

New comment:
	/* We retrieved the "fmap" (the file's map to memory), but
	 * we haven't used it yet. A call to famfs_file_init_dax() will be added
	 * here in a subsequent patch, when we add the ability to attach
	 * fmaps to files.
	 */

> 
> > +	kfree(fmap_buf);
> > +	return 0;
> > +}
> > +#endif
> > +
> >  static int fuse_open(struct inode *inode, struct file *file)
> >  {
> >  	struct fuse_mount *fm = get_fuse_mount(inode);
> > @@ -263,6 +334,19 @@ static int fuse_open(struct inode *inode, struct file *file)
> >  
> >  	err = fuse_do_open(fm, get_node_id(inode), file, false);
> >  	if (!err) {
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > +		if (fm->fc->famfs_iomap) {
> > +			if (S_ISREG(inode->i_mode)) {
> 
> /me wonders if you want to turn this into a dumb helper to reduce the
> indenting levels?
> 
> #if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> static inline bool fuse_is_famfs_file(struct inode *inode)
> {
> 	return fm->fc->famfs_iomap && S_ISREG(inode->i_mode);
> }
> #else
> # define fuse_is_famfs_file(...)	(false)
> #endif
> 
> 	if (!err) {
> 		if (fuse_is_famfs_file(inode)) {
> 			rc = fuse_get_fmap(fm, inode);
> 			...
> 		}
> 	}
> 

I've already refactored helpers and simplified this logic in the -next 
branch, including losing the conditrional code here in file.c:

	if (!err) {
		if ((fm->fc->famfs_iomap) && (S_ISREG(inode->i_mode))) {
			int rc;
			/* Get the famfs fmap */
			rc = fuse_get_fmap(fm, inode);
			...
		}
		...
	}

So I think it's quite a bit cleaner... will send out an updated patch
pretty soon (probably next week, without the poisoned page fixes yet).

> > +				int rc;
> > +				/* Get the famfs fmap */
> > +				rc = fuse_get_fmap(fm, inode,
> > +						   get_node_id(inode));
> 
> Just get_node_id inside fuse_get_fmap to reduce the parameter count.

Done, thanks

> 
> > +				if (rc)
> > +					pr_err("%s: fuse_get_fmap err=%d\n",
> > +					       __func__, rc);
> > +			}
> > +		}
> > +#endif
> >  		ff = file->private_data;
> >  		err = fuse_finish_open(inode, file);
> >  		if (err)
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index f4ee61046578..e01d6e5c6e93 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -193,6 +193,10 @@ struct fuse_inode {
> >  	/** Reference to backing file in passthrough mode */
> >  	struct fuse_backing *fb;
> >  #endif
> > +
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > +	void *famfs_meta;
> > +#endif
> 
> What gets stored in here?

Explanatory comment added:
	/* Pointer to the file's famfs metadata. Primary content is the
	 * in-memory version of the fmap - the map from file's offset range
	 * to DAX memory
	 */

> 
> >  };
> >  
> >  /** FUSE inode state bits */
> > @@ -945,6 +949,8 @@ struct fuse_conn {
> >  #endif
> >  
> >  #if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > +	struct rw_semaphore famfs_devlist_sem;
> > +	struct famfs_dax_devlist *dax_devlist;
> >  	char *shadow;
> >  #endif
> >  };
> > @@ -1435,11 +1441,14 @@ void fuse_free_conn(struct fuse_conn *fc);
> >  
> >  /* dax.c */
> >  
> > +static inline int fuse_file_famfs(struct fuse_inode *fi); /* forward */
> > +
> >  /* This macro is used by virtio_fs, but now it also needs to filter for
> >   * "not famfs"
> >   */
> >  #define FUSE_IS_VIRTIO_DAX(fuse_inode) (IS_ENABLED(CONFIG_FUSE_DAX)	\
> > -					&& IS_DAX(&fuse_inode->inode))
> > +					&& IS_DAX(&fuse_inode->inode)	\
> > +					&& !fuse_file_famfs(fuse_inode))
> >  
> >  ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to);
> >  ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from);
> > @@ -1550,4 +1559,29 @@ extern void fuse_sysctl_unregister(void);
> >  #define fuse_sysctl_unregister()	do { } while (0)
> >  #endif /* CONFIG_SYSCTL */
> >  
> > +/* famfs.c */
> > +static inline struct fuse_backing *famfs_meta_set(struct fuse_inode *fi,
> > +						       void *meta)
> > +{
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > +	return xchg(&fi->famfs_meta, meta);
> > +#else
> > +	return NULL;
> > +#endif
> > +}
> > +
> > +static inline void famfs_meta_free(struct fuse_inode *fi)
> > +{
> > +	/* Stub wil be connected in a subsequent commit */
> > +}
> > +
> > +static inline int fuse_file_famfs(struct fuse_inode *fi)
> > +{
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > +	return (READ_ONCE(fi->famfs_meta) != NULL);
> > +#else
> > +	return 0;
> > +#endif
> > +}
> 
> ...or maybe this is the predicate you want to see if you really need to
> fmapping related stuff?
> 
> > +
> >  #endif /* _FS_FUSE_I_H */
> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > index a7e1cf8257b0..b071d16f7d04 100644
> > --- a/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -117,6 +117,9 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
> >  	if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> >  		fuse_inode_backing_set(fi, NULL);
> >  
> > +	if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
> > +		famfs_meta_set(fi, NULL);
> > +
> >  	return &fi->inode;
> >  
> >  out_free_forget:
> > @@ -138,6 +141,13 @@ static void fuse_free_inode(struct inode *inode)
> >  	if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> >  		fuse_backing_put(fuse_inode_backing(fi));
> >  
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > +	if (S_ISREG(inode->i_mode) && fi->famfs_meta) {
> > +		famfs_meta_free(fi);
> > +		famfs_meta_set(fi, NULL);
> 
> _free should null out the pointer, no?

Good point - will do

<snip>

Thanks Darrick!
John


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 11/18] famfs_fuse: Basic famfs mount opts
  2025-07-09  3:59   ` Darrick J. Wong
@ 2025-07-11 15:28     ` John Groves
  2025-07-12  5:54       ` Darrick J. Wong
  0 siblings, 1 reply; 91+ messages in thread
From: John Groves @ 2025-07-11 15:28 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On 25/07/08 08:59PM, Darrick J. Wong wrote:
> On Thu, Jul 03, 2025 at 01:50:25PM -0500, John Groves wrote:
> > * -o shadow=<shadowpath>
> 
> What is a shadow?
> 
> > * -o daxdev=<daxdev>

Derp - OK, that's a stale commit message. Here is the one for the -next
version of this patch:

    famfs_fuse: Basic famfs mount opt: -o shadow=<shadowpath>

    The shadow path is a (usually tmpfs) file system area used by the famfs 
    user space to commuicate with the famfs fuse server. There is a minor 
    dilemma that the user space tools must be able to resolve from a mount 
    point path to a shadow path. The shadow path is exposed via /proc/mounts, 
    but otherwise not used by the kernel. User space gets the shadow path 
    from /proc/mounts...


> 
> And, uh, if there's a FUSE_GET_DAXDEV command, then what does this mount
> option do?  Pre-populate the first element of that set?
> 
> --D
> 

I took out -o daxdev, but had failed to update the commit msg.

The logic is this: The general model requires the FUSE_GET_DAXDEV message /
response, so passing in the primary daxdev as a -o arg creates two ways to
do the same thing.

The only initial heartburn about this was one could imagine a case where a
mount happens, but no I/O happens for a while so the mount could "succeed",
only to fail later if the primary daxdev could not be accessed.

But this can't happen with famfs, because the mount procedure includes 
creating "meta files" - .meta/.superblock and .meta/.log and accessing them
immediately. So it is guaranteed that FUSE_GET_DAXDEV will be sent right away,
and if it fails, the mount will be unwound.

Thanks Darrick!
John

<snip>


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 10/18] famfs_fuse: Basic fuse kernel ABI enablement for famfs
  2025-07-11  1:32             ` John Groves
@ 2025-07-12  4:49               ` Darrick J. Wong
  2025-08-11 18:30               ` John Groves
  1 sibling, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2025-07-12  4:49 UTC (permalink / raw)
  To: John Groves
  Cc: Amir Goldstein, Dan Williams, Miklos Szeredi, Bernd Schubert,
	John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Randy Dunlap, Jeff Layton, Kent Overstreet, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On Thu, Jul 10, 2025 at 08:32:13PM -0500, John Groves wrote:
> On 25/07/08 06:53PM, Darrick J. Wong wrote:
> > On Tue, Jul 08, 2025 at 07:02:03AM -0500, John Groves wrote:
> > > On 25/07/07 10:39AM, Darrick J. Wong wrote:
> > > > On Fri, Jul 04, 2025 at 08:39:59AM -0500, John Groves wrote:
> > > > > On 25/07/04 09:54AM, Amir Goldstein wrote:
> > > > > > On Thu, Jul 3, 2025 at 8:51 PM John Groves <John@groves.net> wrote:
> > > > > > >
> > > > > > > * FUSE_DAX_FMAP flag in INIT request/reply
> > > > > > >
> > > > > > > * fuse_conn->famfs_iomap (enable famfs-mapped files) to denote a
> > > > > > >   famfs-enabled connection
> > > > > > >
> > > > > > > Signed-off-by: John Groves <john@groves.net>
> > > > > > > ---
> > > > > > >  fs/fuse/fuse_i.h          |  3 +++
> > > > > > >  fs/fuse/inode.c           | 14 ++++++++++++++
> > > > > > >  include/uapi/linux/fuse.h |  4 ++++
> > > > > > >  3 files changed, 21 insertions(+)
> > > > > > >
> > > > > > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > > > > > index 9d87ac48d724..a592c1002861 100644
> > > > > > > --- a/fs/fuse/fuse_i.h
> > > > > > > +++ b/fs/fuse/fuse_i.h
> > > > > > > @@ -873,6 +873,9 @@ struct fuse_conn {
> > > > > > >         /* Use io_uring for communication */
> > > > > > >         unsigned int io_uring;
> > > > > > >
> > > > > > > +       /* dev_dax_iomap support for famfs */
> > > > > > > +       unsigned int famfs_iomap:1;
> > > > > > > +
> > > > > > 
> > > > > > pls move up to the bit fields members.
> > > > > 
> > > > > Oops, done, thanks.
> > > > > 
> > > > > > 
> > > > > > >         /** Maximum stack depth for passthrough backing files */
> > > > > > >         int max_stack_depth;
> > > > > > >
> > > > > > > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > > > > > > index 29147657a99f..e48e11c3f9f3 100644
> > > > > > > --- a/fs/fuse/inode.c
> > > > > > > +++ b/fs/fuse/inode.c
> > > > > > > @@ -1392,6 +1392,18 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
> > > > > > >                         }
> > > > > > >                         if (flags & FUSE_OVER_IO_URING && fuse_uring_enabled())
> > > > > > >                                 fc->io_uring = 1;
> > > > > > > +                       if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX) &&
> > > > > > > +                           flags & FUSE_DAX_FMAP) {
> > > > > > > +                               /* XXX: Should also check that fuse server
> > > > > > > +                                * has CAP_SYS_RAWIO and/or CAP_SYS_ADMIN,
> > > > > > > +                                * since it is directing the kernel to access
> > > > > > > +                                * dax memory directly - but this function
> > > > > > > +                                * appears not to be called in fuse server
> > > > > > > +                                * process context (b/c even if it drops
> > > > > > > +                                * those capabilities, they are held here).
> > > > > > > +                                */
> > > > > > > +                               fc->famfs_iomap = 1;
> > > > > > > +                       }
> > > > > > 
> > > > > > 1. As long as the mapping requests are checking capabilities we should be ok
> > > > > >     Right?
> > > > > 
> > > > > It depends on the definition of "are", or maybe of "mapping requests" ;)
> > > > > 
> > > > > Forgive me if this *is* obvious, but the fuse server capabilities are what
> > > > > I think need to be checked here - not the app that it accessing a file.
> > > > > 
> > > > > An app accessing a regular file doesn't need permission to do raw access to
> > > > > the underlying block dev, but the fuse server does - becuase it is directing
> > > > > the kernel to access that for apps.
> > > > > 
> > > > > > 2. What's the deal with capable(CAP_SYS_ADMIN) in process_init_limits then?
> > > > > 
> > > > > I *think* that's checking the capabilities of the app that is accessing the
> > > > > file, and not the fuse server. But I might be wrong - I have not pulled very
> > > > > hard on that thread yet.
> > > > 
> > > > The init reply should be processed in the context of the fuse server.
> > > > At that point the kernel hasn't exposed the fs to user programs, so
> > > > (AFAICT) there won't be any other programs accessing that fuse mount.
> > > 
> > > Hmm. It would be good if you're right about that. My fuse server *is* running
> > > as root, and when I check those capabilities in process_init_reply(), I
> > > find those capabilities. So far so good.
> > > 
> > > Then I added code to my fuse server to drop those capabilities prior to
> > > starting the fuse session (prctl(PR_CAPBSET_DROP, CAP_SYS_RAWIO) and 
> > > prctl(PR_CAPBSET_DROP, CAP_SYS_ADMIN). I expected (hoped?) to see those 
> > > capabilities disappear in process_init_reply() - but they did not disappear.
> > > 
> > > I'm all ears if somebody can see a flaw in my logic here. Otherwise, the
> > > capabilities need to be stashed away before the reply is processsed, when 
> > > fs/fuse *is* running in fuse server context.
> > > 
> > > I'm somewhat surprised if that isn't already happening somewhere...
> > 
> > Hrm.  I *thought* that since FUSE_INIT isn't queued as a background
> > command, it should still execute in the same process context as the fuse
> > server.
> > 
> > OTOH it also occurs to me that I have this code in fuse_send_init:
> > 
> > 	if (has_capability_noaudit(current, CAP_SYS_RAWIO))
> > 		flags |= FUSE_IOMAP | FUSE_IOMAP_DIRECTIO | FUSE_IOMAP_PAGECACHE;
> > 	...
> > 	ia->in.flags = flags;
> > 	ia->in.flags2 = flags >> 32;
> > 
> > which means that we only advertise iomap support in FUSE_INIT if the
> > process running fuse_fill_super (which you hope is the fuse server)
> > actually has CAP_SYS_RAWIO.  Would that work for you?  Or are you
> > dropping privileges before you even open /dev/fuse?
> 
> Ah - that might be the answer. I will check if dropped capabilities 
> disappear in fuse_send_init. If so, I can work with that - not advertising 
> the famfs capability unless the capability is present at that point looks 
> like a perfectly good option. Thanks for that idea!

I thought of another twist -- what about a fuse server that runs with no
special privilege and is passed an open fd to a dax/block device?  Maybe
you're right that we need no explicit capability checks -- an open fd is
sufficient.

> > Note: I might decide to relax that approach later on, since iomap
> > requires you to have opened a block device ... which implies that the
> > process had read/write access to start with; and maybe we're ok with
> > unprivileged fuse2fs servers running on a chmod 666 block device?
> > 
> > <shrug> always easier to /relax/ the privilege checks. :)
> 
> My policy on security is that I'm against it...
> 
> > 
> > > > > > 3. Darrick mentioned the need for a synchronic INIT variant for his work on
> > > > > >     blockdev iomap support [1]
> > > > > 
> > > > > I'm not sure that's the same thing (Darrick?), but I do think Darrick's
> > > > > use case probably needs to check capabilities for a server that is sending
> > > > > apps (via files) off to access extents of block devices.
> > > > 
> > > > I don't know either, Miklos hasn't responded to my questions.  I think
> > > > the motivation for a synchronous 
> > > 
> > > ?
> > 
> > ..."I don't know what his motivations for synchronous FUSE_INIT are."
> > 
> > I guess I fubard vim. :(
> 
> So I'm not alone...
> 
> > 
> > > > As for fuse/iomap, I just only need to ask the kernel if iomap support
> > > > is available before calling ext2fs_open2() because the iomap question
> > > > has some implications for how we open the ext4 filesystem.
> > > > 
> > > > > > I also wonder how much of your patches and Darrick's patches end up
> > > > > > being an overlap?
> > > > > 
> > > > > Darrick and I spent some time hashing through this, and came to the conclusion
> > > > > that the actual overlap is slim-to-none. 
> > > > 
> > > > Yeah.  The neat thing about FMAPs is that you can establish repeating
> > > > patterns, which is useful for interleaved DRAM/pmem devices.  Disk
> > > > filesystems don't do repeating patterns, so they'd much rather manage
> > > > non-repeating mappings.
> > > 
> > > Right. Interleaving is critical to how we use memory, so fmaps are designed
> > > to support it.
> > > 
> > > Tangent: at some point a broader-than-just-me discussion of how block devices
> > > have the device mapper, but memory has no such layout tools, might be good
> > > to have. Without such a thing (which might or might not be possible/practical),
> > > it's essential that famfs do the interleaving. Lacking a mapper layer also
> > > means that we need dax to provide a clean "device abstraction" (meaning
> > > a single CXL allocation [which has a uuid/tag] needs to appear as a single
> > > dax device whether or not it's HPA-contiguous).
> > 
> > Well it's not as simple as device-mapper, where we can intercept struct
> > bio and remap/split it to our heart's content.  I guess you could do
> > that with an iovec...?  Would be sorta amusing if you could software
> > RAID10 some DRAM. :P
> 
> SW RAID, and mapper in general, has a "store and forward" property (or maybe
> "store, transmogrify, and forward") that doesn't really work for memory. 
> It's vma's (and files) that can remap memory address regions. Layered vma's 
> anyone? I need to think about whether that's utter nonsense, or just mostly 
> nonsense.

Oh but the ability to transmogrify is the key benefit of store and
forward!  Suppose you have to jack into some Klingon battle cruiser...

--D

> Continuing to think about this...
> 
> Thanks!
> John
> 
> 
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 11/18] famfs_fuse: Basic famfs mount opts
  2025-07-11 15:28     ` John Groves
@ 2025-07-12  5:54       ` Darrick J. Wong
  2025-08-14 10:37         ` Miklos Szeredi
  0 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2025-07-12  5:54 UTC (permalink / raw)
  To: John Groves
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On Fri, Jul 11, 2025 at 10:28:20AM -0500, John Groves wrote:
> On 25/07/08 08:59PM, Darrick J. Wong wrote:
> > On Thu, Jul 03, 2025 at 01:50:25PM -0500, John Groves wrote:
> > > * -o shadow=<shadowpath>
> > 
> > What is a shadow?
> > 
> > > * -o daxdev=<daxdev>
> 
> Derp - OK, that's a stale commit message. Here is the one for the -next
> version of this patch:
> 
>     famfs_fuse: Basic famfs mount opt: -o shadow=<shadowpath>
> 
>     The shadow path is a (usually tmpfs) file system area used by the famfs 
>     user space to commuicate with the famfs fuse server. There is a minor 
>     dilemma that the user space tools must be able to resolve from a mount 
>     point path to a shadow path. The shadow path is exposed via /proc/mounts, 
>     but otherwise not used by the kernel. User space gets the shadow path 
>     from /proc/mounts...

Ah.  A service directory, of sorts.

> > And, uh, if there's a FUSE_GET_DAXDEV command, then what does this mount
> > option do?  Pre-populate the first element of that set?
> > 
> > --D
> > 
> 
> I took out -o daxdev, but had failed to update the commit msg.
> 
> The logic is this: The general model requires the FUSE_GET_DAXDEV message /
> response, so passing in the primary daxdev as a -o arg creates two ways to
> do the same thing.
> 
> The only initial heartburn about this was one could imagine a case where a
> mount happens, but no I/O happens for a while so the mount could "succeed",
> only to fail later if the primary daxdev could not be accessed.
> 
> But this can't happen with famfs, because the mount procedure includes 
> creating "meta files" - .meta/.superblock and .meta/.log and accessing them
> immediately. So it is guaranteed that FUSE_GET_DAXDEV will be sent right away,
> and if it fails, the mount will be unwound.

<nod> 

--D

> Thanks Darrick!
> John
> 
> <snip>
> 
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 10/18] famfs_fuse: Basic fuse kernel ABI enablement for famfs
  2025-07-11  1:32             ` John Groves
  2025-07-12  4:49               ` Darrick J. Wong
@ 2025-08-11 18:30               ` John Groves
  2025-08-12 16:37                 ` Darrick J. Wong
  1 sibling, 1 reply; 91+ messages in thread
From: John Groves @ 2025-08-11 18:30 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Amir Goldstein, Dan Williams, Miklos Szeredi, Bernd Schubert,
	John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Randy Dunlap, Jeff Layton, Kent Overstreet, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi, john

On 25/07/10 08:32PM, John Groves wrote:
> On 25/07/08 06:53PM, Darrick J. Wong wrote:
> > On Tue, Jul 08, 2025 at 07:02:03AM -0500, John Groves wrote:
> > > On 25/07/07 10:39AM, Darrick J. Wong wrote:
> > > > On Fri, Jul 04, 2025 at 08:39:59AM -0500, John Groves wrote:
> > > > > On 25/07/04 09:54AM, Amir Goldstein wrote:
> > > > > > On Thu, Jul 3, 2025 at 8:51 PM John Groves <John@groves.net> wrote:
> > > > > > >
> > > > > > > * FUSE_DAX_FMAP flag in INIT request/reply
> > > > > > >
> > > > > > > * fuse_conn->famfs_iomap (enable famfs-mapped files) to denote a
> > > > > > >   famfs-enabled connection
> > > > > > >
> > > > > > > Signed-off-by: John Groves <john@groves.net>
> > > > > > > ---
> > > > > > >  fs/fuse/fuse_i.h          |  3 +++
> > > > > > >  fs/fuse/inode.c           | 14 ++++++++++++++
> > > > > > >  include/uapi/linux/fuse.h |  4 ++++
> > > > > > >  3 files changed, 21 insertions(+)
> > > > > > >
> > > > > > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > > > > > index 9d87ac48d724..a592c1002861 100644
> > > > > > > --- a/fs/fuse/fuse_i.h
> > > > > > > +++ b/fs/fuse/fuse_i.h
> > > > > > > @@ -873,6 +873,9 @@ struct fuse_conn {
> > > > > > >         /* Use io_uring for communication */
> > > > > > >         unsigned int io_uring;
> > > > > > >
> > > > > > > +       /* dev_dax_iomap support for famfs */
> > > > > > > +       unsigned int famfs_iomap:1;
> > > > > > > +
> > > > > > 
> > > > > > pls move up to the bit fields members.
> > > > > 
> > > > > Oops, done, thanks.
> > > > > 
> > > > > > 
> > > > > > >         /** Maximum stack depth for passthrough backing files */
> > > > > > >         int max_stack_depth;
> > > > > > >
> > > > > > > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > > > > > > index 29147657a99f..e48e11c3f9f3 100644
> > > > > > > --- a/fs/fuse/inode.c
> > > > > > > +++ b/fs/fuse/inode.c
> > > > > > > @@ -1392,6 +1392,18 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
> > > > > > >                         }
> > > > > > >                         if (flags & FUSE_OVER_IO_URING && fuse_uring_enabled())
> > > > > > >                                 fc->io_uring = 1;
> > > > > > > +                       if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX) &&
> > > > > > > +                           flags & FUSE_DAX_FMAP) {
> > > > > > > +                               /* XXX: Should also check that fuse server
> > > > > > > +                                * has CAP_SYS_RAWIO and/or CAP_SYS_ADMIN,
> > > > > > > +                                * since it is directing the kernel to access
> > > > > > > +                                * dax memory directly - but this function
> > > > > > > +                                * appears not to be called in fuse server
> > > > > > > +                                * process context (b/c even if it drops
> > > > > > > +                                * those capabilities, they are held here).
> > > > > > > +                                */
> > > > > > > +                               fc->famfs_iomap = 1;
> > > > > > > +                       }
> > > > > > 
> > > > > > 1. As long as the mapping requests are checking capabilities we should be ok
> > > > > >     Right?
> > > > > 
> > > > > It depends on the definition of "are", or maybe of "mapping requests" ;)
> > > > > 
> > > > > Forgive me if this *is* obvious, but the fuse server capabilities are what
> > > > > I think need to be checked here - not the app that it accessing a file.
> > > > > 
> > > > > An app accessing a regular file doesn't need permission to do raw access to
> > > > > the underlying block dev, but the fuse server does - becuase it is directing
> > > > > the kernel to access that for apps.
> > > > > 
> > > > > > 2. What's the deal with capable(CAP_SYS_ADMIN) in process_init_limits then?
> > > > > 
> > > > > I *think* that's checking the capabilities of the app that is accessing the
> > > > > file, and not the fuse server. But I might be wrong - I have not pulled very
> > > > > hard on that thread yet.
> > > > 
> > > > The init reply should be processed in the context of the fuse server.
> > > > At that point the kernel hasn't exposed the fs to user programs, so
> > > > (AFAICT) there won't be any other programs accessing that fuse mount.
> > > 
> > > Hmm. It would be good if you're right about that. My fuse server *is* running
> > > as root, and when I check those capabilities in process_init_reply(), I
> > > find those capabilities. So far so good.
> > > 
> > > Then I added code to my fuse server to drop those capabilities prior to
> > > starting the fuse session (prctl(PR_CAPBSET_DROP, CAP_SYS_RAWIO) and 
> > > prctl(PR_CAPBSET_DROP, CAP_SYS_ADMIN). I expected (hoped?) to see those 
> > > capabilities disappear in process_init_reply() - but they did not disappear.
> > > 
> > > I'm all ears if somebody can see a flaw in my logic here. Otherwise, the
> > > capabilities need to be stashed away before the reply is processsed, when 
> > > fs/fuse *is* running in fuse server context.
> > > 
> > > I'm somewhat surprised if that isn't already happening somewhere...
> > 
> > Hrm.  I *thought* that since FUSE_INIT isn't queued as a background
> > command, it should still execute in the same process context as the fuse
> > server.
> > 
> > OTOH it also occurs to me that I have this code in fuse_send_init:
> > 
> > 	if (has_capability_noaudit(current, CAP_SYS_RAWIO))
> > 		flags |= FUSE_IOMAP | FUSE_IOMAP_DIRECTIO | FUSE_IOMAP_PAGECACHE;
> > 	...
> > 	ia->in.flags = flags;
> > 	ia->in.flags2 = flags >> 32;
> > 
> > which means that we only advertise iomap support in FUSE_INIT if the
> > process running fuse_fill_super (which you hope is the fuse server)
> > actually has CAP_SYS_RAWIO.  Would that work for you?  Or are you
> > dropping privileges before you even open /dev/fuse?
> 
> Ah - that might be the answer. I will check if dropped capabilities 
> disappear in fuse_send_init. If so, I can work with that - not advertising 
> the famfs capability unless the capability is present at that point looks 
> like a perfectly good option. Thanks for that idea!

Review: the famfs fuse server directs the kernel to provide access to raw
(memory) devices, so it should should be required to have have the
CAP_SYS_RAWIO capability. fs/fuse needs to detect this at init time,
and fail the connection/mount if the capability is missing.

I initially attempted to do this verification in process_init_reply(), but
that doesn't run in the fuse server process context.

I am now checking the capability in fuse_send_init(), and not advertising
the FUSE_DAX_FMAP capability (in in_args->flags[2]) unless the server has 
CAP_SYS_RAWIO.

That requires that process_init_reply() reject FUSE_DAX_FMAP from a server
if FUSE_DAX_FMAP was not set in in_args->flags[2]. process_init_reply() was
not previously checking the in_args, but no big deal - this works.

This leads to an apparent dilemma in libfuse. In fuse_lowlevel_ops->init(),
I should check for (flags & FUSE_DAX_IOMAP), and fail the connection if
that capability is not on offer. But fuse_lowlevel_ops->init() doesn't
have an obvious way to fail the connection. 

How should I do that? Hoping Bernd, Amir or the other libfuse people may 
have "the answer" (tm).

And of course if any of this doesn't sound like the way to go, let me know...

Thanks!
John


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 10/18] famfs_fuse: Basic fuse kernel ABI enablement for famfs
  2025-08-11 18:30               ` John Groves
@ 2025-08-12 16:37                 ` Darrick J. Wong
  2025-08-13 13:07                   ` John Groves
  0 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2025-08-12 16:37 UTC (permalink / raw)
  To: John Groves
  Cc: Amir Goldstein, Dan Williams, Miklos Szeredi, Bernd Schubert,
	John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Randy Dunlap, Jeff Layton, Kent Overstreet, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On Mon, Aug 11, 2025 at 01:30:53PM -0500, John Groves wrote:
> On 25/07/10 08:32PM, John Groves wrote:
> > On 25/07/08 06:53PM, Darrick J. Wong wrote:
> > > On Tue, Jul 08, 2025 at 07:02:03AM -0500, John Groves wrote:
> > > > On 25/07/07 10:39AM, Darrick J. Wong wrote:
> > > > > On Fri, Jul 04, 2025 at 08:39:59AM -0500, John Groves wrote:
> > > > > > On 25/07/04 09:54AM, Amir Goldstein wrote:
> > > > > > > On Thu, Jul 3, 2025 at 8:51 PM John Groves <John@groves.net> wrote:
> > > > > > > >
> > > > > > > > * FUSE_DAX_FMAP flag in INIT request/reply
> > > > > > > >
> > > > > > > > * fuse_conn->famfs_iomap (enable famfs-mapped files) to denote a
> > > > > > > >   famfs-enabled connection
> > > > > > > >
> > > > > > > > Signed-off-by: John Groves <john@groves.net>
> > > > > > > > ---
> > > > > > > >  fs/fuse/fuse_i.h          |  3 +++
> > > > > > > >  fs/fuse/inode.c           | 14 ++++++++++++++
> > > > > > > >  include/uapi/linux/fuse.h |  4 ++++
> > > > > > > >  3 files changed, 21 insertions(+)
> > > > > > > >
> > > > > > > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > > > > > > index 9d87ac48d724..a592c1002861 100644
> > > > > > > > --- a/fs/fuse/fuse_i.h
> > > > > > > > +++ b/fs/fuse/fuse_i.h
> > > > > > > > @@ -873,6 +873,9 @@ struct fuse_conn {
> > > > > > > >         /* Use io_uring for communication */
> > > > > > > >         unsigned int io_uring;
> > > > > > > >
> > > > > > > > +       /* dev_dax_iomap support for famfs */
> > > > > > > > +       unsigned int famfs_iomap:1;
> > > > > > > > +
> > > > > > > 
> > > > > > > pls move up to the bit fields members.
> > > > > > 
> > > > > > Oops, done, thanks.
> > > > > > 
> > > > > > > 
> > > > > > > >         /** Maximum stack depth for passthrough backing files */
> > > > > > > >         int max_stack_depth;
> > > > > > > >
> > > > > > > > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > > > > > > > index 29147657a99f..e48e11c3f9f3 100644
> > > > > > > > --- a/fs/fuse/inode.c
> > > > > > > > +++ b/fs/fuse/inode.c
> > > > > > > > @@ -1392,6 +1392,18 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
> > > > > > > >                         }
> > > > > > > >                         if (flags & FUSE_OVER_IO_URING && fuse_uring_enabled())
> > > > > > > >                                 fc->io_uring = 1;
> > > > > > > > +                       if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX) &&
> > > > > > > > +                           flags & FUSE_DAX_FMAP) {
> > > > > > > > +                               /* XXX: Should also check that fuse server
> > > > > > > > +                                * has CAP_SYS_RAWIO and/or CAP_SYS_ADMIN,
> > > > > > > > +                                * since it is directing the kernel to access
> > > > > > > > +                                * dax memory directly - but this function
> > > > > > > > +                                * appears not to be called in fuse server
> > > > > > > > +                                * process context (b/c even if it drops
> > > > > > > > +                                * those capabilities, they are held here).
> > > > > > > > +                                */
> > > > > > > > +                               fc->famfs_iomap = 1;
> > > > > > > > +                       }
> > > > > > > 
> > > > > > > 1. As long as the mapping requests are checking capabilities we should be ok
> > > > > > >     Right?
> > > > > > 
> > > > > > It depends on the definition of "are", or maybe of "mapping requests" ;)
> > > > > > 
> > > > > > Forgive me if this *is* obvious, but the fuse server capabilities are what
> > > > > > I think need to be checked here - not the app that it accessing a file.
> > > > > > 
> > > > > > An app accessing a regular file doesn't need permission to do raw access to
> > > > > > the underlying block dev, but the fuse server does - becuase it is directing
> > > > > > the kernel to access that for apps.
> > > > > > 
> > > > > > > 2. What's the deal with capable(CAP_SYS_ADMIN) in process_init_limits then?
> > > > > > 
> > > > > > I *think* that's checking the capabilities of the app that is accessing the
> > > > > > file, and not the fuse server. But I might be wrong - I have not pulled very
> > > > > > hard on that thread yet.
> > > > > 
> > > > > The init reply should be processed in the context of the fuse server.
> > > > > At that point the kernel hasn't exposed the fs to user programs, so
> > > > > (AFAICT) there won't be any other programs accessing that fuse mount.
> > > > 
> > > > Hmm. It would be good if you're right about that. My fuse server *is* running
> > > > as root, and when I check those capabilities in process_init_reply(), I
> > > > find those capabilities. So far so good.
> > > > 
> > > > Then I added code to my fuse server to drop those capabilities prior to
> > > > starting the fuse session (prctl(PR_CAPBSET_DROP, CAP_SYS_RAWIO) and 
> > > > prctl(PR_CAPBSET_DROP, CAP_SYS_ADMIN). I expected (hoped?) to see those 
> > > > capabilities disappear in process_init_reply() - but they did not disappear.
> > > > 
> > > > I'm all ears if somebody can see a flaw in my logic here. Otherwise, the
> > > > capabilities need to be stashed away before the reply is processsed, when 
> > > > fs/fuse *is* running in fuse server context.
> > > > 
> > > > I'm somewhat surprised if that isn't already happening somewhere...
> > > 
> > > Hrm.  I *thought* that since FUSE_INIT isn't queued as a background
> > > command, it should still execute in the same process context as the fuse
> > > server.
> > > 
> > > OTOH it also occurs to me that I have this code in fuse_send_init:
> > > 
> > > 	if (has_capability_noaudit(current, CAP_SYS_RAWIO))
> > > 		flags |= FUSE_IOMAP | FUSE_IOMAP_DIRECTIO | FUSE_IOMAP_PAGECACHE;
> > > 	...
> > > 	ia->in.flags = flags;
> > > 	ia->in.flags2 = flags >> 32;
> > > 
> > > which means that we only advertise iomap support in FUSE_INIT if the
> > > process running fuse_fill_super (which you hope is the fuse server)
> > > actually has CAP_SYS_RAWIO.  Would that work for you?  Or are you
> > > dropping privileges before you even open /dev/fuse?
> > 
> > Ah - that might be the answer. I will check if dropped capabilities 
> > disappear in fuse_send_init. If so, I can work with that - not advertising 
> > the famfs capability unless the capability is present at that point looks 
> > like a perfectly good option. Thanks for that idea!
> 
> Review: the famfs fuse server directs the kernel to provide access to raw
> (memory) devices, so it should should be required to have have the
> CAP_SYS_RAWIO capability. fs/fuse needs to detect this at init time,
> and fail the connection/mount if the capability is missing.
> 
> I initially attempted to do this verification in process_init_reply(), but
> that doesn't run in the fuse server process context.
> 
> I am now checking the capability in fuse_send_init(), and not advertising
> the FUSE_DAX_FMAP capability (in in_args->flags[2]) unless the server has 
> CAP_SYS_RAWIO.
> 
> That requires that process_init_reply() reject FUSE_DAX_FMAP from a server
> if FUSE_DAX_FMAP was not set in in_args->flags[2]. process_init_reply() was
> not previously checking the in_args, but no big deal - this works.
> 
> This leads to an apparent dilemma in libfuse. In fuse_lowlevel_ops->init(),
> I should check for (flags & FUSE_DAX_IOMAP), and fail the connection if
> that capability is not on offer. But fuse_lowlevel_ops->init() doesn't
> have an obvious way to fail the connection. 

Yeah, I really wish it did.  I particularly wish that it had a way to
negotiate all the FUSE_INIT stuff before libfuse daemonizes and starts
up the event loop.  Well, not all of it -- by the time we get to
FUSE_INIT we've basically decided to commit to mounting.

For fuseblk servers this is horrible, because the kernel needs to be
able to open the block device with O_EXCL during the mount() process,
which means you actually have to be able to (re)open the block device
from op_init, which can fail.  Unless there's a way to drop O_EXCL from
an open fd?

The awful way that I handle failure in FUSE_INIT is to call
fuse_session_exit, but that grossly leaves a dead mount in its place.

Hey wait, is this what Mikulas was talking about when he mentioned
synchronous initialization?

For iomap I created a discovery ioctl so that you can open /dev/fuse and
ask the kernel about the iomap functionality that it supports, and you
can exit(1) without creating a fuse session.  The one goofy problem with
that is that there's a TOCTOU race if someone else does echo N >
/sys/module/fuse/parameters/enable_iomap, though fuse4fs can always
fall back to non-iomap mode.

--D

> How should I do that? Hoping Bernd, Amir or the other libfuse people may 
> have "the answer" (tm).
> 
> And of course if any of this doesn't sound like the way to go, let me know...
> 
> Thanks!
> John
> 
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 10/18] famfs_fuse: Basic fuse kernel ABI enablement for famfs
  2025-08-12 16:37                 ` Darrick J. Wong
@ 2025-08-13 13:07                   ` John Groves
  2025-08-14 17:16                     ` Darrick J. Wong
  0 siblings, 1 reply; 91+ messages in thread
From: John Groves @ 2025-08-13 13:07 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Amir Goldstein, Dan Williams, Miklos Szeredi, Bernd Schubert,
	John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Randy Dunlap, Jeff Layton, Kent Overstreet, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On 25/08/12 09:37AM, Darrick J. Wong wrote:
> On Mon, Aug 11, 2025 at 01:30:53PM -0500, John Groves wrote:
> > On 25/07/10 08:32PM, John Groves wrote:
> > > On 25/07/08 06:53PM, Darrick J. Wong wrote:
> > > > On Tue, Jul 08, 2025 at 07:02:03AM -0500, John Groves wrote:
> > > > > On 25/07/07 10:39AM, Darrick J. Wong wrote:
> > > > > > On Fri, Jul 04, 2025 at 08:39:59AM -0500, John Groves wrote:
> > > > > > > On 25/07/04 09:54AM, Amir Goldstein wrote:
> > > > > > > > On Thu, Jul 3, 2025 at 8:51 PM John Groves <John@groves.net> wrote:
> > > > > > > > >
> > > > > > > > > * FUSE_DAX_FMAP flag in INIT request/reply
> > > > > > > > >
> > > > > > > > > * fuse_conn->famfs_iomap (enable famfs-mapped files) to denote a
> > > > > > > > >   famfs-enabled connection
> > > > > > > > >
> > > > > > > > > Signed-off-by: John Groves <john@groves.net>
> > > > > > > > > ---
> > > > > > > > >  fs/fuse/fuse_i.h          |  3 +++
> > > > > > > > >  fs/fuse/inode.c           | 14 ++++++++++++++
> > > > > > > > >  include/uapi/linux/fuse.h |  4 ++++
> > > > > > > > >  3 files changed, 21 insertions(+)
> > > > > > > > >
> > > > > > > > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > > > > > > > index 9d87ac48d724..a592c1002861 100644
> > > > > > > > > --- a/fs/fuse/fuse_i.h
> > > > > > > > > +++ b/fs/fuse/fuse_i.h
> > > > > > > > > @@ -873,6 +873,9 @@ struct fuse_conn {
> > > > > > > > >         /* Use io_uring for communication */
> > > > > > > > >         unsigned int io_uring;
> > > > > > > > >
> > > > > > > > > +       /* dev_dax_iomap support for famfs */
> > > > > > > > > +       unsigned int famfs_iomap:1;
> > > > > > > > > +
> > > > > > > > 
> > > > > > > > pls move up to the bit fields members.
> > > > > > > 
> > > > > > > Oops, done, thanks.
> > > > > > > 
> > > > > > > > 
> > > > > > > > >         /** Maximum stack depth for passthrough backing files */
> > > > > > > > >         int max_stack_depth;
> > > > > > > > >
> > > > > > > > > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > > > > > > > > index 29147657a99f..e48e11c3f9f3 100644
> > > > > > > > > --- a/fs/fuse/inode.c
> > > > > > > > > +++ b/fs/fuse/inode.c
> > > > > > > > > @@ -1392,6 +1392,18 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
> > > > > > > > >                         }
> > > > > > > > >                         if (flags & FUSE_OVER_IO_URING && fuse_uring_enabled())
> > > > > > > > >                                 fc->io_uring = 1;
> > > > > > > > > +                       if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX) &&
> > > > > > > > > +                           flags & FUSE_DAX_FMAP) {
> > > > > > > > > +                               /* XXX: Should also check that fuse server
> > > > > > > > > +                                * has CAP_SYS_RAWIO and/or CAP_SYS_ADMIN,
> > > > > > > > > +                                * since it is directing the kernel to access
> > > > > > > > > +                                * dax memory directly - but this function
> > > > > > > > > +                                * appears not to be called in fuse server
> > > > > > > > > +                                * process context (b/c even if it drops
> > > > > > > > > +                                * those capabilities, they are held here).
> > > > > > > > > +                                */
> > > > > > > > > +                               fc->famfs_iomap = 1;
> > > > > > > > > +                       }
> > > > > > > > 
> > > > > > > > 1. As long as the mapping requests are checking capabilities we should be ok
> > > > > > > >     Right?
> > > > > > > 
> > > > > > > It depends on the definition of "are", or maybe of "mapping requests" ;)
> > > > > > > 
> > > > > > > Forgive me if this *is* obvious, but the fuse server capabilities are what
> > > > > > > I think need to be checked here - not the app that it accessing a file.
> > > > > > > 
> > > > > > > An app accessing a regular file doesn't need permission to do raw access to
> > > > > > > the underlying block dev, but the fuse server does - becuase it is directing
> > > > > > > the kernel to access that for apps.
> > > > > > > 
> > > > > > > > 2. What's the deal with capable(CAP_SYS_ADMIN) in process_init_limits then?
> > > > > > > 
> > > > > > > I *think* that's checking the capabilities of the app that is accessing the
> > > > > > > file, and not the fuse server. But I might be wrong - I have not pulled very
> > > > > > > hard on that thread yet.
> > > > > > 
> > > > > > The init reply should be processed in the context of the fuse server.
> > > > > > At that point the kernel hasn't exposed the fs to user programs, so
> > > > > > (AFAICT) there won't be any other programs accessing that fuse mount.
> > > > > 
> > > > > Hmm. It would be good if you're right about that. My fuse server *is* running
> > > > > as root, and when I check those capabilities in process_init_reply(), I
> > > > > find those capabilities. So far so good.
> > > > > 
> > > > > Then I added code to my fuse server to drop those capabilities prior to
> > > > > starting the fuse session (prctl(PR_CAPBSET_DROP, CAP_SYS_RAWIO) and 
> > > > > prctl(PR_CAPBSET_DROP, CAP_SYS_ADMIN). I expected (hoped?) to see those 
> > > > > capabilities disappear in process_init_reply() - but they did not disappear.
> > > > > 
> > > > > I'm all ears if somebody can see a flaw in my logic here. Otherwise, the
> > > > > capabilities need to be stashed away before the reply is processsed, when 
> > > > > fs/fuse *is* running in fuse server context.
> > > > > 
> > > > > I'm somewhat surprised if that isn't already happening somewhere...
> > > > 
> > > > Hrm.  I *thought* that since FUSE_INIT isn't queued as a background
> > > > command, it should still execute in the same process context as the fuse
> > > > server.
> > > > 
> > > > OTOH it also occurs to me that I have this code in fuse_send_init:
> > > > 
> > > > 	if (has_capability_noaudit(current, CAP_SYS_RAWIO))
> > > > 		flags |= FUSE_IOMAP | FUSE_IOMAP_DIRECTIO | FUSE_IOMAP_PAGECACHE;
> > > > 	...
> > > > 	ia->in.flags = flags;
> > > > 	ia->in.flags2 = flags >> 32;
> > > > 
> > > > which means that we only advertise iomap support in FUSE_INIT if the
> > > > process running fuse_fill_super (which you hope is the fuse server)
> > > > actually has CAP_SYS_RAWIO.  Would that work for you?  Or are you
> > > > dropping privileges before you even open /dev/fuse?
> > > 
> > > Ah - that might be the answer. I will check if dropped capabilities 
> > > disappear in fuse_send_init. If so, I can work with that - not advertising 
> > > the famfs capability unless the capability is present at that point looks 
> > > like a perfectly good option. Thanks for that idea!
> > 
> > Review: the famfs fuse server directs the kernel to provide access to raw
> > (memory) devices, so it should should be required to have have the
> > CAP_SYS_RAWIO capability. fs/fuse needs to detect this at init time,
> > and fail the connection/mount if the capability is missing.
> > 
> > I initially attempted to do this verification in process_init_reply(), but
> > that doesn't run in the fuse server process context.
> > 
> > I am now checking the capability in fuse_send_init(), and not advertising
> > the FUSE_DAX_FMAP capability (in in_args->flags[2]) unless the server has 
> > CAP_SYS_RAWIO.
> > 
> > That requires that process_init_reply() reject FUSE_DAX_FMAP from a server
> > if FUSE_DAX_FMAP was not set in in_args->flags[2]. process_init_reply() was
> > not previously checking the in_args, but no big deal - this works.
> > 
> > This leads to an apparent dilemma in libfuse. In fuse_lowlevel_ops->init(),
> > I should check for (flags & FUSE_DAX_IOMAP), and fail the connection if
> > that capability is not on offer. But fuse_lowlevel_ops->init() doesn't
> > have an obvious way to fail the connection. 
> 
> Yeah, I really wish it did.  I particularly wish that it had a way to
> negotiate all the FUSE_INIT stuff before libfuse daemonizes and starts
> up the event loop.  Well, not all of it -- by the time we get to
> FUSE_INIT we've basically decided to commit to mounting.
> 
> For fuseblk servers this is horrible, because the kernel needs to be
> able to open the block device with O_EXCL during the mount() process,
> which means you actually have to be able to (re)open the block device
> from op_init, which can fail.  Unless there's a way to drop O_EXCL from
> an open fd?
> 
> The awful way that I handle failure in FUSE_INIT is to call
> fuse_session_exit, but that grossly leaves a dead mount in its place.
> 
> Hey wait, is this what Mikulas was talking about when he mentioned
> synchronous initialization?
> 
> For iomap I created a discovery ioctl so that you can open /dev/fuse and
> ask the kernel about the iomap functionality that it supports, and you
> can exit(1) without creating a fuse session.  The one goofy problem with
> that is that there's a TOCTOU race if someone else does echo N >
> /sys/module/fuse/parameters/enable_iomap, though fuse4fs can always
> fall back to non-iomap mode.
> 
> --D

Thanks Darrick.

Hmm - synchronous init would be nice.

I tried calling fuse_session_exit(), but the broken mount was not an
improvement over a can't-do-I/O mount - which I get if the kernel rejects 
the capability currently known as FUSE_DAX_FMAP due to lack of CAP_SYS_RAWIO.

In my case, I think letting the mount complete with FUSE_DAX_FMAP rejected
is easier to detect and cleanup than a fuse_session_exit() aborted mount.

Famfs mount is a cli operation that does a sequence of stuff before and after
the fork/exec of the famfs fuse server. That fork/exec can't really return 
an error in the conventional sense, so I'm stuck diagnosing whether the 
mount is good (which I already do, but it's a WIP). 

I already have to poll for the .meta files to appear (superblock and log), 
and that can be adapted pretty easily to check whether they can be read 
correctly (which they can't if famfs doesn't have daxdev access).

If mount was synchronous, I'd still need to give the fork/exec enough time
to fail and then detect that. That would probably be cleaner, but not by
a huge amount.

Thanks,
John

<snip>


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 11/18] famfs_fuse: Basic famfs mount opts
  2025-07-12  5:54       ` Darrick J. Wong
@ 2025-08-14 10:37         ` Miklos Szeredi
  2025-08-14 14:39           ` John Groves
  0 siblings, 1 reply; 91+ messages in thread
From: Miklos Szeredi @ 2025-08-14 10:37 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: John Groves, Dan Williams, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On Sat, 12 Jul 2025 at 07:54, Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Fri, Jul 11, 2025 at 10:28:20AM -0500, John Groves wrote:

> >     famfs_fuse: Basic famfs mount opt: -o shadow=<shadowpath>
> >
> >     The shadow path is a (usually tmpfs) file system area used by the famfs
> >     user space to commuicate with the famfs fuse server. There is a minor
> >     dilemma that the user space tools must be able to resolve from a mount
> >     point path to a shadow path. The shadow path is exposed via /proc/mounts,
> >     but otherwise not used by the kernel. User space gets the shadow path
> >     from /proc/mounts...

Don't know if we want to go that way.  Is there no other way?

But if we do, at least do it in a generic way.  I.e. fuse server can
tell the kernel to display options A, B and C in /proc/mounts.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 12/18] famfs_fuse: Plumb the GET_FMAP message/response
  2025-07-03 18:50 ` [RFC V2 12/18] famfs_fuse: Plumb the GET_FMAP message/response John Groves
  2025-07-04  8:54   ` Amir Goldstein
  2025-07-09  4:27   ` Darrick J. Wong
@ 2025-08-14 13:36   ` Miklos Szeredi
  2025-08-14 14:36     ` Miklos Szeredi
                       ` (2 more replies)
  2 siblings, 3 replies; 91+ messages in thread
From: Miklos Szeredi @ 2025-08-14 13:36 UTC (permalink / raw)
  To: John Groves
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Darrick J . Wong,
	Randy Dunlap, Jeff Layton, Kent Overstreet, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Amir Goldstein,
	Jonathan Cameron, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
	Aravind Ramesh, Ajay Joshi

On Thu, 3 Jul 2025 at 20:54, John Groves <John@groves.net> wrote:
>
> Upon completion of an OPEN, if we're in famfs-mode we do a GET_FMAP to
> retrieve and cache up the file-to-dax map in the kernel. If this
> succeeds, read/write/mmap are resolved direct-to-dax with no upcalls.

Nothing to do at this time unless you want a side project:  doing this
with compound requests would save a roundtrip (OPEN + GET_FMAP in one
go).

> GET_FMAP has a variable-size response payload, and the allocated size
> is sent in the in_args[0].size field. If the fmap would overflow the
> message, the fuse server sends a reply of size 'sizeof(uint32_t)' which
> specifies the size of the fmap message. Then the kernel can realloc a
> large enough buffer and try again.

There is a better way to do this: the allocation can happen when we
get the response.  Just need to add infrastructure to dev.c.

> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index 6c384640c79b..dff5aa62543e 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -654,6 +654,10 @@ enum fuse_opcode {
>         FUSE_TMPFILE            = 51,
>         FUSE_STATX              = 52,
>
> +       /* Famfs / devdax opcodes */
> +       FUSE_GET_FMAP           = 53,
> +       FUSE_GET_DAXDEV         = 54,

Introduced too early.

> +
>         /* CUSE specific operations */
>         CUSE_INIT               = 4096,
>
> @@ -888,6 +892,16 @@ struct fuse_access_in {
>         uint32_t        padding;
>  };
>
> +struct fuse_get_fmap_in {
> +       uint32_t        size;
> +       uint32_t        padding;
> +};

As noted, passing size to server really makes no sense.  I'd just omit
fuse_get_fmap_in completely.

> +
> +struct fuse_get_fmap_out {
> +       uint32_t        size;
> +       uint32_t        padding;
> +};
> +
>  struct fuse_init_in {
>         uint32_t        major;
>         uint32_t        minor;
> @@ -1284,4 +1298,8 @@ struct fuse_uring_cmd_req {
>         uint8_t padding[6];
>  };
>
> +/* Famfs fmap message components */
> +
> +#define FAMFS_FMAP_MAX 32768 /* Largest supported fmap message */
> +

Hmm, Darrick's interface gets one extents at a time.   This one tries
to get the whole map in one go.

The single extent thing can be inefficient even for plain block fs, so
it would be nice to get multiple extents.  The whole map has an
artificial limit that currently may seem sufficient but down the line
could cause pain.

I'm still hoping some common ground would benefit both interfaces.
Just not sure what it should be.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 14/18] famfs_fuse: GET_DAXDEV message and daxdev_table
  2025-07-03 18:50 ` [RFC V2 14/18] famfs_fuse: GET_DAXDEV message and daxdev_table John Groves
  2025-07-04 13:20   ` Jonathan Cameron
@ 2025-08-14 13:58   ` Miklos Szeredi
  2025-08-14 17:19     ` Darrick J. Wong
  2025-08-15 16:38     ` John Groves
  1 sibling, 2 replies; 91+ messages in thread
From: Miklos Szeredi @ 2025-08-14 13:58 UTC (permalink / raw)
  To: John Groves
  Cc: Dan Williams, Bernd Schubert, John Groves, Jonathan Corbet,
	Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Darrick J . Wong, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On Thu, 3 Jul 2025 at 20:54, John Groves <John@groves.net> wrote:
>
> * The new GET_DAXDEV message/response is enabled
> * The command it triggered by the update_daxdev_table() call, if there
>   are any daxdevs in the subject fmap that are not represented in the
>   daxdev_dable yet.

This is rather convoluted, the server *should know* which dax devices
it has registered, hence it shouldn't need to be explicitly asked.

And there's already an API for registering file descriptors:
FUSE_DEV_IOC_BACKING_OPEN.  Is there a reason that interface couldn't
be used by famfs?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 12/18] famfs_fuse: Plumb the GET_FMAP message/response
  2025-08-14 13:36   ` Miklos Szeredi
@ 2025-08-14 14:36     ` Miklos Szeredi
  2025-08-14 18:20       ` Darrick J. Wong
  2025-08-15 16:53       ` John Groves
  2025-08-14 18:05     ` Darrick J. Wong
  2025-08-15  0:38     ` John Groves
  2 siblings, 2 replies; 91+ messages in thread
From: Miklos Szeredi @ 2025-08-14 14:36 UTC (permalink / raw)
  To: John Groves
  Cc: Dan Williams, Bernd Schubert, John Groves, Jonathan Corbet,
	Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Darrick J . Wong, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On Thu, 14 Aug 2025 at 15:36, Miklos Szeredi <miklos@szeredi.hu> wrote:

> I'm still hoping some common ground would benefit both interfaces.
> Just not sure what it should be.

Something very high level:

 - allow several map formats: say a plain one with a list of extents
and a famfs one
 - allow several types of backing files: say regular and dax dev
 - querying maps has a common protocol, format of maps is opaque to this
 - maps are cached by a common facility
 - each type of mapping has a decoder module
 - each type of backing file has a module for handling I/O

Does this make sense?

This doesn't have to be implemented in one go, but for example
GET_FMAP could be renamed to GET_READ_MAP with an added offset and
size parameter.  For famfs the offset/size would be set to zero/inf.
I'd be content with that for now.

Thanks,
Miklos

>
> Thanks,
> Miklos

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 11/18] famfs_fuse: Basic famfs mount opts
  2025-08-14 10:37         ` Miklos Szeredi
@ 2025-08-14 14:39           ` John Groves
  2025-08-14 15:19             ` Miklos Szeredi
  0 siblings, 1 reply; 91+ messages in thread
From: John Groves @ 2025-08-14 14:39 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Darrick J. Wong, Dan Williams, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On 25/08/14 12:37PM, Miklos Szeredi wrote:
> On Sat, 12 Jul 2025 at 07:54, Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Fri, Jul 11, 2025 at 10:28:20AM -0500, John Groves wrote:
> 
> > >     famfs_fuse: Basic famfs mount opt: -o shadow=<shadowpath>
> > >
> > >     The shadow path is a (usually tmpfs) file system area used by the famfs
> > >     user space to commuicate with the famfs fuse server. There is a minor
> > >     dilemma that the user space tools must be able to resolve from a mount
> > >     point path to a shadow path. The shadow path is exposed via /proc/mounts,
> > >     but otherwise not used by the kernel. User space gets the shadow path
> > >     from /proc/mounts...
> 
> Don't know if we want to go that way.  Is there no other way?
> 
> But if we do, at least do it in a generic way.  I.e. fuse server can
> tell the kernel to display options A, B and C in /proc/mounts.
> 
> Thanks,
> Miklos

So far I haven't come up with an alternative, other than bad ones. 

Could parse the shadow path from the fuse server with the correct mount
point from 'ps -ef', but there are cases where a fuse server is killed and 
the kernel still thinks it's mounted (and we still might need to find the 
shadow path).

Could write the shadow path to a systemd log and parse it from there, but 
that would break if the log wasn't enabled, and would disappear if the log
was rotated during a long-running mount - and this resolution must happen
every time the famfs cli does most anything (cp, creat, fsck, etc.).

Could write it to a "secret file" somewhere, but that's kinda brittle.

Shadow paths are almost always tmpdir paths that are generated at mount time,
so there really isn't a good way to guess them, and it doesn't seem viable
to require them to be in (e.g.) /tmp in all cases.

Here is what it currently looks like on a running system:

$ grep famfs /proc/mounts
/dev/dax0.0 /mnt/famfs fuse rw,nosuid,nodev,relatime,user_id=0,group_id=0,shadow=/tmp/famfs_shadow_5m0dnH 0 0
$ ps -ef | grep /mnt/famfs | grep -v grep
root       12775       1  0 07:04 ?        00:00:00 /dev/dax0.0 -o daxdev=/dev/dax0.0,shadow=/tmp/famfs_shadow_5m0dnH,fsname=/dev/dax0.0,timeout=31536000.000000 /mnt/famfs

Having a generic approach rather than a '-o' option would be fine with me.
Also happy to entertain other ideas...

Thanks,
John


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 11/18] famfs_fuse: Basic famfs mount opts
  2025-08-14 14:39           ` John Groves
@ 2025-08-14 15:19             ` Miklos Szeredi
  2025-08-14 23:52               ` John Groves
  0 siblings, 1 reply; 91+ messages in thread
From: Miklos Szeredi @ 2025-08-14 15:19 UTC (permalink / raw)
  To: John Groves
  Cc: Darrick J. Wong, Dan Williams, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On Thu, 14 Aug 2025 at 16:39, John Groves <John@groves.net> wrote:

> Having a generic approach rather than a '-o' option would be fine with me.
> Also happy to entertain other ideas...

We could just allow arbitrary options to be set by the server.  It
might break cases where the server just passes unknown options down
into the kernel, which currently are rejected.  I don't think this is
common practice, but still it sounds a bit risky.

Alternatively allow INIT_REPLY to set up misc options, which can only
be done explicitly, so no risk there.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 10/18] famfs_fuse: Basic fuse kernel ABI enablement for famfs
  2025-08-13 13:07                   ` John Groves
@ 2025-08-14 17:16                     ` Darrick J. Wong
  0 siblings, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2025-08-14 17:16 UTC (permalink / raw)
  To: John Groves
  Cc: Amir Goldstein, Dan Williams, Miklos Szeredi, Bernd Schubert,
	John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Randy Dunlap, Jeff Layton, Kent Overstreet, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On Wed, Aug 13, 2025 at 08:07:00AM -0500, John Groves wrote:
> On 25/08/12 09:37AM, Darrick J. Wong wrote:
> > On Mon, Aug 11, 2025 at 01:30:53PM -0500, John Groves wrote:
> > > On 25/07/10 08:32PM, John Groves wrote:
> > > > On 25/07/08 06:53PM, Darrick J. Wong wrote:
> > > > > On Tue, Jul 08, 2025 at 07:02:03AM -0500, John Groves wrote:
> > > > > > On 25/07/07 10:39AM, Darrick J. Wong wrote:
> > > > > > > On Fri, Jul 04, 2025 at 08:39:59AM -0500, John Groves wrote:
> > > > > > > > On 25/07/04 09:54AM, Amir Goldstein wrote:
> > > > > > > > > On Thu, Jul 3, 2025 at 8:51 PM John Groves <John@groves.net> wrote:
> > > > > > > > > >
> > > > > > > > > > * FUSE_DAX_FMAP flag in INIT request/reply
> > > > > > > > > >
> > > > > > > > > > * fuse_conn->famfs_iomap (enable famfs-mapped files) to denote a
> > > > > > > > > >   famfs-enabled connection
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: John Groves <john@groves.net>
> > > > > > > > > > ---
> > > > > > > > > >  fs/fuse/fuse_i.h          |  3 +++
> > > > > > > > > >  fs/fuse/inode.c           | 14 ++++++++++++++
> > > > > > > > > >  include/uapi/linux/fuse.h |  4 ++++
> > > > > > > > > >  3 files changed, 21 insertions(+)
> > > > > > > > > >
> > > > > > > > > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > > > > > > > > index 9d87ac48d724..a592c1002861 100644
> > > > > > > > > > --- a/fs/fuse/fuse_i.h
> > > > > > > > > > +++ b/fs/fuse/fuse_i.h
> > > > > > > > > > @@ -873,6 +873,9 @@ struct fuse_conn {
> > > > > > > > > >         /* Use io_uring for communication */
> > > > > > > > > >         unsigned int io_uring;
> > > > > > > > > >
> > > > > > > > > > +       /* dev_dax_iomap support for famfs */
> > > > > > > > > > +       unsigned int famfs_iomap:1;
> > > > > > > > > > +
> > > > > > > > > 
> > > > > > > > > pls move up to the bit fields members.
> > > > > > > > 
> > > > > > > > Oops, done, thanks.
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > >         /** Maximum stack depth for passthrough backing files */
> > > > > > > > > >         int max_stack_depth;
> > > > > > > > > >
> > > > > > > > > > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > > > > > > > > > index 29147657a99f..e48e11c3f9f3 100644
> > > > > > > > > > --- a/fs/fuse/inode.c
> > > > > > > > > > +++ b/fs/fuse/inode.c
> > > > > > > > > > @@ -1392,6 +1392,18 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
> > > > > > > > > >                         }
> > > > > > > > > >                         if (flags & FUSE_OVER_IO_URING && fuse_uring_enabled())
> > > > > > > > > >                                 fc->io_uring = 1;
> > > > > > > > > > +                       if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX) &&
> > > > > > > > > > +                           flags & FUSE_DAX_FMAP) {
> > > > > > > > > > +                               /* XXX: Should also check that fuse server
> > > > > > > > > > +                                * has CAP_SYS_RAWIO and/or CAP_SYS_ADMIN,
> > > > > > > > > > +                                * since it is directing the kernel to access
> > > > > > > > > > +                                * dax memory directly - but this function
> > > > > > > > > > +                                * appears not to be called in fuse server
> > > > > > > > > > +                                * process context (b/c even if it drops
> > > > > > > > > > +                                * those capabilities, they are held here).
> > > > > > > > > > +                                */
> > > > > > > > > > +                               fc->famfs_iomap = 1;
> > > > > > > > > > +                       }
> > > > > > > > > 
> > > > > > > > > 1. As long as the mapping requests are checking capabilities we should be ok
> > > > > > > > >     Right?
> > > > > > > > 
> > > > > > > > It depends on the definition of "are", or maybe of "mapping requests" ;)
> > > > > > > > 
> > > > > > > > Forgive me if this *is* obvious, but the fuse server capabilities are what
> > > > > > > > I think need to be checked here - not the app that it accessing a file.
> > > > > > > > 
> > > > > > > > An app accessing a regular file doesn't need permission to do raw access to
> > > > > > > > the underlying block dev, but the fuse server does - becuase it is directing
> > > > > > > > the kernel to access that for apps.
> > > > > > > > 
> > > > > > > > > 2. What's the deal with capable(CAP_SYS_ADMIN) in process_init_limits then?
> > > > > > > > 
> > > > > > > > I *think* that's checking the capabilities of the app that is accessing the
> > > > > > > > file, and not the fuse server. But I might be wrong - I have not pulled very
> > > > > > > > hard on that thread yet.
> > > > > > > 
> > > > > > > The init reply should be processed in the context of the fuse server.
> > > > > > > At that point the kernel hasn't exposed the fs to user programs, so
> > > > > > > (AFAICT) there won't be any other programs accessing that fuse mount.
> > > > > > 
> > > > > > Hmm. It would be good if you're right about that. My fuse server *is* running
> > > > > > as root, and when I check those capabilities in process_init_reply(), I
> > > > > > find those capabilities. So far so good.
> > > > > > 
> > > > > > Then I added code to my fuse server to drop those capabilities prior to
> > > > > > starting the fuse session (prctl(PR_CAPBSET_DROP, CAP_SYS_RAWIO) and 
> > > > > > prctl(PR_CAPBSET_DROP, CAP_SYS_ADMIN). I expected (hoped?) to see those 
> > > > > > capabilities disappear in process_init_reply() - but they did not disappear.
> > > > > > 
> > > > > > I'm all ears if somebody can see a flaw in my logic here. Otherwise, the
> > > > > > capabilities need to be stashed away before the reply is processsed, when 
> > > > > > fs/fuse *is* running in fuse server context.
> > > > > > 
> > > > > > I'm somewhat surprised if that isn't already happening somewhere...
> > > > > 
> > > > > Hrm.  I *thought* that since FUSE_INIT isn't queued as a background
> > > > > command, it should still execute in the same process context as the fuse
> > > > > server.
> > > > > 
> > > > > OTOH it also occurs to me that I have this code in fuse_send_init:
> > > > > 
> > > > > 	if (has_capability_noaudit(current, CAP_SYS_RAWIO))
> > > > > 		flags |= FUSE_IOMAP | FUSE_IOMAP_DIRECTIO | FUSE_IOMAP_PAGECACHE;
> > > > > 	...
> > > > > 	ia->in.flags = flags;
> > > > > 	ia->in.flags2 = flags >> 32;
> > > > > 
> > > > > which means that we only advertise iomap support in FUSE_INIT if the
> > > > > process running fuse_fill_super (which you hope is the fuse server)
> > > > > actually has CAP_SYS_RAWIO.  Would that work for you?  Or are you
> > > > > dropping privileges before you even open /dev/fuse?
> > > > 
> > > > Ah - that might be the answer. I will check if dropped capabilities 
> > > > disappear in fuse_send_init. If so, I can work with that - not advertising 
> > > > the famfs capability unless the capability is present at that point looks 
> > > > like a perfectly good option. Thanks for that idea!
> > > 
> > > Review: the famfs fuse server directs the kernel to provide access to raw
> > > (memory) devices, so it should should be required to have have the
> > > CAP_SYS_RAWIO capability. fs/fuse needs to detect this at init time,
> > > and fail the connection/mount if the capability is missing.
> > > 
> > > I initially attempted to do this verification in process_init_reply(), but
> > > that doesn't run in the fuse server process context.
> > > 
> > > I am now checking the capability in fuse_send_init(), and not advertising
> > > the FUSE_DAX_FMAP capability (in in_args->flags[2]) unless the server has 
> > > CAP_SYS_RAWIO.
> > > 
> > > That requires that process_init_reply() reject FUSE_DAX_FMAP from a server
> > > if FUSE_DAX_FMAP was not set in in_args->flags[2]. process_init_reply() was
> > > not previously checking the in_args, but no big deal - this works.
> > > 
> > > This leads to an apparent dilemma in libfuse. In fuse_lowlevel_ops->init(),
> > > I should check for (flags & FUSE_DAX_IOMAP), and fail the connection if
> > > that capability is not on offer. But fuse_lowlevel_ops->init() doesn't
> > > have an obvious way to fail the connection. 
> > 
> > Yeah, I really wish it did.  I particularly wish that it had a way to
> > negotiate all the FUSE_INIT stuff before libfuse daemonizes and starts
> > up the event loop.  Well, not all of it -- by the time we get to
> > FUSE_INIT we've basically decided to commit to mounting.
> > 
> > For fuseblk servers this is horrible, because the kernel needs to be
> > able to open the block device with O_EXCL during the mount() process,
> > which means you actually have to be able to (re)open the block device
> > from op_init, which can fail.  Unless there's a way to drop O_EXCL from
> > an open fd?
> > 
> > The awful way that I handle failure in FUSE_INIT is to call
> > fuse_session_exit, but that grossly leaves a dead mount in its place.
> > 
> > Hey wait, is this what Mikulas was talking about when he mentioned
> > synchronous initialization?
> > 
> > For iomap I created a discovery ioctl so that you can open /dev/fuse and
> > ask the kernel about the iomap functionality that it supports, and you
> > can exit(1) without creating a fuse session.  The one goofy problem with
> > that is that there's a TOCTOU race if someone else does echo N >
> > /sys/module/fuse/parameters/enable_iomap, though fuse4fs can always
> > fall back to non-iomap mode.
> > 
> > --D
> 
> Thanks Darrick.
> 
> Hmm - synchronous init would be nice.
> 
> I tried calling fuse_session_exit(), but the broken mount was not an
> improvement over a can't-do-I/O mount - which I get if the kernel rejects 
> the capability currently known as FUSE_DAX_FMAP due to lack of CAP_SYS_RAWIO.
> 
> In my case, I think letting the mount complete with FUSE_DAX_FMAP rejected
> is easier to detect and cleanup than a fuse_session_exit() aborted mount.

Yeah, you can always adjust the fuse server to react to an FMAP
rejection by returning EIO or something.  Though I guess it's nice to
have some detection that you can do prior to calling fuse_main so that
you can print complaints and exit(1) while the user is still paying
attention. :)

--D

> Famfs mount is a cli operation that does a sequence of stuff before and after
> the fork/exec of the famfs fuse server. That fork/exec can't really return 
> an error in the conventional sense, so I'm stuck diagnosing whether the 
> mount is good (which I already do, but it's a WIP). 
> 
> I already have to poll for the .meta files to appear (superblock and log), 
> and that can be adapted pretty easily to check whether they can be read 
> correctly (which they can't if famfs doesn't have daxdev access).
> 
> If mount was synchronous, I'd still need to give the fork/exec enough time
> to fail and then detect that. That would probably be cleaner, but not by
> a huge amount.
> 
> Thanks,
> John
> 
> <snip>
> 
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 14/18] famfs_fuse: GET_DAXDEV message and daxdev_table
  2025-08-14 13:58   ` Miklos Szeredi
@ 2025-08-14 17:19     ` Darrick J. Wong
  2025-08-14 18:25       ` Miklos Szeredi
  2025-08-15 16:38     ` John Groves
  1 sibling, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2025-08-14 17:19 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: John Groves, Dan Williams, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On Thu, Aug 14, 2025 at 03:58:58PM +0200, Miklos Szeredi wrote:
> On Thu, 3 Jul 2025 at 20:54, John Groves <John@groves.net> wrote:
> >
> > * The new GET_DAXDEV message/response is enabled
> > * The command it triggered by the update_daxdev_table() call, if there
> >   are any daxdevs in the subject fmap that are not represented in the
> >   daxdev_dable yet.
> 
> This is rather convoluted, the server *should know* which dax devices
> it has registered, hence it shouldn't need to be explicitly asked.
> 
> And there's already an API for registering file descriptors:
> FUSE_DEV_IOC_BACKING_OPEN.  Is there a reason that interface couldn't
> be used by famfs?

What happens if you want to have a fuse server that hosts both famfs
files /and/ backing files?  That'd be pretty crazy to mix both paths in
one filesystem, but it's in theory possible, particularly if the famfs
server wanted to export a pseudofile where everyone could find that
shadow file?

--D

> Thanks,
> Miklos
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 12/18] famfs_fuse: Plumb the GET_FMAP message/response
  2025-08-14 13:36   ` Miklos Szeredi
  2025-08-14 14:36     ` Miklos Szeredi
@ 2025-08-14 18:05     ` Darrick J. Wong
  2025-08-16 15:00       ` John Groves
  2025-08-15  0:38     ` John Groves
  2 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2025-08-14 18:05 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert,
	John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Randy Dunlap, Jeff Layton, Kent Overstreet, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Amir Goldstein,
	Jonathan Cameron, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
	Aravind Ramesh, Ajay Joshi

On Thu, Aug 14, 2025 at 03:36:26PM +0200, Miklos Szeredi wrote:
> On Thu, 3 Jul 2025 at 20:54, John Groves <John@groves.net> wrote:
> >
> > Upon completion of an OPEN, if we're in famfs-mode we do a GET_FMAP to
> > retrieve and cache up the file-to-dax map in the kernel. If this
> > succeeds, read/write/mmap are resolved direct-to-dax with no upcalls.
> 
> Nothing to do at this time unless you want a side project:  doing this
> with compound requests would save a roundtrip (OPEN + GET_FMAP in one
> go).
> 
> > GET_FMAP has a variable-size response payload, and the allocated size
> > is sent in the in_args[0].size field. If the fmap would overflow the
> > message, the fuse server sends a reply of size 'sizeof(uint32_t)' which
> > specifies the size of the fmap message. Then the kernel can realloc a
> > large enough buffer and try again.
> 
> There is a better way to do this: the allocation can happen when we
> get the response.  Just need to add infrastructure to dev.c.
> 
> > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > index 6c384640c79b..dff5aa62543e 100644
> > --- a/include/uapi/linux/fuse.h
> > +++ b/include/uapi/linux/fuse.h
> > @@ -654,6 +654,10 @@ enum fuse_opcode {
> >         FUSE_TMPFILE            = 51,
> >         FUSE_STATX              = 52,
> >
> > +       /* Famfs / devdax opcodes */
> > +       FUSE_GET_FMAP           = 53,
> > +       FUSE_GET_DAXDEV         = 54,
> 
> Introduced too early.
> 
> > +
> >         /* CUSE specific operations */
> >         CUSE_INIT               = 4096,
> >
> > @@ -888,6 +892,16 @@ struct fuse_access_in {
> >         uint32_t        padding;
> >  };
> >
> > +struct fuse_get_fmap_in {
> > +       uint32_t        size;
> > +       uint32_t        padding;
> > +};
> 
> As noted, passing size to server really makes no sense.  I'd just omit
> fuse_get_fmap_in completely.
> 
> > +
> > +struct fuse_get_fmap_out {
> > +       uint32_t        size;
> > +       uint32_t        padding;
> > +};
> > +
> >  struct fuse_init_in {
> >         uint32_t        major;
> >         uint32_t        minor;
> > @@ -1284,4 +1298,8 @@ struct fuse_uring_cmd_req {
> >         uint8_t padding[6];
> >  };
> >
> > +/* Famfs fmap message components */
> > +
> > +#define FAMFS_FMAP_MAX 32768 /* Largest supported fmap message */
> > +
> 
> Hmm, Darrick's interface gets one extents at a time.   This one tries
> to get the whole map in one go.
> 
> The single extent thing can be inefficient even for plain block fs, so
> it would be nice to get multiple extents.  The whole map has an

The fuse/iomap series adds a mapping upsertion "notification" that the
fuse server can use to prepopulate mappings in the kernel so that it
doesn't have to call back to the server for reads and pure overwrites.
Maybe it would be useful to be able to upsert mappings for more than a
single file range, but so far it hasn't been necessary.

> artificial limit that currently may seem sufficient but down the line
> could cause pain.
> 
> I'm still hoping some common ground would benefit both interfaces.
> Just not sure what it should be.

It's possible that famfs could use the mapping upsertion notification to
upload mappings into the kernel.  As far as I can tell, fuse servers can
send notifications even when they're in the middle of handling a fuse
request, so the famfs daemon's ->open function could upload mappings
before completing the open operation.

As for the other use of FMAP (uploading a description of striping data
from which one can construct mappings) ... I don't know what to say
about that.  That one isn't so useful for block filesystems. :)

(onto the next reply)

--D

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 12/18] famfs_fuse: Plumb the GET_FMAP message/response
  2025-08-14 14:36     ` Miklos Szeredi
@ 2025-08-14 18:20       ` Darrick J. Wong
  2025-08-15 15:06         ` John Groves
  2025-08-15 16:53       ` John Groves
  1 sibling, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2025-08-14 18:20 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: John Groves, Dan Williams, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On Thu, Aug 14, 2025 at 04:36:57PM +0200, Miklos Szeredi wrote:
> On Thu, 14 Aug 2025 at 15:36, Miklos Szeredi <miklos@szeredi.hu> wrote:
> 
> > I'm still hoping some common ground would benefit both interfaces.
> > Just not sure what it should be.
> 
> Something very high level:
> 
>  - allow several map formats: say a plain one with a list of extents
> and a famfs one

Yes, I think that's needed.

>  - allow several types of backing files: say regular and dax dev

"block device", for iomap.

>  - querying maps has a common protocol, format of maps is opaque to this
>  - maps are cached by a common facility

I've written such a cache already. :)

>  - each type of mapping has a decoder module

I don't know that you need much "decoding" -- for famfs, the regular
mappings correspond to FUSE_IOMAP_TYPE_MAPPED.  The one goofy part is
the device cookie in each IO mapping: fuse-iomap maps each block device
you give it to a device cookie, so I guess famfs will have to do the
same.

OTOH you can then have a famfs backed by many persistent memory
devices.

>  - each type of backing file has a module for handling I/O
> 
> Does this make sense?

More or less.

> This doesn't have to be implemented in one go, but for example
> GET_FMAP could be renamed to GET_READ_MAP with an added offset and
> size parameter.  For famfs the offset/size would be set to zero/inf.
> I'd be content with that for now.

I'll try to cough up a RFC v4 next week.

--D

> Thanks,
> Miklos
> 
> >
> > Thanks,
> > Miklos
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 14/18] famfs_fuse: GET_DAXDEV message and daxdev_table
  2025-08-14 17:19     ` Darrick J. Wong
@ 2025-08-14 18:25       ` Miklos Szeredi
  2025-08-14 18:55         ` Darrick J. Wong
  2025-08-16 16:22         ` John Groves
  0 siblings, 2 replies; 91+ messages in thread
From: Miklos Szeredi @ 2025-08-14 18:25 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: John Groves, Dan Williams, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On Thu, 14 Aug 2025 at 19:19, Darrick J. Wong <djwong@kernel.org> wrote:
> What happens if you want to have a fuse server that hosts both famfs
> files /and/ backing files?  That'd be pretty crazy to mix both paths in
> one filesystem, but it's in theory possible, particularly if the famfs
> server wanted to export a pseudofile where everyone could find that
> shadow file?

Either FUSE_DEV_IOC_BACKING_OPEN detects what kind of object it has
been handed, or we add a flag that explicitly says this is a dax dev
or a block dev or a regular file.  I'd prefer the latter.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 14/18] famfs_fuse: GET_DAXDEV message and daxdev_table
  2025-08-14 18:25       ` Miklos Szeredi
@ 2025-08-14 18:55         ` Darrick J. Wong
  2025-08-14 19:19           ` Miklos Szeredi
  2025-08-16 16:22         ` John Groves
  1 sibling, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2025-08-14 18:55 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: John Groves, Dan Williams, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On Thu, Aug 14, 2025 at 08:25:23PM +0200, Miklos Szeredi wrote:
> On Thu, 14 Aug 2025 at 19:19, Darrick J. Wong <djwong@kernel.org> wrote:
> > What happens if you want to have a fuse server that hosts both famfs
> > files /and/ backing files?  That'd be pretty crazy to mix both paths in
> > one filesystem, but it's in theory possible, particularly if the famfs
> > server wanted to export a pseudofile where everyone could find that
> > shadow file?
> 
> Either FUSE_DEV_IOC_BACKING_OPEN detects what kind of object it has
> been handed, or we add a flag that explicitly says this is a dax dev
> or a block dev or a regular file.  I'd prefer the latter.

I don't think it's difficult to do something like:

	if (!fud)
		return -EPERM;

	if (copy_from_user(&map, argp, sizeof(map)))
		return -EFAULT;

	if (IS_ENABLED(CONFIG_FUSE_IOMAP)) {
		ret = fuse_iomap_dev_add(fud->fc, &map);
		if (ret)
			return ret;
	}

	if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
		return fuse_backing_open(fud->fc, &map);

	return 0;

I guess the hard part is -- how do we return /two/ device cookies?
Or do we move the backing_files_map out of CONFIG_FUSE_PASSTHROUGH and
then let fuse-iomap/famfs extract the block/dax device from that?
Then the backing_id/device cookie would be the same across a fuse mount.
iomap would have to check that it's being given block devices, but
that's easy.

--D

> Thanks,
> Miklos
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 14/18] famfs_fuse: GET_DAXDEV message and daxdev_table
  2025-08-14 18:55         ` Darrick J. Wong
@ 2025-08-14 19:19           ` Miklos Szeredi
  0 siblings, 0 replies; 91+ messages in thread
From: Miklos Szeredi @ 2025-08-14 19:19 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: John Groves, Dan Williams, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On Thu, 14 Aug 2025 at 20:55, Darrick J. Wong <djwong@kernel.org> wrote:

> Or do we move the backing_files_map out of CONFIG_FUSE_PASSTHROUGH and
> then let fuse-iomap/famfs extract the block/dax device from that?
> Then the backing_id/device cookie would be the same across a fuse mount.

Yes.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 11/18] famfs_fuse: Basic famfs mount opts
  2025-08-14 15:19             ` Miklos Szeredi
@ 2025-08-14 23:52               ` John Groves
  0 siblings, 0 replies; 91+ messages in thread
From: John Groves @ 2025-08-14 23:52 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Darrick J. Wong, Dan Williams, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On 25/08/14 05:19PM, Miklos Szeredi wrote:
> On Thu, 14 Aug 2025 at 16:39, John Groves <John@groves.net> wrote:
> 
> > Having a generic approach rather than a '-o' option would be fine with me.
> > Also happy to entertain other ideas...
> 
> We could just allow arbitrary options to be set by the server.  It
> might break cases where the server just passes unknown options down
> into the kernel, which currently are rejected.  I don't think this is
> common practice, but still it sounds a bit risky.
> 
> Alternatively allow INIT_REPLY to set up misc options, which can only
> be done explicitly, so no risk there.
> 
> Thanks,
> Miklos

I'll take a look at INIT_REPLY; if I can make sense of it, I'll try something
based on that in V3. Or I may have questions...

Thanks,
John


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 12/18] famfs_fuse: Plumb the GET_FMAP message/response
  2025-08-14 13:36   ` Miklos Szeredi
  2025-08-14 14:36     ` Miklos Szeredi
  2025-08-14 18:05     ` Darrick J. Wong
@ 2025-08-15  0:38     ` John Groves
  2 siblings, 0 replies; 91+ messages in thread
From: John Groves @ 2025-08-15  0:38 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Darrick J . Wong,
	Randy Dunlap, Jeff Layton, Kent Overstreet, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Amir Goldstein,
	Jonathan Cameron, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
	Aravind Ramesh, Ajay Joshi

On 25/08/14 03:36PM, Miklos Szeredi wrote:
> On Thu, 3 Jul 2025 at 20:54, John Groves <John@groves.net> wrote:
> >
> > Upon completion of an OPEN, if we're in famfs-mode we do a GET_FMAP to
> > retrieve and cache up the file-to-dax map in the kernel. If this
> > succeeds, read/write/mmap are resolved direct-to-dax with no upcalls.
> 
> Nothing to do at this time unless you want a side project:  doing this
> with compound requests would save a roundtrip (OPEN + GET_FMAP in one
> go).

I'm thinking that's an opportunity for improvement after the basic mechanism
is in ;)

> 
> > GET_FMAP has a variable-size response payload, and the allocated size
> > is sent in the in_args[0].size field. If the fmap would overflow the
> > message, the fuse server sends a reply of size 'sizeof(uint32_t)' which
> > specifies the size of the fmap message. Then the kernel can realloc a
> > large enough buffer and try again.
> 
> There is a better way to do this: the allocation can happen when we
> get the response.  Just need to add infrastructure to dev.c.

OK, makes sense. Will take a run at this. Might drop back and go with a hard
limit and relax it later. Famfs fmaps won't grow unbounded near term...

> 
> > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > index 6c384640c79b..dff5aa62543e 100644
> > --- a/include/uapi/linux/fuse.h
> > +++ b/include/uapi/linux/fuse.h
> > @@ -654,6 +654,10 @@ enum fuse_opcode {
> >         FUSE_TMPFILE            = 51,
> >         FUSE_STATX              = 52,
> >
> > +       /* Famfs / devdax opcodes */
> > +       FUSE_GET_FMAP           = 53,
> > +       FUSE_GET_DAXDEV         = 54,
> 
> Introduced too early.

You mean FUSE_GET_DAXDEV I presume (which is not used until 2 patches later? 
Right, will fix.

> 
> > +
> >         /* CUSE specific operations */
> >         CUSE_INIT               = 4096,
> >
> > @@ -888,6 +892,16 @@ struct fuse_access_in {
> >         uint32_t        padding;
> >  };
> >
> > +struct fuse_get_fmap_in {
> > +       uint32_t        size;
> > +       uint32_t        padding;
> > +};
> 
> As noted, passing size to server really makes no sense.  I'd just omit
> fuse_get_fmap_in completely.

OK, I think I understand; Will rework in v3.

Same idea as "better way" above...

> 
> > +
> > +struct fuse_get_fmap_out {
> > +       uint32_t        size;
> > +       uint32_t        padding;
> > +};
> > +
> >  struct fuse_init_in {
> >         uint32_t        major;
> >         uint32_t        minor;
> > @@ -1284,4 +1298,8 @@ struct fuse_uring_cmd_req {
> >         uint8_t padding[6];
> >  };
> >
> > +/* Famfs fmap message components */
> > +
> > +#define FAMFS_FMAP_MAX 32768 /* Largest supported fmap message */
> > +
> 
> Hmm, Darrick's interface gets one extents at a time.   This one tries
> to get the whole map in one go.
> 
> The single extent thing can be inefficient even for plain block fs, so
> it would be nice to get multiple extents.  The whole map has an
> artificial limit that currently may seem sufficient but down the line
> could cause pain.
> 
> I'm still hoping some common ground would benefit both interfaces.
> Just not sure what it should be.
> 
> Thanks,
> Miklos

At one point Darrick and I discussed retrieving a [file: offset, length] range 
of extents (i.e. request describes what it wants, and reply describes what 
range of the file it covers). I'm not sure it will make sense for famfs to 
retrieve anything but the whole file's map, but I know it might in Darrick's 
case.

I could imagine an update to GET_FMAP (possibly with a differnet name) that 
requests an offset range, and then receives a (possibly different) range that 
is intended to match or exceed the requested range.

It seems like we might be able to share the same command to retrieve extents, 
provided the response starts with a header that allows us to have separate 
(and presumably extensible) payload formats. No doubt Darrick will have 
thoughts on this :D

I don't think we can merge our "fmap" formats; famfs uses either short
extent lists or a format that is efficient for repeating interleave patterns,
and wants to cache the entire fmap.  ...which is not likely to match Darrick's 
pattern, but we might be able to share the same retrieval message/response.

Thanks!
John


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 12/18] famfs_fuse: Plumb the GET_FMAP message/response
  2025-08-14 18:20       ` Darrick J. Wong
@ 2025-08-15 15:06         ` John Groves
  2025-08-19 21:55           ` Darrick J. Wong
  0 siblings, 1 reply; 91+ messages in thread
From: John Groves @ 2025-08-15 15:06 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On 25/08/14 11:20AM, Darrick J. Wong wrote:
> On Thu, Aug 14, 2025 at 04:36:57PM +0200, Miklos Szeredi wrote:
> > On Thu, 14 Aug 2025 at 15:36, Miklos Szeredi <miklos@szeredi.hu> wrote:
> > 
> > > I'm still hoping some common ground would benefit both interfaces.
> > > Just not sure what it should be.
> > 
> > Something very high level:
> > 
> >  - allow several map formats: say a plain one with a list of extents
> > and a famfs one
> 
> Yes, I think that's needed.

Agreed

> 
> >  - allow several types of backing files: say regular and dax dev
> 
> "block device", for iomap.
> 
> >  - querying maps has a common protocol, format of maps is opaque to this
> >  - maps are cached by a common facility
> 
> I've written such a cache already. :)

I guess I need to take a look at that. Can you point me to the right place?

> 
> >  - each type of mapping has a decoder module
> 
> I don't know that you need much "decoding" -- for famfs, the regular
> mappings correspond to FUSE_IOMAP_TYPE_MAPPED.  The one goofy part is
> the device cookie in each IO mapping: fuse-iomap maps each block device
> you give it to a device cookie, so I guess famfs will have to do the
> same.
> 
> OTOH you can then have a famfs backed by many persistent memory
> devices.

That's handled in the famfs fmaps already. When an fmap is ingested,
if it references any previously-unknown daxdevs, they get retrieved
(FUSE_GET_DAXDEV).

Oversimplifying a bit, I assume that famfs fmaps won't really change,
they'll just be retrieved by a more flexible method and be preceded
by a header that identifies the payload as a famfs fmap.

> 
> >  - each type of backing file has a module for handling I/O
> > 
> > Does this make sense?
> 
> More or less.

I'm nervous about going for too much generalization too soon here,
but otherwise yeah.

> 
> > This doesn't have to be implemented in one go, but for example
> > GET_FMAP could be renamed to GET_READ_MAP with an added offset and
> > size parameter.  For famfs the offset/size would be set to zero/inf.
> > I'd be content with that for now.
> 
> I'll try to cough up a RFC v4 next week.

Darrick, let's try to chat next week to compare notes.

Based on this thinking, I will keep my rework of GET_FMAP to a minimum
since that will likely be a new shared message/response. I think that
part can be merged later in the cycle...

John


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 14/18] famfs_fuse: GET_DAXDEV message and daxdev_table
  2025-08-14 13:58   ` Miklos Szeredi
  2025-08-14 17:19     ` Darrick J. Wong
@ 2025-08-15 16:38     ` John Groves
  2025-08-19 22:34       ` Darrick J. Wong
  1 sibling, 1 reply; 91+ messages in thread
From: John Groves @ 2025-08-15 16:38 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Dan Williams, Bernd Schubert, John Groves, Jonathan Corbet,
	Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Darrick J . Wong, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi, john

On 25/08/14 03:58PM, Miklos Szeredi wrote:
> On Thu, 3 Jul 2025 at 20:54, John Groves <John@groves.net> wrote:
> >
> > * The new GET_DAXDEV message/response is enabled
> > * The command it triggered by the update_daxdev_table() call, if there
> >   are any daxdevs in the subject fmap that are not represented in the
> >   daxdev_dable yet.
> 
> This is rather convoluted, the server *should know* which dax devices
> it has registered, hence it shouldn't need to be explicitly asked.

That's not impossible, but it's also a bit harder than the current
approach for the famfs user space - which I think would need to become
stateful as to which daxdevs had been pushed into the kernel. The
famfs user space is as unstateful as possible ;)

> 
> And there's already an API for registering file descriptors:
> FUSE_DEV_IOC_BACKING_OPEN.  Is there a reason that interface couldn't
> be used by famfs?

FUSE_DEV_IOC_BACKING_OPEN looks pretty specific to passthrough. The
procedure for opening a daxdev is stolen from the way xfs does it in 
fs-dax mode. It calls fs_dax_get() rather then open(), and passes in 
'famfs_fuse_dax_holder_ops' to 1) effect exclusivity, and 2) receive
callbacks in the event of memory errors. See famfs_fuse_get_daxdev()...

A *similar* ioctl could be added to push in a daxdev, but that would
still add statefulness that isn't required in the current implementation.
I didn't go there because there are so few IOCTLs currently (the overall 
model is more 'pull' than 'push').

Keep in mind that the baseline case with famfs will be files that are 
interleaved across strips from many daxdevs. We commonly create files 
that are striped across 16 daxdevs, selected at random from a
significantly larger pool. Because interleaving is essential to memory 
performance...

There is no "device mapper" analog for memory, so famfs really does 
have to span daxdevs. As Darrick commented somewhere, famfs fmaps do 
repeating patterns well (i.e. striping)...

I think there is a certain elegance to the current approach, but
if you feel strongly I will change it.

Thanks!
John


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 12/18] famfs_fuse: Plumb the GET_FMAP message/response
  2025-08-14 14:36     ` Miklos Szeredi
  2025-08-14 18:20       ` Darrick J. Wong
@ 2025-08-15 16:53       ` John Groves
  2025-08-19 22:13         ` Darrick J. Wong
  1 sibling, 1 reply; 91+ messages in thread
From: John Groves @ 2025-08-15 16:53 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Dan Williams, Bernd Schubert, John Groves, Jonathan Corbet,
	Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Darrick J . Wong, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi, john

On 25/08/14 04:36PM, Miklos Szeredi wrote:
> On Thu, 14 Aug 2025 at 15:36, Miklos Szeredi <miklos@szeredi.hu> wrote:
> 
> > I'm still hoping some common ground would benefit both interfaces.
> > Just not sure what it should be.
> 
> Something very high level:
> 
>  - allow several map formats: say a plain one with a list of extents
> and a famfs one
>  - allow several types of backing files: say regular and dax dev
>  - querying maps has a common protocol, format of maps is opaque to this
>  - maps are cached by a common facility
>  - each type of mapping has a decoder module
>  - each type of backing file has a module for handling I/O
> 
> Does this make sense?
> 
> This doesn't have to be implemented in one go, but for example
> GET_FMAP could be renamed to GET_READ_MAP with an added offset and
> size parameter.  For famfs the offset/size would be set to zero/inf.
> I'd be content with that for now.

Maybe GET_FILE_MAP or GET_FILE_IOMAP if we want to keep overloading 
the term iomap. Maps are to backing-dev for regular file systems,
and to device memory (devdax) for famfs - in all cases both read
and write (when write is allowed).

Thanks,
John


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 12/18] famfs_fuse: Plumb the GET_FMAP message/response
  2025-08-14 18:05     ` Darrick J. Wong
@ 2025-08-16 15:00       ` John Groves
  2025-08-19 22:17         ` Darrick J. Wong
  0 siblings, 1 reply; 91+ messages in thread
From: John Groves @ 2025-08-16 15:00 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Miklos Szeredi, Dan Williams, Miklos Szeredi, Bernd Schubert,
	John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Randy Dunlap, Jeff Layton, Kent Overstreet, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Amir Goldstein,
	Jonathan Cameron, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
	Aravind Ramesh, Ajay Joshi, john

On 25/08/14 11:05AM, Darrick J. Wong wrote:
<snip>
> It's possible that famfs could use the mapping upsertion notification to
> upload mappings into the kernel.  As far as I can tell, fuse servers can
> send notifications even when they're in the middle of handling a fuse
> request, so the famfs daemon's ->open function could upload mappings
> before completing the open operation.
> 

Famfs dax mappings don't change (and might or might not ever change).
Plus, famfs is exposing memory, so it must run at memory speed - which
is why it needs to cache the entire fmap for any active file. That way
mapping faults happen at lookup-in-fmap speed (which is order 1 for
interleaved fmaps, and order-small-n for non-interleaved.

I wouldn't rule out ever using upsert, but probably not before we
integrate famfs with PNFS, or some other major generalizing event.

Thanks,
John

<snip>


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 14/18] famfs_fuse: GET_DAXDEV message and daxdev_table
  2025-08-14 18:25       ` Miklos Szeredi
  2025-08-14 18:55         ` Darrick J. Wong
@ 2025-08-16 16:22         ` John Groves
  2025-08-19 22:32           ` Darrick J. Wong
  1 sibling, 1 reply; 91+ messages in thread
From: John Groves @ 2025-08-16 16:22 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Darrick J. Wong, Dan Williams, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi, john

On 25/08/14 08:25PM, Miklos Szeredi wrote:
> On Thu, 14 Aug 2025 at 19:19, Darrick J. Wong <djwong@kernel.org> wrote:
> > What happens if you want to have a fuse server that hosts both famfs
> > files /and/ backing files?  That'd be pretty crazy to mix both paths in
> > one filesystem, but it's in theory possible, particularly if the famfs
> > server wanted to export a pseudofile where everyone could find that
> > shadow file?
> 
> Either FUSE_DEV_IOC_BACKING_OPEN detects what kind of object it has
> been handed, or we add a flag that explicitly says this is a dax dev
> or a block dev or a regular file.  I'd prefer the latter.
> 
> Thanks,
> Miklos

I have future ideas of famfs supporting non-dax-memory files in a mixed
namespace with normal famfs dax files. This seems like the simplest way 
to relax the "files are strictly pre-allocated" rule. But I think this 
is orthogonal to how fmaps and backing devs are passed into the kernel. 

The way I'm thinking about it, the difference would be handled in
read/write/mmap. Taking fuse_file_read_iter as the example, the code 
currently looks like this:

	if (FUSE_IS_VIRTIO_DAX(fi))
		return fuse_dax_read_iter(iocb, to);
	if (fuse_file_famfs(fi))
		return famfs_fuse_read_iter(iocb, to);

	/* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
	if (ff->open_flags & FOPEN_DIRECT_IO)
		return fuse_direct_read_iter(iocb, to);
	else if (fuse_file_passthrough(ff))
		return fuse_passthrough_read_iter(iocb, to);
	else
		return fuse_cache_read_iter(iocb, to);

If the famfs fuse servert wants a particular file handled via another 
mechanism -- e.g. READ message to server or passthrough -- the famfs 
fuse server can just provide an fmap that indicates such.  Then 
fuse_file_famfs(fi) would return false for that file, and it would be 
handled through other existing mechanisms (which the famfs fuse 
server would have to handle correctly).

Famfs could, for example, allow files to be created as generic or
passthrough, and then have a "commit" step that allocated dax memory, 
moved the data from a non-dax into dax, and appended the file to the 
famfs metadata log - flipping the file to full-monty-famfs (tm). 
Prior to the "commit", performance is less but all manner of mutations 
could be allowed.

So I don't think this looks very be hard, and it's independent of the 
mechanism by which fmaps get into the kernel.

Regards,
John



^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 12/18] famfs_fuse: Plumb the GET_FMAP message/response
  2025-08-15 15:06         ` John Groves
@ 2025-08-19 21:55           ` Darrick J. Wong
  0 siblings, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2025-08-19 21:55 UTC (permalink / raw)
  To: John Groves
  Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On Fri, Aug 15, 2025 at 10:06:01AM -0500, John Groves wrote:
> On 25/08/14 11:20AM, Darrick J. Wong wrote:
> > On Thu, Aug 14, 2025 at 04:36:57PM +0200, Miklos Szeredi wrote:
> > > On Thu, 14 Aug 2025 at 15:36, Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > 
> > > > I'm still hoping some common ground would benefit both interfaces.
> > > > Just not sure what it should be.
> > > 
> > > Something very high level:
> > > 
> > >  - allow several map formats: say a plain one with a list of extents
> > > and a famfs one
> > 
> > Yes, I think that's needed.
> 
> Agreed
> 
> > 
> > >  - allow several types of backing files: say regular and dax dev
> > 
> > "block device", for iomap.
> > 
> > >  - querying maps has a common protocol, format of maps is opaque to this
> > >  - maps are cached by a common facility
> > 
> > I've written such a cache already. :)
> 
> I guess I need to take a look at that. Can you point me to the right place?
> 
> > 
> > >  - each type of mapping has a decoder module
> > 
> > I don't know that you need much "decoding" -- for famfs, the regular
> > mappings correspond to FUSE_IOMAP_TYPE_MAPPED.  The one goofy part is
> > the device cookie in each IO mapping: fuse-iomap maps each block device
> > you give it to a device cookie, so I guess famfs will have to do the
> > same.
> > 
> > OTOH you can then have a famfs backed by many persistent memory
> > devices.
> 
> That's handled in the famfs fmaps already. When an fmap is ingested,
> if it references any previously-unknown daxdevs, they get retrieved
> (FUSE_GET_DAXDEV).
> 
> Oversimplifying a bit, I assume that famfs fmaps won't really change,
> they'll just be retrieved by a more flexible method and be preceded
> by a header that identifies the payload as a famfs fmap.

<nod> Well, I suppose fmaps aren't supposed to change much, but I get
the strong sense that Miklos would rather we both use the
FUSE_DEV_IOC_BACKING_OPEN interface...

> > 
> > >  - each type of backing file has a module for handling I/O
> > > 
> > > Does this make sense?
> > 
> > More or less.
> 
> I'm nervous about going for too much generalization too soon here,
> but otherwise yeah.

...and I've tried to make it simple for famfs to pick up the interface.
From the new fuse_backing_open:

	/*
	 * Each _backing_open function should either:
	 *
	 * 1. Take a ref to fb if it wants the file and return 0.
	 * 2. Return 0 without taking a ref if the backing file isn't needed.
	 * 3. Return an errno explaining why it couldn't attach.
	 *
	 * If at least one subsystem bumps the reference count to open it,
	 * we'll install it into the index and return the index.  If nobody
	 * opens the file, the error code will be passed up.  EPERM is the
	 * default.
	 */
	passthrough_res = fuse_passthrough_backing_open(fc, fb);
	iomap_res = fuse_iomap_backing_open(fc, fb);

	if (refcount_read(&fb->count) < 2)
		/* drop the fuse_backing and return one of the res */

So all your famfs_backing_open function has to do is check that fb->file
points to a pmem device.  If so, it sets fb->famfs = 1 and bumps the
fb->count refcount.

Full code here:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-attrs_2025-08-19

I'll let the build robots find any facepalm problems and post this whole
series tomorrow.

> > > This doesn't have to be implemented in one go, but for example
> > > GET_FMAP could be renamed to GET_READ_MAP with an added offset and
> > > size parameter.  For famfs the offset/size would be set to zero/inf.
> > > I'd be content with that for now.
> > 
> > I'll try to cough up a RFC v4 next week.
> 
> Darrick, let's try to chat next week to compare notes.
> 
> Based on this thinking, I will keep my rework of GET_FMAP to a minimum
> since that will likely be a new shared message/response. I think that
> part can be merged later in the cycle...

<nod>

--D

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 12/18] famfs_fuse: Plumb the GET_FMAP message/response
  2025-08-15 16:53       ` John Groves
@ 2025-08-19 22:13         ` Darrick J. Wong
  0 siblings, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2025-08-19 22:13 UTC (permalink / raw)
  To: John Groves
  Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On Fri, Aug 15, 2025 at 11:53:16AM -0500, John Groves wrote:
> On 25/08/14 04:36PM, Miklos Szeredi wrote:
> > On Thu, 14 Aug 2025 at 15:36, Miklos Szeredi <miklos@szeredi.hu> wrote:
> > 
> > > I'm still hoping some common ground would benefit both interfaces.
> > > Just not sure what it should be.
> > 
> > Something very high level:
> > 
> >  - allow several map formats: say a plain one with a list of extents
> > and a famfs one
> >  - allow several types of backing files: say regular and dax dev
> >  - querying maps has a common protocol, format of maps is opaque to this
> >  - maps are cached by a common facility
> >  - each type of mapping has a decoder module
> >  - each type of backing file has a module for handling I/O
> > 
> > Does this make sense?
> > 
> > This doesn't have to be implemented in one go, but for example
> > GET_FMAP could be renamed to GET_READ_MAP with an added offset and
> > size parameter.  For famfs the offset/size would be set to zero/inf.
> > I'd be content with that for now.
> 
> Maybe GET_FILE_MAP or GET_FILE_IOMAP if we want to keep overloading 
> the term iomap. Maps are to backing-dev for regular file systems,
> and to device memory (devdax) for famfs - in all cases both read
> and write (when write is allowed).

The calling model for fuse-iomap is the same as fs/iomap -- there's an
IOMAP_BEGIN upcall to get a mapping from the filesystem, and an
IOMAP_END upcall to tell the fuse server whatever it did with the
mapping.  Some filesystems will reserve delayed allocation reservations
in iomap_begin for a pagecache write, and need to cancel those
reservations if the write fails.

For a pagecache write you need both a read and a write mapping because
the caller's file range isn't guaranteed to be fsblock-aligned.  famfs
mappings are a subcase of iomappings -- the read & write mappings are
the same, and they're always FUSE_IOMAP_TYPE_MAPPED.

IOWs, I don't want "GET_FILE_IOMAP" because that's not how iomap works.
(There's a separate FUSE_IOMAP_IOEND to pass along IO completions from
storage)

Given that famfs just calls dax_iomap_rw with an iomap_ops struct, I
seriously wonder if I should just wire up fsdax for RFC v5 and then
let's see how much code famfs actually needs on top of that.

--D

> Thanks,
> John
> 
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 12/18] famfs_fuse: Plumb the GET_FMAP message/response
  2025-08-16 15:00       ` John Groves
@ 2025-08-19 22:17         ` Darrick J. Wong
  0 siblings, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2025-08-19 22:17 UTC (permalink / raw)
  To: John Groves
  Cc: Miklos Szeredi, Dan Williams, Miklos Szeredi, Bernd Schubert,
	John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Randy Dunlap, Jeff Layton, Kent Overstreet, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Amir Goldstein,
	Jonathan Cameron, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
	Aravind Ramesh, Ajay Joshi

On Sat, Aug 16, 2025 at 10:00:23AM -0500, John Groves wrote:
> On 25/08/14 11:05AM, Darrick J. Wong wrote:
> <snip>
> > It's possible that famfs could use the mapping upsertion notification to
> > upload mappings into the kernel.  As far as I can tell, fuse servers can
> > send notifications even when they're in the middle of handling a fuse
> > request, so the famfs daemon's ->open function could upload mappings
> > before completing the open operation.
> > 
> 
> Famfs dax mappings don't change (and might or might not ever change).
> Plus, famfs is exposing memory, so it must run at memory speed - which
> is why it needs to cache the entire fmap for any active file. That way
> mapping faults happen at lookup-in-fmap speed (which is order 1 for
> interleaved fmaps, and order-small-n for non-interleaved.
> 
> I wouldn't rule out ever using upsert, but probably not before we
> integrate famfs with PNFS, or some other major generalizing event.

Hrm?  No, you'd just make the famfs ->open function upsert all the
relevant mappings.  Since the mappings are all fully written and
(presumably) within EOF, they'll stay in the cache forever and you never
have to upload them ever again.

Though it's probably smarter to wait for the first ->iomap_begin to do
the upserting because you wouldn't want waste kernel memory until
something actually wants to do IO to the file.

--D

> Thanks,
> John
> 
> <snip>
> 
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 14/18] famfs_fuse: GET_DAXDEV message and daxdev_table
  2025-08-16 16:22         ` John Groves
@ 2025-08-19 22:32           ` Darrick J. Wong
  0 siblings, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2025-08-19 22:32 UTC (permalink / raw)
  To: John Groves
  Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On Sat, Aug 16, 2025 at 11:22:49AM -0500, John Groves wrote:
> On 25/08/14 08:25PM, Miklos Szeredi wrote:
> > On Thu, 14 Aug 2025 at 19:19, Darrick J. Wong <djwong@kernel.org> wrote:
> > > What happens if you want to have a fuse server that hosts both famfs
> > > files /and/ backing files?  That'd be pretty crazy to mix both paths in
> > > one filesystem, but it's in theory possible, particularly if the famfs
> > > server wanted to export a pseudofile where everyone could find that
> > > shadow file?
> > 
> > Either FUSE_DEV_IOC_BACKING_OPEN detects what kind of object it has
> > been handed, or we add a flag that explicitly says this is a dax dev
> > or a block dev or a regular file.  I'd prefer the latter.
> > 
> > Thanks,
> > Miklos
> 
> I have future ideas of famfs supporting non-dax-memory files in a mixed
> namespace with normal famfs dax files. This seems like the simplest way 
> to relax the "files are strictly pre-allocated" rule. But I think this 
> is orthogonal to how fmaps and backing devs are passed into the kernel. 
> 
> The way I'm thinking about it, the difference would be handled in
> read/write/mmap. Taking fuse_file_read_iter as the example, the code 
> currently looks like this:
> 
> 	if (FUSE_IS_VIRTIO_DAX(fi))
> 		return fuse_dax_read_iter(iocb, to);
> 	if (fuse_file_famfs(fi))
> 		return famfs_fuse_read_iter(iocb, to);
> 
> 	/* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
> 	if (ff->open_flags & FOPEN_DIRECT_IO)
> 		return fuse_direct_read_iter(iocb, to);
> 	else if (fuse_file_passthrough(ff))
> 		return fuse_passthrough_read_iter(iocb, to);
> 	else
> 		return fuse_cache_read_iter(iocb, to);
> 
> If the famfs fuse servert wants a particular file handled via another 
> mechanism -- e.g. READ message to server or passthrough -- the famfs 
> fuse server can just provide an fmap that indicates such.  Then 
> fuse_file_famfs(fi) would return false for that file, and it would be 
> handled through other existing mechanisms (which the famfs fuse 
> server would have to handle correctly).
> 
> Famfs could, for example, allow files to be created as generic or
> passthrough, and then have a "commit" step that allocated dax memory, 
> moved the data from a non-dax into dax, and appended the file to the 
> famfs metadata log - flipping the file to full-monty-famfs (tm). 
> Prior to the "commit", performance is less but all manner of mutations 
> could be allowed.
> 
> So I don't think this looks very be hard, and it's independent of the 
> mechanism by which fmaps get into the kernel.

This is one thing I wasn't planning -- iomap files are always that, and
there's no fallback to any of the other IO strategies.  The pagecache
handling parts of iomap require things such as i_rwsem controlling
access to a file no matter how many places it's hardlinked, and
timestamp/mode/acl handling working more or less the same way they do in
xfs and ext4.  iomap isn't all that congruent with the way that the
other IO paths (passthrough, writeback_cache, and "directio" files)
work.

Though to undercut my own point partially, sending an "inline data"
mapping to the kernel causes it to call FUSE_READ/FUSE_WRITE and then
you can inject whatever IO path you want.  OTOH the iomap inlinedata
paths are ... not well tested for pos > 0.

--D

> Regards,
> John
> 
> 
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC V2 14/18] famfs_fuse: GET_DAXDEV message and daxdev_table
  2025-08-15 16:38     ` John Groves
@ 2025-08-19 22:34       ` Darrick J. Wong
  0 siblings, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2025-08-19 22:34 UTC (permalink / raw)
  To: John Groves
  Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Randy Dunlap,
	Jeff Layton, Kent Overstreet, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On Fri, Aug 15, 2025 at 11:38:02AM -0500, John Groves wrote:
> On 25/08/14 03:58PM, Miklos Szeredi wrote:
> > On Thu, 3 Jul 2025 at 20:54, John Groves <John@groves.net> wrote:
> > >
> > > * The new GET_DAXDEV message/response is enabled
> > > * The command it triggered by the update_daxdev_table() call, if there
> > >   are any daxdevs in the subject fmap that are not represented in the
> > >   daxdev_dable yet.
> > 
> > This is rather convoluted, the server *should know* which dax devices
> > it has registered, hence it shouldn't need to be explicitly asked.
> 
> That's not impossible, but it's also a bit harder than the current
> approach for the famfs user space - which I think would need to become
> stateful as to which daxdevs had been pushed into the kernel. The
> famfs user space is as unstateful as possible ;)
> 
> > 
> > And there's already an API for registering file descriptors:
> > FUSE_DEV_IOC_BACKING_OPEN.  Is there a reason that interface couldn't
> > be used by famfs?
> 
> FUSE_DEV_IOC_BACKING_OPEN looks pretty specific to passthrough. The
> procedure for opening a daxdev is stolen from the way xfs does it in 
> fs-dax mode. It calls fs_dax_get() rather then open(), and passes in 
> 'famfs_fuse_dax_holder_ops' to 1) effect exclusivity, and 2) receive
> callbacks in the event of memory errors. See famfs_fuse_get_daxdev()...

Yeah, that's about what I would do to wire up fsdax in fuse-iomap.

> A *similar* ioctl could be added to push in a daxdev, but that would
> still add statefulness that isn't required in the current implementation.
> I didn't go there because there are so few IOCTLs currently (the overall 
> model is more 'pull' than 'push').
> 
> Keep in mind that the baseline case with famfs will be files that are 
> interleaved across strips from many daxdevs. We commonly create files 
> that are striped across 16 daxdevs, selected at random from a
> significantly larger pool. Because interleaving is essential to memory 
> performance...
> 
> There is no "device mapper" analog for memory, so famfs really does 
> have to span daxdevs. As Darrick commented somewhere, famfs fmaps do 
> repeating patterns well (i.e. striping)...
> 
> I think there is a certain elegance to the current approach, but
> if you feel strongly I will change it.

I still kinda wonder if you actually want BPF for this sort of thing
(programmatically computed file IO mappings) since they'd give you more
flexibility than hardcoded C in the kernel.

--D

> Thanks!
> John
> 
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

end of thread, other threads:[~2025-08-19 22:34 UTC | newest]

Thread overview: 91+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-03 18:50 [RFC V2 00/18] famfs: port into fuse John Groves
2025-07-03 18:50 ` [RFC V2 01/18] dev_dax_iomap: Move dax_pgoff_to_phys() from device.c to bus.c John Groves
2025-07-03 18:50 ` [RFC V2 02/18] dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage John Groves
2025-07-04 10:39   ` Jonathan Cameron
2025-07-04 12:54     ` John Groves
2025-07-03 18:50 ` [RFC V2 03/18] dev_dax_iomap: Save the kva from memremap John Groves
2025-07-04 11:11   ` Jonathan Cameron
2025-07-03 18:50 ` [RFC V2 04/18] dev_dax_iomap: Add dax_operations for use by fs-dax on devdax John Groves
2025-07-04 12:47   ` Jonathan Cameron
2025-07-05 22:56     ` John Groves
2025-07-03 18:50 ` [RFC V2 05/18] dev_dax_iomap: export dax_dev_get() John Groves
2025-07-03 18:50 ` [RFC V2 06/18] dev_dax_iomap: (ignore!) Drop poisoned page warning in fs/dax.c John Groves
2025-07-03 18:50 ` [RFC V2 07/18] famfs_fuse: magic.h: Add famfs magic numbers John Groves
2025-07-03 18:50 ` [RFC V2 08/18] famfs_fuse: Kconfig John Groves
2025-07-03 18:50 ` [RFC V2 09/18] famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/ John Groves
2025-07-04  8:44   ` Amir Goldstein
2025-07-03 18:50 ` [RFC V2 10/18] famfs_fuse: Basic fuse kernel ABI enablement for famfs John Groves
2025-07-03 22:45   ` John Groves
2025-07-07 17:32     ` Darrick J. Wong
2025-07-04  7:54   ` Amir Goldstein
2025-07-04 13:39     ` John Groves
2025-07-07 17:39       ` Darrick J. Wong
2025-07-08 12:02         ` John Groves
2025-07-09  1:53           ` Darrick J. Wong
2025-07-11  1:32             ` John Groves
2025-07-12  4:49               ` Darrick J. Wong
2025-08-11 18:30               ` John Groves
2025-08-12 16:37                 ` Darrick J. Wong
2025-08-13 13:07                   ` John Groves
2025-08-14 17:16                     ` Darrick J. Wong
2025-07-03 18:50 ` [RFC V2 11/18] famfs_fuse: Basic famfs mount opts John Groves
2025-07-09  3:59   ` Darrick J. Wong
2025-07-11 15:28     ` John Groves
2025-07-12  5:54       ` Darrick J. Wong
2025-08-14 10:37         ` Miklos Szeredi
2025-08-14 14:39           ` John Groves
2025-08-14 15:19             ` Miklos Szeredi
2025-08-14 23:52               ` John Groves
2025-07-03 18:50 ` [RFC V2 12/18] famfs_fuse: Plumb the GET_FMAP message/response John Groves
2025-07-04  8:54   ` Amir Goldstein
2025-07-04 20:30     ` John Groves
2025-07-05  0:06       ` John Groves
2025-07-05  7:58         ` Amir Goldstein
2025-07-05 19:17           ` John Groves
2025-07-09  4:27   ` Darrick J. Wong
2025-07-11 13:46     ` John Groves
2025-08-14 13:36   ` Miklos Szeredi
2025-08-14 14:36     ` Miklos Szeredi
2025-08-14 18:20       ` Darrick J. Wong
2025-08-15 15:06         ` John Groves
2025-08-19 21:55           ` Darrick J. Wong
2025-08-15 16:53       ` John Groves
2025-08-19 22:13         ` Darrick J. Wong
2025-08-14 18:05     ` Darrick J. Wong
2025-08-16 15:00       ` John Groves
2025-08-19 22:17         ` Darrick J. Wong
2025-08-15  0:38     ` John Groves
2025-07-03 18:50 ` [RFC V2 13/18] famfs_fuse: Create files with famfs fmaps John Groves
2025-07-04  9:01   ` Amir Goldstein
2025-07-05 19:27     ` John Groves
2025-07-03 18:50 ` [RFC V2 14/18] famfs_fuse: GET_DAXDEV message and daxdev_table John Groves
2025-07-04 13:20   ` Jonathan Cameron
2025-07-06 17:07     ` John Groves
2025-08-14 13:58   ` Miklos Szeredi
2025-08-14 17:19     ` Darrick J. Wong
2025-08-14 18:25       ` Miklos Szeredi
2025-08-14 18:55         ` Darrick J. Wong
2025-08-14 19:19           ` Miklos Szeredi
2025-08-16 16:22         ` John Groves
2025-08-19 22:32           ` Darrick J. Wong
2025-08-15 16:38     ` John Groves
2025-08-19 22:34       ` Darrick J. Wong
2025-07-03 18:50 ` [RFC V2 15/18] famfs_fuse: Plumb dax iomap and fuse read/write/mmap John Groves
2025-07-04  9:13   ` Amir Goldstein
2025-07-05 19:44     ` John Groves
2025-07-03 18:50 ` [RFC V2 16/18] famfs_fuse: Add holder_operations for dax notify_failure() John Groves
2025-07-03 18:50 ` [RFC V2 17/18] famfs_fuse: Add famfs metadata documentation John Groves
2025-07-03 18:50 ` [RFC V2 18/18] famfs_fuse: Add documentation John Groves
2025-07-04  0:27   ` Bagas Sanjaya
2025-07-04  2:22     ` Jonathan Corbet
2025-07-04  3:53       ` Bagas Sanjaya
2025-07-04 18:58         ` Matthew Wilcox
2025-07-04 23:29           ` Bagas Sanjaya
2025-07-04 23:43             ` Matthew Wilcox
2025-07-05  1:11               ` Bagas Sanjaya
2025-07-04  6:09   ` Randy Dunlap
2025-07-04  8:27   ` Amir Goldstein
2025-07-04 23:36     ` Bagas Sanjaya
2025-07-03 18:56 ` [RFC V2 00/18] famfs: port into fuse John Groves
2025-07-09  3:26   ` Miklos Szeredi
2025-07-11  1:18     ` John Groves

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).