linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/19] famfs: port into fuse
@ 2025-04-21  1:33 John Groves
  2025-04-21  1:33 ` [RFC PATCH 01/19] dev_dax_iomap: Move dax_pgoff_to_phys() from device.c to bus.c John Groves
                   ` (21 more replies)
  0 siblings, 22 replies; 58+ messages in thread
From: John Groves @ 2025-04-21  1:33 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Luis Henriques, Randy Dunlap, Jeff Layton,
	Kent Overstreet, Petr Vorel, Brian Foster, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Amir Goldstein,
	Jonathan Cameron, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
	Aravind Ramesh, Ajay Joshi, John Groves

Subject: famfs: port into fuse

This is the initial RFC for the fabric-attached memory file system (famfs)
integration into fuse. In order to function, this requires a related patch
to libfuse [1] and the famfs user space [2]. 

This RFC is mainly intended to socialize the approach and get feedback from
the fuse developers and maintainers. There is some dax work that needs to
be done before this should be merged (see the "poisoned page|folio problem"
below).

This patch set fully works with Linux 6.14 -- passing all existing famfs
smoke and unit tests -- and I encourage existing famfs users to test it.

This is really two patch sets mashed up:

* The patches with the dev_dax_iomap: prefix fill in missing functionality for
  devdax to host an fs-dax file system.
* The famfs_fuse: patches add famfs into fs/fuse/. These are effectively
  unchanged since last year.

Because this is not ready to merge yet, I have felt free to leave some debug
prints in place because we still find them useful; those will be cleaned up
in a subsequent revision.

Famfs Overview

Famfs exposes shared memory as a file system. Famfs consumes shared memory
from dax devices, and provides memory-mappable files that map directly to
the memory - no page cache involvement. Famfs differs from conventional
file systems in fs-dax mode, in that it handles in-memory metadata in a
sharable way (which begins with never caching dirty shared metadata).

Famfs started as a standalone file system [3,4], but the consensus at LSFMM
2024 [5] was that it should be ported into fuse - and this RFC is the first
public evidence that I've been working on that.

The key performance requirement is that famfs must resolve mapping faults
without upcalls. This is achieved by fully caching the file-to-devdax
metadata for all active files. This is done via two fuse client/server
message/response pairs: GET_FMAP and GET_DAXDEV.

Famfs remains the first fs-dax file system that is backed by devdax rather
than pmem in fs-dax mode (hence the need for the dev_dax_iomap fixups).

Notes

* Once the dev_dax_iomap patches land, I suspect it may make sense for
  virtiofs to update to use the improved interface.

* I'm currently maintaining compatibility between the famfs user space and
  both the standalone famfs kernel file system and this new fuse
  implementation. In the near future I'll be running performance comparisons
  and sharing them - but there is no reason to expect significant degradation
  with fuse, since famfs caches entire "fmaps" in the kernel to resolve
  faults with no upcalls. This patch has a bit too much debug turned on to
  to that testing quite yet. A branch 

* Two new fuse messages / responses are added: GET_FMAP and GET_DAXDEV.

* When a file is looked up in a famfs mount, the LOOKUP is followed by a
  GET_FMAP message and response. The "fmap" is the full file-to-dax mapping,
  allowing the fuse/famfs kernel code to handle read/write/fault without any
  upcalls.

* After each GET_FMAP, the fmap is checked for extents that reference
  previously-unknown daxdevs. Each such occurence is handled with a
  GET_DAXDEV message and response.

* Daxdevs are stored in a table (which might become an xarray at some point).
  When entries are added to the table, we acquire exclusive access to the
  daxdev via the fs_dax_get() call (modeled after how fs-dax handles this
  with pmem devices). famfs provides holder_operations to devdax, providing
  a notification path in the event of memory errors.

* If devdax notifies famfs of memory errors on a dax device, famfs currently
  bocks all subsequent accesses to data on that device. The recovery is to
  re-initialize the memory and file system. Famfs is memory, not storage...

* Because famfs uses backing (devdax) devices, only privileged mounts are
  supported.

* The famfs kernel code never accesses the memory directly - it only
  facilitates read, write and mmap on behalf of user processes. As such,
  the RAS of the shared memory affects applications, but not the kernel.

* Famfs has backing device(s), but they are devdax (char) rather than
  block. Right now there is no way to tell the vfs layer that famfs has a
  char backing device (unless we say it's block, but it's not). Currently
  we use the standard anonymous fuse fs_type - but I'm not sure that's
  ultimately optimal (thoughts?)

The "poisoned page|folio problem"

* Background: before doing a kernel mount, the famfs user space [2] validates
  the superblock and log. This is done via raw mmap of the primary devdax
  device. If valid, the file system is mounted, and the superblock and log
  get exposed through a pair of files (.meta/.superblock and .meta/.log) -
  because we can't be using raw device mmap when a file system is mounted
  on the device. But this exposes a devdax bug and warning...

* Pages that have been memory mapped via devdax are left in a permanently
  problematic state. Devdax sets page|folio->mapping when a page is accessed
  via raw devdax mmap (as famfs does before mount), but never cleans it up.
  When the pages of the famfs superblock and log are accessed via the "meta"
  files after mount, we see a WARN_ONCE() in dax_insert_entry(), which
  notices that page|folio->mapping is still set. I intend to address this
  prior to asking for the famfs patches to be merged.

* Alistair Popple's recent dax patch series [6], which has been merged
  for 6.15, addresses some dax issues, but sadly does not fix the poisoned
  page|folio problem - its enhanced refcount checking turns the warning into
  an error.

* This 6.14 patch set disables the warning; a proper fix will be required for
  famfs to work at all in 6.15. Dan W. and I are actively discussing how to do
  this properly...

* In terms of the correct functionality of famfs, the warning can be ignored.

References

[1] - https://github.com/libfuse/libfuse/pull/1200
[2] - https://github.com/cxl-micron-reskit/famfs
[3] - https://lore.kernel.org/linux-cxl/cover.1708709155.git.john@groves.net/
[4] - https://lore.kernel.org/linux-cxl/cover.1714409084.git.john@groves.net/
[5] - https://lwn.net/Articles/983105/
[6] - https://lore.kernel.org/linux-cxl/cover.8068ad144a7eea4a813670301f4d2a86a8e68ec4.1740713401.git-series.apopple@nvidia.com/


John Groves (19):
  dev_dax_iomap: Move dax_pgoff_to_phys() from device.c to bus.c
  dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage
  dev_dax_iomap: Save the kva from memremap
  dev_dax_iomap: Add dax_operations for use by fs-dax on devdax
  dev_dax_iomap: export dax_dev_get()
  dev_dax_iomap: (ignore!) Drop poisoned page warning in fs/dax.c
  famfs_fuse: magic.h: Add famfs magic numbers
  famfs_fuse: Kconfig
  famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/
  famfs_fuse: Basic fuse kernel ABI enablement for famfs
  famfs_fuse: Basic famfs mount opts
  famfs_fuse: Plumb the GET_FMAP message/response
  famfs_fuse: Create files with famfs fmaps
  famfs_fuse: GET_DAXDEV message and daxdev_table
  famfs_fuse: Plumb dax iomap and fuse read/write/mmap
  famfs_fuse: Add holder_operations for dax notify_failure()
  famfs_fuse: Add famfs metadata documentation
  famfs_fuse: Add documentation
  famfs_fuse: (ignore) debug cruft

 Documentation/filesystems/famfs.rst |  142 ++++
 Documentation/filesystems/index.rst |    1 +
 MAINTAINERS                         |   10 +
 drivers/dax/Kconfig                 |    6 +
 drivers/dax/bus.c                   |  144 +++-
 drivers/dax/dax-private.h           |    1 +
 drivers/dax/device.c                |   38 +-
 drivers/dax/super.c                 |   33 +-
 fs/dax.c                            |    1 -
 fs/fuse/Kconfig                     |   13 +
 fs/fuse/Makefile                    |    4 +-
 fs/fuse/dev.c                       |   61 ++
 fs/fuse/dir.c                       |   74 +-
 fs/fuse/famfs.c                     | 1105 +++++++++++++++++++++++++++
 fs/fuse/famfs_kfmap.h               |  166 ++++
 fs/fuse/file.c                      |   27 +-
 fs/fuse/fuse_i.h                    |   67 +-
 fs/fuse/inode.c                     |   49 +-
 fs/fuse/iomode.c                    |    2 +-
 fs/namei.c                          |    1 +
 include/linux/dax.h                 |    6 +
 include/uapi/linux/fuse.h           |   63 ++
 include/uapi/linux/magic.h          |    2 +
 23 files changed, 1973 insertions(+), 43 deletions(-)
 create mode 100644 Documentation/filesystems/famfs.rst
 create mode 100644 fs/fuse/famfs.c
 create mode 100644 fs/fuse/famfs_kfmap.h


base-commit: 38fec10eb60d687e30c8c6b5420d86e8149f7557
-- 
2.49.0


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC PATCH 01/19] dev_dax_iomap: Move dax_pgoff_to_phys() from device.c to bus.c
  2025-04-21  1:33 [RFC PATCH 00/19] famfs: port into fuse John Groves
@ 2025-04-21  1:33 ` John Groves
  2025-04-21  1:33 ` [RFC PATCH 02/19] dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage John Groves
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: John Groves @ 2025-04-21  1:33 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Luis Henriques, Randy Dunlap, Jeff Layton,
	Kent Overstreet, Petr Vorel, Brian Foster, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Amir Goldstein,
	Jonathan Cameron, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
	Aravind Ramesh, Ajay Joshi, John Groves

No changes to the function - just moved it.

dev_dax_iomap needs to call this function from
drivers/dax/bus.c.

drivers/dax/bus.c can't call functions in drivers/dax/device.c -
that creates a circular linkage dependency - but device.c can
call functions in bus.c. Also exports dax_pgoff_to_phys() since
both bus.c and device.c now call it.

Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/bus.c    | 24 ++++++++++++++++++++++++
 drivers/dax/device.c | 23 -----------------------
 2 files changed, 24 insertions(+), 23 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index fde29e0ad68b..9d9a4ae7bbc0 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1417,6 +1417,30 @@ static const struct device_type dev_dax_type = {
 	.groups = dax_attribute_groups,
 };
 
+/* see "strong" declaration in tools/testing/nvdimm/dax-dev.c  */
+__weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
+			      unsigned long size)
+{
+	int i;
+
+	for (i = 0; i < dev_dax->nr_range; i++) {
+		struct dev_dax_range *dax_range = &dev_dax->ranges[i];
+		struct range *range = &dax_range->range;
+		unsigned long long pgoff_end;
+		phys_addr_t phys;
+
+		pgoff_end = dax_range->pgoff + PHYS_PFN(range_len(range)) - 1;
+		if (pgoff < dax_range->pgoff || pgoff > pgoff_end)
+			continue;
+		phys = PFN_PHYS(pgoff - dax_range->pgoff) + range->start;
+		if (phys + size - 1 <= range->end)
+			return phys;
+		break;
+	}
+	return -1;
+}
+EXPORT_SYMBOL_GPL(dax_pgoff_to_phys);
+
 static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
 {
 	struct dax_region *dax_region = data->dax_region;
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 6d74e62bbee0..29f61771fef0 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -50,29 +50,6 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
 	return 0;
 }
 
-/* see "strong" declaration in tools/testing/nvdimm/dax-dev.c */
-__weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
-		unsigned long size)
-{
-	int i;
-
-	for (i = 0; i < dev_dax->nr_range; i++) {
-		struct dev_dax_range *dax_range = &dev_dax->ranges[i];
-		struct range *range = &dax_range->range;
-		unsigned long long pgoff_end;
-		phys_addr_t phys;
-
-		pgoff_end = dax_range->pgoff + PHYS_PFN(range_len(range)) - 1;
-		if (pgoff < dax_range->pgoff || pgoff > pgoff_end)
-			continue;
-		phys = PFN_PHYS(pgoff - dax_range->pgoff) + range->start;
-		if (phys + size - 1 <= range->end)
-			return phys;
-		break;
-	}
-	return -1;
-}
-
 static void dax_set_mapping(struct vm_fault *vmf, pfn_t pfn,
 			      unsigned long fault_size)
 {
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 02/19] dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage
  2025-04-21  1:33 [RFC PATCH 00/19] famfs: port into fuse John Groves
  2025-04-21  1:33 ` [RFC PATCH 01/19] dev_dax_iomap: Move dax_pgoff_to_phys() from device.c to bus.c John Groves
@ 2025-04-21  1:33 ` John Groves
  2025-04-21  1:33 ` [RFC PATCH 03/19] dev_dax_iomap: Save the kva from memremap John Groves
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: John Groves @ 2025-04-21  1:33 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Luis Henriques, Randy Dunlap, Jeff Layton,
	Kent Overstreet, Petr Vorel, Brian Foster, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Amir Goldstein,
	Jonathan Cameron, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
	Aravind Ramesh, Ajay Joshi, John Groves

This function should be called by fs-dax file systems after opening the
devdax device. This adds holder_operations, which effects exclusivity
between callers of fs_dax_get().

This function serves the same role as fs_dax_get_by_bdev(), which dax
file systems call after opening the pmem block device.

This also adds the CONFIG_DEV_DAX_IOMAP Kconfig parameter

Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/Kconfig |  6 ++++++
 drivers/dax/super.c | 30 ++++++++++++++++++++++++++++++
 include/linux/dax.h |  5 +++++
 3 files changed, 41 insertions(+)

diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index d656e4c0eb84..ad19fa966b8b 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -78,4 +78,10 @@ config DEV_DAX_KMEM
 
 	  Say N if unsure.
 
+config DEV_DAX_IOMAP
+       depends on DEV_DAX && DAX
+       def_bool y
+       help
+         Support iomap mapping of devdax devices (for FS-DAX file
+         systems that reside on character /dev/dax devices)
 endif
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index e16d1d40d773..48bab9b5f341 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -122,6 +122,36 @@ void fs_put_dax(struct dax_device *dax_dev, void *holder)
 EXPORT_SYMBOL_GPL(fs_put_dax);
 #endif /* CONFIG_BLOCK && CONFIG_FS_DAX */
 
+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+/**
+ * fs_dax_get()
+ *
+ * fs-dax file systems call this function to prepare to use a devdax device for
+ * fsdax. This is like fs_dax_get_by_bdev(), but the caller already has struct
+ * dev_dax (and there  * is no bdev). The holder makes this exclusive.
+ *
+ * @dax_dev: dev to be prepared for fs-dax usage
+ * @holder: filesystem or mapped device inside the dax_device
+ * @hops: operations for the inner holder
+ *
+ * Returns: 0 on success, <0 on failure
+ */
+int fs_dax_get(struct dax_device *dax_dev, void *holder,
+	const struct dax_holder_operations *hops)
+{
+	if (!dax_dev || !dax_alive(dax_dev) || !igrab(&dax_dev->inode))
+		return -ENODEV;
+
+	if (cmpxchg(&dax_dev->holder_data, NULL, holder))
+		return -EBUSY;
+
+	dax_dev->holder_ops = hops;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(fs_dax_get);
+#endif /* DEV_DAX_IOMAP */
+
 enum dax_device_flags {
 	/* !alive + rcu grace period == no new operations / mappings */
 	DAXDEV_ALIVE,
diff --git a/include/linux/dax.h b/include/linux/dax.h
index df41a0017b31..86bf5922f1b0 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -51,6 +51,11 @@ struct dax_holder_operations {
 
 #if IS_ENABLED(CONFIG_DAX)
 struct dax_device *alloc_dax(void *private, const struct dax_operations *ops);
+
+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+int fs_dax_get(struct dax_device *dax_dev, void *holder, const struct dax_holder_operations *hops);
+struct dax_device *inode_dax(struct inode *inode);
+#endif
 void *dax_holder(struct dax_device *dax_dev);
 void put_dax(struct dax_device *dax_dev);
 void kill_dax(struct dax_device *dax_dev);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 03/19] dev_dax_iomap: Save the kva from memremap
  2025-04-21  1:33 [RFC PATCH 00/19] famfs: port into fuse John Groves
  2025-04-21  1:33 ` [RFC PATCH 01/19] dev_dax_iomap: Move dax_pgoff_to_phys() from device.c to bus.c John Groves
  2025-04-21  1:33 ` [RFC PATCH 02/19] dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage John Groves
@ 2025-04-21  1:33 ` John Groves
  2025-04-21  1:33 ` [RFC PATCH 04/19] dev_dax_iomap: Add dax_operations for use by fs-dax on devdax John Groves
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: John Groves @ 2025-04-21  1:33 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Luis Henriques, Randy Dunlap, Jeff Layton,
	Kent Overstreet, Petr Vorel, Brian Foster, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Amir Goldstein,
	Jonathan Cameron, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
	Aravind Ramesh, Ajay Joshi, John Groves

Save the kva from memremap because we need it for iomap rw support.

Prior to famfs, there were no iomap users of /dev/dax - so the virtual
address from memremap was not needed.

Also: in some cases dev_dax_probe() is called with the first
dev_dax->range offset past the start of pgmap[0].range. In those cases
we need to add the difference to virt_addr in order to have the physaddr's
in dev_dax->ranges match dev_dax->virt_addr.

This happens with devdax devices that started as pmem and got converted
to devdax. I'm not sure whether the offset is due to label storage, or
page tables, but this works in all known cases.

Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/dax-private.h |  1 +
 drivers/dax/device.c      | 15 +++++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 0867115aeef2..2a6b07813f9f 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -81,6 +81,7 @@ struct dev_dax_range {
 struct dev_dax {
 	struct dax_region *region;
 	struct dax_device *dax_dev;
+	void *virt_addr;
 	unsigned int align;
 	int target_node;
 	bool dyn_id;
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 29f61771fef0..583150478dcc 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -372,6 +372,7 @@ static int dev_dax_probe(struct dev_dax *dev_dax)
 	struct dax_device *dax_dev = dev_dax->dax_dev;
 	struct device *dev = &dev_dax->dev;
 	struct dev_pagemap *pgmap;
+	u64 data_offset = 0;
 	struct inode *inode;
 	struct cdev *cdev;
 	void *addr;
@@ -426,6 +427,20 @@ static int dev_dax_probe(struct dev_dax *dev_dax)
 	if (IS_ERR(addr))
 		return PTR_ERR(addr);
 
+	/* Detect whether the data is at a non-zero offset into the memory */
+	if (pgmap->range.start != dev_dax->ranges[0].range.start) {
+		u64 phys = dev_dax->ranges[0].range.start;
+		u64 pgmap_phys = dev_dax->pgmap[0].range.start;
+		u64 vmemmap_shift = dev_dax->pgmap[0].vmemmap_shift;
+
+		if (!WARN_ON(pgmap_phys > phys))
+			data_offset = phys - pgmap_phys;
+
+		pr_debug("%s: offset detected phys=%llx pgmap_phys=%llx offset=%llx shift=%llx\n",
+		       __func__, phys, pgmap_phys, data_offset, vmemmap_shift);
+	}
+	dev_dax->virt_addr = addr + data_offset;
+
 	inode = dax_inode(dax_dev);
 	cdev = inode->i_cdev;
 	cdev_init(cdev, &dax_fops);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 04/19] dev_dax_iomap: Add dax_operations for use by fs-dax on devdax
  2025-04-21  1:33 [RFC PATCH 00/19] famfs: port into fuse John Groves
                   ` (2 preceding siblings ...)
  2025-04-21  1:33 ` [RFC PATCH 03/19] dev_dax_iomap: Save the kva from memremap John Groves
@ 2025-04-21  1:33 ` John Groves
  2025-04-21  1:33 ` [RFC PATCH 05/19] dev_dax_iomap: export dax_dev_get() John Groves
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: John Groves @ 2025-04-21  1:33 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Luis Henriques, Randy Dunlap, Jeff Layton,
	Kent Overstreet, Petr Vorel, Brian Foster, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Amir Goldstein,
	Jonathan Cameron, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
	Aravind Ramesh, Ajay Joshi, John Groves

Notes about this commit:

* These methods are based on pmem_dax_ops from drivers/nvdimm/pmem.c

* dev_dax_direct_access() is returns the hpa, pfn and kva. The kva was
  newly stored as dev_dax->virt_addr by dev_dax_probe().

* The hpa/pfn are used for mmap (dax_iomap_fault()), and the kva is used
  for read/write (dax_iomap_rw())

* dev_dax_recovery_write() and dev_dax_zero_page_range() have not been
  tested yet. I'm looking for suggestions as to how to test those.

Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/bus.c | 120 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 115 insertions(+), 5 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 9d9a4ae7bbc0..61a8d1b3c07a 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -7,6 +7,10 @@
 #include <linux/slab.h>
 #include <linux/dax.h>
 #include <linux/io.h>
+#include <linux/backing-dev.h>
+#include <linux/pfn_t.h>
+#include <linux/range.h>
+#include <linux/uio.h>
 #include "dax-private.h"
 #include "bus.h"
 
@@ -1441,6 +1445,105 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
 }
 EXPORT_SYMBOL_GPL(dax_pgoff_to_phys);
 
+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+
+static void write_dax(void *pmem_addr, struct page *page,
+		unsigned int off, unsigned int len)
+{
+	unsigned int chunk;
+	void *mem;
+
+	while (len) {
+		mem = kmap_local_page(page);
+		chunk = min_t(unsigned int, len, PAGE_SIZE - off);
+		memcpy_flushcache(pmem_addr, mem + off, chunk);
+		kunmap_local(mem);
+		len -= chunk;
+		off = 0;
+		page++;
+		pmem_addr += chunk;
+	}
+}
+
+static long __dev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
+			long nr_pages, enum dax_access_mode mode, void **kaddr,
+			pfn_t *pfn)
+{
+	struct dev_dax *dev_dax = dax_get_private(dax_dev);
+	size_t size = nr_pages << PAGE_SHIFT;
+	size_t offset = pgoff << PAGE_SHIFT;
+	void *virt_addr = dev_dax->virt_addr + offset;
+	u64 flags = PFN_DEV|PFN_MAP;
+	phys_addr_t phys;
+	pfn_t local_pfn;
+	size_t dax_size;
+
+	WARN_ON(!dev_dax->virt_addr);
+
+	if (down_read_interruptible(&dax_dev_rwsem))
+		return 0; /* no valid data since we were killed */
+	dax_size = dev_dax_size(dev_dax);
+	up_read(&dax_dev_rwsem);
+
+	phys = dax_pgoff_to_phys(dev_dax, pgoff, nr_pages << PAGE_SHIFT);
+
+	if (kaddr)
+		*kaddr = virt_addr;
+
+	local_pfn = phys_to_pfn_t(phys, flags); /* are flags correct? */
+	if (pfn)
+		*pfn = local_pfn;
+
+	/* This the valid size at the specified address */
+	return PHYS_PFN(min_t(size_t, size, dax_size - offset));
+}
+
+static int dev_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
+				    size_t nr_pages)
+{
+	long resid = nr_pages << PAGE_SHIFT;
+	long offset = pgoff << PAGE_SHIFT;
+
+	/* Break into one write per dax region */
+	while (resid > 0) {
+		void *kaddr;
+		pgoff_t poff = offset >> PAGE_SHIFT;
+		long len = __dev_dax_direct_access(dax_dev, poff,
+						   nr_pages, DAX_ACCESS, &kaddr, NULL);
+		len = min_t(long, len, PAGE_SIZE);
+		write_dax(kaddr, ZERO_PAGE(0), offset, len);
+
+		offset += len;
+		resid  -= len;
+	}
+	return 0;
+}
+
+static long dev_dax_direct_access(struct dax_device *dax_dev,
+		pgoff_t pgoff, long nr_pages, enum dax_access_mode mode,
+		void **kaddr, pfn_t *pfn)
+{
+	return __dev_dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr, pfn);
+}
+
+static size_t dev_dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
+		void *addr, size_t bytes, struct iov_iter *i)
+{
+	size_t off;
+
+	off = offset_in_page(addr);
+
+	return _copy_from_iter_flushcache(addr, bytes, i);
+}
+
+static const struct dax_operations dev_dax_ops = {
+	.direct_access = dev_dax_direct_access,
+	.zero_page_range = dev_dax_zero_page_range,
+	.recovery_write = dev_dax_recovery_write,
+};
+
+#endif /* IS_ENABLED(CONFIG_DEV_DAX_IOMAP) */
+
 static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
 {
 	struct dax_region *dax_region = data->dax_region;
@@ -1496,11 +1599,18 @@ static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data)
 		}
 	}
 
-	/*
-	 * No dax_operations since there is no access to this device outside of
-	 * mmap of the resulting character device.
-	 */
-	dax_dev = alloc_dax(dev_dax, NULL);
+	if (IS_ENABLED(CONFIG_DEV_DAX_IOMAP))
+		/* holder_ops currently populated separately in a slightly
+		 * hacky way
+		 */
+		dax_dev = alloc_dax(dev_dax, &dev_dax_ops);
+	else
+		/*
+		 * No dax_operations since there is no access to this device
+		 * outside of mmap of the resulting character device.
+		 */
+		dax_dev = alloc_dax(dev_dax, NULL);
+
 	if (IS_ERR(dax_dev)) {
 		rc = PTR_ERR(dax_dev);
 		goto err_alloc_dax;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 05/19] dev_dax_iomap: export dax_dev_get()
  2025-04-21  1:33 [RFC PATCH 00/19] famfs: port into fuse John Groves
                   ` (3 preceding siblings ...)
  2025-04-21  1:33 ` [RFC PATCH 04/19] dev_dax_iomap: Add dax_operations for use by fs-dax on devdax John Groves
@ 2025-04-21  1:33 ` John Groves
  2025-04-21  1:33 ` [RFC PATCH 06/19] dev_dax_iomap: (ignore!) Drop poisoned page warning in fs/dax.c John Groves
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: John Groves @ 2025-04-21  1:33 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Luis Henriques, Randy Dunlap, Jeff Layton,
	Kent Overstreet, Petr Vorel, Brian Foster, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Amir Goldstein,
	Jonathan Cameron, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
	Aravind Ramesh, Ajay Joshi, John Groves

famfs needs access to dev_dax_get()

Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/super.c | 3 ++-
 include/linux/dax.h | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 48bab9b5f341..033fd841c2bb 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -452,7 +452,7 @@ static int dax_set(struct inode *inode, void *data)
 	return 0;
 }
 
-static struct dax_device *dax_dev_get(dev_t devt)
+struct dax_device *dax_dev_get(dev_t devt)
 {
 	struct dax_device *dax_dev;
 	struct inode *inode;
@@ -475,6 +475,7 @@ static struct dax_device *dax_dev_get(dev_t devt)
 
 	return dax_dev;
 }
+EXPORT_SYMBOL_GPL(dax_dev_get);
 
 struct dax_device *alloc_dax(void *private, const struct dax_operations *ops)
 {
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 86bf5922f1b0..c7bf03535b52 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -55,6 +55,7 @@ struct dax_device *alloc_dax(void *private, const struct dax_operations *ops);
 #if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
 int fs_dax_get(struct dax_device *dax_dev, void *holder, const struct dax_holder_operations *hops);
 struct dax_device *inode_dax(struct inode *inode);
+struct dax_device *dax_dev_get(dev_t devt);
 #endif
 void *dax_holder(struct dax_device *dax_dev);
 void put_dax(struct dax_device *dax_dev);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 06/19] dev_dax_iomap: (ignore!) Drop poisoned page warning in fs/dax.c
  2025-04-21  1:33 [RFC PATCH 00/19] famfs: port into fuse John Groves
                   ` (4 preceding siblings ...)
  2025-04-21  1:33 ` [RFC PATCH 05/19] dev_dax_iomap: export dax_dev_get() John Groves
@ 2025-04-21  1:33 ` John Groves
  2025-04-21  1:33 ` [RFC PATCH 07/19] famfs_fuse: magic.h: Add famfs magic numbers John Groves
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: John Groves @ 2025-04-21  1:33 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Luis Henriques, Randy Dunlap, Jeff Layton,
	Kent Overstreet, Petr Vorel, Brian Foster, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Amir Goldstein,
	Jonathan Cameron, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
	Aravind Ramesh, Ajay Joshi, John Groves

This just works around a the "poisoned page" warning that will be
properly fixed in a future version of this patch set. Please ignore
for the moment.

Signed-off-by: John Groves <john@groves.net>
---
 fs/dax.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 21b47402b3dc..635937593d5e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -369,7 +369,6 @@ static void dax_associate_entry(void *entry, struct address_space *mapping,
 		if (shared) {
 			dax_page_share_get(page);
 		} else {
-			WARN_ON_ONCE(page->mapping);
 			page->mapping = mapping;
 			page->index = index + i++;
 		}
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 07/19] famfs_fuse: magic.h: Add famfs magic numbers
  2025-04-21  1:33 [RFC PATCH 00/19] famfs: port into fuse John Groves
                   ` (5 preceding siblings ...)
  2025-04-21  1:33 ` [RFC PATCH 06/19] dev_dax_iomap: (ignore!) Drop poisoned page warning in fs/dax.c John Groves
@ 2025-04-21  1:33 ` John Groves
  2025-04-21  1:33 ` [RFC PATCH 08/19] famfs_fuse: Kconfig John Groves
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: John Groves @ 2025-04-21  1:33 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Luis Henriques, Randy Dunlap, Jeff Layton,
	Kent Overstreet, Petr Vorel, Brian Foster, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Amir Goldstein,
	Jonathan Cameron, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
	Aravind Ramesh, Ajay Joshi, John Groves

Famfs distinguishes between its on-media and in-memory superblocks

Signed-off-by: John Groves <john@groves.net>
---
 include/uapi/linux/magic.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index bb575f3ab45e..ee497665d8d7 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -38,6 +38,8 @@
 #define OVERLAYFS_SUPER_MAGIC	0x794c7630
 #define FUSE_SUPER_MAGIC	0x65735546
 #define BCACHEFS_SUPER_MAGIC	0xca451a4e
+#define FAMFS_SUPER_MAGIC	0x87b282ff
+#define FAMFS_STATFS_MAGIC      0x87b282fd
 
 #define MINIX_SUPER_MAGIC	0x137F		/* minix v1 fs, 14 char names */
 #define MINIX_SUPER_MAGIC2	0x138F		/* minix v1 fs, 30 char names */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 08/19] famfs_fuse: Kconfig
  2025-04-21  1:33 [RFC PATCH 00/19] famfs: port into fuse John Groves
                   ` (6 preceding siblings ...)
  2025-04-21  1:33 ` [RFC PATCH 07/19] famfs_fuse: magic.h: Add famfs magic numbers John Groves
@ 2025-04-21  1:33 ` John Groves
  2025-04-21  1:33 ` [RFC PATCH 09/19] famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/ John Groves
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: John Groves @ 2025-04-21  1:33 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Luis Henriques, Randy Dunlap, Jeff Layton,
	Kent Overstreet, Petr Vorel, Brian Foster, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Amir Goldstein,
	Jonathan Cameron, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
	Aravind Ramesh, Ajay Joshi, John Groves

Add FUSE_FAMFS_DAX config parameter, to control compilation of famfs
within fuse.

Signed-off-by: John Groves <john@groves.net>
---
 fs/fuse/Kconfig | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
index ca215a3cba3e..e6d554f2a21c 100644
--- a/fs/fuse/Kconfig
+++ b/fs/fuse/Kconfig
@@ -75,3 +75,16 @@ config FUSE_IO_URING
 
 	  If you want to allow fuse server/client communication through io-uring,
 	  answer Y
+
+config FUSE_FAMFS_DAX
+	bool "FUSE support for fs-dax filesystems backed by devdax"
+	depends on FUSE_FS
+	default FUSE_FS
+	select DEV_DAX_IOMAP
+	help
+	  This enables the fabric-attached memory file system (famfs),
+	  which enables formatting devdax memory as a file system. Famfs
+	  is primarily intended for scale-out shared access to
+	  disaggregated memory.
+
+	  To enable famfs or other fuse/fs-dax file systems, answer Y
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 09/19] famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/
  2025-04-21  1:33 [RFC PATCH 00/19] famfs: port into fuse John Groves
                   ` (7 preceding siblings ...)
  2025-04-21  1:33 ` [RFC PATCH 08/19] famfs_fuse: Kconfig John Groves
@ 2025-04-21  1:33 ` John Groves
  2025-04-21  1:33 ` [RFC PATCH 10/19] famfs_fuse: Basic fuse kernel ABI enablement for famfs John Groves
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: John Groves @ 2025-04-21  1:33 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Luis Henriques, Randy Dunlap, Jeff Layton,
	Kent Overstreet, Petr Vorel, Brian Foster, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Amir Goldstein,
	Jonathan Cameron, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
	Aravind Ramesh, Ajay Joshi, John Groves

Virtio_fs now needs to determine if an inode is DAX && not famfs.

Signed-off-by: John Groves <john@groves.net>
---
 fs/fuse/dir.c    |  2 +-
 fs/fuse/file.c   | 13 ++++++++-----
 fs/fuse/fuse_i.h |  6 +++++-
 fs/fuse/inode.c  |  2 +-
 fs/fuse/iomode.c |  2 +-
 5 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 3805f9b06c9d..bc29db0117f4 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1937,7 +1937,7 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
 		is_truncate = true;
 	}
 
-	if (FUSE_IS_DAX(inode) && is_truncate) {
+	if (FUSE_IS_VIRTIO_DAX(fi) && is_truncate) {
 		filemap_invalidate_lock(mapping);
 		fault_blocked = true;
 		err = fuse_dax_break_layouts(inode, 0, 0);
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index d63e56fd3dd2..6f10ae54e710 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -239,7 +239,7 @@ static int fuse_open(struct inode *inode, struct file *file)
 	int err;
 	bool is_truncate = (file->f_flags & O_TRUNC) && fc->atomic_o_trunc;
 	bool is_wb_truncate = is_truncate && fc->writeback_cache;
-	bool dax_truncate = is_truncate && FUSE_IS_DAX(inode);
+	bool dax_truncate = is_truncate && FUSE_IS_VIRTIO_DAX(fi);
 
 	if (fuse_is_bad(inode))
 		return -EIO;
@@ -1770,11 +1770,12 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	struct file *file = iocb->ki_filp;
 	struct fuse_file *ff = file->private_data;
 	struct inode *inode = file_inode(file);
+	struct fuse_inode *fi = get_fuse_inode(inode);
 
 	if (fuse_is_bad(inode))
 		return -EIO;
 
-	if (FUSE_IS_DAX(inode))
+	if (FUSE_IS_VIRTIO_DAX(fi))
 		return fuse_dax_read_iter(iocb, to);
 
 	/* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
@@ -1791,11 +1792,12 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	struct file *file = iocb->ki_filp;
 	struct fuse_file *ff = file->private_data;
 	struct inode *inode = file_inode(file);
+	struct fuse_inode *fi = get_fuse_inode(inode);
 
 	if (fuse_is_bad(inode))
 		return -EIO;
 
-	if (FUSE_IS_DAX(inode))
+	if (FUSE_IS_VIRTIO_DAX(fi))
 		return fuse_dax_write_iter(iocb, from);
 
 	/* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
@@ -2627,10 +2629,11 @@ static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
 	struct fuse_file *ff = file->private_data;
 	struct fuse_conn *fc = ff->fm->fc;
 	struct inode *inode = file_inode(file);
+	struct fuse_inode *fi = get_fuse_inode(inode);
 	int rc;
 
 	/* DAX mmap is superior to direct_io mmap */
-	if (FUSE_IS_DAX(inode))
+	if (FUSE_IS_VIRTIO_DAX(fi))
 		return fuse_dax_mmap(file, vma);
 
 	/*
@@ -3191,7 +3194,7 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
 		.mode = mode
 	};
 	int err;
-	bool block_faults = FUSE_IS_DAX(inode) &&
+	bool block_faults = FUSE_IS_VIRTIO_DAX(fi) &&
 		(!(mode & FALLOC_FL_KEEP_SIZE) ||
 		 (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE)));
 
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index fee96fe7887b..e04d160fa995 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1423,7 +1423,11 @@ void fuse_free_conn(struct fuse_conn *fc);
 
 /* dax.c */
 
-#define FUSE_IS_DAX(inode) (IS_ENABLED(CONFIG_FUSE_DAX) && IS_DAX(inode))
+/* This macro is used by virtio_fs, but now it also needs to filter for
+ * "not famfs"
+ */
+#define FUSE_IS_VIRTIO_DAX(fuse_inode) (IS_ENABLED(CONFIG_FUSE_DAX)	\
+					&& IS_DAX(&fuse_inode->inode))
 
 ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to);
 ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index e9db2cb8c150..29147657a99f 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -164,7 +164,7 @@ static void fuse_evict_inode(struct inode *inode)
 	if (inode->i_sb->s_flags & SB_ACTIVE) {
 		struct fuse_conn *fc = get_fuse_conn(inode);
 
-		if (FUSE_IS_DAX(inode))
+		if (FUSE_IS_VIRTIO_DAX(fi))
 			fuse_dax_inode_cleanup(inode);
 		if (fi->nlookup) {
 			fuse_queue_forget(fc, fi->forget, fi->nodeid,
diff --git a/fs/fuse/iomode.c b/fs/fuse/iomode.c
index c99e285f3183..aec4aecb5d79 100644
--- a/fs/fuse/iomode.c
+++ b/fs/fuse/iomode.c
@@ -204,7 +204,7 @@ int fuse_file_io_open(struct file *file, struct inode *inode)
 	 * io modes are not relevant with DAX and with server that does not
 	 * implement open.
 	 */
-	if (FUSE_IS_DAX(inode) || !ff->args)
+	if (FUSE_IS_VIRTIO_DAX(fi) || !ff->args)
 		return 0;
 
 	/*
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 10/19] famfs_fuse: Basic fuse kernel ABI enablement for famfs
  2025-04-21  1:33 [RFC PATCH 00/19] famfs: port into fuse John Groves
                   ` (8 preceding siblings ...)
  2025-04-21  1:33 ` [RFC PATCH 09/19] famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/ John Groves
@ 2025-04-21  1:33 ` John Groves
  2025-04-23  1:36   ` Joanne Koong
  2025-04-21  1:33 ` [RFC PATCH 11/19] famfs_fuse: Basic famfs mount opts John Groves
                   ` (11 subsequent siblings)
  21 siblings, 1 reply; 58+ messages in thread
From: John Groves @ 2025-04-21  1:33 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Luis Henriques, Randy Dunlap, Jeff Layton,
	Kent Overstreet, Petr Vorel, Brian Foster, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Amir Goldstein,
	Jonathan Cameron, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
	Aravind Ramesh, Ajay Joshi, John Groves

* FUSE_DAX_FMAP flag in INIT request/reply

* fuse_conn->famfs_iomap (enable famfs-mapped files) to denote a
  famfs-enabled connection

Signed-off-by: John Groves <john@groves.net>
---
 fs/fuse/fuse_i.h          | 3 +++
 fs/fuse/inode.c           | 5 +++++
 include/uapi/linux/fuse.h | 2 ++
 3 files changed, 10 insertions(+)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index e04d160fa995..b2c563b1a1c8 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -870,6 +870,9 @@ struct fuse_conn {
 	/* Use io_uring for communication */
 	unsigned int io_uring;
 
+	/* dev_dax_iomap support for famfs */
+	unsigned int famfs_iomap:1;
+
 	/** Maximum stack depth for passthrough backing files */
 	int max_stack_depth;
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 29147657a99f..5c6947b12503 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1392,6 +1392,9 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
 			}
 			if (flags & FUSE_OVER_IO_URING && fuse_uring_enabled())
 				fc->io_uring = 1;
+			if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX) &&
+				       flags & FUSE_DAX_FMAP)
+				fc->famfs_iomap = 1;
 		} else {
 			ra_pages = fc->max_read / PAGE_SIZE;
 			fc->no_lock = 1;
@@ -1450,6 +1453,8 @@ void fuse_send_init(struct fuse_mount *fm)
 		flags |= FUSE_SUBMOUNTS;
 	if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
 		flags |= FUSE_PASSTHROUGH;
+	if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
+		flags |= FUSE_DAX_FMAP;
 
 	/*
 	 * This is just an information flag for fuse server. No need to check
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 5e0eb41d967e..f9e14180367a 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -435,6 +435,7 @@ struct fuse_file_lock {
  *		    of the request ID indicates resend requests
  * FUSE_ALLOW_IDMAP: allow creation of idmapped mounts
  * FUSE_OVER_IO_URING: Indicate that client supports io-uring
+ * FUSE_DAX_FMAP: kernel supports dev_dax_iomap (aka famfs) fmaps
  */
 #define FUSE_ASYNC_READ		(1 << 0)
 #define FUSE_POSIX_LOCKS	(1 << 1)
@@ -482,6 +483,7 @@ struct fuse_file_lock {
 #define FUSE_DIRECT_IO_RELAX	FUSE_DIRECT_IO_ALLOW_MMAP
 #define FUSE_ALLOW_IDMAP	(1ULL << 40)
 #define FUSE_OVER_IO_URING	(1ULL << 41)
+#define FUSE_DAX_FMAP		(1ULL << 42)
 
 /**
  * CUSE INIT request/reply flags
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 11/19] famfs_fuse: Basic famfs mount opts
  2025-04-21  1:33 [RFC PATCH 00/19] famfs: port into fuse John Groves
                   ` (9 preceding siblings ...)
  2025-04-21  1:33 ` [RFC PATCH 10/19] famfs_fuse: Basic fuse kernel ABI enablement for famfs John Groves
@ 2025-04-21  1:33 ` John Groves
  2025-04-23  1:51   ` Joanne Koong
  2025-04-21  1:33 ` [RFC PATCH 12/19] famfs_fuse: Plumb the GET_FMAP message/response John Groves
                   ` (10 subsequent siblings)
  21 siblings, 1 reply; 58+ messages in thread
From: John Groves @ 2025-04-21  1:33 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Luis Henriques, Randy Dunlap, Jeff Layton,
	Kent Overstreet, Petr Vorel, Brian Foster, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Amir Goldstein,
	Jonathan Cameron, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
	Aravind Ramesh, Ajay Joshi, John Groves

* -o shadow=<shadowpath>
* -o daxdev=<daxdev>

Signed-off-by: John Groves <john@groves.net>
---
 fs/fuse/fuse_i.h |  8 +++++++-
 fs/fuse/inode.c  | 25 ++++++++++++++++++++++++-
 2 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index b2c563b1a1c8..931613102d32 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -580,9 +580,11 @@ struct fuse_fs_context {
 	unsigned int blksize;
 	const char *subtype;
 
-	/* DAX device, may be NULL */
+	/* DAX device for virtiofs, may be NULL */
 	struct dax_device *dax_dev;
 
+	const char *shadow; /* famfs - null if not famfs */
+
 	/* fuse_dev pointer to fill in, should contain NULL on entry */
 	void **fudptr;
 };
@@ -938,6 +940,10 @@ struct fuse_conn {
 	/**  uring connection information*/
 	struct fuse_ring *ring;
 #endif
+
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+	char *shadow;
+#endif
 };
 
 /*
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 5c6947b12503..7f4b73e739cb 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -766,6 +766,9 @@ enum {
 	OPT_ALLOW_OTHER,
 	OPT_MAX_READ,
 	OPT_BLKSIZE,
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+	OPT_SHADOW,
+#endif
 	OPT_ERR
 };
 
@@ -780,6 +783,9 @@ static const struct fs_parameter_spec fuse_fs_parameters[] = {
 	fsparam_u32	("max_read",		OPT_MAX_READ),
 	fsparam_u32	("blksize",		OPT_BLKSIZE),
 	fsparam_string	("subtype",		OPT_SUBTYPE),
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+	fsparam_string("shadow",		OPT_SHADOW),
+#endif
 	{}
 };
 
@@ -875,6 +881,15 @@ static int fuse_parse_param(struct fs_context *fsc, struct fs_parameter *param)
 		ctx->blksize = result.uint_32;
 		break;
 
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+	case OPT_SHADOW:
+		if (ctx->shadow)
+			return invalfc(fsc, "Multiple shadows specified");
+		ctx->shadow = param->string;
+		param->string = NULL;
+		break;
+#endif
+
 	default:
 		return -EINVAL;
 	}
@@ -888,6 +903,7 @@ static void fuse_free_fsc(struct fs_context *fsc)
 
 	if (ctx) {
 		kfree(ctx->subtype);
+		kfree(ctx->shadow);
 		kfree(ctx);
 	}
 }
@@ -919,7 +935,10 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
 	else if (fc->dax_mode == FUSE_DAX_INODE_USER)
 		seq_puts(m, ",dax=inode");
 #endif
-
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+	if (fc->shadow)
+		seq_printf(m, ",shadow=%s", fc->shadow);
+#endif
 	return 0;
 }
 
@@ -1825,6 +1844,10 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
 	sb->s_root = root_dentry;
 	if (ctx->fudptr)
 		*ctx->fudptr = fud;
+
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+	fc->shadow = kstrdup(ctx->shadow, GFP_KERNEL);
+#endif
 	mutex_unlock(&fuse_mutex);
 	return 0;
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 12/19] famfs_fuse: Plumb the GET_FMAP message/response
  2025-04-21  1:33 [RFC PATCH 00/19] famfs: port into fuse John Groves
                   ` (10 preceding siblings ...)
  2025-04-21  1:33 ` [RFC PATCH 11/19] famfs_fuse: Basic famfs mount opts John Groves
@ 2025-04-21  1:33 ` John Groves
  2025-05-02  5:48   ` Joanne Koong
  2025-04-21  1:33 ` [RFC PATCH 13/19] famfs_fuse: Create files with famfs fmaps John Groves
                   ` (9 subsequent siblings)
  21 siblings, 1 reply; 58+ messages in thread
From: John Groves @ 2025-04-21  1:33 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Luis Henriques, Randy Dunlap, Jeff Layton,
	Kent Overstreet, Petr Vorel, Brian Foster, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Amir Goldstein,
	Jonathan Cameron, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
	Aravind Ramesh, Ajay Joshi, John Groves

Upon completion of a LOOKUP, if we're in famfs-mode we do a GET_FMAP to
retrieve and cache up the file-to-dax map in the kernel. If this
succeeds, read/write/mmap are resolved direct-to-dax with no upcalls.

Signed-off-by: John Groves <john@groves.net>
---
 fs/fuse/dir.c             | 69 +++++++++++++++++++++++++++++++++++++++
 fs/fuse/fuse_i.h          | 36 +++++++++++++++++++-
 fs/fuse/inode.c           | 15 +++++++++
 include/uapi/linux/fuse.h |  4 +++
 4 files changed, 123 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index bc29db0117f4..ae135c55b9f6 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -359,6 +359,56 @@ bool fuse_invalid_attr(struct fuse_attr *attr)
 	return !fuse_valid_type(attr->mode) || !fuse_valid_size(attr->size);
 }
 
+#define FMAP_BUFSIZE 4096
+
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+static void
+fuse_get_fmap_init(
+	struct fuse_conn *fc,
+	struct fuse_args *args,
+	u64 nodeid,
+	void *outbuf,
+	size_t outbuf_size)
+{
+	memset(outbuf, 0, outbuf_size);
+	args->opcode = FUSE_GET_FMAP;
+	args->nodeid = nodeid;
+
+	args->in_numargs = 0;
+
+	args->out_numargs = 1;
+	args->out_args[0].size = FMAP_BUFSIZE;
+	args->out_args[0].value = outbuf;
+}
+
+static int
+fuse_get_fmap(struct fuse_mount *fm, struct inode *inode, u64 nodeid)
+{
+	size_t fmap_size;
+	void *fmap_buf;
+	int err;
+
+	pr_notice("%s: nodeid=%lld, inode=%llx\n", __func__,
+		  nodeid, (u64)inode);
+	fmap_buf = kcalloc(1, FMAP_BUFSIZE, GFP_KERNEL);
+	FUSE_ARGS(args);
+	fuse_get_fmap_init(fm->fc, &args, nodeid, fmap_buf, FMAP_BUFSIZE);
+
+	/* Send GET_FMAP command */
+	err = fuse_simple_request(fm, &args);
+	if (err) {
+		pr_err("%s: err=%d from fuse_simple_request()\n",
+		       __func__, err);
+		return err;
+	}
+
+	fmap_size = args.out_args[0].size;
+	pr_notice("%s: nodei=%lld fmap_size=%ld\n", __func__, nodeid, fmap_size);
+
+	return 0;
+}
+#endif
+
 int fuse_lookup_name(struct super_block *sb, u64 nodeid, const struct qstr *name,
 		     struct fuse_entry_out *outarg, struct inode **inode)
 {
@@ -404,6 +454,25 @@ int fuse_lookup_name(struct super_block *sb, u64 nodeid, const struct qstr *name
 		fuse_queue_forget(fm->fc, forget, outarg->nodeid, 1);
 		goto out;
 	}
+
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+	if (fm->fc->famfs_iomap) {
+		if (S_ISREG((*inode)->i_mode)) {
+			/* Note Lookup returns the looked-up inode in the attr
+			 * struct, but not in outarg->nodeid !
+			 */
+			pr_notice("%s: outarg: size=%d nodeid=%lld attr.ino=%lld\n",
+				 __func__, args.out_args[0].size, outarg->nodeid,
+				 outarg->attr.ino);
+			/* Get the famfs fmap */
+			fuse_get_fmap(fm, *inode, outarg->attr.ino);
+		} else
+			pr_notice("%s: no get_fmap for non-regular file\n",
+				 __func__);
+	} else
+		pr_notice("%s: fc->dax_iomap is not set\n", __func__);
+#endif
+
 	err = 0;
 
  out_put_forget:
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 931613102d32..437177c2f092 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -193,6 +193,10 @@ struct fuse_inode {
 	/** Reference to backing file in passthrough mode */
 	struct fuse_backing *fb;
 #endif
+
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+	void *famfs_meta;
+#endif
 };
 
 /** FUSE inode state bits */
@@ -942,6 +946,8 @@ struct fuse_conn {
 #endif
 
 #if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+	struct rw_semaphore famfs_devlist_sem;
+	struct famfs_dax_devlist *dax_devlist;
 	char *shadow;
 #endif
 };
@@ -1432,11 +1438,14 @@ void fuse_free_conn(struct fuse_conn *fc);
 
 /* dax.c */
 
+static inline int fuse_file_famfs(struct fuse_inode *fi); /* forward */
+
 /* This macro is used by virtio_fs, but now it also needs to filter for
  * "not famfs"
  */
 #define FUSE_IS_VIRTIO_DAX(fuse_inode) (IS_ENABLED(CONFIG_FUSE_DAX)	\
-					&& IS_DAX(&fuse_inode->inode))
+					&& IS_DAX(&fuse_inode->inode)	\
+					&& !fuse_file_famfs(fuse_inode))
 
 ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to);
 ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from);
@@ -1547,4 +1556,29 @@ extern void fuse_sysctl_unregister(void);
 #define fuse_sysctl_unregister()	do { } while (0)
 #endif /* CONFIG_SYSCTL */
 
+/* famfs.c */
+static inline struct fuse_backing *famfs_meta_set(struct fuse_inode *fi,
+						       void *meta)
+{
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+	return xchg(&fi->famfs_meta, meta);
+#else
+	return NULL;
+#endif
+}
+
+static inline void famfs_meta_free(struct fuse_inode *fi)
+{
+	/* Stub wil be connected in a subsequent commit */
+}
+
+static inline int fuse_file_famfs(struct fuse_inode *fi)
+{
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+	return (fi->famfs_meta != NULL);
+#else
+	return 0;
+#endif
+}
+
 #endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 7f4b73e739cb..848c8818e6f7 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -117,6 +117,9 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
 	if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
 		fuse_inode_backing_set(fi, NULL);
 
+	if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
+		famfs_meta_set(fi, NULL);
+
 	return &fi->inode;
 
 out_free_forget:
@@ -138,6 +141,13 @@ static void fuse_free_inode(struct inode *inode)
 	if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
 		fuse_backing_put(fuse_inode_backing(fi));
 
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+	if (S_ISREG(inode->i_mode) && fi->famfs_meta) {
+		famfs_meta_free(fi);
+		famfs_meta_set(fi, NULL);
+	}
+#endif
+
 	kmem_cache_free(fuse_inode_cachep, fi);
 }
 
@@ -1002,6 +1012,11 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm,
 	if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
 		fuse_backing_files_init(fc);
 
+	if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)) {
+		pr_notice("%s: Kernel is FUSE_FAMFS_DAX capable\n", __func__);
+		init_rwsem(&fc->famfs_devlist_sem);
+	}
+
 	INIT_LIST_HEAD(&fc->mounts);
 	list_add(&fm->fc_entry, &fc->mounts);
 	fm->fc = fc;
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index f9e14180367a..d85fb692cf3b 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -652,6 +652,10 @@ enum fuse_opcode {
 	FUSE_TMPFILE		= 51,
 	FUSE_STATX		= 52,
 
+	/* Famfs / devdax opcodes */
+	FUSE_GET_FMAP           = 53,
+	FUSE_GET_DAXDEV         = 54,
+
 	/* CUSE specific operations */
 	CUSE_INIT		= 4096,
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 13/19] famfs_fuse: Create files with famfs fmaps
  2025-04-21  1:33 [RFC PATCH 00/19] famfs: port into fuse John Groves
                   ` (11 preceding siblings ...)
  2025-04-21  1:33 ` [RFC PATCH 12/19] famfs_fuse: Plumb the GET_FMAP message/response John Groves
@ 2025-04-21  1:33 ` John Groves
  2025-04-21 21:57   ` Darrick J. Wong
  2025-04-24 13:43   ` John Groves
  2025-04-21  1:33 ` [RFC PATCH 14/19] famfs_fuse: GET_DAXDEV message and daxdev_table John Groves
                   ` (8 subsequent siblings)
  21 siblings, 2 replies; 58+ messages in thread
From: John Groves @ 2025-04-21  1:33 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Luis Henriques, Randy Dunlap, Jeff Layton,
	Kent Overstreet, Petr Vorel, Brian Foster, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Amir Goldstein,
	Jonathan Cameron, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
	Aravind Ramesh, Ajay Joshi, John Groves

On completion of GET_FMAP message/response, setup the full famfs
metadata such that it's possible to handle read/write/mmap directly to
dax. Note that the devdax_iomap plumbing is not in yet...

Update MAINTAINERS for the new files.

Signed-off-by: John Groves <john@groves.net>
---
 MAINTAINERS               |   9 +
 fs/fuse/Makefile          |   2 +-
 fs/fuse/dir.c             |   3 +
 fs/fuse/famfs.c           | 344 ++++++++++++++++++++++++++++++++++++++
 fs/fuse/famfs_kfmap.h     |  63 +++++++
 fs/fuse/fuse_i.h          |  16 +-
 fs/fuse/inode.c           |   2 +-
 include/uapi/linux/fuse.h |  42 +++++
 8 files changed, 477 insertions(+), 4 deletions(-)
 create mode 100644 fs/fuse/famfs.c
 create mode 100644 fs/fuse/famfs_kfmap.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 00e94bec401e..2a5a7e0e8b28 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8808,6 +8808,15 @@ F:	Documentation/networking/failover.rst
 F:	include/net/failover.h
 F:	net/core/failover.c
 
+FAMFS
+M:	John Groves <jgroves@micron.com>
+M:	John Groves <John@Groves.net>
+L:	linux-cxl@vger.kernel.org
+L:	linux-fsdevel@vger.kernel.org
+S:	Supported
+F:	fs/fuse/famfs.c
+F:	fs/fuse/famfs_kfmap.h
+
 FANOTIFY
 M:	Jan Kara <jack@suse.cz>
 R:	Amir Goldstein <amir73il@gmail.com>
diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
index 3f0f312a31c1..65a12975d734 100644
--- a/fs/fuse/Makefile
+++ b/fs/fuse/Makefile
@@ -16,5 +16,5 @@ fuse-$(CONFIG_FUSE_DAX) += dax.o
 fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o
 fuse-$(CONFIG_SYSCTL) += sysctl.o
 fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o
-
+fuse-$(CONFIG_FUSE_FAMFS_DAX) += famfs.o
 virtiofs-y := virtio_fs.o
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index ae135c55b9f6..b28a1e912d6b 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -405,6 +405,9 @@ fuse_get_fmap(struct fuse_mount *fm, struct inode *inode, u64 nodeid)
 	fmap_size = args.out_args[0].size;
 	pr_notice("%s: nodei=%lld fmap_size=%ld\n", __func__, nodeid, fmap_size);
 
+	/* Convert fmap into in-memory format and hang from inode */
+	famfs_file_init_dax(fm, inode, fmap_buf, fmap_size);
+
 	return 0;
 }
 #endif
diff --git a/fs/fuse/famfs.c b/fs/fuse/famfs.c
new file mode 100644
index 000000000000..e62c047d0950
--- /dev/null
+++ b/fs/fuse/famfs.c
@@ -0,0 +1,344 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * famfs - dax file system for shared fabric-attached memory
+ *
+ * Copyright 2023-2025 Micron Technology, Inc.
+ *
+ * This file system, originally based on ramfs the dax support from xfs,
+ * is intended to allow multiple host systems to mount a common file system
+ * view of dax files that map to shared memory.
+ */
+
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/dax.h>
+#include <linux/iomap.h>
+#include <linux/path.h>
+#include <linux/namei.h>
+#include <linux/string.h>
+
+#include "famfs_kfmap.h"
+#include "fuse_i.h"
+
+
+void
+__famfs_meta_free(void *famfs_meta)
+{
+	struct famfs_file_meta *fmap = famfs_meta;
+
+	if (!fmap)
+		return;
+
+	if (fmap) {
+		switch (fmap->fm_extent_type) {
+		case SIMPLE_DAX_EXTENT:
+			kfree(fmap->se);
+			break;
+		case INTERLEAVED_EXTENT:
+			if (fmap->ie)
+				kfree(fmap->ie->ie_strips);
+
+			kfree(fmap->ie);
+			break;
+		default:
+			pr_err("%s: invalid fmap type\n", __func__);
+			break;
+		}
+	}
+	kfree(fmap);
+}
+
+static int
+famfs_check_ext_alignment(struct famfs_meta_simple_ext *se)
+{
+	int errs = 0;
+
+	if (se->dev_index != 0)
+		errs++;
+
+	/* TODO: pass in alignment so we can support the other page sizes */
+	if (!IS_ALIGNED(se->ext_offset, PMD_SIZE))
+		errs++;
+
+	if (!IS_ALIGNED(se->ext_len, PMD_SIZE))
+		errs++;
+
+	return errs;
+}
+
+/**
+ * famfs_meta_alloc() - Allocate famfs file metadata
+ * @metap:       Pointer to an mcache_map_meta pointer
+ * @ext_count:  The number of extents needed
+ */
+static int
+famfs_meta_alloc_v3(
+	void *fmap_buf,
+	size_t fmap_buf_size,
+	struct famfs_file_meta **metap)
+{
+	struct famfs_file_meta *meta = NULL;
+	struct fuse_famfs_fmap_header *fmh;
+	size_t extent_total = 0;
+	size_t next_offset = 0;
+	int errs = 0;
+	int i, j;
+	int rc;
+
+	fmh = (struct fuse_famfs_fmap_header *)fmap_buf;
+
+	/* Move past fmh in fmap_buf */
+	next_offset += sizeof(*fmh);
+	if (next_offset > fmap_buf_size) {
+		pr_err("%s:%d: fmap_buf underflow offset/size %ld/%ld\n",
+		       __func__, __LINE__, next_offset, fmap_buf_size);
+		rc = -EINVAL;
+		goto errout;
+	}
+
+	if (fmh->nextents < 1) {
+		pr_err("%s: nextents %d < 1\n", __func__, fmh->nextents);
+		rc = -EINVAL;
+		goto errout;
+	}
+
+	if (fmh->nextents > FUSE_FAMFS_MAX_EXTENTS) {
+		pr_err("%s: nextents %d > max (%d) 1\n",
+		       __func__, fmh->nextents, FUSE_FAMFS_MAX_EXTENTS);
+		rc = -E2BIG;
+		goto errout;
+	}
+
+	meta = kzalloc(sizeof(*meta), GFP_KERNEL);
+	if (!meta)
+		return -ENOMEM;
+	meta->error = false;
+
+	meta->file_type = fmh->file_type;
+	meta->file_size = fmh->file_size;
+	meta->fm_extent_type = fmh->ext_type;
+
+	switch (fmh->ext_type) {
+	case FUSE_FAMFS_EXT_SIMPLE: {
+		struct fuse_famfs_simple_ext *se_in;
+
+		se_in = (struct fuse_famfs_simple_ext *)(fmap_buf + next_offset);
+
+		/* Move past simple extents */
+		next_offset += fmh->nextents * sizeof(*se_in);
+		if (next_offset > fmap_buf_size) {
+			pr_err("%s:%d: fmap_buf underflow offset/size %ld/%ld\n",
+			       __func__, __LINE__, next_offset, fmap_buf_size);
+			rc = -EINVAL;
+			goto errout;
+		}
+
+		meta->fm_nextents = fmh->nextents;
+
+		meta->se = kcalloc(meta->fm_nextents, sizeof(*(meta->se)),
+				   GFP_KERNEL);
+		if (!meta->se) {
+			rc = -ENOMEM;
+			goto errout;
+		}
+
+		if ((meta->fm_nextents > FUSE_FAMFS_MAX_EXTENTS) ||
+		    (meta->fm_nextents < 1)) {
+			rc = -EINVAL;
+			goto errout;
+		}
+
+		for (i = 0; i < fmh->nextents; i++) {
+			meta->se[i].dev_index  = se_in[i].se_devindex;
+			meta->se[i].ext_offset = se_in[i].se_offset;
+			meta->se[i].ext_len    = se_in[i].se_len;
+
+			/* Record bitmap of referenced daxdev indices */
+			meta->dev_bitmap |= (1 << meta->se[i].dev_index);
+
+			errs += famfs_check_ext_alignment(&meta->se[i]);
+
+			extent_total += meta->se[i].ext_len;
+		}
+		break;
+	}
+
+	case FUSE_FAMFS_EXT_INTERLEAVE: {
+		s64 size_remainder = meta->file_size;
+		struct fuse_famfs_iext *ie_in;
+		int niext = fmh->nextents;
+
+		meta->fm_niext = niext;
+
+		/* Allocate interleaved extent */
+		meta->ie = kcalloc(niext, sizeof(*(meta->ie)), GFP_KERNEL);
+		if (!meta->ie) {
+			rc = -ENOMEM;
+			goto errout;
+		}
+
+		/*
+		 * Each interleaved extent has a simple extent list of strips.
+		 * Outer loop is over separate interleaved extents
+		 */
+		for (i = 0; i < niext; i++) {
+			u64 nstrips;
+			struct fuse_famfs_simple_ext *sie_in;
+
+			/* ie_in = one interleaved extent in fmap_buf */
+			ie_in = (struct fuse_famfs_iext *)
+				(fmap_buf + next_offset);
+
+			/* Move past one interleaved extent header in fmap_buf */
+			next_offset += sizeof(*ie_in);
+			if (next_offset > fmap_buf_size) {
+				pr_err("%s:%d: fmap_buf underflow offset/size %ld/%ld\n",
+				       __func__, __LINE__, next_offset, fmap_buf_size);
+				rc = -EINVAL;
+				goto errout;
+			}
+
+			nstrips = ie_in->ie_nstrips;
+			meta->ie[i].fie_chunk_size = ie_in->ie_chunk_size;
+			meta->ie[i].fie_nstrips    = ie_in->ie_nstrips;
+			meta->ie[i].fie_nbytes     = ie_in->ie_nbytes;
+
+			if (!meta->ie[i].fie_nbytes) {
+				pr_err("%s: zero-length interleave!\n",
+				       __func__);
+				rc = -EINVAL;
+				goto errout;
+			}
+
+			/* sie_in = the strip extents in fmap_buf */
+			sie_in = (struct fuse_famfs_simple_ext *)
+				(fmap_buf + next_offset);
+
+			/* Move past strip extents in fmap_buf */
+			next_offset += nstrips * sizeof(*sie_in);
+			if (next_offset > fmap_buf_size) {
+				pr_err("%s:%d: fmap_buf underflow offset/size %ld/%ld\n",
+				       __func__, __LINE__, next_offset, fmap_buf_size);
+				rc = -EINVAL;
+				goto errout;
+			}
+
+			if ((nstrips > FUSE_FAMFS_MAX_STRIPS) || (nstrips < 1)) {
+				pr_err("%s: invalid nstrips=%lld (max=%d)\n",
+				       __func__, nstrips,
+				       FUSE_FAMFS_MAX_STRIPS);
+				errs++;
+			}
+
+			/* Allocate strip extent array */
+			meta->ie[i].ie_strips = kcalloc(ie_in->ie_nstrips,
+					sizeof(meta->ie[i].ie_strips[0]),
+							GFP_KERNEL);
+			if (!meta->ie[i].ie_strips) {
+				rc = -ENOMEM;
+				goto errout;
+			}
+
+			/* Inner loop is over strips */
+			for (j = 0; j < nstrips; j++) {
+				struct famfs_meta_simple_ext *strips_out;
+				u64 devindex = sie_in[j].se_devindex;
+				u64 offset   = sie_in[j].se_offset;
+				u64 len      = sie_in[j].se_len;
+
+				strips_out = meta->ie[i].ie_strips;
+				strips_out[j].dev_index  = devindex;
+				strips_out[j].ext_offset = offset;
+				strips_out[j].ext_len    = len;
+
+				/* Record bitmap of referenced daxdev indices */
+				meta->dev_bitmap |= (1 << devindex);
+
+				extent_total += len;
+				errs += famfs_check_ext_alignment(&strips_out[j]);
+				size_remainder -= len;
+			}
+		}
+
+		if (size_remainder > 0) {
+			/* Sum of interleaved extent sizes is less than file size! */
+			pr_err("%s: size_remainder %lld (0x%llx)\n",
+			       __func__, size_remainder, size_remainder);
+			rc = -EINVAL;
+			goto errout;
+		}
+		break;
+	}
+
+	default:
+		pr_err("%s: invalid ext_type %d\n", __func__, fmh->ext_type);
+		rc = -EINVAL;
+		goto errout;
+	}
+
+	if (errs > 0) {
+		pr_err("%s: %d alignment errors found\n", __func__, errs);
+		rc = -EINVAL;
+		goto errout;
+	}
+
+	/* More sanity checks */
+	if (extent_total < meta->file_size) {
+		pr_err("%s: file size %ld larger than map size %ld\n",
+		       __func__, meta->file_size, extent_total);
+		rc = -EINVAL;
+		goto errout;
+	}
+
+	*metap = meta;
+
+	return 0;
+errout:
+	__famfs_meta_free(meta);
+	return rc;
+}
+
+int
+famfs_file_init_dax(
+	struct fuse_mount *fm,
+	struct inode *inode,
+	void *fmap_buf,
+	size_t fmap_size)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct famfs_file_meta *meta = NULL;
+	int rc;
+
+	if (fi->famfs_meta) {
+		pr_notice("%s: i_no=%ld fmap_size=%ld ALREADY INITIALIZED\n",
+			  __func__,
+			  inode->i_ino, fmap_size);
+		return -EEXIST;
+	}
+
+	rc = famfs_meta_alloc_v3(fmap_buf, fmap_size, &meta);
+	if (rc)
+		goto errout;
+
+	/* Publish the famfs metadata on fi->famfs_meta */
+	inode_lock(inode);
+	if (fi->famfs_meta) {
+		rc = -EEXIST; /* file already has famfs metadata */
+	} else {
+		if (famfs_meta_set(fi, meta) != NULL) {
+			pr_err("%s: file already had metadata\n", __func__);
+			rc = -EALREADY;
+			goto errout;
+		}
+		i_size_write(inode, meta->file_size);
+		inode->i_flags |= S_DAX;
+	}
+	inode_unlock(inode);
+
+ errout:
+	if (rc)
+		__famfs_meta_free(meta);
+
+	return rc;
+}
+
diff --git a/fs/fuse/famfs_kfmap.h b/fs/fuse/famfs_kfmap.h
new file mode 100644
index 000000000000..ce785d76719c
--- /dev/null
+++ b/fs/fuse/famfs_kfmap.h
@@ -0,0 +1,63 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * famfs - dax file system for shared fabric-attached memory
+ *
+ * Copyright 2023-2025 Micron Technology, Inc.
+ */
+#ifndef FAMFS_KFMAP_H
+#define FAMFS_KFMAP_H
+
+/*
+ * These structures are the in-memory metadata format for famfs files. Metadata
+ * retrieved via the GET_FMAP response is converted to this format for use in
+ * resolving file mapping faults.
+ */
+
+enum famfs_file_type {
+	FAMFS_REG,
+	FAMFS_SUPERBLOCK,
+	FAMFS_LOG,
+};
+
+/* We anticipate the possiblity of supporting additional types of extents */
+enum famfs_extent_type {
+	SIMPLE_DAX_EXTENT,
+	INTERLEAVED_EXTENT,
+	INVALID_EXTENT_TYPE,
+};
+
+struct famfs_meta_simple_ext {
+	u64 dev_index;
+	u64 ext_offset;
+	u64 ext_len;
+};
+
+struct famfs_meta_interleaved_ext {
+	u64 fie_nstrips;
+	u64 fie_chunk_size;
+	u64 fie_nbytes;
+	struct famfs_meta_simple_ext *ie_strips;
+};
+
+/*
+ * Each famfs dax file has this hanging from its fuse_inode->famfs_meta
+ */
+struct famfs_file_meta {
+	bool                   error;
+	enum famfs_file_type   file_type;
+	size_t                 file_size;
+	enum famfs_extent_type fm_extent_type;
+	u64 dev_bitmap; /* bitmap of referenced daxdevs by index */
+	union { /* This will make code a bit more readable */
+		struct {
+			size_t         fm_nextents;
+			struct famfs_meta_simple_ext  *se;
+		};
+		struct {
+			size_t         fm_niext;
+			struct famfs_meta_interleaved_ext *ie;
+		};
+	};
+};
+
+#endif /* FAMFS_KFMAP_H */
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 437177c2f092..d8e0ac784224 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1557,11 +1557,18 @@ extern void fuse_sysctl_unregister(void);
 #endif /* CONFIG_SYSCTL */
 
 /* famfs.c */
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+int famfs_file_init_dax(struct fuse_mount *fm,
+			     struct inode *inode, void *fmap_buf,
+			     size_t fmap_size);
+void __famfs_meta_free(void *map);
+#endif
+
 static inline struct fuse_backing *famfs_meta_set(struct fuse_inode *fi,
 						       void *meta)
 {
 #if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
-	return xchg(&fi->famfs_meta, meta);
+	return cmpxchg(&fi->famfs_meta, NULL, meta);
 #else
 	return NULL;
 #endif
@@ -1569,7 +1576,12 @@ static inline struct fuse_backing *famfs_meta_set(struct fuse_inode *fi,
 
 static inline void famfs_meta_free(struct fuse_inode *fi)
 {
-	/* Stub wil be connected in a subsequent commit */
+#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
+	if (fi->famfs_meta != NULL) {
+		__famfs_meta_free(fi->famfs_meta);
+		famfs_meta_set(fi, NULL);
+	}
+#endif
 }
 
 static inline int fuse_file_famfs(struct fuse_inode *fi)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 848c8818e6f7..e86bf330117f 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -118,7 +118,7 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
 		fuse_inode_backing_set(fi, NULL);
 
 	if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
-		famfs_meta_set(fi, NULL);
+		fi->famfs_meta = NULL; /* XXX new inodes currently not zeroed; why not? */
 
 	return &fi->inode;
 
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index d85fb692cf3b..0f6ff1ffb23d 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -1286,4 +1286,46 @@ struct fuse_uring_cmd_req {
 	uint8_t padding[6];
 };
 
+/* Famfs fmap message components */
+
+#define FAMFS_FMAP_VERSION 1
+
+#define FUSE_FAMFS_MAX_EXTENTS 2
+#define FUSE_FAMFS_MAX_STRIPS 16
+
+enum fuse_famfs_file_type {
+	FUSE_FAMFS_FILE_REG,
+	FUSE_FAMFS_FILE_SUPERBLOCK,
+	FUSE_FAMFS_FILE_LOG,
+};
+
+enum famfs_ext_type {
+	FUSE_FAMFS_EXT_SIMPLE = 0,
+	FUSE_FAMFS_EXT_INTERLEAVE = 1,
+};
+
+struct fuse_famfs_simple_ext {
+	uint32_t se_devindex;
+	uint32_t reserved;
+	uint64_t se_offset;
+	uint64_t se_len;
+};
+
+struct fuse_famfs_iext { /* Interleaved extent */
+	uint32_t ie_nstrips;
+	uint32_t ie_chunk_size;
+	uint64_t ie_nbytes; /* Total bytes for this interleaved_ext; sum of strips may be more */
+	uint64_t reserved;
+};
+
+struct fuse_famfs_fmap_header {
+	uint8_t file_type; /* enum famfs_file_type */
+	uint8_t reserved;
+	uint16_t fmap_version;
+	uint32_t ext_type; /* enum famfs_log_ext_type */
+	uint32_t nextents;
+	uint32_t reserved0;
+	uint64_t file_size;
+	uint64_t reserved1;
+};
 #endif /* _LINUX_FUSE_H */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 14/19] famfs_fuse: GET_DAXDEV message and daxdev_table
  2025-04-21  1:33 [RFC PATCH 00/19] famfs: port into fuse John Groves
                   ` (12 preceding siblings ...)
  2025-04-21  1:33 ` [RFC PATCH 13/19] famfs_fuse: Create files with famfs fmaps John Groves
@ 2025-04-21  1:33 ` John Groves
  2025-04-21  3:43   ` Randy Dunlap
  2025-04-21  1:33 ` [RFC PATCH 15/19] famfs_fuse: Plumb dax iomap and fuse read/write/mmap John Groves
                   ` (7 subsequent siblings)
  21 siblings, 1 reply; 58+ messages in thread
From: John Groves @ 2025-04-21  1:33 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Luis Henriques, Randy Dunlap, Jeff Layton,
	Kent Overstreet, Petr Vorel, Brian Foster, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Amir Goldstein,
	Jonathan Cameron, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
	Aravind Ramesh, Ajay Joshi, John Groves

* The new GET_DAXDEV message/response is enabled
* The command it triggered by the update_daxdev_table() call, if there
  are any daxdevs in the subject fmap that are not represented in the
  daxdev_dable yet.

Signed-off-by: John Groves <john@groves.net>
---
 fs/fuse/famfs.c           | 281 ++++++++++++++++++++++++++++++++++++--
 fs/fuse/famfs_kfmap.h     |  23 ++++
 fs/fuse/fuse_i.h          |   4 +
 fs/fuse/inode.c           |   2 +
 fs/namei.c                |   1 +
 include/uapi/linux/fuse.h |  15 ++
 6 files changed, 316 insertions(+), 10 deletions(-)

diff --git a/fs/fuse/famfs.c b/fs/fuse/famfs.c
index e62c047d0950..2e182cb7d7c9 100644
--- a/fs/fuse/famfs.c
+++ b/fs/fuse/famfs.c
@@ -20,6 +20,250 @@
 #include "famfs_kfmap.h"
 #include "fuse_i.h"
 
+/*
+ * famfs_teardown()
+ *
+ * Deallocate famfs metadata for a fuse_conn
+ */
+void
+famfs_teardown(struct fuse_conn *fc)
+{
+	struct famfs_dax_devlist *devlist = fc->dax_devlist;
+	int i;
+
+	fc->dax_devlist = NULL;
+
+	if (!devlist)
+		return;
+
+	if (!devlist->devlist)
+		goto out;
+
+	/* Close & release all the daxdevs in our table */
+	for (i = 0; i < devlist->nslots; i++) {
+		if (devlist->devlist[i].valid && devlist->devlist[i].devp)
+			fs_put_dax(devlist->devlist[i].devp, fc);
+	}
+	kfree(devlist->devlist);
+
+out:
+	kfree(devlist);
+}
+
+static int
+famfs_verify_daxdev(const char *pathname, dev_t *devno)
+{
+	struct inode *inode;
+	struct path path;
+	int err;
+
+	if (!pathname || !*pathname)
+		return -EINVAL;
+
+	err = kern_path(pathname, LOOKUP_FOLLOW, &path);
+	if (err)
+		return err;
+
+	inode = d_backing_inode(path.dentry);
+	if (!S_ISCHR(inode->i_mode)) {
+		err = -EINVAL;
+		goto out_path_put;
+	}
+
+	if (!may_open_dev(&path)) { /* had to export this */
+		err = -EACCES;
+		goto out_path_put;
+	}
+
+	*devno = inode->i_rdev;
+
+out_path_put:
+	path_put(&path);
+	return err;
+}
+
+/**
+ * famfs_fuse_get_daxdev()
+ *
+ * Send a GET_DAXDEV message to the fuse server to retrieve info on a
+ * dax device.
+ *
+ * @fm    - fuse_mount
+ * @index - the index of the dax device; daxdevs are referred to by index
+ *          in fmaps, and the server resolves the index to a particular daxdev
+ *
+ * Returns: 0=success
+ *          -errno=failure
+ */
+static int
+famfs_fuse_get_daxdev(struct fuse_mount *fm, const u64 index)
+{
+	struct fuse_daxdev_out daxdev_out = { 0 };
+	struct fuse_conn *fc = fm->fc;
+	struct famfs_daxdev *daxdev;
+	int err = 0;
+
+	FUSE_ARGS(args);
+
+	pr_notice("%s: index=%lld\n", __func__, index);
+
+	/* Store the daxdev in our table */
+	if (index >= fc->dax_devlist->nslots) {
+		pr_err("%s: index(%lld) > nslots(%d)\n",
+		       __func__, index, fc->dax_devlist->nslots);
+		err = -EINVAL;
+		goto out;
+	}
+
+	args.opcode = FUSE_GET_DAXDEV;
+	args.nodeid = index;
+
+	args.in_numargs = 0;
+
+	args.out_numargs = 1;
+	args.out_args[0].size = sizeof(daxdev_out);
+	args.out_args[0].value = &daxdev_out;
+
+	/* Send GET_DAXDEV command */
+	err = fuse_simple_request(fm, &args);
+	if (err) {
+		pr_err("%s: err=%d from fuse_simple_request()\n",
+		       __func__, err);
+		/* Error will be that the payload is smaller than FMAP_BUFSIZE,
+		 * which is the max we can handle. Empty payload handled below.
+		 */
+		goto out;
+	}
+
+	down_write(&fc->famfs_devlist_sem);
+
+	daxdev = &fc->dax_devlist->devlist[index];
+	pr_debug("%s: dax_devlist %llx daxdev[%lld]=%llx\n", __func__,
+		 (u64)fc->dax_devlist, index, (u64)daxdev);
+
+	/* Abort if daxdev is now valid */
+	if (daxdev->valid) {
+		up_write(&fc->famfs_devlist_sem);
+		/* We already have a valid entry at this index */
+		err = -EALREADY;
+		goto out;
+	}
+
+	/* This verifies that the dev is valid and can be opened and gets the devno */
+	pr_debug("%s: famfs_verify_daxdev(%s)\n", __func__, daxdev_out.name);
+	err = famfs_verify_daxdev(daxdev_out.name, &daxdev->devno);
+	if (err) {
+		up_write(&fc->famfs_devlist_sem);
+		pr_err("%s: err=%d from famfs_verify_daxdev()\n", __func__, err);
+		goto out;
+	}
+
+	/* This will fail if it's not a dax device */
+	pr_debug("%s: dax_dev_get(%x)\n", __func__, daxdev->devno);
+	daxdev->devp = dax_dev_get(daxdev->devno);
+	if (!daxdev->devp) {
+		up_write(&fc->famfs_devlist_sem);
+		pr_warn("%s: device %s not found or not dax\n",
+			__func__, daxdev_out.name);
+		err = -ENODEV;
+		goto out;
+	}
+
+	daxdev->name = kstrdup(daxdev_out.name, GFP_KERNEL);
+	wmb(); /* all daxdev fields must be visible before marking it valid */
+	daxdev->valid = 1;
+
+	up_write(&fc->famfs_devlist_sem);
+
+	pr_debug("%s: daxdev(%lld, %s)=%llx opened and marked valid\n",
+		 __func__, index, daxdev->name, (u64)daxdev);
+
+out:
+	return err;
+}
+
+/**
+ * famfs_update_daxdev_table()
+ *
+ * This function is called for each new file fmap, to verify whether all
+ * referenced daxdevs are already known (i.e. in the table). Any daxdev
+ * indices that are not in the table will be retrieved via
+ * famfs_fuse_get_daxdev()
+ * @fm   - fuse_mount
+ * @meta - famfs_file_meta, in-memory format, built from a GET_FMAP response
+ *
+ * Returns: 0=success
+ *          -errno=failure
+ */
+static int
+famfs_update_daxdev_table(
+	struct fuse_mount *fm,
+	const struct famfs_file_meta *meta)
+{
+	struct famfs_dax_devlist *local_devlist;
+	struct fuse_conn *fc = fm->fc;
+	int err;
+	int i;
+
+	pr_debug("%s: dev_bitmap=0x%llx\n", __func__, meta->dev_bitmap);
+
+	/* First time through we will need to allocate the dax_devlist */
+	if (!fc->dax_devlist) {
+		local_devlist = kcalloc(1, sizeof(*fc->dax_devlist), GFP_KERNEL);
+		if (!local_devlist)
+			return -ENOMEM;
+
+		local_devlist->nslots = MAX_DAXDEVS;
+		pr_debug("%s: allocate dax_devlist=%llx\n", __func__,
+			 (u64)local_devlist);
+
+		local_devlist->devlist = kcalloc(MAX_DAXDEVS,
+						 sizeof(struct famfs_daxdev),
+						 GFP_KERNEL);
+		if (!local_devlist->devlist) {
+			kfree(local_devlist);
+			return -ENOMEM;
+		}
+
+		/* We don't need the famfs_devlist_sem here because we use cmpxchg... */
+		if (cmpxchg(&fc->dax_devlist, NULL, local_devlist) != NULL) {
+			pr_debug("%s: aborting new devlist\n", __func__);
+			kfree(local_devlist->devlist);
+			kfree(local_devlist); /* another thread beat us to it */
+		} else {
+			pr_debug("%s: published new dax_devlist %llx / %llx\n",
+				 __func__, (u64)local_devlist,
+				 (u64)local_devlist->devlist);
+		}
+	}
+
+	down_read(&fc->famfs_devlist_sem);
+	for (i = 0; i < fc->dax_devlist->nslots; i++) {
+		if (meta->dev_bitmap & (1ULL << i)) {
+			/* This file meta struct references devindex i
+			 * if devindex i isn't in the table; get it...
+			 */
+			if (!(fc->dax_devlist->devlist[i].valid)) {
+				up_read(&fc->famfs_devlist_sem);
+
+				pr_notice("%s: daxdev=%d (%llx) invalid...getting\n",
+					  __func__, i,
+					  (u64)(&fc->dax_devlist->devlist[i]));
+				err = famfs_fuse_get_daxdev(fm, i);
+				if (err)
+					pr_err("%s: failed to get daxdev=%d\n",
+					       __func__, i);
+
+				down_read(&fc->famfs_devlist_sem);
+			}
+		}
+	}
+	up_read(&fc->famfs_devlist_sem);
+
+	return 0;
+}
+
+/***************************************************************************/
 
 void
 __famfs_meta_free(void *famfs_meta)
@@ -67,12 +311,15 @@ famfs_check_ext_alignment(struct famfs_meta_simple_ext *se)
 }
 
 /**
- * famfs_meta_alloc() - Allocate famfs file metadata
+ * famfs_fuse_meta_alloc() - Allocate famfs file metadata
  * @metap:       Pointer to an mcache_map_meta pointer
  * @ext_count:  The number of extents needed
+ *
+ * Returns: 0=success
+ *          -errno=failure
  */
 static int
-famfs_meta_alloc_v3(
+famfs_fuse_meta_alloc(
 	void *fmap_buf,
 	size_t fmap_buf_size,
 	struct famfs_file_meta **metap)
@@ -92,28 +339,25 @@ famfs_meta_alloc_v3(
 	if (next_offset > fmap_buf_size) {
 		pr_err("%s:%d: fmap_buf underflow offset/size %ld/%ld\n",
 		       __func__, __LINE__, next_offset, fmap_buf_size);
-		rc = -EINVAL;
-		goto errout;
+		return -EINVAL;
 	}
 
 	if (fmh->nextents < 1) {
 		pr_err("%s: nextents %d < 1\n", __func__, fmh->nextents);
-		rc = -EINVAL;
-		goto errout;
+		return -EINVAL;
 	}
 
 	if (fmh->nextents > FUSE_FAMFS_MAX_EXTENTS) {
 		pr_err("%s: nextents %d > max (%d) 1\n",
 		       __func__, fmh->nextents, FUSE_FAMFS_MAX_EXTENTS);
-		rc = -E2BIG;
-		goto errout;
+		return -E2BIG;
 	}
 
 	meta = kzalloc(sizeof(*meta), GFP_KERNEL);
 	if (!meta)
 		return -ENOMEM;
-	meta->error = false;
 
+	meta->error = false;
 	meta->file_type = fmh->file_type;
 	meta->file_size = fmh->file_size;
 	meta->fm_extent_type = fmh->ext_type;
@@ -298,6 +542,20 @@ famfs_meta_alloc_v3(
 	return rc;
 }
 
+/**
+ * famfs_file_init_dax()
+ *
+ * Initialize famfs metadata for a file, based on the contents of the GET_FMAP
+ * response
+ *
+ * @fm        - fuse_mount
+ * @inode     - the inode
+ * @fmap_buf  - fmap response message
+ * @fmap_size - Size of the fmap message
+ *
+ * Returns: 0=success
+ *          -errno=failure
+ */
 int
 famfs_file_init_dax(
 	struct fuse_mount *fm,
@@ -316,10 +574,13 @@ famfs_file_init_dax(
 		return -EEXIST;
 	}
 
-	rc = famfs_meta_alloc_v3(fmap_buf, fmap_size, &meta);
+	rc = famfs_fuse_meta_alloc(fmap_buf, fmap_size, &meta);
 	if (rc)
 		goto errout;
 
+	/* Make sure this fmap doesn't reference any unknown daxdevs */
+	famfs_update_daxdev_table(fm, meta);
+
 	/* Publish the famfs metadata on fi->famfs_meta */
 	inode_lock(inode);
 	if (fi->famfs_meta) {
diff --git a/fs/fuse/famfs_kfmap.h b/fs/fuse/famfs_kfmap.h
index ce785d76719c..325adb8b99c5 100644
--- a/fs/fuse/famfs_kfmap.h
+++ b/fs/fuse/famfs_kfmap.h
@@ -60,4 +60,27 @@ struct famfs_file_meta {
 	};
 };
 
+/*
+ * dax_devlist
+ *
+ * This is the in-memory daxdev metadata that is populated by
+ * the responses to GET_FMAP messages
+ */
+struct famfs_daxdev {
+	/* Include dev uuid? */
+	bool valid;
+	bool error;
+	dev_t devno;
+	struct dax_device *devp;
+	char *name;
+};
+
+#define MAX_DAXDEVS 24
+
+struct famfs_dax_devlist {
+	int nslots;
+	int ndevs;
+	struct famfs_daxdev *devlist; /* XXX: make this an xarray! */
+};
+
 #endif /* FAMFS_KFMAP_H */
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index d8e0ac784224..4c4c4f0ff280 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1561,7 +1561,11 @@ extern void fuse_sysctl_unregister(void);
 int famfs_file_init_dax(struct fuse_mount *fm,
 			     struct inode *inode, void *fmap_buf,
 			     size_t fmap_size);
+ssize_t famfs_dax_write_iter(struct kiocb *iocb, struct iov_iter *from);
+ssize_t famfs_dax_read_iter(struct kiocb *iocb, struct iov_iter	*to);
+int famfs_file_mmap(struct file *file, struct vm_area_struct *vma);
 void __famfs_meta_free(void *map);
+void famfs_teardown(struct fuse_conn *fc);
 #endif
 
 static inline struct fuse_backing *famfs_meta_set(struct fuse_inode *fi,
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index e86bf330117f..af1629b07a30 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1051,6 +1051,8 @@ void fuse_conn_put(struct fuse_conn *fc)
 		}
 		if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
 			fuse_backing_files_free(fc);
+		if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
+			famfs_teardown(fc);
 		call_rcu(&fc->rcu, delayed_release);
 	}
 }
diff --git a/fs/namei.c b/fs/namei.c
index ecb7b95c2ca3..75a1e1d46593 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3380,6 +3380,7 @@ bool may_open_dev(const struct path *path)
 	return !(path->mnt->mnt_flags & MNT_NODEV) &&
 		!(path->mnt->mnt_sb->s_iflags & SB_I_NODEV);
 }
+EXPORT_SYMBOL(may_open_dev);
 
 static int may_open(struct mnt_idmap *idmap, const struct path *path,
 		    int acc_mode, int flag)
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 0f6ff1ffb23d..982d4fc66ef8 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -1328,4 +1328,19 @@ struct fuse_famfs_fmap_header {
 	uint64_t file_size;
 	uint64_t reserved1;
 };
+
+struct fuse_get_daxdev_in {
+	uint32_t        daxdev_num;
+};
+
+#define DAXDEV_NAME_MAX 256
+struct fuse_daxdev_out {
+	uint16_t index;
+	uint16_t reserved;
+	uint32_t reserved2;
+	uint64_t reserved3; /* enough space for a uuid if we need it */
+	uint64_t reserved4;
+	char name[DAXDEV_NAME_MAX];
+};
+
 #endif /* _LINUX_FUSE_H */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 15/19] famfs_fuse: Plumb dax iomap and fuse read/write/mmap
  2025-04-21  1:33 [RFC PATCH 00/19] famfs: port into fuse John Groves
                   ` (13 preceding siblings ...)
  2025-04-21  1:33 ` [RFC PATCH 14/19] famfs_fuse: GET_DAXDEV message and daxdev_table John Groves
@ 2025-04-21  1:33 ` John Groves
  2025-04-21  1:33 ` [RFC PATCH 16/19] famfs_fuse: Add holder_operations for dax notify_failure() John Groves
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: John Groves @ 2025-04-21  1:33 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Luis Henriques, Randy Dunlap, Jeff Layton,
	Kent Overstreet, Petr Vorel, Brian Foster, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Amir Goldstein,
	Jonathan Cameron, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
	Aravind Ramesh, Ajay Joshi, John Groves

This commit fills in read/write/mmap handling for famfs files. The
dev_dax_iomap interface is used - just like xfs in fs-dax mode.

Signed-off-by: John Groves <john@groves.net>
---
 fs/fuse/famfs.c | 432 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/fuse/file.c  |  14 ++
 2 files changed, 446 insertions(+)

diff --git a/fs/fuse/famfs.c b/fs/fuse/famfs.c
index 2e182cb7d7c9..8c12e8bd96b2 100644
--- a/fs/fuse/famfs.c
+++ b/fs/fuse/famfs.c
@@ -603,3 +603,435 @@ famfs_file_init_dax(
 	return rc;
 }
 
+/*********************************************************************
+ * iomap_operations
+ *
+ * This stuff uses the iomap (dax-related) helpers to resolve file offsets to
+ * offsets within a dax device.
+ */
+
+static ssize_t famfs_file_invalid(struct inode *inode);
+
+static int
+famfs_meta_to_dax_offset_v2(struct inode *inode, struct iomap *iomap,
+			 loff_t file_offset, off_t len, unsigned int flags)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct famfs_file_meta *meta = fi->famfs_meta;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	loff_t local_offset = file_offset;
+	int i;
+
+	/* This function is only for extent_type INTERLEAVED_EXTENT */
+	if (meta->fm_extent_type != INTERLEAVED_EXTENT) {
+		pr_err("%s: bad extent type\n", __func__);
+		goto err_out;
+	}
+
+	if (famfs_file_invalid(inode))
+		goto err_out;
+
+	iomap->offset = file_offset;
+
+	for (i = 0; i < meta->fm_niext; i++) {
+		struct famfs_meta_interleaved_ext *fei = &meta->ie[i];
+		u64 chunk_size = fei->fie_chunk_size;
+		u64 nstrips = fei->fie_nstrips;
+		u64 ext_size = fei->fie_nbytes;
+
+		ext_size = min_t(u64, ext_size, meta->file_size);
+
+		if (ext_size == 0) {
+			pr_err("%s: ext_size=%lld file_size=%ld\n",
+			       __func__, fei->fie_nbytes, meta->file_size);
+			goto err_out;
+		}
+
+		/* Is the data is in this striped extent? */
+		if (local_offset < ext_size) {
+			u64 chunk_num       = local_offset / chunk_size;
+			u64 chunk_offset    = local_offset % chunk_size;
+			u64 stripe_num      = chunk_num / nstrips;
+			u64 strip_num       = chunk_num % nstrips;
+			u64 chunk_remainder = chunk_size - chunk_offset;
+			u64 strip_offset    = chunk_offset + (stripe_num * chunk_size);
+			u64 strip_dax_ofs = fei->ie_strips[strip_num].ext_offset;
+			u64 strip_devidx = fei->ie_strips[strip_num].dev_index;
+
+			if (!fc->dax_devlist->devlist[strip_devidx].valid) {
+				pr_err("%s: daxdev=%lld invalid\n", __func__,
+					strip_devidx);
+				goto err_out;
+			}
+			iomap->addr    = strip_dax_ofs + strip_offset;
+			iomap->offset  = file_offset;
+			iomap->length  = min_t(loff_t, len, chunk_remainder);
+
+			iomap->dax_dev = fc->dax_devlist->devlist[strip_devidx].devp;
+
+			iomap->type    = IOMAP_MAPPED;
+			iomap->flags   = flags;
+
+			return 0;
+		}
+		local_offset -= ext_size; /* offset is beyond this striped extent */
+	}
+
+ err_out:
+	pr_err("%s: err_out\n", __func__);
+
+	/* We fell out the end of the extent list.
+	 * Set iomap to zero length in this case, and return 0
+	 * This just means that the r/w is past EOF
+	 */
+	iomap->addr    = 0; /* there is no valid dax device offset */
+	iomap->offset  = file_offset; /* file offset */
+	iomap->length  = 0; /* this had better result in no access to dax mem */
+	iomap->dax_dev = NULL;
+	iomap->type    = IOMAP_MAPPED;
+	iomap->flags   = flags;
+
+	return 0;
+}
+
+/**
+ * famfs_meta_to_dax_offset() - Resolve (file, offset, len) to (daxdev, offset, len)
+ *
+ * This function is called by famfs_iomap_begin() to resolve an offset in a
+ * file to an offset in a dax device. This is upcalled from dax from calls to
+ * both  * dax_iomap_fault() and dax_iomap_rw(). Dax finishes the job resolving
+ * a fault to a specific physical page (the fault case) or doing a memcpy
+ * variant (the rw case)
+ *
+ * Pages can be PTE (4k), PMD (2MiB) or (theoretically) PuD (1GiB)
+ * (these sizes are for X86; may vary on other cpu architectures
+ *
+ * @inode:  The file where the fault occurred
+ * @iomap:       To be filled in to indicate where to find the right memory,
+ *               relative  to a dax device.
+ * @file_offset: Within the file where the fault occurred (will be page boundary)
+ * @len:         The length of the faulted mapping (will be a page multiple)
+ *               (will be trimmed in *iomap if it's disjoint in the extent list)
+ * @flags:
+ *
+ * Return values: 0. (info is returned in a modified @iomap struct)
+ */
+static int
+famfs_meta_to_dax_offset(struct inode *inode, struct iomap *iomap,
+			 loff_t file_offset, off_t len, unsigned int flags)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct famfs_file_meta *meta = fi->famfs_meta;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	loff_t local_offset = file_offset;
+	int i;
+
+	if (!fc->dax_devlist) {
+		pr_err("%s: null dax_devlist\n", __func__);
+		goto err_out;
+	}
+
+	if (famfs_file_invalid(inode))
+		goto err_out;
+
+	if (meta->fm_extent_type == INTERLEAVED_EXTENT)
+		return famfs_meta_to_dax_offset_v2(inode, iomap, file_offset,
+						   len, flags);
+
+	iomap->offset = file_offset;
+
+	for (i = 0; i < meta->fm_nextents; i++) {
+		/* TODO: check devindex too */
+		loff_t dax_ext_offset = meta->se[i].ext_offset;
+		loff_t dax_ext_len    = meta->se[i].ext_len;
+		u64 daxdev_idx = meta->se[i].dev_index;
+
+		if ((dax_ext_offset == 0) &&
+		    (meta->file_type != FAMFS_SUPERBLOCK))
+			pr_warn("%s: zero offset on non-superblock file!!\n",
+				__func__);
+
+		/* local_offset is the offset minus the size of extents skipped
+		 * so far; If local_offset < dax_ext_len, the data of interest
+		 * starts in this extent
+		 */
+		if (local_offset < dax_ext_len) {
+			loff_t ext_len_remainder = dax_ext_len - local_offset;
+
+			if (!fc->dax_devlist->devlist[daxdev_idx].valid) {
+				pr_err("%s: daxdev=%lld invalid\n", __func__,
+					daxdev_idx);
+				goto err_out;
+			}
+
+			/*
+			 * OK, we found the file metadata extent where this
+			 * data begins
+			 * @local_offset      - The offset within the current
+			 *                      extent
+			 * @ext_len_remainder - Remaining length of ext after
+			 *                      skipping local_offset
+			 * Outputs:
+			 * iomap->addr:   the offset within the dax device where
+			 *                the  data starts
+			 * iomap->offset: the file offset
+			 * iomap->length: the valid length resolved here
+			 */
+			iomap->addr    = dax_ext_offset + local_offset;
+			iomap->offset  = file_offset;
+			iomap->length  = min_t(loff_t, len, ext_len_remainder);
+
+			iomap->dax_dev = fc->dax_devlist->devlist[daxdev_idx].devp;
+
+			iomap->type    = IOMAP_MAPPED;
+			iomap->flags   = flags;
+			return 0;
+		}
+		local_offset -= dax_ext_len; /* Get ready for the next extent */
+	}
+
+ err_out:
+	pr_err("%s: err_out\n", __func__);
+
+	/* We fell out the end of the extent list.
+	 * Set iomap to zero length in this case, and return 0
+	 * This just means that the r/w is past EOF
+	 */
+	iomap->addr    = 0; /* there is no valid dax device offset */
+	iomap->offset  = file_offset; /* file offset */
+	iomap->length  = 0; /* this had better result in no access to dax mem */
+	iomap->dax_dev = NULL;
+	iomap->type    = IOMAP_MAPPED;
+	iomap->flags   = flags;
+
+	return 0;
+}
+
+/**
+ * famfs_iomap_begin() - Handler for iomap_begin upcall from dax
+ *
+ * This function is pretty simple because files are
+ * * never partially allocated
+ * * never have holes (never sparse)
+ * * never "allocate on write"
+ *
+ * @inode:  inode for the file being accessed
+ * @offset: offset within the file
+ * @length: Length being accessed at offset
+ * @flags:
+ * @iomap:  iomap struct to be filled in, resolving (offset, length) to
+ *          (daxdev, offset, len)
+ * @srcmap:
+ */
+static int
+famfs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
+		  unsigned int flags, struct iomap *iomap, struct iomap *srcmap)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct famfs_file_meta *meta = fi->famfs_meta;
+	size_t size;
+
+	size = i_size_read(inode);
+
+	WARN_ON(size != meta->file_size);
+
+	return famfs_meta_to_dax_offset(inode, iomap, offset, length, flags);
+}
+
+/* Note: We never need a special set of write_iomap_ops because famfs never
+ * performs allocation on write.
+ */
+const struct iomap_ops famfs_iomap_ops = {
+	.iomap_begin		= famfs_iomap_begin,
+};
+
+/*********************************************************************
+ * vm_operations
+ */
+static vm_fault_t
+__famfs_filemap_fault(struct vm_fault *vmf, unsigned int pe_size,
+		      bool write_fault)
+{
+	struct inode *inode = file_inode(vmf->vma->vm_file);
+	vm_fault_t ret;
+	pfn_t pfn;
+
+	if (!IS_DAX(file_inode(vmf->vma->vm_file))) {
+		pr_err("%s: file not marked IS_DAX!!\n", __func__);
+		return VM_FAULT_SIGBUS;
+	}
+
+	if (write_fault) {
+		sb_start_pagefault(inode->i_sb);
+		file_update_time(vmf->vma->vm_file);
+	}
+
+	ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &famfs_iomap_ops);
+	if (ret & VM_FAULT_NEEDDSYNC)
+		ret = dax_finish_sync_fault(vmf, pe_size, pfn);
+
+	if (write_fault)
+		sb_end_pagefault(inode->i_sb);
+
+	return ret;
+}
+
+static inline bool
+famfs_is_write_fault(struct vm_fault *vmf)
+{
+	return (vmf->flags & FAULT_FLAG_WRITE) &&
+	       (vmf->vma->vm_flags & VM_SHARED);
+}
+
+static vm_fault_t
+famfs_filemap_fault(struct vm_fault *vmf)
+{
+	return __famfs_filemap_fault(vmf, 0, famfs_is_write_fault(vmf));
+}
+
+static vm_fault_t
+famfs_filemap_huge_fault(struct vm_fault *vmf, unsigned int pe_size)
+{
+	return __famfs_filemap_fault(vmf, pe_size, famfs_is_write_fault(vmf));
+}
+
+static vm_fault_t
+famfs_filemap_page_mkwrite(struct vm_fault *vmf)
+{
+	return __famfs_filemap_fault(vmf, 0, true);
+}
+
+static vm_fault_t
+famfs_filemap_pfn_mkwrite(struct vm_fault *vmf)
+{
+	return __famfs_filemap_fault(vmf, 0, true);
+}
+
+static vm_fault_t
+famfs_filemap_map_pages(struct vm_fault	*vmf, pgoff_t start_pgoff,
+			pgoff_t	end_pgoff)
+{
+	return filemap_map_pages(vmf, start_pgoff, end_pgoff);
+}
+
+const struct vm_operations_struct famfs_file_vm_ops = {
+	.fault		= famfs_filemap_fault,
+	.huge_fault	= famfs_filemap_huge_fault,
+	.map_pages	= famfs_filemap_map_pages,
+	.page_mkwrite	= famfs_filemap_page_mkwrite,
+	.pfn_mkwrite	= famfs_filemap_pfn_mkwrite,
+};
+
+/*********************************************************************
+ * file_operations
+ */
+
+/* Reject I/O to files that aren't in a valid state */
+static ssize_t
+famfs_file_invalid(struct inode *inode)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct famfs_file_meta *meta = fi->famfs_meta;
+	size_t i_size = i_size_read(inode);
+
+	if (!meta) {
+		pr_debug("%s: un-initialized famfs file\n", __func__);
+		return -EIO;
+	}
+	if (meta->error) {
+		pr_debug("%s: previously detected metadata errors\n", __func__);
+		return -EIO;
+	}
+	if (i_size != meta->file_size) {
+		pr_warn("%s: i_size overwritten from %ld to %ld\n",
+		       __func__, meta->file_size, i_size);
+		meta->error = true;
+		return -ENXIO;
+	}
+	if (!IS_DAX(inode)) {
+		pr_debug("%s: inode %llx IS_DAX is false\n", __func__, (u64)inode);
+		return -ENXIO;
+	}
+	return 0;
+}
+
+static ssize_t
+famfs_rw_prep(struct kiocb *iocb, struct iov_iter *ubuf)
+{
+	struct inode *inode = iocb->ki_filp->f_mapping->host;
+	size_t i_size = i_size_read(inode);
+	size_t count = iov_iter_count(ubuf);
+	size_t max_count;
+	ssize_t rc;
+
+	rc = famfs_file_invalid(inode);
+	if (rc)
+		return rc;
+
+	max_count = max_t(size_t, 0, i_size - iocb->ki_pos);
+
+	if (count > max_count)
+		iov_iter_truncate(ubuf, max_count);
+
+	if (!iov_iter_count(ubuf))
+		return 0;
+
+	return rc;
+}
+
+ssize_t
+famfs_dax_read_iter(struct kiocb *iocb, struct iov_iter	*to)
+{
+	ssize_t rc;
+
+	rc = famfs_rw_prep(iocb, to);
+	if (rc)
+		return rc;
+
+	if (!iov_iter_count(to))
+		return 0;
+
+	rc = dax_iomap_rw(iocb, to, &famfs_iomap_ops);
+
+	file_accessed(iocb->ki_filp);
+	return rc;
+}
+
+/**
+ * famfs_dax_write_iter()
+ *
+ * We need our own write-iter in order to prevent append
+ *
+ * @iocb:
+ * @from: iterator describing the user memory source for the write
+ */
+ssize_t
+famfs_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+	ssize_t rc;
+
+	rc = famfs_rw_prep(iocb, from);
+	if (rc)
+		return rc;
+
+	if (!iov_iter_count(from))
+		return 0;
+
+	return dax_iomap_rw(iocb, from, &famfs_iomap_ops);
+}
+
+int
+famfs_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct inode *inode = file_inode(file);
+	ssize_t rc;
+
+	rc = famfs_file_invalid(inode);
+	if (rc)
+		return (int)rc;
+
+	file_accessed(file);
+	vma->vm_ops = &famfs_file_vm_ops;
+	vm_flags_set(vma, VM_HUGEPAGE);
+	return 0;
+}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 6f10ae54e710..11201195924d 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1777,6 +1777,8 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 
 	if (FUSE_IS_VIRTIO_DAX(fi))
 		return fuse_dax_read_iter(iocb, to);
+	if (fuse_file_famfs(fi))
+		return famfs_dax_read_iter(iocb, to);
 
 	/* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
 	if (ff->open_flags & FOPEN_DIRECT_IO)
@@ -1799,6 +1801,8 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 
 	if (FUSE_IS_VIRTIO_DAX(fi))
 		return fuse_dax_write_iter(iocb, from);
+	if (fuse_file_famfs(fi))
+		return famfs_dax_write_iter(iocb, from);
 
 	/* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
 	if (ff->open_flags & FOPEN_DIRECT_IO)
@@ -1814,10 +1818,14 @@ static ssize_t fuse_splice_read(struct file *in, loff_t *ppos,
 				unsigned int flags)
 {
 	struct fuse_file *ff = in->private_data;
+	struct inode *inode = file_inode(in);
+	struct fuse_inode *fi = get_fuse_inode(inode);
 
 	/* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
 	if (fuse_file_passthrough(ff) && !(ff->open_flags & FOPEN_DIRECT_IO))
 		return fuse_passthrough_splice_read(in, ppos, pipe, len, flags);
+	else if (fuse_file_famfs(fi))
+		return -EIO; /* direct I/O doesn't make sense in dax_iomap */
 	else
 		return filemap_splice_read(in, ppos, pipe, len, flags);
 }
@@ -1826,10 +1834,14 @@ static ssize_t fuse_splice_write(struct pipe_inode_info *pipe, struct file *out,
 				 loff_t *ppos, size_t len, unsigned int flags)
 {
 	struct fuse_file *ff = out->private_data;
+	struct inode *inode = file_inode(out);
+	struct fuse_inode *fi = get_fuse_inode(inode);
 
 	/* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
 	if (fuse_file_passthrough(ff) && !(ff->open_flags & FOPEN_DIRECT_IO))
 		return fuse_passthrough_splice_write(pipe, out, ppos, len, flags);
+	else if (fuse_file_famfs(fi))
+		return -EIO; /* direct I/O doesn't make sense in dax_iomap */
 	else
 		return iter_file_splice_write(pipe, out, ppos, len, flags);
 }
@@ -2635,6 +2647,8 @@ static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
 	/* DAX mmap is superior to direct_io mmap */
 	if (FUSE_IS_VIRTIO_DAX(fi))
 		return fuse_dax_mmap(file, vma);
+	if (fuse_file_famfs(fi))
+		return famfs_file_mmap(file, vma);
 
 	/*
 	 * If inode is in passthrough io mode, because it has some file open
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 16/19] famfs_fuse: Add holder_operations for dax notify_failure()
  2025-04-21  1:33 [RFC PATCH 00/19] famfs: port into fuse John Groves
                   ` (14 preceding siblings ...)
  2025-04-21  1:33 ` [RFC PATCH 15/19] famfs_fuse: Plumb dax iomap and fuse read/write/mmap John Groves
@ 2025-04-21  1:33 ` John Groves
  2025-04-21  1:33 ` [RFC PATCH 17/19] famfs_fuse: Add famfs metadata documentation John Groves
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: John Groves @ 2025-04-21  1:33 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Luis Henriques, Randy Dunlap, Jeff Layton,
	Kent Overstreet, Petr Vorel, Brian Foster, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Amir Goldstein,
	Jonathan Cameron, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
	Aravind Ramesh, Ajay Joshi, John Groves

If we get a notify_failure() call on a daxdev, set its error flag and
prevent further access to that device.

Signed-off-by: John Groves <john@groves.net>
---
 fs/fuse/famfs.c  | 154 ++++++++++++++++++++++++++++++++++-------------
 fs/fuse/file.c   |   6 +-
 fs/fuse/fuse_i.h |   6 +-
 3 files changed, 117 insertions(+), 49 deletions(-)

diff --git a/fs/fuse/famfs.c b/fs/fuse/famfs.c
index 8c12e8bd96b2..363031704c8d 100644
--- a/fs/fuse/famfs.c
+++ b/fs/fuse/famfs.c
@@ -20,6 +20,26 @@
 #include "famfs_kfmap.h"
 #include "fuse_i.h"
 
+static void famfs_set_daxdev_err(
+	struct fuse_conn *fc, struct dax_device *dax_devp);
+
+static int
+famfs_dax_notify_failure(struct dax_device *dax_devp, u64 offset,
+			u64 len, int mf_flags)
+{
+	struct fuse_conn *fc = dax_holder(dax_devp);
+
+	famfs_set_daxdev_err(fc, dax_devp);
+
+	return 0;
+}
+
+static const struct dax_holder_operations famfs_fuse_dax_holder_ops = {
+	.notify_failure		= famfs_dax_notify_failure,
+};
+
+/*****************************************************************************/
+
 /*
  * famfs_teardown()
  *
@@ -169,6 +189,15 @@ famfs_fuse_get_daxdev(struct fuse_mount *fm, const u64 index)
 		goto out;
 	}
 
+	err = fs_dax_get(daxdev->devp, fc, &famfs_fuse_dax_holder_ops);
+	if (err) {
+		up_write(&fc->famfs_devlist_sem);
+		pr_err("%s: fs_dax_get(%lld) failed\n",
+		       __func__, (u64)daxdev->devno);
+		err = -EBUSY;
+		goto out;
+	}
+
 	daxdev->name = kstrdup(daxdev_out.name, GFP_KERNEL);
 	wmb(); /* all daxdev fields must be visible before marking it valid */
 	daxdev->valid = 1;
@@ -263,6 +292,38 @@ famfs_update_daxdev_table(
 	return 0;
 }
 
+static void
+famfs_set_daxdev_err(
+	struct fuse_conn *fc,
+	struct dax_device *dax_devp)
+{
+	int i;
+
+	/* Gotta search the list by dax_devp;
+	 * read lock because we're not adding or removing daxdev entries
+	 */
+	down_read(&fc->famfs_devlist_sem);
+	for (i = 0; i < fc->dax_devlist->nslots; i++) {
+		if (fc->dax_devlist->devlist[i].valid) {
+			struct famfs_daxdev *dd = &fc->dax_devlist->devlist[i];
+
+			if (dd->devp != dax_devp)
+				continue;
+
+			dd->error = true;
+			up_read(&fc->famfs_devlist_sem);
+
+			pr_err("%s: memory error on daxdev %s (%d)\n",
+			       __func__, dd->name, i);
+			goto done;
+		}
+	}
+	up_read(&fc->famfs_devlist_sem);
+	pr_err("%s: memory err on unrecognized daxdev\n", __func__);
+
+done:
+}
+
 /***************************************************************************/
 
 void
@@ -610,10 +671,10 @@ famfs_file_init_dax(
  * offsets within a dax device.
  */
 
-static ssize_t famfs_file_invalid(struct inode *inode);
+static ssize_t famfs_file_bad(struct inode *inode);
 
 static int
-famfs_meta_to_dax_offset_v2(struct inode *inode, struct iomap *iomap,
+famfs_interleave_fileofs_to_daxofs(struct inode *inode, struct iomap *iomap,
 			 loff_t file_offset, off_t len, unsigned int flags)
 {
 	struct fuse_inode *fi = get_fuse_inode(inode);
@@ -628,7 +689,7 @@ famfs_meta_to_dax_offset_v2(struct inode *inode, struct iomap *iomap,
 		goto err_out;
 	}
 
-	if (famfs_file_invalid(inode))
+	if (famfs_file_bad(inode))
 		goto err_out;
 
 	iomap->offset = file_offset;
@@ -649,6 +710,7 @@ famfs_meta_to_dax_offset_v2(struct inode *inode, struct iomap *iomap,
 
 		/* Is the data is in this striped extent? */
 		if (local_offset < ext_size) {
+			struct famfs_daxdev *dd;
 			u64 chunk_num       = local_offset / chunk_size;
 			u64 chunk_offset    = local_offset % chunk_size;
 			u64 stripe_num      = chunk_num / nstrips;
@@ -658,9 +720,11 @@ famfs_meta_to_dax_offset_v2(struct inode *inode, struct iomap *iomap,
 			u64 strip_dax_ofs = fei->ie_strips[strip_num].ext_offset;
 			u64 strip_devidx = fei->ie_strips[strip_num].dev_index;
 
-			if (!fc->dax_devlist->devlist[strip_devidx].valid) {
-				pr_err("%s: daxdev=%lld invalid\n", __func__,
-					strip_devidx);
+			dd = &fc->dax_devlist->devlist[strip_devidx];
+			if (!dd->valid || dd->error) {
+				pr_err("%s: daxdev=%lld %s\n", __func__,
+				       strip_devidx,
+				       dd->valid ? "error" : "invalid");
 				goto err_out;
 			}
 			iomap->addr    = strip_dax_ofs + strip_offset;
@@ -695,9 +759,9 @@ famfs_meta_to_dax_offset_v2(struct inode *inode, struct iomap *iomap,
 }
 
 /**
- * famfs_meta_to_dax_offset() - Resolve (file, offset, len) to (daxdev, offset, len)
+ * famfs_fileofs_to_daxofs() - Resolve (file, offset, len) to (daxdev, offset, len)
  *
- * This function is called by famfs_iomap_begin() to resolve an offset in a
+ * This function is called by famfs_fuse_iomap_begin() to resolve an offset in a
  * file to an offset in a dax device. This is upcalled from dax from calls to
  * both  * dax_iomap_fault() and dax_iomap_rw(). Dax finishes the job resolving
  * a fault to a specific physical page (the fault case) or doing a memcpy
@@ -717,7 +781,7 @@ famfs_meta_to_dax_offset_v2(struct inode *inode, struct iomap *iomap,
  * Return values: 0. (info is returned in a modified @iomap struct)
  */
 static int
-famfs_meta_to_dax_offset(struct inode *inode, struct iomap *iomap,
+famfs_fileofs_to_daxofs(struct inode *inode, struct iomap *iomap,
 			 loff_t file_offset, off_t len, unsigned int flags)
 {
 	struct fuse_inode *fi = get_fuse_inode(inode);
@@ -731,12 +795,13 @@ famfs_meta_to_dax_offset(struct inode *inode, struct iomap *iomap,
 		goto err_out;
 	}
 
-	if (famfs_file_invalid(inode))
+	if (famfs_file_bad(inode))
 		goto err_out;
 
 	if (meta->fm_extent_type == INTERLEAVED_EXTENT)
-		return famfs_meta_to_dax_offset_v2(inode, iomap, file_offset,
-						   len, flags);
+		return famfs_interleave_fileofs_to_daxofs(inode, iomap,
+							  file_offset,
+							  len, flags);
 
 	iomap->offset = file_offset;
 
@@ -757,10 +822,14 @@ famfs_meta_to_dax_offset(struct inode *inode, struct iomap *iomap,
 		 */
 		if (local_offset < dax_ext_len) {
 			loff_t ext_len_remainder = dax_ext_len - local_offset;
+			struct famfs_daxdev *dd;
+
+			dd = &fc->dax_devlist->devlist[daxdev_idx];
 
-			if (!fc->dax_devlist->devlist[daxdev_idx].valid) {
-				pr_err("%s: daxdev=%lld invalid\n", __func__,
-					daxdev_idx);
+			if (!dd->valid || dd->error) {
+				pr_err("%s: daxdev=%lld %s\n", __func__,
+				       daxdev_idx,
+				       dd->valid ? "error" : "invalid");
 				goto err_out;
 			}
 
@@ -808,7 +877,7 @@ famfs_meta_to_dax_offset(struct inode *inode, struct iomap *iomap,
 }
 
 /**
- * famfs_iomap_begin() - Handler for iomap_begin upcall from dax
+ * famfs_fuse_iomap_begin() - Handler for iomap_begin upcall from dax
  *
  * This function is pretty simple because files are
  * * never partially allocated
@@ -824,7 +893,7 @@ famfs_meta_to_dax_offset(struct inode *inode, struct iomap *iomap,
  * @srcmap:
  */
 static int
-famfs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
+famfs_fuse_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 		  unsigned int flags, struct iomap *iomap, struct iomap *srcmap)
 {
 	struct fuse_inode *fi = get_fuse_inode(inode);
@@ -835,21 +904,21 @@ famfs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 
 	WARN_ON(size != meta->file_size);
 
-	return famfs_meta_to_dax_offset(inode, iomap, offset, length, flags);
+	return famfs_fileofs_to_daxofs(inode, iomap, offset, length, flags);
 }
 
 /* Note: We never need a special set of write_iomap_ops because famfs never
  * performs allocation on write.
  */
 const struct iomap_ops famfs_iomap_ops = {
-	.iomap_begin		= famfs_iomap_begin,
+	.iomap_begin		= famfs_fuse_iomap_begin,
 };
 
 /*********************************************************************
  * vm_operations
  */
 static vm_fault_t
-__famfs_filemap_fault(struct vm_fault *vmf, unsigned int pe_size,
+__famfs_fuse_filemap_fault(struct vm_fault *vmf, unsigned int pe_size,
 		      bool write_fault)
 {
 	struct inode *inode = file_inode(vmf->vma->vm_file);
@@ -886,25 +955,25 @@ famfs_is_write_fault(struct vm_fault *vmf)
 static vm_fault_t
 famfs_filemap_fault(struct vm_fault *vmf)
 {
-	return __famfs_filemap_fault(vmf, 0, famfs_is_write_fault(vmf));
+	return __famfs_fuse_filemap_fault(vmf, 0, famfs_is_write_fault(vmf));
 }
 
 static vm_fault_t
 famfs_filemap_huge_fault(struct vm_fault *vmf, unsigned int pe_size)
 {
-	return __famfs_filemap_fault(vmf, pe_size, famfs_is_write_fault(vmf));
+	return __famfs_fuse_filemap_fault(vmf, pe_size, famfs_is_write_fault(vmf));
 }
 
 static vm_fault_t
 famfs_filemap_page_mkwrite(struct vm_fault *vmf)
 {
-	return __famfs_filemap_fault(vmf, 0, true);
+	return __famfs_fuse_filemap_fault(vmf, 0, true);
 }
 
 static vm_fault_t
 famfs_filemap_pfn_mkwrite(struct vm_fault *vmf)
 {
-	return __famfs_filemap_fault(vmf, 0, true);
+	return __famfs_fuse_filemap_fault(vmf, 0, true);
 }
 
 static vm_fault_t
@@ -926,16 +995,23 @@ const struct vm_operations_struct famfs_file_vm_ops = {
  * file_operations
  */
 
-/* Reject I/O to files that aren't in a valid state */
+/**
+ * famfs_file_bad() - Check for files that aren't in a valid state
+ *
+ * @inode - inode
+ *
+ * Returns: 0=success
+ *          -errno=failure
+ */
 static ssize_t
-famfs_file_invalid(struct inode *inode)
+famfs_file_bad(struct inode *inode)
 {
 	struct fuse_inode *fi = get_fuse_inode(inode);
 	struct famfs_file_meta *meta = fi->famfs_meta;
 	size_t i_size = i_size_read(inode);
 
 	if (!meta) {
-		pr_debug("%s: un-initialized famfs file\n", __func__);
+		pr_err("%s: un-initialized famfs file\n", __func__);
 		return -EIO;
 	}
 	if (meta->error) {
@@ -956,7 +1032,7 @@ famfs_file_invalid(struct inode *inode)
 }
 
 static ssize_t
-famfs_rw_prep(struct kiocb *iocb, struct iov_iter *ubuf)
+famfs_fuse_rw_prep(struct kiocb *iocb, struct iov_iter *ubuf)
 {
 	struct inode *inode = iocb->ki_filp->f_mapping->host;
 	size_t i_size = i_size_read(inode);
@@ -964,7 +1040,7 @@ famfs_rw_prep(struct kiocb *iocb, struct iov_iter *ubuf)
 	size_t max_count;
 	ssize_t rc;
 
-	rc = famfs_file_invalid(inode);
+	rc = famfs_file_bad(inode);
 	if (rc)
 		return rc;
 
@@ -980,11 +1056,11 @@ famfs_rw_prep(struct kiocb *iocb, struct iov_iter *ubuf)
 }
 
 ssize_t
-famfs_dax_read_iter(struct kiocb *iocb, struct iov_iter	*to)
+famfs_fuse_read_iter(struct kiocb *iocb, struct iov_iter	*to)
 {
 	ssize_t rc;
 
-	rc = famfs_rw_prep(iocb, to);
+	rc = famfs_fuse_rw_prep(iocb, to);
 	if (rc)
 		return rc;
 
@@ -997,20 +1073,12 @@ famfs_dax_read_iter(struct kiocb *iocb, struct iov_iter	*to)
 	return rc;
 }
 
-/**
- * famfs_dax_write_iter()
- *
- * We need our own write-iter in order to prevent append
- *
- * @iocb:
- * @from: iterator describing the user memory source for the write
- */
 ssize_t
-famfs_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
+famfs_fuse_write_iter(struct kiocb *iocb, struct iov_iter *from)
 {
 	ssize_t rc;
 
-	rc = famfs_rw_prep(iocb, from);
+	rc = famfs_fuse_rw_prep(iocb, from);
 	if (rc)
 		return rc;
 
@@ -1021,12 +1089,12 @@ famfs_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
 }
 
 int
-famfs_file_mmap(struct file *file, struct vm_area_struct *vma)
+famfs_fuse_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct inode *inode = file_inode(file);
 	ssize_t rc;
 
-	rc = famfs_file_invalid(inode);
+	rc = famfs_file_bad(inode);
 	if (rc)
 		return (int)rc;
 
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 11201195924d..47b3d76acb38 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1778,7 +1778,7 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	if (FUSE_IS_VIRTIO_DAX(fi))
 		return fuse_dax_read_iter(iocb, to);
 	if (fuse_file_famfs(fi))
-		return famfs_dax_read_iter(iocb, to);
+		return famfs_fuse_read_iter(iocb, to);
 
 	/* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
 	if (ff->open_flags & FOPEN_DIRECT_IO)
@@ -1802,7 +1802,7 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (FUSE_IS_VIRTIO_DAX(fi))
 		return fuse_dax_write_iter(iocb, from);
 	if (fuse_file_famfs(fi))
-		return famfs_dax_write_iter(iocb, from);
+		return famfs_fuse_write_iter(iocb, from);
 
 	/* FOPEN_DIRECT_IO overrides FOPEN_PASSTHROUGH */
 	if (ff->open_flags & FOPEN_DIRECT_IO)
@@ -2648,7 +2648,7 @@ static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
 	if (FUSE_IS_VIRTIO_DAX(fi))
 		return fuse_dax_mmap(file, vma);
 	if (fuse_file_famfs(fi))
-		return famfs_file_mmap(file, vma);
+		return famfs_fuse_mmap(file, vma);
 
 	/*
 	 * If inode is in passthrough io mode, because it has some file open
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 4c4c4f0ff280..702c1849720c 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1561,9 +1561,9 @@ extern void fuse_sysctl_unregister(void);
 int famfs_file_init_dax(struct fuse_mount *fm,
 			     struct inode *inode, void *fmap_buf,
 			     size_t fmap_size);
-ssize_t famfs_dax_write_iter(struct kiocb *iocb, struct iov_iter *from);
-ssize_t famfs_dax_read_iter(struct kiocb *iocb, struct iov_iter	*to);
-int famfs_file_mmap(struct file *file, struct vm_area_struct *vma);
+ssize_t famfs_fuse_write_iter(struct kiocb *iocb, struct iov_iter *from);
+ssize_t famfs_fuse_read_iter(struct kiocb *iocb, struct iov_iter	*to);
+int famfs_fuse_mmap(struct file *file, struct vm_area_struct *vma);
 void __famfs_meta_free(void *map);
 void famfs_teardown(struct fuse_conn *fc);
 #endif
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 17/19] famfs_fuse: Add famfs metadata documentation
  2025-04-21  1:33 [RFC PATCH 00/19] famfs: port into fuse John Groves
                   ` (15 preceding siblings ...)
  2025-04-21  1:33 ` [RFC PATCH 16/19] famfs_fuse: Add holder_operations for dax notify_failure() John Groves
@ 2025-04-21  1:33 ` John Groves
  2025-04-21  3:51   ` Randy Dunlap
  2025-04-21  1:33 ` [RFC PATCH 18/19] famfs_fuse: Add documentation John Groves
                   ` (4 subsequent siblings)
  21 siblings, 1 reply; 58+ messages in thread
From: John Groves @ 2025-04-21  1:33 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Luis Henriques, Randy Dunlap, Jeff Layton,
	Kent Overstreet, Petr Vorel, Brian Foster, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Amir Goldstein,
	Jonathan Cameron, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
	Aravind Ramesh, Ajay Joshi, John Groves

From: John Groves <John@Groves.net>

This describes the fmap metadata - both simple and interleaved

Signed-off-by: John Groves <john@groves.net>
---
 fs/fuse/famfs_kfmap.h | 90 ++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 85 insertions(+), 5 deletions(-)

diff --git a/fs/fuse/famfs_kfmap.h b/fs/fuse/famfs_kfmap.h
index 325adb8b99c5..7c8d57b52e64 100644
--- a/fs/fuse/famfs_kfmap.h
+++ b/fs/fuse/famfs_kfmap.h
@@ -7,10 +7,90 @@
 #ifndef FAMFS_KFMAP_H
 #define FAMFS_KFMAP_H
 
+
+/* KABI version 43 (aka v2) fmap structures
+ *
+ * The location of the memory backing for a famfs file is described by
+ * the response to the GET_FMAP fuse message (devined in
+ * include/uapi/linux/fuse.h
+ *
+ * There are currently two extent formats: Simple and Interleaved.
+ *
+ * Simple extents are just (devindex, offset, length) tuples, where devindex
+ * references a devdax device that must retrievable via the GET_DAXDEV
+ * message/response.
+ *
+ * The extent list size must be >= file_size.
+ *
+ * Interleaved extents merit some additional explanation. Interleaved
+ * extents stripe data across a collection of strips. Each strip is a
+ * contiguous allocation from a single devdax device - and is described by
+ * a simple_extent structure.
+ *
+ * Interleaved_extent example:
+ *   ie_nstrips = 4
+ *   ie_chunk_size = 2MiB
+ *   ie_nbytes = 24MiB
+ *
+ * ┌────────────┐────────────┐────────────┐────────────┐
+ * │Chunk = 0   │Chunk = 1   │Chunk = 2   │Chunk = 3   │
+ * │Strip = 0   │Strip = 1   │Strip = 2   │Strip = 3   │
+ * │Stripe = 0  │Stripe = 0  │Stripe = 0  │Stripe = 0  │
+ * │            │            │            │            │
+ * └────────────┘────────────┘────────────┘────────────┘
+ * │Chunk = 4   │Chunk = 5   │Chunk = 6   │Chunk = 7   │
+ * │Strip = 0   │Strip = 1   │Strip = 2   │Strip = 3   │
+ * │Stripe = 1  │Stripe = 1  │Stripe = 1  │Stripe = 1  │
+ * │            │            │            │            │
+ * └────────────┘────────────┘────────────┘────────────┘
+ * │Chunk = 8   │Chunk = 9   │Chunk = 10  │Chunk = 11  │
+ * │Strip = 0   │Strip = 1   │Strip = 2   │Strip = 3   │
+ * │Stripe = 2  │Stripe = 2  │Stripe = 2  │Stripe = 2  │
+ * │            │            │            │            │
+ * └────────────┘────────────┘────────────┘────────────┘
+ *
+ * * Data is laid out across chunks in chunk # order
+ * * Columns are strips
+ * * Strips are contiguous devdax extents, normally each coming from a
+ *   different
+ *   memory device
+ * * Rows are stripes
+ * * The number of chunks is (int)((file_size + chunk_size - 1) / chunk_size)
+ *   (and obviously the last chunk could be partial)
+ * * The stripe_size = (nstrips * chunk_size)
+ * * chunk_num(offset) = offset / chunk_size    //integer division
+ * * strip_num(offset) = chunk_num(offset) % nchunks
+ * * stripe_num(offset) = offset / stripe_size  //integer division
+ * * ...You get the idea - see the code for more details...
+ *
+ * Some concrete examples from the layout above:
+ * * Offset 0 in the file is offset 0 in chunk 0, which is offset 0 in
+ *   strip 0
+ * * Offset 4MiB in the file is offset 0 in chunk 2, which is offset 0 in
+ *   strip 2
+ * * Offset 15MiB in the file is offset 1MiB in chunk 7, which is offset
+ *   3MiB in strip 3
+ *
+ * Notes about this metadata format:
+ *
+ * * For various reasons, chunk_size must be a multiple of the applicable
+ *   PAGE_SIZE
+ * * Since chunk_size and nstrips are constant within an interleaved_extent,
+ *   resolving a file offset to a strip offset within a single
+ *   interleaved_ext is order 1.
+ * * If nstrips==1, a list of interleaved_ext structures degenerates to a
+ *   regular extent list (albeit with some wasted struct space).
+ */
+
+
 /*
- * These structures are the in-memory metadata format for famfs files. Metadata
- * retrieved via the GET_FMAP response is converted to this format for use in
- * resolving file mapping faults.
+ * The structures below are the in-memory metadata format for famfs files.
+ * Metadata retrieved via the GET_FMAP response is converted to this format
+ * for use in  * resolving file mapping faults.
+ *
+ * The GET_FMAP response contains the same information, but in a more
+ * message-and-versioning-friendly format. Those structs can be found in the
+ * famfs section of include/uapi/linux/fuse.h (aka fuse_kernel.h in libfuse)
  */
 
 enum famfs_file_type {
@@ -19,7 +99,7 @@ enum famfs_file_type {
 	FAMFS_LOG,
 };
 
-/* We anticipate the possiblity of supporting additional types of extents */
+/* We anticipate the possibility of supporting additional types of extents */
 enum famfs_extent_type {
 	SIMPLE_DAX_EXTENT,
 	INTERLEAVED_EXTENT,
@@ -63,7 +143,7 @@ struct famfs_file_meta {
 /*
  * dax_devlist
  *
- * This is the in-memory daxdev metadata that is populated by
+ * This is the in-memory daxdev metadata that is populated by parsing
  * the responses to GET_FMAP messages
  */
 struct famfs_daxdev {
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 18/19] famfs_fuse: Add documentation
  2025-04-21  1:33 [RFC PATCH 00/19] famfs: port into fuse John Groves
                   ` (16 preceding siblings ...)
  2025-04-21  1:33 ` [RFC PATCH 17/19] famfs_fuse: Add famfs metadata documentation John Groves
@ 2025-04-21  1:33 ` John Groves
  2025-04-22  2:10   ` Randy Dunlap
  2025-04-21  1:33 ` [RFC PATCH 19/19] famfs_fuse: (ignore) debug cruft John Groves
                   ` (3 subsequent siblings)
  21 siblings, 1 reply; 58+ messages in thread
From: John Groves @ 2025-04-21  1:33 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Luis Henriques, Randy Dunlap, Jeff Layton,
	Kent Overstreet, Petr Vorel, Brian Foster, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Amir Goldstein,
	Jonathan Cameron, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
	Aravind Ramesh, Ajay Joshi, John Groves

Add Documentation/filesystems/famfs.rst and update MAINTAINERS

Signed-off-by: John Groves <john@groves.net>
---
 Documentation/filesystems/famfs.rst | 142 ++++++++++++++++++++++++++++
 Documentation/filesystems/index.rst |   1 +
 MAINTAINERS                         |   1 +
 3 files changed, 144 insertions(+)
 create mode 100644 Documentation/filesystems/famfs.rst

diff --git a/Documentation/filesystems/famfs.rst b/Documentation/filesystems/famfs.rst
new file mode 100644
index 000000000000..b6b3500b6905
--- /dev/null
+++ b/Documentation/filesystems/famfs.rst
@@ -0,0 +1,142 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _famfs_index:
+
+==================================================================
+famfs: The fabric-attached memory file system
+==================================================================
+
+- Copyright (C) 2024-2025 Micron Technology, Inc.
+
+Introduction
+============
+Compute Express Link (CXL) provides a mechanism for disaggregated or
+fabric-attached memory (FAM). This creates opportunities for data sharing;
+clustered apps that would otherwise have to shard or replicate data can
+share one copy in disaggregated memory.
+
+Famfs, which is not CXL-specific in any way, provides a mechanism for
+multiple hosts to concurrently access data in shared memory, by giving it
+a file system interface. With famfs, any app that understands files can
+access data sets in shared memory. Although famfs supports read and write,
+the real point is to support mmap, which provides direct (dax) access to
+the memory - either writable or read-only.
+
+Shared memory can pose complex coherency and synchronization issues, but
+there are also simple cases. Two simple and eminently useful patterns that
+occur frequently in data analytics and AI are:
+
+* Serial Sharing - Only one host or process at a time has access to a file
+* Read-only Sharing - Multiple hosts or processes share read-only access
+  to a file
+
+The famfs fuse file system is part of the famfs framework; User space
+components [1] handle metadata allocation and distribution, and provide a
+low-level fuse server to expose files that map directly to [presumably
+shared] memory.
+
+The famfs framework manages coherency of its own metadata and structures,
+but does not attempt to manage coherency for applications.
+
+Famfs also provides data isolation between files. That is, even though
+the host has access to an entire memory "device" (as a devdax device), apps
+cannot write to memory for which the file is read-only, and mapping one
+file provides isolation from the memory of all other files. This is pretty
+basic, but some experimental shared memory usage patterns provide no such
+isolation.
+
+Principles of Operation
+=======================
+
+Famfs is a file system with one or more devdax devices as a first-class
+backing device(s). Metadata maintenance and query operations happen
+entirely in user space.
+
+The famfs low-level fuse server daemon provides file maps (fmaps) and
+devdax device info to the fuse/famfs kernel component so that
+read/write/mapping faults can be handled without up-calls for all active
+files.
+
+The famfs user space is responsible for maintaining and distributing
+consistent metadata. This is currently handled via an append-only
+metadata log within the memory, but this is orthogonal to the fuse/famfs
+kernel code.
+
+Once instantiated, "the same file" on each host points to the same shared
+memory, but in-memory metadata (inodes, etc.) is ephemeral on each host
+that has a famfs instance mounted. Use cases are free to allow or not
+allow mutations to data on a file-by-file basis.
+
+When an app accesses a data object in a famfs file, there is no page cache
+involvement. The CPU cache is loaded directly from the shared memory. In
+some use cases, this is an enormous reduction read amplification compared
+to loading an entire page into the page cache.
+
+
+Famfs is Not a Conventional File System
+---------------------------------------
+
+Famfs files can be accessed by conventional means, but there are
+limitations. The kernel component of fuse/famfs is not involved in the
+allocation of backing memory for files at all; the famfs user space
+creates files and responds as a low-level fuse server with fmaps and
+devdax device info upon request.
+
+Famfs differs in some important ways from conventional file systems:
+
+* Files must be pre-allocated by the famfs framework; Allocation is never
+  performed on (or after) write.
+* Any operation that changes a file's size is considered to put the file
+  in an invalid state, disabling access to the data. It may be possible to
+  revisit this in the future. (Typically the famfs user space can restore
+  files to a valid state by replaying the famfs metadata log.)
+
+Famfs exists to apply the existing file system abstractions to shared
+memory so applications and workflows can more easily adapt to an
+environment with disaggregated shared memory.
+
+Memory Error Handling
+=====================
+
+Possible memory errors include timeouts, poison and unexpected
+reconfiguration of an underlying dax device. In all of these cases, famfs
+receives a call from the devdax layer via its iomap_ops->notify_failure()
+function. If any memory errors have been detected, access to the affected
+daxdev is disabled to avoid further errors or corruption.
+
+In all known cases, famfs can be unmounted cleanly. In most cases errors
+can be cleared by re-initializing the memory - at which point a new famfs
+file system can be created.
+
+Key Requirements
+================
+
+The primary requirements for famfs are:
+
+1. Must support a file system abstraction backed by sharable devdax memory
+2. Files must efficiently handle VMA faults
+3. Must support metadata distribution in a sharable way
+4. Must handle clients with a stale copy of metadata
+
+The famfs kernel component takes care of 1-2 above by caching each file's
+mapping metadata in the kernel.
+
+Requirements 3 and 4 are handled by the user space components, and are
+largely orthogonal to the functionality of the famfs kernel module.
+
+Requirements 3 and 4 cannot be met by conventional fs-dax file systems
+(e.g. xfs) because they use write-back metadata; it is not valid to mount
+such a file system on two hosts from the same in-memory image.
+
+
+Famfs Usage
+===========
+
+Famfs usage is documented at [1].
+
+
+References
+==========
+
+- [1] Famfs user space repository and documentation
+      https://github.com/cxl-micron-reskit/famfs
diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index 2636f2a41bd3..5aad315206ee 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -90,6 +90,7 @@ Documentation for filesystem implementations.
    ext3
    ext4/index
    f2fs
+   famfs
    gfs2
    gfs2-uevents
    gfs2-glocks
diff --git a/MAINTAINERS b/MAINTAINERS
index 2a5a7e0e8b28..46744be9e6d1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8814,6 +8814,7 @@ M:	John Groves <John@Groves.net>
 L:	linux-cxl@vger.kernel.org
 L:	linux-fsdevel@vger.kernel.org
 S:	Supported
+F:	Documentation/filesystems/famfs.rst
 F:	fs/fuse/famfs.c
 F:	fs/fuse/famfs_kfmap.h
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 19/19] famfs_fuse: (ignore) debug cruft
  2025-04-21  1:33 [RFC PATCH 00/19] famfs: port into fuse John Groves
                   ` (17 preceding siblings ...)
  2025-04-21  1:33 ` [RFC PATCH 18/19] famfs_fuse: Add documentation John Groves
@ 2025-04-21  1:33 ` John Groves
  2025-04-21 18:27 ` [RFC PATCH 00/19] famfs: port into fuse Darrick J. Wong
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 58+ messages in thread
From: John Groves @ 2025-04-21  1:33 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Luis Henriques, Randy Dunlap, Jeff Layton,
	Kent Overstreet, Petr Vorel, Brian Foster, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Amir Goldstein,
	Jonathan Cameron, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
	Aravind Ramesh, Ajay Joshi, John Groves

This debug cruft will be dropped from the "real" patch set

Signed-off-by: John Groves <john@groves.net>
---
 fs/fuse/Makefile |  2 +-
 fs/fuse/dev.c    | 61 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
index 65a12975d734..ad3e06a9a809 100644
--- a/fs/fuse/Makefile
+++ b/fs/fuse/Makefile
@@ -4,7 +4,7 @@
 #
 
 # Needed for trace events
-ccflags-y = -I$(src)
+ccflags-y = -I$(src) -g -DDEBUG -fno-inline -fno-omit-frame-pointer
 
 obj-$(CONFIG_FUSE_FS) += fuse.o
 obj-$(CONFIG_CUSE) += cuse.o
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 51e31df4c546..ba947511a379 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -30,6 +30,60 @@
 MODULE_ALIAS_MISCDEV(FUSE_MINOR);
 MODULE_ALIAS("devname:fuse");
 
+static char *opname[] = {
+	[FUSE_LOOKUP]	   =   "LOOKUP",
+	[FUSE_FORGET]	   =   "FORGET",
+	[FUSE_GETATTR]	   =   "GETATTR",
+	[FUSE_SETATTR]	   =   "SETATTR",
+	[FUSE_READLINK]	   =   "READLINK",
+	[FUSE_SYMLINK]	   =   "SYMLINK",
+	[FUSE_MKNOD]	   =   "MKNOD",
+	[FUSE_MKDIR]	   =   "MKDIR",
+	[FUSE_UNLINK]	   =   "UNLINK",
+	[FUSE_RMDIR]	   =   "RMDIR",
+	[FUSE_RENAME]	   =   "RENAME",
+	[FUSE_LINK]	   =   "LINK",
+	[FUSE_OPEN]	   =   "OPEN",
+	[FUSE_READ]	   =   "READ",
+	[FUSE_WRITE]	   =   "WRITE",
+	[FUSE_STATFS]	   =   "STATFS",
+	[FUSE_STATX]       =   "STATX",
+	[FUSE_RELEASE]	   =   "RELEASE",
+	[FUSE_FSYNC]	   =   "FSYNC",
+	[FUSE_SETXATTR]	   =   "SETXATTR",
+	[FUSE_GETXATTR]	   =   "GETXATTR",
+	[FUSE_LISTXATTR]   =   "LISTXATTR",
+	[FUSE_REMOVEXATTR] =   "REMOVEXATTR",
+	[FUSE_FLUSH]	   =   "FLUSH",
+	[FUSE_INIT]	   =   "INIT",
+	[FUSE_OPENDIR]	   =   "OPENDIR",
+	[FUSE_READDIR]	   =   "READDIR",
+	[FUSE_RELEASEDIR]  =   "RELEASEDIR",
+	[FUSE_FSYNCDIR]	   =   "FSYNCDIR",
+	[FUSE_GETLK]	   =   "GETLK",
+	[FUSE_SETLK]	   =   "SETLK",
+	[FUSE_SETLKW]	   =   "SETLKW",
+	[FUSE_ACCESS]	   =  "ACCESS",
+	[FUSE_CREATE]	   =  "CREATE",
+	[FUSE_INTERRUPT]   =  "INTERRUPT",
+	[FUSE_BMAP]	   =  "BMAP",
+	[FUSE_IOCTL]	   =  "IOCTL",
+	[FUSE_POLL]	   =  "POLL",
+	[FUSE_FALLOCATE]   =  "FALLOCATE",
+	[FUSE_DESTROY]	   =  "DESTROY",
+	[FUSE_NOTIFY_REPLY] = "NOTIFY_REPLY",
+	[FUSE_BATCH_FORGET] = "BATCH_FORGET",
+	[FUSE_READDIRPLUS] = "READDIRPLUS",
+	[FUSE_RENAME2]     =  "RENAME2",
+	[FUSE_COPY_FILE_RANGE] = "COPY_FILE_RANGE",
+	[FUSE_LSEEK]	   = "LSEEK",
+	[CUSE_INIT]	   = "CUSE_INIT",
+	[FUSE_TMPFILE]     = "TMPFILE",
+	[FUSE_SYNCFS]      = "SYNCFS",
+	[FUSE_GET_FMAP]    = "GET_FMAP",
+	[FUSE_GET_DAXDEV]  = "GET_DAXDEV",
+};
+
 static struct kmem_cache *fuse_req_cachep;
 
 static void fuse_request_init(struct fuse_mount *fm, struct fuse_req *req)
@@ -566,6 +620,13 @@ ssize_t __fuse_simple_request(struct mnt_idmap *idmap,
 	}
 	fuse_put_request(req);
 
+	pr_debug("%s: opcode=%s (%d) nodeid=%lld out_numargs=%d len[0]=%d len[1]=%d\n",
+		  __func__, opname[args->opcode], args->opcode,
+		  args->nodeid,
+		  args->out_numargs,
+		  args->out_args[0].size,
+		  (args->out_numargs > 1) ? args->out_args[1].size : 0);
+
 	return ret;
 }
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 14/19] famfs_fuse: GET_DAXDEV message and daxdev_table
  2025-04-21  1:33 ` [RFC PATCH 14/19] famfs_fuse: GET_DAXDEV message and daxdev_table John Groves
@ 2025-04-21  3:43   ` Randy Dunlap
  2025-04-21 20:57     ` John Groves
  0 siblings, 1 reply; 58+ messages in thread
From: Randy Dunlap @ 2025-04-21  3:43 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Luis Henriques, Jeff Layton, Kent Overstreet,
	Petr Vorel, Brian Foster, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

Hi,

On 4/20/25 6:33 PM, John Groves wrote:
> * The new GET_DAXDEV message/response is enabled
> * The command it triggered by the update_daxdev_table() call, if there
>   are any daxdevs in the subject fmap that are not represented in the
>   daxdev_dable yet.
> 
> Signed-off-by: John Groves <john@groves.net>
> ---
>  fs/fuse/famfs.c           | 281 ++++++++++++++++++++++++++++++++++++--
>  fs/fuse/famfs_kfmap.h     |  23 ++++
>  fs/fuse/fuse_i.h          |   4 +
>  fs/fuse/inode.c           |   2 +
>  fs/namei.c                |   1 +
>  include/uapi/linux/fuse.h |  15 ++
>  6 files changed, 316 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/fuse/famfs.c b/fs/fuse/famfs.c
> index e62c047d0950..2e182cb7d7c9 100644
> --- a/fs/fuse/famfs.c
> +++ b/fs/fuse/famfs.c
> @@ -20,6 +20,250 @@
>  #include "famfs_kfmap.h"
>  #include "fuse_i.h"
>  
> +/*
> + * famfs_teardown()
> + *
> + * Deallocate famfs metadata for a fuse_conn
> + */
> +void
> +famfs_teardown(struct fuse_conn *fc)

Is this function formatting prevalent in fuse?
It's a bit different from most Linux.
(many locations throughout the patch set)

> +{
> +	struct famfs_dax_devlist *devlist = fc->dax_devlist;
> +	int i;
> +
> +	fc->dax_devlist = NULL;
> +
> +	if (!devlist)
> +		return;
> +
> +	if (!devlist->devlist)
> +		goto out;
> +
> +	/* Close & release all the daxdevs in our table */
> +	for (i = 0; i < devlist->nslots; i++) {
> +		if (devlist->devlist[i].valid && devlist->devlist[i].devp)
> +			fs_put_dax(devlist->devlist[i].devp, fc);
> +	}
> +	kfree(devlist->devlist);
> +
> +out:
> +	kfree(devlist);
> +}
> +
> +static int
> +famfs_verify_daxdev(const char *pathname, dev_t *devno)
> +{
> +	struct inode *inode;
> +	struct path path;
> +	int err;
> +
> +	if (!pathname || !*pathname)
> +		return -EINVAL;
> +
> +	err = kern_path(pathname, LOOKUP_FOLLOW, &path);
> +	if (err)
> +		return err;
> +
> +	inode = d_backing_inode(path.dentry);
> +	if (!S_ISCHR(inode->i_mode)) {
> +		err = -EINVAL;
> +		goto out_path_put;
> +	}
> +
> +	if (!may_open_dev(&path)) { /* had to export this */
> +		err = -EACCES;
> +		goto out_path_put;
> +	}
> +
> +	*devno = inode->i_rdev;
> +
> +out_path_put:
> +	path_put(&path);
> +	return err;
> +}
> +
> +/**
> + * famfs_fuse_get_daxdev()

Missing " - <short function description>"
but then it's a static function, so kernel-doc is not required.
It's up to you, but please use full kernel-doc notation if using kernel-doc.

> + *
> + * Send a GET_DAXDEV message to the fuse server to retrieve info on a
> + * dax device.
> + *
> + * @fm    - fuse_mount
> + * @index - the index of the dax device; daxdevs are referred to by index
> + *          in fmaps, and the server resolves the index to a particular daxdev

Parameter names in kernel-doc notation should be followed by a ':', not '-'.

> + *
> + * Returns: 0=success
> + *          -errno=failure
> + */
> +static int
> +famfs_fuse_get_daxdev(struct fuse_mount *fm, const u64 index)
> +{
> +	struct fuse_daxdev_out daxdev_out = { 0 };
> +	struct fuse_conn *fc = fm->fc;
> +	struct famfs_daxdev *daxdev;
> +	int err = 0;
> +
> +	FUSE_ARGS(args);
> +
> +	pr_notice("%s: index=%lld\n", __func__, index);
> +
> +	/* Store the daxdev in our table */
> +	if (index >= fc->dax_devlist->nslots) {
> +		pr_err("%s: index(%lld) > nslots(%d)\n",
> +		       __func__, index, fc->dax_devlist->nslots);
> +		err = -EINVAL;
> +		goto out;
> +	}
> +
> +	args.opcode = FUSE_GET_DAXDEV;
> +	args.nodeid = index;
> +
> +	args.in_numargs = 0;
> +
> +	args.out_numargs = 1;
> +	args.out_args[0].size = sizeof(daxdev_out);
> +	args.out_args[0].value = &daxdev_out;
> +
> +	/* Send GET_DAXDEV command */
> +	err = fuse_simple_request(fm, &args);
> +	if (err) {
> +		pr_err("%s: err=%d from fuse_simple_request()\n",
> +		       __func__, err);
> +		/* Error will be that the payload is smaller than FMAP_BUFSIZE,
> +		 * which is the max we can handle. Empty payload handled below.
> +		 */

Usual multi-line comment format is
		/*
		 * line1
		 * line2
		 */
unless fuse is all different (like netdev is).

> +		goto out;
> +	}
> +
> +	down_write(&fc->famfs_devlist_sem);
> +
> +	daxdev = &fc->dax_devlist->devlist[index];
> +	pr_debug("%s: dax_devlist %llx daxdev[%lld]=%llx\n", __func__,
> +		 (u64)fc->dax_devlist, index, (u64)daxdev);
> +
> +	/* Abort if daxdev is now valid */
> +	if (daxdev->valid) {
> +		up_write(&fc->famfs_devlist_sem);
> +		/* We already have a valid entry at this index */
> +		err = -EALREADY;
> +		goto out;
> +	}
> +
> +	/* This verifies that the dev is valid and can be opened and gets the devno */
> +	pr_debug("%s: famfs_verify_daxdev(%s)\n", __func__, daxdev_out.name);
> +	err = famfs_verify_daxdev(daxdev_out.name, &daxdev->devno);
> +	if (err) {
> +		up_write(&fc->famfs_devlist_sem);
> +		pr_err("%s: err=%d from famfs_verify_daxdev()\n", __func__, err);
> +		goto out;
> +	}
> +
> +	/* This will fail if it's not a dax device */
> +	pr_debug("%s: dax_dev_get(%x)\n", __func__, daxdev->devno);
> +	daxdev->devp = dax_dev_get(daxdev->devno);
> +	if (!daxdev->devp) {
> +		up_write(&fc->famfs_devlist_sem);
> +		pr_warn("%s: device %s not found or not dax\n",
> +			__func__, daxdev_out.name);
> +		err = -ENODEV;
> +		goto out;
> +	}
> +
> +	daxdev->name = kstrdup(daxdev_out.name, GFP_KERNEL);
> +	wmb(); /* all daxdev fields must be visible before marking it valid */
> +	daxdev->valid = 1;
> +
> +	up_write(&fc->famfs_devlist_sem);
> +
> +	pr_debug("%s: daxdev(%lld, %s)=%llx opened and marked valid\n",
> +		 __func__, index, daxdev->name, (u64)daxdev);
> +
> +out:
> +	return err;
> +}
> +
> +/**
> + * famfs_update_daxdev_table()

Missing short function description above or don't use kernel-doc notation.

> + *
> + * This function is called for each new file fmap, to verify whether all
> + * referenced daxdevs are already known (i.e. in the table). Any daxdev
> + * indices that are not in the table will be retrieved via
> + * famfs_fuse_get_daxdev()
> + * @fm   - fuse_mount
> + * @meta - famfs_file_meta, in-memory format, built from a GET_FMAP response
> + *
> + * Returns: 0=success
> + *          -errno=failure
> + */
> +static int
> +famfs_update_daxdev_table(
> +	struct fuse_mount *fm,
> +	const struct famfs_file_meta *meta)
> +{
> +	struct famfs_dax_devlist *local_devlist;
> +	struct fuse_conn *fc = fm->fc;
> +	int err;
> +	int i;
> +
> +	pr_debug("%s: dev_bitmap=0x%llx\n", __func__, meta->dev_bitmap);
> +
> +	/* First time through we will need to allocate the dax_devlist */
> +	if (!fc->dax_devlist) {
> +		local_devlist = kcalloc(1, sizeof(*fc->dax_devlist), GFP_KERNEL);
> +		if (!local_devlist)
> +			return -ENOMEM;
> +
> +		local_devlist->nslots = MAX_DAXDEVS;
> +		pr_debug("%s: allocate dax_devlist=%llx\n", __func__,
> +			 (u64)local_devlist);
> +
> +		local_devlist->devlist = kcalloc(MAX_DAXDEVS,
> +						 sizeof(struct famfs_daxdev),
> +						 GFP_KERNEL);
> +		if (!local_devlist->devlist) {
> +			kfree(local_devlist);
> +			return -ENOMEM;
> +		}
> +
> +		/* We don't need the famfs_devlist_sem here because we use cmpxchg... */
> +		if (cmpxchg(&fc->dax_devlist, NULL, local_devlist) != NULL) {
> +			pr_debug("%s: aborting new devlist\n", __func__);
> +			kfree(local_devlist->devlist);
> +			kfree(local_devlist); /* another thread beat us to it */
> +		} else {
> +			pr_debug("%s: published new dax_devlist %llx / %llx\n",
> +				 __func__, (u64)local_devlist,
> +				 (u64)local_devlist->devlist);
> +		}
> +	}
> +
> +	down_read(&fc->famfs_devlist_sem);
> +	for (i = 0; i < fc->dax_devlist->nslots; i++) {
> +		if (meta->dev_bitmap & (1ULL << i)) {
> +			/* This file meta struct references devindex i
> +			 * if devindex i isn't in the table; get it...
> +			 */
> +			if (!(fc->dax_devlist->devlist[i].valid)) {
> +				up_read(&fc->famfs_devlist_sem);
> +
> +				pr_notice("%s: daxdev=%d (%llx) invalid...getting\n",
> +					  __func__, i,
> +					  (u64)(&fc->dax_devlist->devlist[i]));
> +				err = famfs_fuse_get_daxdev(fm, i);
> +				if (err)
> +					pr_err("%s: failed to get daxdev=%d\n",
> +					       __func__, i);
> +
> +				down_read(&fc->famfs_devlist_sem);
> +			}
> +		}
> +	}
> +	up_read(&fc->famfs_devlist_sem);
> +
> +	return 0;
> +}
> +
> +/***************************************************************************/
>  
>  void
>  __famfs_meta_free(void *famfs_meta)
> @@ -67,12 +311,15 @@ famfs_check_ext_alignment(struct famfs_meta_simple_ext *se)
>  }
>  
>  /**
> - * famfs_meta_alloc() - Allocate famfs file metadata
> + * famfs_fuse_meta_alloc() - Allocate famfs file metadata
>   * @metap:       Pointer to an mcache_map_meta pointer
>   * @ext_count:  The number of extents needed
> + *
> + * Returns: 0=success
> + *          -errno=failure
>   */
>  static int
> -famfs_meta_alloc_v3(
> +famfs_fuse_meta_alloc(
>  	void *fmap_buf,
>  	size_t fmap_buf_size,
>  	struct famfs_file_meta **metap)
> @@ -92,28 +339,25 @@ famfs_meta_alloc_v3(
>  	if (next_offset > fmap_buf_size) {
>  		pr_err("%s:%d: fmap_buf underflow offset/size %ld/%ld\n",
>  		       __func__, __LINE__, next_offset, fmap_buf_size);
> -		rc = -EINVAL;
> -		goto errout;
> +		return -EINVAL;
>  	}
>  
>  	if (fmh->nextents < 1) {
>  		pr_err("%s: nextents %d < 1\n", __func__, fmh->nextents);
> -		rc = -EINVAL;
> -		goto errout;
> +		return -EINVAL;
>  	}
>  
>  	if (fmh->nextents > FUSE_FAMFS_MAX_EXTENTS) {
>  		pr_err("%s: nextents %d > max (%d) 1\n",
>  		       __func__, fmh->nextents, FUSE_FAMFS_MAX_EXTENTS);
> -		rc = -E2BIG;
> -		goto errout;
> +		return -E2BIG;
>  	}
>  
>  	meta = kzalloc(sizeof(*meta), GFP_KERNEL);
>  	if (!meta)
>  		return -ENOMEM;
> -	meta->error = false;
>  
> +	meta->error = false;
>  	meta->file_type = fmh->file_type;
>  	meta->file_size = fmh->file_size;
>  	meta->fm_extent_type = fmh->ext_type;
> @@ -298,6 +542,20 @@ famfs_meta_alloc_v3(
>  	return rc;
>  }
>  
> +/**
> + * famfs_file_init_dax()

Missing kernel-doc notation above.

> + *
> + * Initialize famfs metadata for a file, based on the contents of the GET_FMAP
> + * response
> + *
> + * @fm        - fuse_mount
> + * @inode     - the inode
> + * @fmap_buf  - fmap response message
> + * @fmap_size - Size of the fmap message

Use
 * @parameter: description
instead of '-'.

> + *
> + * Returns: 0=success
> + *          -errno=failure
> + */
>  int
>  famfs_file_init_dax(
>  	struct fuse_mount *fm,
> @@ -316,10 +574,13 @@ famfs_file_init_dax(
>  		return -EEXIST;
>  	}
>  
> -	rc = famfs_meta_alloc_v3(fmap_buf, fmap_size, &meta);
> +	rc = famfs_fuse_meta_alloc(fmap_buf, fmap_size, &meta);
>  	if (rc)
>  		goto errout;
>  
> +	/* Make sure this fmap doesn't reference any unknown daxdevs */
> +	famfs_update_daxdev_table(fm, meta);
> +
>  	/* Publish the famfs metadata on fi->famfs_meta */
>  	inode_lock(inode);
>  	if (fi->famfs_meta) {
> diff --git a/fs/fuse/famfs_kfmap.h b/fs/fuse/famfs_kfmap.h
> index ce785d76719c..325adb8b99c5 100644
> --- a/fs/fuse/famfs_kfmap.h
> +++ b/fs/fuse/famfs_kfmap.h
> @@ -60,4 +60,27 @@ struct famfs_file_meta {
>  	};
>  };
>  
> +/*
> + * dax_devlist

Missing struct short description above?
It apparently should be

/*
 * struct famfs_daxdev - <short description>
instead of dax_devlist.

> + *
> + * This is the in-memory daxdev metadata that is populated by
> + * the responses to GET_FMAP messages
> + */
> +struct famfs_daxdev {
> +	/* Include dev uuid? */
> +	bool valid;
> +	bool error;
> +	dev_t devno;
> +	struct dax_device *devp;
> +	char *name;
> +};
> +
> +#define MAX_DAXDEVS 24
> +
> +struct famfs_dax_devlist {
> +	int nslots;
> +	int ndevs;
> +	struct famfs_daxdev *devlist; /* XXX: make this an xarray! */
> +};
> +
>  #endif /* FAMFS_KFMAP_H */



-- 
~Randy


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 17/19] famfs_fuse: Add famfs metadata documentation
  2025-04-21  1:33 ` [RFC PATCH 17/19] famfs_fuse: Add famfs metadata documentation John Groves
@ 2025-04-21  3:51   ` Randy Dunlap
  2025-04-21 21:00     ` John Groves
  0 siblings, 1 reply; 58+ messages in thread
From: Randy Dunlap @ 2025-04-21  3:51 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Luis Henriques, Jeff Layton, Kent Overstreet,
	Petr Vorel, Brian Foster, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi



On 4/20/25 6:33 PM, John Groves wrote:
> From: John Groves <John@Groves.net>
> 
> This describes the fmap metadata - both simple and interleaved
> 
> Signed-off-by: John Groves <john@groves.net>
> ---
>  fs/fuse/famfs_kfmap.h | 90 ++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 85 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/fuse/famfs_kfmap.h b/fs/fuse/famfs_kfmap.h
> index 325adb8b99c5..7c8d57b52e64 100644
> --- a/fs/fuse/famfs_kfmap.h
> +++ b/fs/fuse/famfs_kfmap.h
> @@ -7,10 +7,90 @@
>  #ifndef FAMFS_KFMAP_H
>  #define FAMFS_KFMAP_H
>  
> +
> +/* KABI version 43 (aka v2) fmap structures
> + *
> + * The location of the memory backing for a famfs file is described by
> + * the response to the GET_FMAP fuse message (devined in

                                                 divined

> + * include/uapi/linux/fuse.h
> + *
> + * There are currently two extent formats: Simple and Interleaved.
> + *
> + * Simple extents are just (devindex, offset, length) tuples, where devindex
> + * references a devdax device that must retrievable via the GET_DAXDEV

                                      must be

> + * message/response.
> + *
> + * The extent list size must be >= file_size.
> + *
> + * Interleaved extents merit some additional explanation. Interleaved
> + * extents stripe data across a collection of strips. Each strip is a
> + * contiguous allocation from a single devdax device - and is described by
> + * a simple_extent structure.
> + *
> + * Interleaved_extent example:
> + *   ie_nstrips = 4
> + *   ie_chunk_size = 2MiB
> + *   ie_nbytes = 24MiB
> + *
> + * ┌────────────┐────────────┐────────────┐────────────┐
> + * │Chunk = 0   │Chunk = 1   │Chunk = 2   │Chunk = 3   │
> + * │Strip = 0   │Strip = 1   │Strip = 2   │Strip = 3   │
> + * │Stripe = 0  │Stripe = 0  │Stripe = 0  │Stripe = 0  │
> + * │            │            │            │            │
> + * └────────────┘────────────┘────────────┘────────────┘
> + * │Chunk = 4   │Chunk = 5   │Chunk = 6   │Chunk = 7   │
> + * │Strip = 0   │Strip = 1   │Strip = 2   │Strip = 3   │
> + * │Stripe = 1  │Stripe = 1  │Stripe = 1  │Stripe = 1  │
> + * │            │            │            │            │
> + * └────────────┘────────────┘────────────┘────────────┘
> + * │Chunk = 8   │Chunk = 9   │Chunk = 10  │Chunk = 11  │
> + * │Strip = 0   │Strip = 1   │Strip = 2   │Strip = 3   │
> + * │Stripe = 2  │Stripe = 2  │Stripe = 2  │Stripe = 2  │
> + * │            │            │            │            │
> + * └────────────┘────────────┘────────────┘────────────┘
> + *
> + * * Data is laid out across chunks in chunk # order
> + * * Columns are strips
> + * * Strips are contiguous devdax extents, normally each coming from a
> + *   different
> + *   memory device

Combine 2 lines above.

> + * * Rows are stripes
> + * * The number of chunks is (int)((file_size + chunk_size - 1) / chunk_size)
> + *   (and obviously the last chunk could be partial)
> + * * The stripe_size = (nstrips * chunk_size)
> + * * chunk_num(offset) = offset / chunk_size    //integer division
> + * * strip_num(offset) = chunk_num(offset) % nchunks
> + * * stripe_num(offset) = offset / stripe_size  //integer division
> + * * ...You get the idea - see the code for more details...
> + *
> + * Some concrete examples from the layout above:
> + * * Offset 0 in the file is offset 0 in chunk 0, which is offset 0 in
> + *   strip 0
> + * * Offset 4MiB in the file is offset 0 in chunk 2, which is offset 0 in
> + *   strip 2
> + * * Offset 15MiB in the file is offset 1MiB in chunk 7, which is offset
> + *   3MiB in strip 3
> + *
> + * Notes about this metadata format:
> + *
> + * * For various reasons, chunk_size must be a multiple of the applicable
> + *   PAGE_SIZE
> + * * Since chunk_size and nstrips are constant within an interleaved_extent,
> + *   resolving a file offset to a strip offset within a single
> + *   interleaved_ext is order 1.
> + * * If nstrips==1, a list of interleaved_ext structures degenerates to a
> + *   regular extent list (albeit with some wasted struct space).
> + */
> +
> +
>  /*
> - * These structures are the in-memory metadata format for famfs files. Metadata
> - * retrieved via the GET_FMAP response is converted to this format for use in
> - * resolving file mapping faults.
> + * The structures below are the in-memory metadata format for famfs files.
> + * Metadata retrieved via the GET_FMAP response is converted to this format
> + * for use in  * resolving file mapping faults.

                  ^drop

> + *
> + * The GET_FMAP response contains the same information, but in a more
> + * message-and-versioning-friendly format. Those structs can be found in the
> + * famfs section of include/uapi/linux/fuse.h (aka fuse_kernel.h in libfuse)
>   */
>  
>  enum famfs_file_type {
> @@ -19,7 +99,7 @@ enum famfs_file_type {
>  	FAMFS_LOG,
>  };
>  
> -/* We anticipate the possiblity of supporting additional types of extents */
> +/* We anticipate the possibility of supporting additional types of extents */
>  enum famfs_extent_type {
>  	SIMPLE_DAX_EXTENT,
>  	INTERLEAVED_EXTENT,
> @@ -63,7 +143,7 @@ struct famfs_file_meta {
>  /*
>   * dax_devlist
>   *
> - * This is the in-memory daxdev metadata that is populated by
> + * This is the in-memory daxdev metadata that is populated by parsing
>   * the responses to GET_FMAP messages
>   */
>  struct famfs_daxdev {

-- 
~Randy


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 00/19] famfs: port into fuse
  2025-04-21  1:33 [RFC PATCH 00/19] famfs: port into fuse John Groves
                   ` (18 preceding siblings ...)
  2025-04-21  1:33 ` [RFC PATCH 19/19] famfs_fuse: (ignore) debug cruft John Groves
@ 2025-04-21 18:27 ` Darrick J. Wong
  2025-04-21 22:00   ` John Groves
  2025-04-30 14:42 ` Alireza Sanaee
  2025-05-21 22:30 ` John Groves
  21 siblings, 1 reply; 58+ messages in thread
From: Darrick J. Wong @ 2025-04-21 18:27 UTC (permalink / raw)
  To: John Groves
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Luis Henriques,
	Randy Dunlap, Jeff Layton, Kent Overstreet, Petr Vorel,
	Brian Foster, linux-doc, linux-kernel, nvdimm, linux-cxl,
	linux-fsdevel, Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi,
	Joanne Koong, Josef Bacik, Aravind Ramesh, Ajay Joshi

On Sun, Apr 20, 2025 at 08:33:27PM -0500, John Groves wrote:
> Subject: famfs: port into fuse
> 
> This is the initial RFC for the fabric-attached memory file system (famfs)
> integration into fuse. In order to function, this requires a related patch
> to libfuse [1] and the famfs user space [2]. 
> 
> This RFC is mainly intended to socialize the approach and get feedback from
> the fuse developers and maintainers. There is some dax work that needs to
> be done before this should be merged (see the "poisoned page|folio problem"
> below).

Note that I'm only looking at the fuse and iomap aspects of this
patchset.  I don't know the devdax code at all.

> This patch set fully works with Linux 6.14 -- passing all existing famfs
> smoke and unit tests -- and I encourage existing famfs users to test it.
> 
> This is really two patch sets mashed up:
> 
> * The patches with the dev_dax_iomap: prefix fill in missing functionality for
>   devdax to host an fs-dax file system.
> * The famfs_fuse: patches add famfs into fs/fuse/. These are effectively
>   unchanged since last year.
> 
> Because this is not ready to merge yet, I have felt free to leave some debug
> prints in place because we still find them useful; those will be cleaned up
> in a subsequent revision.
> 
> Famfs Overview
> 
> Famfs exposes shared memory as a file system. Famfs consumes shared memory
> from dax devices, and provides memory-mappable files that map directly to
> the memory - no page cache involvement. Famfs differs from conventional
> file systems in fs-dax mode, in that it handles in-memory metadata in a
> sharable way (which begins with never caching dirty shared metadata).
> 
> Famfs started as a standalone file system [3,4], but the consensus at LSFMM
> 2024 [5] was that it should be ported into fuse - and this RFC is the first
> public evidence that I've been working on that.

This is very timely, as I just started looking into how I might connect
iomap to fuse so that most of the hot IO path continues to run in the
kernel, and userspace block device filesystem drivers merely supply the
file mappings to the kernel.  In other words, we kick the metadata
parsing craziness out of the kernel.

> The key performance requirement is that famfs must resolve mapping faults
> without upcalls. This is achieved by fully caching the file-to-devdax
> metadata for all active files. This is done via two fuse client/server
> message/response pairs: GET_FMAP and GET_DAXDEV.

Heh, just last week I finally got around to laying out how I think I'd
want to expose iomap through fuse to allow ->iomap_begin/->iomap_end
upcalls to a fuse server.  Note that I've done zero prototyping but
"upload all the mappings at open time" seems like a reasonable place for
me to start looking, especially for a filesystem with static mappings.

I think what I want to try to build is an in-kernel mapping cache (sort
of like the one you built), only with upcalls to the fuse server when
there is no mapping information for a given IO.  I'd probably want to
have a means for the fuse server to put new mappings into the cache, or
invalidate existing mappings.

(famfs obviously is a simple corner-case of that grandiose vision, but I
still have a long way to get to my larger vision so don't take my words
as any kind of requirement.)

> Famfs remains the first fs-dax file system that is backed by devdax rather
> than pmem in fs-dax mode (hence the need for the dev_dax_iomap fixups).
> 
> Notes
> 
> * Once the dev_dax_iomap patches land, I suspect it may make sense for
>   virtiofs to update to use the improved interface.
> 
> * I'm currently maintaining compatibility between the famfs user space and
>   both the standalone famfs kernel file system and this new fuse
>   implementation. In the near future I'll be running performance comparisons
>   and sharing them - but there is no reason to expect significant degradation
>   with fuse, since famfs caches entire "fmaps" in the kernel to resolve

I'm curious to hear what you find, performance-wise. :)

>   faults with no upcalls. This patch has a bit too much debug turned on to
>   to that testing quite yet. A branch 

A branch ... what?

> * Two new fuse messages / responses are added: GET_FMAP and GET_DAXDEV.
> 
> * When a file is looked up in a famfs mount, the LOOKUP is followed by a
>   GET_FMAP message and response. The "fmap" is the full file-to-dax mapping,
>   allowing the fuse/famfs kernel code to handle read/write/fault without any
>   upcalls.

Huh, I'd have thought you'd wait until FUSE_OPEN to start preloading
mappings into the kernel.

> * After each GET_FMAP, the fmap is checked for extents that reference
>   previously-unknown daxdevs. Each such occurence is handled with a
>   GET_DAXDEV message and response.

I hadn't figured out how this part would work for my silly prototype.
Just out of curiosity, does the famfs fuse server hold an open fd to the
storage, in which case the fmap(ping) could just contain the open fd?

Where are the mappings that are sent from the fuse server?  Is that
struct fuse_famfs_simple_ext?

> * Daxdevs are stored in a table (which might become an xarray at some point).
>   When entries are added to the table, we acquire exclusive access to the
>   daxdev via the fs_dax_get() call (modeled after how fs-dax handles this
>   with pmem devices). famfs provides holder_operations to devdax, providing
>   a notification path in the event of memory errors.
> 
> * If devdax notifies famfs of memory errors on a dax device, famfs currently
>   bocks all subsequent accesses to data on that device. The recovery is to
>   re-initialize the memory and file system. Famfs is memory, not storage...

Ouch. :)

> * Because famfs uses backing (devdax) devices, only privileged mounts are
>   supported.
> 
> * The famfs kernel code never accesses the memory directly - it only
>   facilitates read, write and mmap on behalf of user processes. As such,
>   the RAS of the shared memory affects applications, but not the kernel.
> 
> * Famfs has backing device(s), but they are devdax (char) rather than
>   block. Right now there is no way to tell the vfs layer that famfs has a
>   char backing device (unless we say it's block, but it's not). Currently
>   we use the standard anonymous fuse fs_type - but I'm not sure that's
>   ultimately optimal (thoughts?)

Does it work if the fusefs server adds "-o fsname=<devdax cdev>" to the
fuse_args object?  fuse2fs does that, though I don't recall if that's a
reasonable thing to do.

> The "poisoned page|folio problem"
> 
> * Background: before doing a kernel mount, the famfs user space [2] validates
>   the superblock and log. This is done via raw mmap of the primary devdax
>   device. If valid, the file system is mounted, and the superblock and log
>   get exposed through a pair of files (.meta/.superblock and .meta/.log) -
>   because we can't be using raw device mmap when a file system is mounted
>   on the device. But this exposes a devdax bug and warning...
> 
> * Pages that have been memory mapped via devdax are left in a permanently
>   problematic state. Devdax sets page|folio->mapping when a page is accessed
>   via raw devdax mmap (as famfs does before mount), but never cleans it up.
>   When the pages of the famfs superblock and log are accessed via the "meta"
>   files after mount, we see a WARN_ONCE() in dax_insert_entry(), which
>   notices that page|folio->mapping is still set. I intend to address this
>   prior to asking for the famfs patches to be merged.
> 
> * Alistair Popple's recent dax patch series [6], which has been merged
>   for 6.15, addresses some dax issues, but sadly does not fix the poisoned
>   page|folio problem - its enhanced refcount checking turns the warning into
>   an error.
> 
> * This 6.14 patch set disables the warning; a proper fix will be required for
>   famfs to work at all in 6.15. Dan W. and I are actively discussing how to do
>   this properly...
> 
> * In terms of the correct functionality of famfs, the warning can be ignored.
> 
> References
> 
> [1] - https://github.com/libfuse/libfuse/pull/1200
> [2] - https://github.com/cxl-micron-reskit/famfs

Thanks for posting links, I'll have a look there too.

--D

> [3] - https://lore.kernel.org/linux-cxl/cover.1708709155.git.john@groves.net/
> [4] - https://lore.kernel.org/linux-cxl/cover.1714409084.git.john@groves.net/
> [5] - https://lwn.net/Articles/983105/
> [6] - https://lore.kernel.org/linux-cxl/cover.8068ad144a7eea4a813670301f4d2a86a8e68ec4.1740713401.git-series.apopple@nvidia.com/
> 
> 
> John Groves (19):
>   dev_dax_iomap: Move dax_pgoff_to_phys() from device.c to bus.c
>   dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage
>   dev_dax_iomap: Save the kva from memremap
>   dev_dax_iomap: Add dax_operations for use by fs-dax on devdax
>   dev_dax_iomap: export dax_dev_get()
>   dev_dax_iomap: (ignore!) Drop poisoned page warning in fs/dax.c
>   famfs_fuse: magic.h: Add famfs magic numbers
>   famfs_fuse: Kconfig
>   famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/
>   famfs_fuse: Basic fuse kernel ABI enablement for famfs
>   famfs_fuse: Basic famfs mount opts
>   famfs_fuse: Plumb the GET_FMAP message/response
>   famfs_fuse: Create files with famfs fmaps
>   famfs_fuse: GET_DAXDEV message and daxdev_table
>   famfs_fuse: Plumb dax iomap and fuse read/write/mmap
>   famfs_fuse: Add holder_operations for dax notify_failure()
>   famfs_fuse: Add famfs metadata documentation
>   famfs_fuse: Add documentation
>   famfs_fuse: (ignore) debug cruft
> 
>  Documentation/filesystems/famfs.rst |  142 ++++
>  Documentation/filesystems/index.rst |    1 +
>  MAINTAINERS                         |   10 +
>  drivers/dax/Kconfig                 |    6 +
>  drivers/dax/bus.c                   |  144 +++-
>  drivers/dax/dax-private.h           |    1 +
>  drivers/dax/device.c                |   38 +-
>  drivers/dax/super.c                 |   33 +-
>  fs/dax.c                            |    1 -
>  fs/fuse/Kconfig                     |   13 +
>  fs/fuse/Makefile                    |    4 +-
>  fs/fuse/dev.c                       |   61 ++
>  fs/fuse/dir.c                       |   74 +-
>  fs/fuse/famfs.c                     | 1105 +++++++++++++++++++++++++++
>  fs/fuse/famfs_kfmap.h               |  166 ++++
>  fs/fuse/file.c                      |   27 +-
>  fs/fuse/fuse_i.h                    |   67 +-
>  fs/fuse/inode.c                     |   49 +-
>  fs/fuse/iomode.c                    |    2 +-
>  fs/namei.c                          |    1 +
>  include/linux/dax.h                 |    6 +
>  include/uapi/linux/fuse.h           |   63 ++
>  include/uapi/linux/magic.h          |    2 +
>  23 files changed, 1973 insertions(+), 43 deletions(-)
>  create mode 100644 Documentation/filesystems/famfs.rst
>  create mode 100644 fs/fuse/famfs.c
>  create mode 100644 fs/fuse/famfs_kfmap.h
> 
> 
> base-commit: 38fec10eb60d687e30c8c6b5420d86e8149f7557
> -- 
> 2.49.0
> 
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 14/19] famfs_fuse: GET_DAXDEV message and daxdev_table
  2025-04-21  3:43   ` Randy Dunlap
@ 2025-04-21 20:57     ` John Groves
  0 siblings, 0 replies; 58+ messages in thread
From: John Groves @ 2025-04-21 20:57 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Darrick J . Wong,
	Luis Henriques, Jeff Layton, Kent Overstreet, Petr Vorel,
	Brian Foster, linux-doc, linux-kernel, nvdimm, linux-cxl,
	linux-fsdevel, Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi,
	Joanne Koong, Josef Bacik, Aravind Ramesh, Ajay Joshi

On 25/04/20 08:43PM, Randy Dunlap wrote:
> Hi,

Hi Randy - thanks for the review!

> 
> On 4/20/25 6:33 PM, John Groves wrote:
> > * The new GET_DAXDEV message/response is enabled
> > * The command it triggered by the update_daxdev_table() call, if there
> >   are any daxdevs in the subject fmap that are not represented in the
> >   daxdev_dable yet.
> > 
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> >  fs/fuse/famfs.c           | 281 ++++++++++++++++++++++++++++++++++++--
> >  fs/fuse/famfs_kfmap.h     |  23 ++++
> >  fs/fuse/fuse_i.h          |   4 +
> >  fs/fuse/inode.c           |   2 +
> >  fs/namei.c                |   1 +
> >  include/uapi/linux/fuse.h |  15 ++
> >  6 files changed, 316 insertions(+), 10 deletions(-)
> > 
> > diff --git a/fs/fuse/famfs.c b/fs/fuse/famfs.c
> > index e62c047d0950..2e182cb7d7c9 100644
> > --- a/fs/fuse/famfs.c
> > +++ b/fs/fuse/famfs.c
> > @@ -20,6 +20,250 @@
> >  #include "famfs_kfmap.h"
> >  #include "fuse_i.h"
> >  
> > +/*
> > + * famfs_teardown()
> > + *
> > + * Deallocate famfs metadata for a fuse_conn
> > + */
> > +void
> > +famfs_teardown(struct fuse_conn *fc)
> 
> Is this function formatting prevalent in fuse?
> It's a bit different from most Linux.
> (many locations throughout the patch set)

I'll check and clean it up if not; function names beginning in column 1 is a
"thing", but I'll normalize to nearby standards.

> 
> > +{
> > +	struct famfs_dax_devlist *devlist = fc->dax_devlist;
> > +	int i;
> > +
> > +	fc->dax_devlist = NULL;
> > +
> > +	if (!devlist)
> > +		return;
> > +
> > +	if (!devlist->devlist)
> > +		goto out;
> > +
> > +	/* Close & release all the daxdevs in our table */
> > +	for (i = 0; i < devlist->nslots; i++) {
> > +		if (devlist->devlist[i].valid && devlist->devlist[i].devp)
> > +			fs_put_dax(devlist->devlist[i].devp, fc);
> > +	}
> > +	kfree(devlist->devlist);
> > +
> > +out:
> > +	kfree(devlist);
> > +}
> > +
> > +static int
> > +famfs_verify_daxdev(const char *pathname, dev_t *devno)
> > +{
> > +	struct inode *inode;
> > +	struct path path;
> > +	int err;
> > +
> > +	if (!pathname || !*pathname)
> > +		return -EINVAL;
> > +
> > +	err = kern_path(pathname, LOOKUP_FOLLOW, &path);
> > +	if (err)
> > +		return err;
> > +
> > +	inode = d_backing_inode(path.dentry);
> > +	if (!S_ISCHR(inode->i_mode)) {
> > +		err = -EINVAL;
> > +		goto out_path_put;
> > +	}
> > +
> > +	if (!may_open_dev(&path)) { /* had to export this */
> > +		err = -EACCES;
> > +		goto out_path_put;
> > +	}
> > +
> > +	*devno = inode->i_rdev;
> > +
> > +out_path_put:
> > +	path_put(&path);
> > +	return err;
> > +}
> > +
> > +/**
> > + * famfs_fuse_get_daxdev()
> 
> Missing " - <short function description>"
> but then it's a static function, so kernel-doc is not required.
> It's up to you, but please use full kernel-doc notation if using kernel-doc.

Thank you - and sorry for being a bit sloppy on this stuff. I'm caching fixes
for all your comments along these lines into a branch for the next version of
the series.

Snipping the rest, but will address it all.

Thanks,
John


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 17/19] famfs_fuse: Add famfs metadata documentation
  2025-04-21  3:51   ` Randy Dunlap
@ 2025-04-21 21:00     ` John Groves
  0 siblings, 0 replies; 58+ messages in thread
From: John Groves @ 2025-04-21 21:00 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Darrick J . Wong,
	Luis Henriques, Jeff Layton, Kent Overstreet, Petr Vorel,
	Brian Foster, linux-doc, linux-kernel, nvdimm, linux-cxl,
	linux-fsdevel, Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi,
	Joanne Koong, Josef Bacik, Aravind Ramesh, Ajay Joshi

On 25/04/20 08:51PM, Randy Dunlap wrote:
> 
> 

good edits... 

Caching them into a branch for the next versions

Thank you!

John


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 13/19] famfs_fuse: Create files with famfs fmaps
  2025-04-21  1:33 ` [RFC PATCH 13/19] famfs_fuse: Create files with famfs fmaps John Groves
@ 2025-04-21 21:57   ` Darrick J. Wong
  2025-04-21 22:31     ` John Groves
  2025-04-24 13:43   ` John Groves
  1 sibling, 1 reply; 58+ messages in thread
From: Darrick J. Wong @ 2025-04-21 21:57 UTC (permalink / raw)
  To: John Groves
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Luis Henriques,
	Randy Dunlap, Jeff Layton, Kent Overstreet, Petr Vorel,
	Brian Foster, linux-doc, linux-kernel, nvdimm, linux-cxl,
	linux-fsdevel, Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi,
	Joanne Koong, Josef Bacik, Aravind Ramesh, Ajay Joshi

On Sun, Apr 20, 2025 at 08:33:40PM -0500, John Groves wrote:
> On completion of GET_FMAP message/response, setup the full famfs
> metadata such that it's possible to handle read/write/mmap directly to
> dax. Note that the devdax_iomap plumbing is not in yet...
> 
> Update MAINTAINERS for the new files.
> 
> Signed-off-by: John Groves <john@groves.net>
> ---
>  MAINTAINERS               |   9 +
>  fs/fuse/Makefile          |   2 +-
>  fs/fuse/dir.c             |   3 +
>  fs/fuse/famfs.c           | 344 ++++++++++++++++++++++++++++++++++++++
>  fs/fuse/famfs_kfmap.h     |  63 +++++++
>  fs/fuse/fuse_i.h          |  16 +-
>  fs/fuse/inode.c           |   2 +-
>  include/uapi/linux/fuse.h |  42 +++++
>  8 files changed, 477 insertions(+), 4 deletions(-)
>  create mode 100644 fs/fuse/famfs.c
>  create mode 100644 fs/fuse/famfs_kfmap.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 00e94bec401e..2a5a7e0e8b28 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -8808,6 +8808,15 @@ F:	Documentation/networking/failover.rst
>  F:	include/net/failover.h
>  F:	net/core/failover.c
>  
> +FAMFS
> +M:	John Groves <jgroves@micron.com>
> +M:	John Groves <John@Groves.net>
> +L:	linux-cxl@vger.kernel.org
> +L:	linux-fsdevel@vger.kernel.org
> +S:	Supported
> +F:	fs/fuse/famfs.c
> +F:	fs/fuse/famfs_kfmap.h
> +
>  FANOTIFY
>  M:	Jan Kara <jack@suse.cz>
>  R:	Amir Goldstein <amir73il@gmail.com>
> diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
> index 3f0f312a31c1..65a12975d734 100644
> --- a/fs/fuse/Makefile
> +++ b/fs/fuse/Makefile
> @@ -16,5 +16,5 @@ fuse-$(CONFIG_FUSE_DAX) += dax.o
>  fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o
>  fuse-$(CONFIG_SYSCTL) += sysctl.o
>  fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o
> -
> +fuse-$(CONFIG_FUSE_FAMFS_DAX) += famfs.o
>  virtiofs-y := virtio_fs.o
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index ae135c55b9f6..b28a1e912d6b 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -405,6 +405,9 @@ fuse_get_fmap(struct fuse_mount *fm, struct inode *inode, u64 nodeid)
>  	fmap_size = args.out_args[0].size;
>  	pr_notice("%s: nodei=%lld fmap_size=%ld\n", __func__, nodeid, fmap_size);
>  
> +	/* Convert fmap into in-memory format and hang from inode */
> +	famfs_file_init_dax(fm, inode, fmap_buf, fmap_size);
> +
>  	return 0;
>  }
>  #endif
> diff --git a/fs/fuse/famfs.c b/fs/fuse/famfs.c
> new file mode 100644
> index 000000000000..e62c047d0950
> --- /dev/null
> +++ b/fs/fuse/famfs.c
> @@ -0,0 +1,344 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * famfs - dax file system for shared fabric-attached memory
> + *
> + * Copyright 2023-2025 Micron Technology, Inc.
> + *
> + * This file system, originally based on ramfs the dax support from xfs,
> + * is intended to allow multiple host systems to mount a common file system
> + * view of dax files that map to shared memory.
> + */
> +
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include <linux/dax.h>
> +#include <linux/iomap.h>
> +#include <linux/path.h>
> +#include <linux/namei.h>
> +#include <linux/string.h>
> +
> +#include "famfs_kfmap.h"
> +#include "fuse_i.h"
> +
> +
> +void
> +__famfs_meta_free(void *famfs_meta)
> +{
> +	struct famfs_file_meta *fmap = famfs_meta;
> +
> +	if (!fmap)
> +		return;
> +
> +	if (fmap) {
> +		switch (fmap->fm_extent_type) {
> +		case SIMPLE_DAX_EXTENT:
> +			kfree(fmap->se);
> +			break;
> +		case INTERLEAVED_EXTENT:

Are interleaved extents not DAX extents?  Why does one constant refer to
DAX but the other does not?

> +			if (fmap->ie)
> +				kfree(fmap->ie->ie_strips);
> +
> +			kfree(fmap->ie);
> +			break;
> +		default:
> +			pr_err("%s: invalid fmap type\n", __func__);
> +			break;
> +		}
> +	}
> +	kfree(fmap);
> +}
> +
> +static int
> +famfs_check_ext_alignment(struct famfs_meta_simple_ext *se)
> +{
> +	int errs = 0;
> +
> +	if (se->dev_index != 0)
> +		errs++;
> +
> +	/* TODO: pass in alignment so we can support the other page sizes */
> +	if (!IS_ALIGNED(se->ext_offset, PMD_SIZE))
> +		errs++;
> +
> +	if (!IS_ALIGNED(se->ext_len, PMD_SIZE))
> +		errs++;
> +
> +	return errs;
> +}
> +
> +/**
> + * famfs_meta_alloc() - Allocate famfs file metadata
> + * @metap:       Pointer to an mcache_map_meta pointer
> + * @ext_count:  The number of extents needed
> + */
> +static int
> +famfs_meta_alloc_v3(

Err, what's with "v3"?  This is a new fs, right?

> +	void *fmap_buf,
> +	size_t fmap_buf_size,
> +	struct famfs_file_meta **metap)
> +{
> +	struct famfs_file_meta *meta = NULL;
> +	struct fuse_famfs_fmap_header *fmh;
> +	size_t extent_total = 0;
> +	size_t next_offset = 0;
> +	int errs = 0;
> +	int i, j;
> +	int rc;
> +
> +	fmh = (struct fuse_famfs_fmap_header *)fmap_buf;
> +
> +	/* Move past fmh in fmap_buf */
> +	next_offset += sizeof(*fmh);
> +	if (next_offset > fmap_buf_size) {
> +		pr_err("%s:%d: fmap_buf underflow offset/size %ld/%ld\n",
> +		       __func__, __LINE__, next_offset, fmap_buf_size);
> +		rc = -EINVAL;
> +		goto errout;
> +	}
> +
> +	if (fmh->nextents < 1) {
> +		pr_err("%s: nextents %d < 1\n", __func__, fmh->nextents);
> +		rc = -EINVAL;
> +		goto errout;
> +	}
> +
> +	if (fmh->nextents > FUSE_FAMFS_MAX_EXTENTS) {
> +		pr_err("%s: nextents %d > max (%d) 1\n",
> +		       __func__, fmh->nextents, FUSE_FAMFS_MAX_EXTENTS);
> +		rc = -E2BIG;
> +		goto errout;
> +	}
> +
> +	meta = kzalloc(sizeof(*meta), GFP_KERNEL);
> +	if (!meta)
> +		return -ENOMEM;
> +	meta->error = false;
> +
> +	meta->file_type = fmh->file_type;
> +	meta->file_size = fmh->file_size;
> +	meta->fm_extent_type = fmh->ext_type;
> +
> +	switch (fmh->ext_type) {
> +	case FUSE_FAMFS_EXT_SIMPLE: {
> +		struct fuse_famfs_simple_ext *se_in;
> +
> +		se_in = (struct fuse_famfs_simple_ext *)(fmap_buf + next_offset);
> +
> +		/* Move past simple extents */
> +		next_offset += fmh->nextents * sizeof(*se_in);
> +		if (next_offset > fmap_buf_size) {
> +			pr_err("%s:%d: fmap_buf underflow offset/size %ld/%ld\n",
> +			       __func__, __LINE__, next_offset, fmap_buf_size);
> +			rc = -EINVAL;
> +			goto errout;
> +		}
> +
> +		meta->fm_nextents = fmh->nextents;
> +
> +		meta->se = kcalloc(meta->fm_nextents, sizeof(*(meta->se)),
> +				   GFP_KERNEL);
> +		if (!meta->se) {
> +			rc = -ENOMEM;
> +			goto errout;
> +		}
> +
> +		if ((meta->fm_nextents > FUSE_FAMFS_MAX_EXTENTS) ||

FUSE_FAMFS_MAX_EXTENTS is 2?  I gather that simple files in famfs refer
to contiguous regions, but why two mappings?

> +		    (meta->fm_nextents < 1)) {
> +			rc = -EINVAL;
> +			goto errout;
> +		}
> +
> +		for (i = 0; i < fmh->nextents; i++) {
> +			meta->se[i].dev_index  = se_in[i].se_devindex;
> +			meta->se[i].ext_offset = se_in[i].se_offset;
> +			meta->se[i].ext_len    = se_in[i].se_len;
> +
> +			/* Record bitmap of referenced daxdev indices */
> +			meta->dev_bitmap |= (1 << meta->se[i].dev_index);
> +
> +			errs += famfs_check_ext_alignment(&meta->se[i]);

Shouldn't you bail out at the first bad mapping?

> +			extent_total += meta->se[i].ext_len;
> +		}

I took a look at what's already in uapi/linux/fuse.h and saw that
there are two operations -- FUSE_{SETUP,REMOVE}MAPPING.  Those two fuse
upcalls seem to manage an interval tree in struct fuse_inode_dax, which
is used to feed fuse_iomap_begin.  Can you reuse this existing uapi
instead of defining a new one that's already pretty similar?

I'm wondering why create all this new code when fuse/dax.c already seems
to have the ability to cache mappings and pass them to dax_iomap_rw
without restrictions on the number of mappings and all that?

Maybe you're trying to avoid runtime upcalls, but then I would think
that you could teach the fuse/dax.c mapping code to pin the mappings
if there aren't that many of them in the first place, rather than
reinventing mappings?

It occurred to me (perhaps naively) that maybe you created FUSE_GETFMAP
because of this interleaving thing because it's probably faster to
upload a template for that than it would be to upload a large number of
mappings.  But I don't really grok why the interleaving exists, though I
guess it's for memory controllers interleaving memory devices or
something for better throughput?

I also see that famfs_meta_to_dax_offset does a linear walk of the
mapping array, which does not seem like it will be inefficient when
there are many mappings.

> +		break;
> +	}
> +
> +	case FUSE_FAMFS_EXT_INTERLEAVE: {
> +		s64 size_remainder = meta->file_size;
> +		struct fuse_famfs_iext *ie_in;
> +		int niext = fmh->nextents;
> +
> +		meta->fm_niext = niext;
> +
> +		/* Allocate interleaved extent */
> +		meta->ie = kcalloc(niext, sizeof(*(meta->ie)), GFP_KERNEL);
> +		if (!meta->ie) {
> +			rc = -ENOMEM;
> +			goto errout;
> +		}
> +
> +		/*
> +		 * Each interleaved extent has a simple extent list of strips.
> +		 * Outer loop is over separate interleaved extents

Hmm, so there's no checking on fmh->nextents here, so I guess we can
have as many sets of interleaved extents as we want?  Each with up to 16
simple mappings?

--D

> +		 */
> +		for (i = 0; i < niext; i++) {
> +			u64 nstrips;
> +			struct fuse_famfs_simple_ext *sie_in;
> +
> +			/* ie_in = one interleaved extent in fmap_buf */
> +			ie_in = (struct fuse_famfs_iext *)
> +				(fmap_buf + next_offset);
> +
> +			/* Move past one interleaved extent header in fmap_buf */
> +			next_offset += sizeof(*ie_in);
> +			if (next_offset > fmap_buf_size) {
> +				pr_err("%s:%d: fmap_buf underflow offset/size %ld/%ld\n",
> +				       __func__, __LINE__, next_offset, fmap_buf_size);
> +				rc = -EINVAL;
> +				goto errout;
> +			}
> +
> +			nstrips = ie_in->ie_nstrips;
> +			meta->ie[i].fie_chunk_size = ie_in->ie_chunk_size;
> +			meta->ie[i].fie_nstrips    = ie_in->ie_nstrips;
> +			meta->ie[i].fie_nbytes     = ie_in->ie_nbytes;
> +
> +			if (!meta->ie[i].fie_nbytes) {
> +				pr_err("%s: zero-length interleave!\n",
> +				       __func__);
> +				rc = -EINVAL;
> +				goto errout;
> +			}
> +
> +			/* sie_in = the strip extents in fmap_buf */
> +			sie_in = (struct fuse_famfs_simple_ext *)
> +				(fmap_buf + next_offset);
> +
> +			/* Move past strip extents in fmap_buf */
> +			next_offset += nstrips * sizeof(*sie_in);
> +			if (next_offset > fmap_buf_size) {
> +				pr_err("%s:%d: fmap_buf underflow offset/size %ld/%ld\n",
> +				       __func__, __LINE__, next_offset, fmap_buf_size);
> +				rc = -EINVAL;
> +				goto errout;
> +			}
> +
> +			if ((nstrips > FUSE_FAMFS_MAX_STRIPS) || (nstrips < 1)) {
> +				pr_err("%s: invalid nstrips=%lld (max=%d)\n",
> +				       __func__, nstrips,
> +				       FUSE_FAMFS_MAX_STRIPS);
> +				errs++;
> +			}
> +
> +			/* Allocate strip extent array */
> +			meta->ie[i].ie_strips = kcalloc(ie_in->ie_nstrips,
> +					sizeof(meta->ie[i].ie_strips[0]),
> +							GFP_KERNEL);
> +			if (!meta->ie[i].ie_strips) {
> +				rc = -ENOMEM;
> +				goto errout;
> +			}
> +
> +			/* Inner loop is over strips */
> +			for (j = 0; j < nstrips; j++) {
> +				struct famfs_meta_simple_ext *strips_out;
> +				u64 devindex = sie_in[j].se_devindex;
> +				u64 offset   = sie_in[j].se_offset;
> +				u64 len      = sie_in[j].se_len;
> +
> +				strips_out = meta->ie[i].ie_strips;
> +				strips_out[j].dev_index  = devindex;
> +				strips_out[j].ext_offset = offset;
> +				strips_out[j].ext_len    = len;
> +
> +				/* Record bitmap of referenced daxdev indices */
> +				meta->dev_bitmap |= (1 << devindex);
> +
> +				extent_total += len;
> +				errs += famfs_check_ext_alignment(&strips_out[j]);
> +				size_remainder -= len;
> +			}
> +		}
> +
> +		if (size_remainder > 0) {
> +			/* Sum of interleaved extent sizes is less than file size! */
> +			pr_err("%s: size_remainder %lld (0x%llx)\n",
> +			       __func__, size_remainder, size_remainder);
> +			rc = -EINVAL;
> +			goto errout;
> +		}
> +		break;
> +	}
> +
> +	default:
> +		pr_err("%s: invalid ext_type %d\n", __func__, fmh->ext_type);
> +		rc = -EINVAL;
> +		goto errout;
> +	}
> +
> +	if (errs > 0) {
> +		pr_err("%s: %d alignment errors found\n", __func__, errs);
> +		rc = -EINVAL;
> +		goto errout;
> +	}
> +
> +	/* More sanity checks */
> +	if (extent_total < meta->file_size) {
> +		pr_err("%s: file size %ld larger than map size %ld\n",
> +		       __func__, meta->file_size, extent_total);
> +		rc = -EINVAL;
> +		goto errout;
> +	}
> +
> +	*metap = meta;
> +
> +	return 0;
> +errout:
> +	__famfs_meta_free(meta);
> +	return rc;
> +}
> +
> +int
> +famfs_file_init_dax(
> +	struct fuse_mount *fm,
> +	struct inode *inode,
> +	void *fmap_buf,
> +	size_t fmap_size)
> +{
> +	struct fuse_inode *fi = get_fuse_inode(inode);
> +	struct famfs_file_meta *meta = NULL;
> +	int rc;
> +
> +	if (fi->famfs_meta) {
> +		pr_notice("%s: i_no=%ld fmap_size=%ld ALREADY INITIALIZED\n",
> +			  __func__,
> +			  inode->i_ino, fmap_size);
> +		return -EEXIST;
> +	}
> +
> +	rc = famfs_meta_alloc_v3(fmap_buf, fmap_size, &meta);
> +	if (rc)
> +		goto errout;
> +
> +	/* Publish the famfs metadata on fi->famfs_meta */
> +	inode_lock(inode);
> +	if (fi->famfs_meta) {
> +		rc = -EEXIST; /* file already has famfs metadata */
> +	} else {
> +		if (famfs_meta_set(fi, meta) != NULL) {
> +			pr_err("%s: file already had metadata\n", __func__);
> +			rc = -EALREADY;
> +			goto errout;
> +		}
> +		i_size_write(inode, meta->file_size);
> +		inode->i_flags |= S_DAX;
> +	}
> +	inode_unlock(inode);
> +
> + errout:
> +	if (rc)
> +		__famfs_meta_free(meta);
> +
> +	return rc;
> +}
> +
> diff --git a/fs/fuse/famfs_kfmap.h b/fs/fuse/famfs_kfmap.h
> new file mode 100644
> index 000000000000..ce785d76719c
> --- /dev/null
> +++ b/fs/fuse/famfs_kfmap.h
> @@ -0,0 +1,63 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * famfs - dax file system for shared fabric-attached memory
> + *
> + * Copyright 2023-2025 Micron Technology, Inc.
> + */
> +#ifndef FAMFS_KFMAP_H
> +#define FAMFS_KFMAP_H
> +
> +/*
> + * These structures are the in-memory metadata format for famfs files. Metadata
> + * retrieved via the GET_FMAP response is converted to this format for use in
> + * resolving file mapping faults.
> + */
> +
> +enum famfs_file_type {
> +	FAMFS_REG,
> +	FAMFS_SUPERBLOCK,
> +	FAMFS_LOG,
> +};
> +
> +/* We anticipate the possiblity of supporting additional types of extents */
> +enum famfs_extent_type {
> +	SIMPLE_DAX_EXTENT,
> +	INTERLEAVED_EXTENT,
> +	INVALID_EXTENT_TYPE,
> +};
> +
> +struct famfs_meta_simple_ext {
> +	u64 dev_index;
> +	u64 ext_offset;
> +	u64 ext_len;
> +};
> +
> +struct famfs_meta_interleaved_ext {
> +	u64 fie_nstrips;
> +	u64 fie_chunk_size;
> +	u64 fie_nbytes;
> +	struct famfs_meta_simple_ext *ie_strips;
> +};
> +
> +/*
> + * Each famfs dax file has this hanging from its fuse_inode->famfs_meta
> + */
> +struct famfs_file_meta {
> +	bool                   error;
> +	enum famfs_file_type   file_type;
> +	size_t                 file_size;
> +	enum famfs_extent_type fm_extent_type;
> +	u64 dev_bitmap; /* bitmap of referenced daxdevs by index */
> +	union { /* This will make code a bit more readable */
> +		struct {
> +			size_t         fm_nextents;
> +			struct famfs_meta_simple_ext  *se;
> +		};
> +		struct {
> +			size_t         fm_niext;
> +			struct famfs_meta_interleaved_ext *ie;
> +		};
> +	};
> +};
> +
> +#endif /* FAMFS_KFMAP_H */
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 437177c2f092..d8e0ac784224 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -1557,11 +1557,18 @@ extern void fuse_sysctl_unregister(void);
>  #endif /* CONFIG_SYSCTL */
>  
>  /* famfs.c */
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +int amfs_file_init_dax(struct fuse_mount *fm,
> +			     struct inode *inode, void *fmap_buf,
> +			     size_t fmap_size);
> +void __famfs_meta_free(void *map);
> +#endif
> +
>  static inline struct fuse_backing *famfs_meta_set(struct fuse_inode *fi,
>  						       void *meta)
>  {
>  #if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> -	return xchg(&fi->famfs_meta, meta);
> +	return cmpxchg(&fi->famfs_meta, NULL, meta);
>  #else
>  	return NULL;
>  #endif
> @@ -1569,7 +1576,12 @@ static inline struct fuse_backing *famfs_meta_set(struct fuse_inode *fi,
>  
>  static inline void famfs_meta_free(struct fuse_inode *fi)
>  {
> -	/* Stub wil be connected in a subsequent commit */
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +	if (fi->famfs_meta != NULL) {
> +		__famfs_meta_free(fi->famfs_meta);
> +		famfs_meta_set(fi, NULL);
> +	}
> +#endif
>  }
>  
>  static inline int fuse_file_famfs(struct fuse_inode *fi)
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 848c8818e6f7..e86bf330117f 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -118,7 +118,7 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
>  		fuse_inode_backing_set(fi, NULL);
>  
>  	if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
> -		famfs_meta_set(fi, NULL);
> +		fi->famfs_meta = NULL; /* XXX new inodes currently not zeroed; why not? */
>  
>  	return &fi->inode;
>  
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index d85fb692cf3b..0f6ff1ffb23d 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -1286,4 +1286,46 @@ struct fuse_uring_cmd_req {
>  	uint8_t padding[6];
>  };
>  
> +/* Famfs fmap message components */
> +
> +#define FAMFS_FMAP_VERSION 1
> +
> +#define FUSE_FAMFS_MAX_EXTENTS 2
> +#define FUSE_FAMFS_MAX_STRIPS 16
> +
> +enum fuse_famfs_file_type {
> +	FUSE_FAMFS_FILE_REG,
> +	FUSE_FAMFS_FILE_SUPERBLOCK,
> +	FUSE_FAMFS_FILE_LOG,
> +};
> +
> +enum famfs_ext_type {
> +	FUSE_FAMFS_EXT_SIMPLE = 0,
> +	FUSE_FAMFS_EXT_INTERLEAVE = 1,
> +};
> +
> +struct fuse_famfs_simple_ext {
> +	uint32_t se_devindex;
> +	uint32_t reserved;
> +	uint64_t se_offset;
> +	uint64_t se_len;
> +};
> +
> +struct fuse_famfs_iext { /* Interleaved extent */
> +	uint32_t ie_nstrips;
> +	uint32_t ie_chunk_size;
> +	uint64_t ie_nbytes; /* Total bytes for this interleaved_ext; sum of strips may be more */
> +	uint64_t reserved;
> +};
> +
> +struct fuse_famfs_fmap_header {
> +	uint8_t file_type; /* enum famfs_file_type */
> +	uint8_t reserved;
> +	uint16_t fmap_version;
> +	uint32_t ext_type; /* enum famfs_log_ext_type */
> +	uint32_t nextents;
> +	uint32_t reserved0;
> +	uint64_t file_size;
> +	uint64_t reserved1;
> +};
>  #endif /* _LINUX_FUSE_H */
> -- 
> 2.49.0
> 
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 00/19] famfs: port into fuse
  2025-04-21 18:27 ` [RFC PATCH 00/19] famfs: port into fuse Darrick J. Wong
@ 2025-04-21 22:00   ` John Groves
  2025-04-22  1:25     ` Darrick J. Wong
  0 siblings, 1 reply; 58+ messages in thread
From: John Groves @ 2025-04-21 22:00 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Luis Henriques,
	Randy Dunlap, Jeff Layton, Kent Overstreet, Petr Vorel,
	Brian Foster, linux-doc, linux-kernel, nvdimm, linux-cxl,
	linux-fsdevel, Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi,
	Joanne Koong, Josef Bacik, Aravind Ramesh, Ajay Joshi

On 25/04/21 11:27AM, Darrick J. Wong wrote:
> On Sun, Apr 20, 2025 at 08:33:27PM -0500, John Groves wrote:
> > Subject: famfs: port into fuse
> > 
> > This is the initial RFC for the fabric-attached memory file system (famfs)
> > integration into fuse. In order to function, this requires a related patch
> > to libfuse [1] and the famfs user space [2]. 
> > 
> > This RFC is mainly intended to socialize the approach and get feedback from
> > the fuse developers and maintainers. There is some dax work that needs to
> > be done before this should be merged (see the "poisoned page|folio problem"
> > below).
> 
> Note that I'm only looking at the fuse and iomap aspects of this
> patchset.  I don't know the devdax code at all.
> 
> > This patch set fully works with Linux 6.14 -- passing all existing famfs
> > smoke and unit tests -- and I encourage existing famfs users to test it.
> > 
> > This is really two patch sets mashed up:
> > 
> > * The patches with the dev_dax_iomap: prefix fill in missing functionality for
> >   devdax to host an fs-dax file system.
> > * The famfs_fuse: patches add famfs into fs/fuse/. These are effectively
> >   unchanged since last year.
> > 
> > Because this is not ready to merge yet, I have felt free to leave some debug
> > prints in place because we still find them useful; those will be cleaned up
> > in a subsequent revision.
> > 
> > Famfs Overview
> > 
> > Famfs exposes shared memory as a file system. Famfs consumes shared memory
> > from dax devices, and provides memory-mappable files that map directly to
> > the memory - no page cache involvement. Famfs differs from conventional
> > file systems in fs-dax mode, in that it handles in-memory metadata in a
> > sharable way (which begins with never caching dirty shared metadata).
> > 
> > Famfs started as a standalone file system [3,4], but the consensus at LSFMM
> > 2024 [5] was that it should be ported into fuse - and this RFC is the first
> > public evidence that I've been working on that.
> 
> This is very timely, as I just started looking into how I might connect
> iomap to fuse so that most of the hot IO path continues to run in the
> kernel, and userspace block device filesystem drivers merely supply the
> file mappings to the kernel.  In other words, we kick the metadata
> parsing craziness out of the kernel.

Coool!

> 
> > The key performance requirement is that famfs must resolve mapping faults
> > without upcalls. This is achieved by fully caching the file-to-devdax
> > metadata for all active files. This is done via two fuse client/server
> > message/response pairs: GET_FMAP and GET_DAXDEV.
> 
> Heh, just last week I finally got around to laying out how I think I'd
> want to expose iomap through fuse to allow ->iomap_begin/->iomap_end
> upcalls to a fuse server.  Note that I've done zero prototyping but
> "upload all the mappings at open time" seems like a reasonable place for
> me to start looking, especially for a filesystem with static mappings.
> 
> I think what I want to try to build is an in-kernel mapping cache (sort
> of like the one you built), only with upcalls to the fuse server when
> there is no mapping information for a given IO.  I'd probably want to
> have a means for the fuse server to put new mappings into the cache, or
> invalidate existing mappings.
> 
> (famfs obviously is a simple corner-case of that grandiose vision, but I
> still have a long way to get to my larger vision so don't take my words
> as any kind of requirement.)
> 
> > Famfs remains the first fs-dax file system that is backed by devdax rather
> > than pmem in fs-dax mode (hence the need for the dev_dax_iomap fixups).
> > 
> > Notes
> > 
> > * Once the dev_dax_iomap patches land, I suspect it may make sense for
> >   virtiofs to update to use the improved interface.
> > 
> > * I'm currently maintaining compatibility between the famfs user space and
> >   both the standalone famfs kernel file system and this new fuse
> >   implementation. In the near future I'll be running performance comparisons
> >   and sharing them - but there is no reason to expect significant degradation
> >   with fuse, since famfs caches entire "fmaps" in the kernel to resolve
> 
> I'm curious to hear what you find, performance-wise. :)
> 
> >   faults with no upcalls. This patch has a bit too much debug turned on to
> >   to that testing quite yet. A branch 
> 
> A branch ... what?

I trail off sometimes... ;)

> 
> > * Two new fuse messages / responses are added: GET_FMAP and GET_DAXDEV.
> > 
> > * When a file is looked up in a famfs mount, the LOOKUP is followed by a
> >   GET_FMAP message and response. The "fmap" is the full file-to-dax mapping,
> >   allowing the fuse/famfs kernel code to handle read/write/fault without any
> >   upcalls.
> 
> Huh, I'd have thought you'd wait until FUSE_OPEN to start preloading
> mappings into the kernel.

That may be a better approach. Miklos and I discussed it during LPC last year, 
and thought both were options. Having implemented it at LOOKUP time, I think
moving it to open might avoid my READDIRPLUS problem (which is that RDP is a
mashup of READDIR and LOOKUP), therefore might need to add the GET_FMAP
payload. Moving GET_FMAP to open time, would break that connection in a good
way, I think.

> 
> > * After each GET_FMAP, the fmap is checked for extents that reference
> >   previously-unknown daxdevs. Each such occurence is handled with a
> >   GET_DAXDEV message and response.
> 
> I hadn't figured out how this part would work for my silly prototype.
> Just out of curiosity, does the famfs fuse server hold an open fd to the
> storage, in which case the fmap(ping) could just contain the open fd?
> 
> Where are the mappings that are sent from the fuse server?  Is that
> struct fuse_famfs_simple_ext?

See patch 17 or fs/fuse/famfs_kfmap.h for the fmap metadata explanation. 
Famfs currently supports either simple extents (daxdev, offset, length) or 
interleaved ones (which describe each "strip" as a simple extent). I think 
the explanation in famfs_kfmap.h is pretty clear.

A key question is whether any additional basic metadata abstractions would
be needed - because the kernel needs to understand the full scheme.

With disaggregated memory, the interleave approach is nice because it gets
aggregated performance and resolving a file offset to daxdev offset is order
1.

Oh, and there are two fmap formats (ok, more, but the others are legacy ;).
The fmaps-in-messages structs are currently in the famfs section of
include/uapi/linux/fuse.h. And the in-memory version is in 
fs/fuse/famfs_kfmap.h. The former will need to be a versioned interface.
(ugh...)

> 
> > * Daxdevs are stored in a table (which might become an xarray at some point).
> >   When entries are added to the table, we acquire exclusive access to the
> >   daxdev via the fs_dax_get() call (modeled after how fs-dax handles this
> >   with pmem devices). famfs provides holder_operations to devdax, providing
> >   a notification path in the event of memory errors.
> > 
> > * If devdax notifies famfs of memory errors on a dax device, famfs currently
> >   bocks all subsequent accesses to data on that device. The recovery is to
> >   re-initialize the memory and file system. Famfs is memory, not storage...
> 
> Ouch. :)

Cautious initial approach (i.e. I'm trying not to scare people too much ;) 

> 
> > * Because famfs uses backing (devdax) devices, only privileged mounts are
> >   supported.
> > 
> > * The famfs kernel code never accesses the memory directly - it only
> >   facilitates read, write and mmap on behalf of user processes. As such,
> >   the RAS of the shared memory affects applications, but not the kernel.
> > 
> > * Famfs has backing device(s), but they are devdax (char) rather than
> >   block. Right now there is no way to tell the vfs layer that famfs has a
> >   char backing device (unless we say it's block, but it's not). Currently
> >   we use the standard anonymous fuse fs_type - but I'm not sure that's
> >   ultimately optimal (thoughts?)
> 
> Does it work if the fusefs server adds "-o fsname=<devdax cdev>" to the
> fuse_args object?  fuse2fs does that, though I don't recall if that's a
> reasonable thing to do.

The kernel needs to "own" the dax devices. fs-dax on pmem/block calls
fs_dax_get_by_bdev() and passes in holder_operations - which are used for
error upcalls, but also effect exclusive ownership. 

I added fs_dax_get() since the bdev version wasn't really right or char
devdax. But same holder_operations.

I had originally intended to pass in "-o daxdev=<cdev>", but famfs needs to
span multiple daxdevs, in order to interleave for performance. The approach
of retrieving them with GET_DAXDEV handles the generalized case, so "-o"
just amounts to a second way to do the same thing.

"But wait"... I thought. Doesn't the "-o" approach get the primary daxdev
locked up sooner, which might be good? Well, no, because famfs creates a
couple of meta files during mount .meta/.superblock and .meta/.log - and 
those are guaranteed to reference the primary daxdev. So I concluded the -o
approach wasn't worth the trouble (though it's not *much* trouble).

> 
> > The "poisoned page|folio problem"
> > 
> > * Background: before doing a kernel mount, the famfs user space [2] validates
> >   the superblock and log. This is done via raw mmap of the primary devdax
> >   device. If valid, the file system is mounted, and the superblock and log
> >   get exposed through a pair of files (.meta/.superblock and .meta/.log) -
> >   because we can't be using raw device mmap when a file system is mounted
> >   on the device. But this exposes a devdax bug and warning...
> > 
> > * Pages that have been memory mapped via devdax are left in a permanently
> >   problematic state. Devdax sets page|folio->mapping when a page is accessed
> >   via raw devdax mmap (as famfs does before mount), but never cleans it up.
> >   When the pages of the famfs superblock and log are accessed via the "meta"
> >   files after mount, we see a WARN_ONCE() in dax_insert_entry(), which
> >   notices that page|folio->mapping is still set. I intend to address this
> >   prior to asking for the famfs patches to be merged.
> > 
> > * Alistair Popple's recent dax patch series [6], which has been merged
> >   for 6.15, addresses some dax issues, but sadly does not fix the poisoned
> >   page|folio problem - its enhanced refcount checking turns the warning into
> >   an error.
> > 
> > * This 6.14 patch set disables the warning; a proper fix will be required for
> >   famfs to work at all in 6.15. Dan W. and I are actively discussing how to do
> >   this properly...
> > 
> > * In terms of the correct functionality of famfs, the warning can be ignored.
> > 
> > References
> > 
> > [1] - https://github.com/libfuse/libfuse/pull/1200
> > [2] - https://github.com/cxl-micron-reskit/famfs
> 
> Thanks for posting links, I'll have a look there too.
> 
> --D
> 

I'm happy to talk if you wanna kick ideas around.

Cheers,
John


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 13/19] famfs_fuse: Create files with famfs fmaps
  2025-04-21 21:57   ` Darrick J. Wong
@ 2025-04-21 22:31     ` John Groves
  0 siblings, 0 replies; 58+ messages in thread
From: John Groves @ 2025-04-21 22:31 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Luis Henriques,
	Randy Dunlap, Jeff Layton, Kent Overstreet, Petr Vorel,
	Brian Foster, linux-doc, linux-kernel, nvdimm, linux-cxl,
	linux-fsdevel, Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi,
	Joanne Koong, Josef Bacik, Aravind Ramesh, Ajay Joshi

On 25/04/21 02:57PM, Darrick J. Wong wrote:
> On Sun, Apr 20, 2025 at 08:33:40PM -0500, John Groves wrote:
> > On completion of GET_FMAP message/response, setup the full famfs
> > metadata such that it's possible to handle read/write/mmap directly to
> > dax. Note that the devdax_iomap plumbing is not in yet...
> > 
> > Update MAINTAINERS for the new files.
> > 
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> >  MAINTAINERS               |   9 +
> >  fs/fuse/Makefile          |   2 +-
> >  fs/fuse/dir.c             |   3 +
> >  fs/fuse/famfs.c           | 344 ++++++++++++++++++++++++++++++++++++++
> >  fs/fuse/famfs_kfmap.h     |  63 +++++++
> >  fs/fuse/fuse_i.h          |  16 +-
> >  fs/fuse/inode.c           |   2 +-
> >  include/uapi/linux/fuse.h |  42 +++++
> >  8 files changed, 477 insertions(+), 4 deletions(-)
> >  create mode 100644 fs/fuse/famfs.c
> >  create mode 100644 fs/fuse/famfs_kfmap.h
> > 
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 00e94bec401e..2a5a7e0e8b28 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -8808,6 +8808,15 @@ F:	Documentation/networking/failover.rst
> >  F:	include/net/failover.h
> >  F:	net/core/failover.c
> >  
> > +FAMFS
> > +M:	John Groves <jgroves@micron.com>
> > +M:	John Groves <John@Groves.net>
> > +L:	linux-cxl@vger.kernel.org
> > +L:	linux-fsdevel@vger.kernel.org
> > +S:	Supported
> > +F:	fs/fuse/famfs.c
> > +F:	fs/fuse/famfs_kfmap.h
> > +
> >  FANOTIFY
> >  M:	Jan Kara <jack@suse.cz>
> >  R:	Amir Goldstein <amir73il@gmail.com>
> > diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
> > index 3f0f312a31c1..65a12975d734 100644
> > --- a/fs/fuse/Makefile
> > +++ b/fs/fuse/Makefile
> > @@ -16,5 +16,5 @@ fuse-$(CONFIG_FUSE_DAX) += dax.o
> >  fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o
> >  fuse-$(CONFIG_SYSCTL) += sysctl.o
> >  fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o
> > -
> > +fuse-$(CONFIG_FUSE_FAMFS_DAX) += famfs.o
> >  virtiofs-y := virtio_fs.o
> > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> > index ae135c55b9f6..b28a1e912d6b 100644
> > --- a/fs/fuse/dir.c
> > +++ b/fs/fuse/dir.c
> > @@ -405,6 +405,9 @@ fuse_get_fmap(struct fuse_mount *fm, struct inode *inode, u64 nodeid)
> >  	fmap_size = args.out_args[0].size;
> >  	pr_notice("%s: nodei=%lld fmap_size=%ld\n", __func__, nodeid, fmap_size);
> >  
> > +	/* Convert fmap into in-memory format and hang from inode */
> > +	famfs_file_init_dax(fm, inode, fmap_buf, fmap_size);
> > +
> >  	return 0;
> >  }
> >  #endif
> > diff --git a/fs/fuse/famfs.c b/fs/fuse/famfs.c
> > new file mode 100644
> > index 000000000000..e62c047d0950
> > --- /dev/null
> > +++ b/fs/fuse/famfs.c
> > @@ -0,0 +1,344 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * famfs - dax file system for shared fabric-attached memory
> > + *
> > + * Copyright 2023-2025 Micron Technology, Inc.
> > + *
> > + * This file system, originally based on ramfs the dax support from xfs,
> > + * is intended to allow multiple host systems to mount a common file system
> > + * view of dax files that map to shared memory.
> > + */
> > +
> > +#include <linux/fs.h>
> > +#include <linux/mm.h>
> > +#include <linux/dax.h>
> > +#include <linux/iomap.h>
> > +#include <linux/path.h>
> > +#include <linux/namei.h>
> > +#include <linux/string.h>
> > +
> > +#include "famfs_kfmap.h"
> > +#include "fuse_i.h"
> > +
> > +
> > +void
> > +__famfs_meta_free(void *famfs_meta)
> > +{
> > +	struct famfs_file_meta *fmap = famfs_meta;
> > +
> > +	if (!fmap)
> > +		return;
> > +
> > +	if (fmap) {
> > +		switch (fmap->fm_extent_type) {
> > +		case SIMPLE_DAX_EXTENT:
> > +			kfree(fmap->se);
> > +			break;
> > +		case INTERLEAVED_EXTENT:
> 
> Are interleaved extents not DAX extents?  Why does one constant refer to
> DAX but the other does not?

All extents are DAX. Naming evolved over 2+ years, and could be cleaned up.

> 
> > +			if (fmap->ie)
> > +				kfree(fmap->ie->ie_strips);
> > +
> > +			kfree(fmap->ie);
> > +			break;
> > +		default:
> > +			pr_err("%s: invalid fmap type\n", __func__);
> > +			break;
> > +		}
> > +	}
> > +	kfree(fmap);
> > +}
> > +
> > +static int
> > +famfs_check_ext_alignment(struct famfs_meta_simple_ext *se)
> > +{
> > +	int errs = 0;
> > +
> > +	if (se->dev_index != 0)
> > +		errs++;
> > +
> > +	/* TODO: pass in alignment so we can support the other page sizes */
> > +	if (!IS_ALIGNED(se->ext_offset, PMD_SIZE))
> > +		errs++;
> > +
> > +	if (!IS_ALIGNED(se->ext_len, PMD_SIZE))
> > +		errs++;
> > +
> > +	return errs;
> > +}
> > +
> > +/**
> > + * famfs_meta_alloc() - Allocate famfs file metadata
> > + * @metap:       Pointer to an mcache_map_meta pointer
> > + * @ext_count:  The number of extents needed
> > + */
> > +static int
> > +famfs_meta_alloc_v3(
> 
> Err, what's with "v3"?  This is a new fs, right?


Um, been working on this for 2+ years so there's a not-very-public legacy.
But I agree naming should be cleaned up.

> 
> > +	void *fmap_buf,
> > +	size_t fmap_buf_size,
> > +	struct famfs_file_meta **metap)
> > +{
> > +	struct famfs_file_meta *meta = NULL;
> > +	struct fuse_famfs_fmap_header *fmh;
> > +	size_t extent_total = 0;
> > +	size_t next_offset = 0;
> > +	int errs = 0;
> > +	int i, j;
> > +	int rc;
> > +
> > +	fmh = (struct fuse_famfs_fmap_header *)fmap_buf;
> > +
> > +	/* Move past fmh in fmap_buf */
> > +	next_offset += sizeof(*fmh);
> > +	if (next_offset > fmap_buf_size) {
> > +		pr_err("%s:%d: fmap_buf underflow offset/size %ld/%ld\n",
> > +		       __func__, __LINE__, next_offset, fmap_buf_size);
> > +		rc = -EINVAL;
> > +		goto errout;
> > +	}
> > +
> > +	if (fmh->nextents < 1) {
> > +		pr_err("%s: nextents %d < 1\n", __func__, fmh->nextents);
> > +		rc = -EINVAL;
> > +		goto errout;
> > +	}
> > +
> > +	if (fmh->nextents > FUSE_FAMFS_MAX_EXTENTS) {
> > +		pr_err("%s: nextents %d > max (%d) 1\n",
> > +		       __func__, fmh->nextents, FUSE_FAMFS_MAX_EXTENTS);
> > +		rc = -E2BIG;
> > +		goto errout;
> > +	}
> > +
> > +	meta = kzalloc(sizeof(*meta), GFP_KERNEL);
> > +	if (!meta)
> > +		return -ENOMEM;
> > +	meta->error = false;
> > +
> > +	meta->file_type = fmh->file_type;
> > +	meta->file_size = fmh->file_size;
> > +	meta->fm_extent_type = fmh->ext_type;
> > +
> > +	switch (fmh->ext_type) {
> > +	case FUSE_FAMFS_EXT_SIMPLE: {
> > +		struct fuse_famfs_simple_ext *se_in;
> > +
> > +		se_in = (struct fuse_famfs_simple_ext *)(fmap_buf + next_offset);
> > +
> > +		/* Move past simple extents */
> > +		next_offset += fmh->nextents * sizeof(*se_in);
> > +		if (next_offset > fmap_buf_size) {
> > +			pr_err("%s:%d: fmap_buf underflow offset/size %ld/%ld\n",
> > +			       __func__, __LINE__, next_offset, fmap_buf_size);
> > +			rc = -EINVAL;
> > +			goto errout;
> > +		}
> > +
> > +		meta->fm_nextents = fmh->nextents;
> > +
> > +		meta->se = kcalloc(meta->fm_nextents, sizeof(*(meta->se)),
> > +				   GFP_KERNEL);
> > +		if (!meta->se) {
> > +			rc = -ENOMEM;
> > +			goto errout;
> > +		}
> > +
> > +		if ((meta->fm_nextents > FUSE_FAMFS_MAX_EXTENTS) ||
> 
> FUSE_FAMFS_MAX_EXTENTS is 2?  I gather that simple files in famfs refer
> to contiguous regions, but why two mappings?

There is no forward-looking, or even current-term reason why it should be 
limited to 2; But famfs files are strictly pre-allocated, so it takes some 
special code to test the multi-extent code paths. We do that internally, 
hence 2 (rather than 1).

Where we do exercise much bigger lists of the same extents in in interleaved
setups - where the limit is higher.

But dialing it up or even removing the limit provided the GET_FMAP message
validates should be fine.

> 
> > +		    (meta->fm_nextents < 1)) {
> > +			rc = -EINVAL;
> > +			goto errout;
> > +		}
> > +
> > +		for (i = 0; i < fmh->nextents; i++) {
> > +			meta->se[i].dev_index  = se_in[i].se_devindex;
> > +			meta->se[i].ext_offset = se_in[i].se_offset;
> > +			meta->se[i].ext_len    = se_in[i].se_len;
> > +
> > +			/* Record bitmap of referenced daxdev indices */
> > +			meta->dev_bitmap |= (1 << meta->se[i].dev_index);
> > +
> > +			errs += famfs_check_ext_alignment(&meta->se[i]);
> 
> Shouldn't you bail out at the first bad mapping?

Probably yes; need to dredge old memory about this...

> 
> > +			extent_total += meta->se[i].ext_len;
> > +		}
> 
> I took a look at what's already in uapi/linux/fuse.h and saw that
> there are two operations -- FUSE_{SETUP,REMOVE}MAPPING.  Those two fuse
> upcalls seem to manage an interval tree in struct fuse_inode_dax, which
> is used to feed fuse_iomap_begin.  Can you reuse this existing uapi
> instead of defining a new one that's already pretty similar?

OK, so the pre-existing DAX stuff in fuse is for virtiofs, which is doing
a very narrow thing (which I don't understand completely, but Stefan is
on this thread - though if I were him I might not be paying attention :)
My net assessment: the pre-existing fuse dax stuff was not a viable platform
for a file system with many files.

I initially implemented famfs as a standalone file system (patches easy
to find, and there are branches in my github kernel repos - including one
called famfs_dual that has BOTH). The existing DAX stuff in fuse is quite
different from the fs-dax interface that xfs uses - and has no notify_failure
etc.

> 
> I'm wondering why create all this new code when fuse/dax.c already seems
> to have the ability to cache mappings and pass them to dax_iomap_rw
> without restrictions on the number of mappings and all that?
> 
> Maybe you're trying to avoid runtime upcalls, but then I would think
> that you could teach the fuse/dax.c mapping code to pin the mappings
> if there aren't that many of them in the first place, rather than
> reinventing mappings?
> 
> It occurred to me (perhaps naively) that maybe you created FUSE_GETFMAP
> because of this interleaving thing because it's probably faster to
> upload a template for that than it would be to upload a large number of
> mappings.  But I don't really grok why the interleaving exists, though I
> guess it's for memory controllers interleaving memory devices or
> something for better throughput?

In famfsv1 (the standalone version), user space "pushed" mappings into
the kernel, but fuse doesn't do it that way. It wants to do readdir, lookup,
etc. So GET_FMAP was the answer I came up with - and so far it works fine.

> 
> I also see that famfs_meta_to_dax_offset does a linear walk of the
> mapping array, which does not seem like it will be inefficient when
> there are many mappings.

Right, that's no big deal. And if there's only one extent (or if the extents
are fixed-size), it's order 1.

> 
> > +		break;
> > +	}
> > +
> > +	case FUSE_FAMFS_EXT_INTERLEAVE: {
> > +		s64 size_remainder = meta->file_size;
> > +		struct fuse_famfs_iext *ie_in;
> > +		int niext = fmh->nextents;
> > +
> > +		meta->fm_niext = niext;
> > +
> > +		/* Allocate interleaved extent */
> > +		meta->ie = kcalloc(niext, sizeof(*(meta->ie)), GFP_KERNEL);
> > +		if (!meta->ie) {
> > +			rc = -ENOMEM;
> > +			goto errout;
> > +		}
> > +
> > +		/*
> > +		 * Each interleaved extent has a simple extent list of strips.
> > +		 * Outer loop is over separate interleaved extents
> 
> Hmm, so there's no checking on fmh->nextents here, so I guess we can
> have as many sets of interleaved extents as we want?  Each with up to 16
> simple mappings?
> 
> --D

OK, so I'm remembering a bit more about the legacy around extent limits. 
There are some MVP simplifications in the famfs metadata log format 
(which is orthogonal to the message and in-memory metadata formats here). 
An fmap in the log (a third format, but there is at least one more :-/) 
is a fully dimensioned compound structure that you can call sizeof on. 
So that is the second reason (in addition to preallocation) why we didn't 
need many extents.

Also, when we resolve file offsets to dax offsets, limit and validity
checking was already done when the GET_FMAP message was ingested.

I think for fuse famfs, that can be relaxed and ignored - especially if 
you're gonna test it :D.

Thanks for the review eyeballs, and let me know if you wanna talk through
some of this stuff.

Regards,
John




^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 00/19] famfs: port into fuse
  2025-04-21 22:00   ` John Groves
@ 2025-04-22  1:25     ` Darrick J. Wong
  2025-04-22 11:50       ` John Groves
  0 siblings, 1 reply; 58+ messages in thread
From: Darrick J. Wong @ 2025-04-22  1:25 UTC (permalink / raw)
  To: John Groves
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Luis Henriques,
	Randy Dunlap, Jeff Layton, Kent Overstreet, Petr Vorel,
	Brian Foster, linux-doc, linux-kernel, nvdimm, linux-cxl,
	linux-fsdevel, Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi,
	Joanne Koong, Josef Bacik, Aravind Ramesh, Ajay Joshi

On Mon, Apr 21, 2025 at 05:00:35PM -0500, John Groves wrote:
> On 25/04/21 11:27AM, Darrick J. Wong wrote:
> > On Sun, Apr 20, 2025 at 08:33:27PM -0500, John Groves wrote:
> > > Subject: famfs: port into fuse
> > > 
> > > This is the initial RFC for the fabric-attached memory file system (famfs)
> > > integration into fuse. In order to function, this requires a related patch
> > > to libfuse [1] and the famfs user space [2]. 
> > > 
> > > This RFC is mainly intended to socialize the approach and get feedback from
> > > the fuse developers and maintainers. There is some dax work that needs to
> > > be done before this should be merged (see the "poisoned page|folio problem"
> > > below).
> > 
> > Note that I'm only looking at the fuse and iomap aspects of this
> > patchset.  I don't know the devdax code at all.
> > 
> > > This patch set fully works with Linux 6.14 -- passing all existing famfs
> > > smoke and unit tests -- and I encourage existing famfs users to test it.
> > > 
> > > This is really two patch sets mashed up:
> > > 
> > > * The patches with the dev_dax_iomap: prefix fill in missing functionality for
> > >   devdax to host an fs-dax file system.
> > > * The famfs_fuse: patches add famfs into fs/fuse/. These are effectively
> > >   unchanged since last year.
> > > 
> > > Because this is not ready to merge yet, I have felt free to leave some debug
> > > prints in place because we still find them useful; those will be cleaned up
> > > in a subsequent revision.
> > > 
> > > Famfs Overview
> > > 
> > > Famfs exposes shared memory as a file system. Famfs consumes shared memory
> > > from dax devices, and provides memory-mappable files that map directly to
> > > the memory - no page cache involvement. Famfs differs from conventional
> > > file systems in fs-dax mode, in that it handles in-memory metadata in a
> > > sharable way (which begins with never caching dirty shared metadata).
> > > 
> > > Famfs started as a standalone file system [3,4], but the consensus at LSFMM
> > > 2024 [5] was that it should be ported into fuse - and this RFC is the first
> > > public evidence that I've been working on that.
> > 
> > This is very timely, as I just started looking into how I might connect
> > iomap to fuse so that most of the hot IO path continues to run in the
> > kernel, and userspace block device filesystem drivers merely supply the
> > file mappings to the kernel.  In other words, we kick the metadata
> > parsing craziness out of the kernel.
> 
> Coool!
> 
> > 
> > > The key performance requirement is that famfs must resolve mapping faults
> > > without upcalls. This is achieved by fully caching the file-to-devdax
> > > metadata for all active files. This is done via two fuse client/server
> > > message/response pairs: GET_FMAP and GET_DAXDEV.
> > 
> > Heh, just last week I finally got around to laying out how I think I'd
> > want to expose iomap through fuse to allow ->iomap_begin/->iomap_end
> > upcalls to a fuse server.  Note that I've done zero prototyping but
> > "upload all the mappings at open time" seems like a reasonable place for
> > me to start looking, especially for a filesystem with static mappings.
> > 
> > I think what I want to try to build is an in-kernel mapping cache (sort
> > of like the one you built), only with upcalls to the fuse server when
> > there is no mapping information for a given IO.  I'd probably want to
> > have a means for the fuse server to put new mappings into the cache, or
> > invalidate existing mappings.
> > 
> > (famfs obviously is a simple corner-case of that grandiose vision, but I
> > still have a long way to get to my larger vision so don't take my words
> > as any kind of requirement.)
> > 
> > > Famfs remains the first fs-dax file system that is backed by devdax rather
> > > than pmem in fs-dax mode (hence the need for the dev_dax_iomap fixups).
> > > 
> > > Notes
> > > 
> > > * Once the dev_dax_iomap patches land, I suspect it may make sense for
> > >   virtiofs to update to use the improved interface.
> > > 
> > > * I'm currently maintaining compatibility between the famfs user space and
> > >   both the standalone famfs kernel file system and this new fuse
> > >   implementation. In the near future I'll be running performance comparisons
> > >   and sharing them - but there is no reason to expect significant degradation
> > >   with fuse, since famfs caches entire "fmaps" in the kernel to resolve
> > 
> > I'm curious to hear what you find, performance-wise. :)
> > 
> > >   faults with no upcalls. This patch has a bit too much debug turned on to
> > >   to that testing quite yet. A branch 
> > 
> > A branch ... what?
> 
> I trail off sometimes... ;)
> 
> > 
> > > * Two new fuse messages / responses are added: GET_FMAP and GET_DAXDEV.
> > > 
> > > * When a file is looked up in a famfs mount, the LOOKUP is followed by a
> > >   GET_FMAP message and response. The "fmap" is the full file-to-dax mapping,
> > >   allowing the fuse/famfs kernel code to handle read/write/fault without any
> > >   upcalls.
> > 
> > Huh, I'd have thought you'd wait until FUSE_OPEN to start preloading
> > mappings into the kernel.
> 
> That may be a better approach. Miklos and I discussed it during LPC last year, 
> and thought both were options. Having implemented it at LOOKUP time, I think
> moving it to open might avoid my READDIRPLUS problem (which is that RDP is a
> mashup of READDIR and LOOKUP), therefore might need to add the GET_FMAP
> payload. Moving GET_FMAP to open time, would break that connection in a good
> way, I think.

I wonder if we could just add a couple new "notification" types so that
the fuse server can initiate uploads of mappings whenever it feels like
it.  For your usage model I don't think it'll make much difference since
they seem pretty static, but the ability to do that would open up some
flexibility for famfs.  The more general filesystems will need it
anyway, and someone's going to want to truncate a famfs file.  They
always do. ;)

> > 
> > > * After each GET_FMAP, the fmap is checked for extents that reference
> > >   previously-unknown daxdevs. Each such occurence is handled with a
> > >   GET_DAXDEV message and response.
> > 
> > I hadn't figured out how this part would work for my silly prototype.
> > Just out of curiosity, does the famfs fuse server hold an open fd to the
> > storage, in which case the fmap(ping) could just contain the open fd?
> > 
> > Where are the mappings that are sent from the fuse server?  Is that
> > struct fuse_famfs_simple_ext?
> 
> See patch 17 or fs/fuse/famfs_kfmap.h for the fmap metadata explanation. 
> Famfs currently supports either simple extents (daxdev, offset, length) or 
> interleaved ones (which describe each "strip" as a simple extent). I think 
> the explanation in famfs_kfmap.h is pretty clear.
> 
> A key question is whether any additional basic metadata abstractions would
> be needed - because the kernel needs to understand the full scheme.
> 
> With disaggregated memory, the interleave approach is nice because it gets
> aggregated performance and resolving a file offset to daxdev offset is order
> 1.
> 
> Oh, and there are two fmap formats (ok, more, but the others are legacy ;).
> The fmaps-in-messages structs are currently in the famfs section of
> include/uapi/linux/fuse.h. And the in-memory version is in 
> fs/fuse/famfs_kfmap.h. The former will need to be a versioned interface.
> (ugh...)

Ok, will take a look tomorrow morning.

> > 
> > > * Daxdevs are stored in a table (which might become an xarray at some point).
> > >   When entries are added to the table, we acquire exclusive access to the
> > >   daxdev via the fs_dax_get() call (modeled after how fs-dax handles this
> > >   with pmem devices). famfs provides holder_operations to devdax, providing
> > >   a notification path in the event of memory errors.
> > > 
> > > * If devdax notifies famfs of memory errors on a dax device, famfs currently
> > >   bocks all subsequent accesses to data on that device. The recovery is to
> > >   re-initialize the memory and file system. Famfs is memory, not storage...
> > 
> > Ouch. :)
> 
> Cautious initial approach (i.e. I'm trying not to scare people too much ;) 
> 
> > 
> > > * Because famfs uses backing (devdax) devices, only privileged mounts are
> > >   supported.
> > > 
> > > * The famfs kernel code never accesses the memory directly - it only
> > >   facilitates read, write and mmap on behalf of user processes. As such,
> > >   the RAS of the shared memory affects applications, but not the kernel.
> > > 
> > > * Famfs has backing device(s), but they are devdax (char) rather than
> > >   block. Right now there is no way to tell the vfs layer that famfs has a
> > >   char backing device (unless we say it's block, but it's not). Currently
> > >   we use the standard anonymous fuse fs_type - but I'm not sure that's
> > >   ultimately optimal (thoughts?)
> > 
> > Does it work if the fusefs server adds "-o fsname=<devdax cdev>" to the
> > fuse_args object?  fuse2fs does that, though I don't recall if that's a
> > reasonable thing to do.
> 
> The kernel needs to "own" the dax devices. fs-dax on pmem/block calls
> fs_dax_get_by_bdev() and passes in holder_operations - which are used for
> error upcalls, but also effect exclusive ownership. 
> 
> I added fs_dax_get() since the bdev version wasn't really right or char
> devdax. But same holder_operations.
> 
> I had originally intended to pass in "-o daxdev=<cdev>", but famfs needs to
> span multiple daxdevs, in order to interleave for performance. The approach
> of retrieving them with GET_DAXDEV handles the generalized case, so "-o"
> just amounts to a second way to do the same thing.

Oh, hah, it's a multi-device filesystem.  Hee hee hee...

> "But wait"... I thought. Doesn't the "-o" approach get the primary daxdev
> locked up sooner, which might be good? Well, no, because famfs creates a
> couple of meta files during mount .meta/.superblock and .meta/.log - and 
> those are guaranteed to reference the primary daxdev. So I concluded the -o
> approach wasn't worth the trouble (though it's not *much* trouble).

<nod> For block devices, someone needs to own the bdev O_EXCL, but it
doesn't have to be the kernel.  Though ... I wonder what *does* happen
when the something tries to invoke the bdev holder_ops?  Maybe it would
be nice to freeze the fs, but I don't know if fuse already does that.

> > 
> > > The "poisoned page|folio problem"
> > > 
> > > * Background: before doing a kernel mount, the famfs user space [2] validates
> > >   the superblock and log. This is done via raw mmap of the primary devdax
> > >   device. If valid, the file system is mounted, and the superblock and log
> > >   get exposed through a pair of files (.meta/.superblock and .meta/.log) -
> > >   because we can't be using raw device mmap when a file system is mounted
> > >   on the device. But this exposes a devdax bug and warning...
> > > 
> > > * Pages that have been memory mapped via devdax are left in a permanently
> > >   problematic state. Devdax sets page|folio->mapping when a page is accessed
> > >   via raw devdax mmap (as famfs does before mount), but never cleans it up.
> > >   When the pages of the famfs superblock and log are accessed via the "meta"
> > >   files after mount, we see a WARN_ONCE() in dax_insert_entry(), which
> > >   notices that page|folio->mapping is still set. I intend to address this
> > >   prior to asking for the famfs patches to be merged.
> > > 
> > > * Alistair Popple's recent dax patch series [6], which has been merged
> > >   for 6.15, addresses some dax issues, but sadly does not fix the poisoned
> > >   page|folio problem - its enhanced refcount checking turns the warning into
> > >   an error.
> > > 
> > > * This 6.14 patch set disables the warning; a proper fix will be required for
> > >   famfs to work at all in 6.15. Dan W. and I are actively discussing how to do
> > >   this properly...
> > > 
> > > * In terms of the correct functionality of famfs, the warning can be ignored.
> > > 
> > > References
> > > 
> > > [1] - https://github.com/libfuse/libfuse/pull/1200
> > > [2] - https://github.com/cxl-micron-reskit/famfs
> > 
> > Thanks for posting links, I'll have a look there too.
> > 
> > --D
> > 
> 
> I'm happy to talk if you wanna kick ideas around.

Heheh I will, but give me a day or two to wander through the rest of the
patches, or maybe just decide to pull the branch and look at one huge
diff.

--D

> Cheers,
> John
> 
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 18/19] famfs_fuse: Add documentation
  2025-04-21  1:33 ` [RFC PATCH 18/19] famfs_fuse: Add documentation John Groves
@ 2025-04-22  2:10   ` Randy Dunlap
  2025-04-28  1:50     ` John Groves
  0 siblings, 1 reply; 58+ messages in thread
From: Randy Dunlap @ 2025-04-22  2:10 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Luis Henriques, Jeff Layton, Kent Overstreet,
	Petr Vorel, Brian Foster, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi



On 4/20/25 6:33 PM, John Groves wrote:
> Add Documentation/filesystems/famfs.rst and update MAINTAINERS
> 
> Signed-off-by: John Groves <john@groves.net>
> ---
>  Documentation/filesystems/famfs.rst | 142 ++++++++++++++++++++++++++++
>  Documentation/filesystems/index.rst |   1 +
>  MAINTAINERS                         |   1 +
>  3 files changed, 144 insertions(+)
>  create mode 100644 Documentation/filesystems/famfs.rst
> 
> diff --git a/Documentation/filesystems/famfs.rst b/Documentation/filesystems/famfs.rst
> new file mode 100644
> index 000000000000..b6b3500b6905
> --- /dev/null
> +++ b/Documentation/filesystems/famfs.rst
> @@ -0,0 +1,142 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +.. _famfs_index:
> +
> +==================================================================
> +famfs: The fabric-attached memory file system
> +==================================================================
> +
> +- Copyright (C) 2024-2025 Micron Technology, Inc.
> +
> +Introduction
> +============
> +Compute Express Link (CXL) provides a mechanism for disaggregated or
> +fabric-attached memory (FAM). This creates opportunities for data sharing;
> +clustered apps that would otherwise have to shard or replicate data can
> +share one copy in disaggregated memory.
> +
> +Famfs, which is not CXL-specific in any way, provides a mechanism for
> +multiple hosts to concurrently access data in shared memory, by giving it
> +a file system interface. With famfs, any app that understands files can
> +access data sets in shared memory. Although famfs supports read and write,
> +the real point is to support mmap, which provides direct (dax) access to
> +the memory - either writable or read-only.
> +
> +Shared memory can pose complex coherency and synchronization issues, but
> +there are also simple cases. Two simple and eminently useful patterns that
> +occur frequently in data analytics and AI are:
> +
> +* Serial Sharing - Only one host or process at a time has access to a file
> +* Read-only Sharing - Multiple hosts or processes share read-only access
> +  to a file
> +
> +The famfs fuse file system is part of the famfs framework; User space

                                                              user

> +components [1] handle metadata allocation and distribution, and provide a
> +low-level fuse server to expose files that map directly to [presumably
> +shared] memory.
> +
> +The famfs framework manages coherency of its own metadata and structures,
> +but does not attempt to manage coherency for applications.
> +
> +Famfs also provides data isolation between files. That is, even though
> +the host has access to an entire memory "device" (as a devdax device), apps
> +cannot write to memory for which the file is read-only, and mapping one
> +file provides isolation from the memory of all other files. This is pretty
> +basic, but some experimental shared memory usage patterns provide no such
> +isolation.
> +
> +Principles of Operation
> +=======================
> +
> +Famfs is a file system with one or more devdax devices as a first-class
> +backing device(s). Metadata maintenance and query operations happen
> +entirely in user space.
> +
> +The famfs low-level fuse server daemon provides file maps (fmaps) and
> +devdax device info to the fuse/famfs kernel component so that
> +read/write/mapping faults can be handled without up-calls for all active
> +files.
> +
> +The famfs user space is responsible for maintaining and distributing
> +consistent metadata. This is currently handled via an append-only
> +metadata log within the memory, but this is orthogonal to the fuse/famfs
> +kernel code.
> +
> +Once instantiated, "the same file" on each host points to the same shared
> +memory, but in-memory metadata (inodes, etc.) is ephemeral on each host
> +that has a famfs instance mounted. Use cases are free to allow or not
> +allow mutations to data on a file-by-file basis.
> +
> +When an app accesses a data object in a famfs file, there is no page cache
> +involvement. The CPU cache is loaded directly from the shared memory. In
> +some use cases, this is an enormous reduction read amplification compared
> +to loading an entire page into the page cache.
> +
> +
> +Famfs is Not a Conventional File System
> +---------------------------------------
> +
> +Famfs files can be accessed by conventional means, but there are
> +limitations. The kernel component of fuse/famfs is not involved in the
> +allocation of backing memory for files at all; the famfs user space
> +creates files and responds as a low-level fuse server with fmaps and
> +devdax device info upon request.
> +
> +Famfs differs in some important ways from conventional file systems:
> +
> +* Files must be pre-allocated by the famfs framework; Allocation is never

                                                         allocation

> +  performed on (or after) write.
> +* Any operation that changes a file's size is considered to put the file
> +  in an invalid state, disabling access to the data. It may be possible to
> +  revisit this in the future. (Typically the famfs user space can restore
> +  files to a valid state by replaying the famfs metadata log.)
> +
> +Famfs exists to apply the existing file system abstractions to shared
> +memory so applications and workflows can more easily adapt to an
> +environment with disaggregated shared memory.


-- 
~Randy


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 00/19] famfs: port into fuse
  2025-04-22  1:25     ` Darrick J. Wong
@ 2025-04-22 11:50       ` John Groves
  0 siblings, 0 replies; 58+ messages in thread
From: John Groves @ 2025-04-22 11:50 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Luis Henriques,
	Randy Dunlap, Jeff Layton, Kent Overstreet, Petr Vorel,
	Brian Foster, linux-doc, linux-kernel, nvdimm, linux-cxl,
	linux-fsdevel, Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi,
	Joanne Koong, Josef Bacik, Aravind Ramesh, Ajay Joshi

On 25/04/21 06:25PM, Darrick J. Wong wrote:
> On Mon, Apr 21, 2025 at 05:00:35PM -0500, John Groves wrote:
> > On 25/04/21 11:27AM, Darrick J. Wong wrote:
> > > On Sun, Apr 20, 2025 at 08:33:27PM -0500, John Groves wrote:
> > > > Subject: famfs: port into fuse
> > > > 
> > > > This is the initial RFC for the fabric-attached memory file system (famfs)
> > > > integration into fuse. In order to function, this requires a related patch
> > > > to libfuse [1] and the famfs user space [2]. 
> > > > 
> > > > This RFC is mainly intended to socialize the approach and get feedback from
> > > > the fuse developers and maintainers. There is some dax work that needs to
> > > > be done before this should be merged (see the "poisoned page|folio problem"
> > > > below).
> > > 
> > > Note that I'm only looking at the fuse and iomap aspects of this
> > > patchset.  I don't know the devdax code at all.
> > > 
> > > > This patch set fully works with Linux 6.14 -- passing all existing famfs
> > > > smoke and unit tests -- and I encourage existing famfs users to test it.
> > > > 
> > > > This is really two patch sets mashed up:
> > > > 
> > > > * The patches with the dev_dax_iomap: prefix fill in missing functionality for
> > > >   devdax to host an fs-dax file system.
> > > > * The famfs_fuse: patches add famfs into fs/fuse/. These are effectively
> > > >   unchanged since last year.
> > > > 
> > > > Because this is not ready to merge yet, I have felt free to leave some debug
> > > > prints in place because we still find them useful; those will be cleaned up
> > > > in a subsequent revision.
> > > > 
> > > > Famfs Overview
> > > > 
> > > > Famfs exposes shared memory as a file system. Famfs consumes shared memory
> > > > from dax devices, and provides memory-mappable files that map directly to
> > > > the memory - no page cache involvement. Famfs differs from conventional
> > > > file systems in fs-dax mode, in that it handles in-memory metadata in a
> > > > sharable way (which begins with never caching dirty shared metadata).
> > > > 
> > > > Famfs started as a standalone file system [3,4], but the consensus at LSFMM
> > > > 2024 [5] was that it should be ported into fuse - and this RFC is the first
> > > > public evidence that I've been working on that.
> > > 
> > > This is very timely, as I just started looking into how I might connect
> > > iomap to fuse so that most of the hot IO path continues to run in the
> > > kernel, and userspace block device filesystem drivers merely supply the
> > > file mappings to the kernel.  In other words, we kick the metadata
> > > parsing craziness out of the kernel.
> > 
> > Coool!
> > 
> > > 
> > > > The key performance requirement is that famfs must resolve mapping faults
> > > > without upcalls. This is achieved by fully caching the file-to-devdax
> > > > metadata for all active files. This is done via two fuse client/server
> > > > message/response pairs: GET_FMAP and GET_DAXDEV.
> > > 
> > > Heh, just last week I finally got around to laying out how I think I'd
> > > want to expose iomap through fuse to allow ->iomap_begin/->iomap_end
> > > upcalls to a fuse server.  Note that I've done zero prototyping but
> > > "upload all the mappings at open time" seems like a reasonable place for
> > > me to start looking, especially for a filesystem with static mappings.
> > > 
> > > I think what I want to try to build is an in-kernel mapping cache (sort
> > > of like the one you built), only with upcalls to the fuse server when
> > > there is no mapping information for a given IO.  I'd probably want to
> > > have a means for the fuse server to put new mappings into the cache, or
> > > invalidate existing mappings.
> > > 
> > > (famfs obviously is a simple corner-case of that grandiose vision, but I
> > > still have a long way to get to my larger vision so don't take my words
> > > as any kind of requirement.)
> > > 
> > > > Famfs remains the first fs-dax file system that is backed by devdax rather
> > > > than pmem in fs-dax mode (hence the need for the dev_dax_iomap fixups).
> > > > 
> > > > Notes
> > > > 
> > > > * Once the dev_dax_iomap patches land, I suspect it may make sense for
> > > >   virtiofs to update to use the improved interface.
> > > > 
> > > > * I'm currently maintaining compatibility between the famfs user space and
> > > >   both the standalone famfs kernel file system and this new fuse
> > > >   implementation. In the near future I'll be running performance comparisons
> > > >   and sharing them - but there is no reason to expect significant degradation
> > > >   with fuse, since famfs caches entire "fmaps" in the kernel to resolve
> > > 
> > > I'm curious to hear what you find, performance-wise. :)
> > > 
> > > >   faults with no upcalls. This patch has a bit too much debug turned on to
> > > >   to that testing quite yet. A branch 
> > > 
> > > A branch ... what?
> > 
> > I trail off sometimes... ;)
> > 
> > > 
> > > > * Two new fuse messages / responses are added: GET_FMAP and GET_DAXDEV.
> > > > 
> > > > * When a file is looked up in a famfs mount, the LOOKUP is followed by a
> > > >   GET_FMAP message and response. The "fmap" is the full file-to-dax mapping,
> > > >   allowing the fuse/famfs kernel code to handle read/write/fault without any
> > > >   upcalls.
> > > 
> > > Huh, I'd have thought you'd wait until FUSE_OPEN to start preloading
> > > mappings into the kernel.
> > 
> > That may be a better approach. Miklos and I discussed it during LPC last year, 
> > and thought both were options. Having implemented it at LOOKUP time, I think
> > moving it to open might avoid my READDIRPLUS problem (which is that RDP is a
> > mashup of READDIR and LOOKUP), therefore might need to add the GET_FMAP
> > payload. Moving GET_FMAP to open time, would break that connection in a good
> > way, I think.
> 
> I wonder if we could just add a couple new "notification" types so that
> the fuse server can initiate uploads of mappings whenever it feels like
> it.  For your usage model I don't think it'll make much difference since
> they seem pretty static, but the ability to do that would open up some
> flexibility for famfs.  The more general filesystems will need it
> anyway, and someone's going to want to truncate a famfs file.  They
> always do. ;)
> 
> > > 
> > > > * After each GET_FMAP, the fmap is checked for extents that reference
> > > >   previously-unknown daxdevs. Each such occurence is handled with a
> > > >   GET_DAXDEV message and response.
> > > 
> > > I hadn't figured out how this part would work for my silly prototype.
> > > Just out of curiosity, does the famfs fuse server hold an open fd to the
> > > storage, in which case the fmap(ping) could just contain the open fd?
> > > 
> > > Where are the mappings that are sent from the fuse server?  Is that
> > > struct fuse_famfs_simple_ext?
> > 
> > See patch 17 or fs/fuse/famfs_kfmap.h for the fmap metadata explanation. 
> > Famfs currently supports either simple extents (daxdev, offset, length) or 
> > interleaved ones (which describe each "strip" as a simple extent). I think 
> > the explanation in famfs_kfmap.h is pretty clear.
> > 
> > A key question is whether any additional basic metadata abstractions would
> > be needed - because the kernel needs to understand the full scheme.
> > 
> > With disaggregated memory, the interleave approach is nice because it gets
> > aggregated performance and resolving a file offset to daxdev offset is order
> > 1.
> > 
> > Oh, and there are two fmap formats (ok, more, but the others are legacy ;).
> > The fmaps-in-messages structs are currently in the famfs section of
> > include/uapi/linux/fuse.h. And the in-memory version is in 
> > fs/fuse/famfs_kfmap.h. The former will need to be a versioned interface.
> > (ugh...)
> 
> Ok, will take a look tomorrow morning.
> 
> > > 
> > > > * Daxdevs are stored in a table (which might become an xarray at some point).
> > > >   When entries are added to the table, we acquire exclusive access to the
> > > >   daxdev via the fs_dax_get() call (modeled after how fs-dax handles this
> > > >   with pmem devices). famfs provides holder_operations to devdax, providing
> > > >   a notification path in the event of memory errors.
> > > > 
> > > > * If devdax notifies famfs of memory errors on a dax device, famfs currently
> > > >   bocks all subsequent accesses to data on that device. The recovery is to
> > > >   re-initialize the memory and file system. Famfs is memory, not storage...
> > > 
> > > Ouch. :)
> > 
> > Cautious initial approach (i.e. I'm trying not to scare people too much ;) 
> > 
> > > 
> > > > * Because famfs uses backing (devdax) devices, only privileged mounts are
> > > >   supported.
> > > > 
> > > > * The famfs kernel code never accesses the memory directly - it only
> > > >   facilitates read, write and mmap on behalf of user processes. As such,
> > > >   the RAS of the shared memory affects applications, but not the kernel.
> > > > 
> > > > * Famfs has backing device(s), but they are devdax (char) rather than
> > > >   block. Right now there is no way to tell the vfs layer that famfs has a
> > > >   char backing device (unless we say it's block, but it's not). Currently
> > > >   we use the standard anonymous fuse fs_type - but I'm not sure that's
> > > >   ultimately optimal (thoughts?)
> > > 
> > > Does it work if the fusefs server adds "-o fsname=<devdax cdev>" to the
> > > fuse_args object?  fuse2fs does that, though I don't recall if that's a
> > > reasonable thing to do.
> > 
> > The kernel needs to "own" the dax devices. fs-dax on pmem/block calls
> > fs_dax_get_by_bdev() and passes in holder_operations - which are used for
> > error upcalls, but also effect exclusive ownership. 
> > 
> > I added fs_dax_get() since the bdev version wasn't really right or char
> > devdax. But same holder_operations.
> > 
> > I had originally intended to pass in "-o daxdev=<cdev>", but famfs needs to
> > span multiple daxdevs, in order to interleave for performance. The approach
> > of retrieving them with GET_DAXDEV handles the generalized case, so "-o"
> > just amounts to a second way to do the same thing.
> 
> Oh, hah, it's a multi-device filesystem.  Hee hee hee...

Hee hee indeed. The thing about memory, and dax devices, is that there
isn't anything like device mapper that can make compound or interleaved
devices. There's not a "stop while dma happens" point for swizzling 
addresses. I'm down for a discussion about whether there is a viable way 
to have a mapper layer, but I also think constructing interleaved objects 
as files is quite good - and might be the best solution.

Interleaving is essential to memory performance in general. System-ram is
pretty much never not interleaved. And there are some reasons why programming
the hardware to do the interleaving is gonna be problem for non-static 
setups. I'll save going down that rathole for a different time...

John

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 10/19] famfs_fuse: Basic fuse kernel ABI enablement for famfs
  2025-04-21  1:33 ` [RFC PATCH 10/19] famfs_fuse: Basic fuse kernel ABI enablement for famfs John Groves
@ 2025-04-23  1:36   ` Joanne Koong
  2025-04-23 20:23     ` John Groves
  0 siblings, 1 reply; 58+ messages in thread
From: Joanne Koong @ 2025-04-23  1:36 UTC (permalink / raw)
  To: John Groves
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Darrick J . Wong,
	Luis Henriques, Randy Dunlap, Jeff Layton, Kent Overstreet,
	Petr Vorel, Brian Foster, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Josef Bacik, Aravind Ramesh, Ajay Joshi

On Sun, Apr 20, 2025 at 6:34 PM John Groves <John@groves.net> wrote:
>
> * FUSE_DAX_FMAP flag in INIT request/reply
>
> * fuse_conn->famfs_iomap (enable famfs-mapped files) to denote a
>   famfs-enabled connection
>
> Signed-off-by: John Groves <john@groves.net>
> ---
>  fs/fuse/fuse_i.h          | 3 +++
>  fs/fuse/inode.c           | 5 +++++
>  include/uapi/linux/fuse.h | 2 ++
>  3 files changed, 10 insertions(+)
>
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index e04d160fa995..b2c563b1a1c8 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -870,6 +870,9 @@ struct fuse_conn {
>         /* Use io_uring for communication */
>         unsigned int io_uring;
>
> +       /* dev_dax_iomap support for famfs */
> +       unsigned int famfs_iomap:1;
> +
>         /** Maximum stack depth for passthrough backing files */
>         int max_stack_depth;
>
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 29147657a99f..5c6947b12503 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -1392,6 +1392,9 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
>                         }
>                         if (flags & FUSE_OVER_IO_URING && fuse_uring_enabled())
>                                 fc->io_uring = 1;
> +                       if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX) &&
> +                                      flags & FUSE_DAX_FMAP)
> +                               fc->famfs_iomap = 1;
>                 } else {
>                         ra_pages = fc->max_read / PAGE_SIZE;
>                         fc->no_lock = 1;
> @@ -1450,6 +1453,8 @@ void fuse_send_init(struct fuse_mount *fm)
>                 flags |= FUSE_SUBMOUNTS;
>         if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
>                 flags |= FUSE_PASSTHROUGH;
> +       if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
> +               flags |= FUSE_DAX_FMAP;
>
>         /*
>          * This is just an information flag for fuse server. No need to check
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index 5e0eb41d967e..f9e14180367a 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -435,6 +435,7 @@ struct fuse_file_lock {
>   *                 of the request ID indicates resend requests
>   * FUSE_ALLOW_IDMAP: allow creation of idmapped mounts
>   * FUSE_OVER_IO_URING: Indicate that client supports io-uring
> + * FUSE_DAX_FMAP: kernel supports dev_dax_iomap (aka famfs) fmaps
>   */
>  #define FUSE_ASYNC_READ                (1 << 0)
>  #define FUSE_POSIX_LOCKS       (1 << 1)
> @@ -482,6 +483,7 @@ struct fuse_file_lock {
>  #define FUSE_DIRECT_IO_RELAX   FUSE_DIRECT_IO_ALLOW_MMAP
>  #define FUSE_ALLOW_IDMAP       (1ULL << 40)
>  #define FUSE_OVER_IO_URING     (1ULL << 41)
> +#define FUSE_DAX_FMAP          (1ULL << 42)

There's also a protocol changelog at the top of this file that tracks
any updates made to the uapi. We should probably also update that to
include this?


Thanks,
Joanne
>
>  /**
>   * CUSE INIT request/reply flags
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 11/19] famfs_fuse: Basic famfs mount opts
  2025-04-21  1:33 ` [RFC PATCH 11/19] famfs_fuse: Basic famfs mount opts John Groves
@ 2025-04-23  1:51   ` Joanne Koong
  2025-04-23 20:19     ` John Groves
  0 siblings, 1 reply; 58+ messages in thread
From: Joanne Koong @ 2025-04-23  1:51 UTC (permalink / raw)
  To: John Groves
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Darrick J . Wong,
	Luis Henriques, Randy Dunlap, Jeff Layton, Kent Overstreet,
	Petr Vorel, Brian Foster, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Josef Bacik, Aravind Ramesh, Ajay Joshi

On Sun, Apr 20, 2025 at 6:34 PM John Groves <John@groves.net> wrote:
>
> * -o shadow=<shadowpath>
> * -o daxdev=<daxdev>
>
> Signed-off-by: John Groves <john@groves.net>
> ---
>  fs/fuse/fuse_i.h |  8 +++++++-
>  fs/fuse/inode.c  | 25 ++++++++++++++++++++++++-
>  2 files changed, 31 insertions(+), 2 deletions(-)
>
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index b2c563b1a1c8..931613102d32 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -580,9 +580,11 @@ struct fuse_fs_context {
>         unsigned int blksize;
>         const char *subtype;
>
> -       /* DAX device, may be NULL */
> +       /* DAX device for virtiofs, may be NULL */
>         struct dax_device *dax_dev;
>
> +       const char *shadow; /* famfs - null if not famfs */
> +
>         /* fuse_dev pointer to fill in, should contain NULL on entry */
>         void **fudptr;
>  };
> @@ -938,6 +940,10 @@ struct fuse_conn {
>         /**  uring connection information*/
>         struct fuse_ring *ring;
>  #endif
> +
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +       char *shadow;
> +#endif
>  };
>
>  /*
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 5c6947b12503..7f4b73e739cb 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -766,6 +766,9 @@ enum {
>         OPT_ALLOW_OTHER,
>         OPT_MAX_READ,
>         OPT_BLKSIZE,
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +       OPT_SHADOW,
> +#endif
>         OPT_ERR
>  };
>
> @@ -780,6 +783,9 @@ static const struct fs_parameter_spec fuse_fs_parameters[] = {
>         fsparam_u32     ("max_read",            OPT_MAX_READ),
>         fsparam_u32     ("blksize",             OPT_BLKSIZE),
>         fsparam_string  ("subtype",             OPT_SUBTYPE),
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +       fsparam_string("shadow",                OPT_SHADOW),
> +#endif
>         {}
>  };
>
> @@ -875,6 +881,15 @@ static int fuse_parse_param(struct fs_context *fsc, struct fs_parameter *param)
>                 ctx->blksize = result.uint_32;
>                 break;
>
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +       case OPT_SHADOW:
> +               if (ctx->shadow)
> +                       return invalfc(fsc, "Multiple shadows specified");
> +               ctx->shadow = param->string;
> +               param->string = NULL;
> +               break;
> +#endif
> +
>         default:
>                 return -EINVAL;
>         }
> @@ -888,6 +903,7 @@ static void fuse_free_fsc(struct fs_context *fsc)
>
>         if (ctx) {
>                 kfree(ctx->subtype);
> +               kfree(ctx->shadow);
>                 kfree(ctx);
>         }
>  }
> @@ -919,7 +935,10 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
>         else if (fc->dax_mode == FUSE_DAX_INODE_USER)
>                 seq_puts(m, ",dax=inode");
>  #endif
> -
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +       if (fc->shadow)
> +               seq_printf(m, ",shadow=%s", fc->shadow);
> +#endif
>         return 0;
>  }
>
> @@ -1825,6 +1844,10 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
>         sb->s_root = root_dentry;
>         if (ctx->fudptr)
>                 *ctx->fudptr = fud;
> +
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +       fc->shadow = kstrdup(ctx->shadow, GFP_KERNEL);
> +#endif

Since this is kstrdup-ed, I think you meant to also kfree this in
fuse_conn_put() when the last refcount on fc gets dropped?


Thanks,
Joanne

>         mutex_unlock(&fuse_mutex);
>         return 0;
>
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 11/19] famfs_fuse: Basic famfs mount opts
  2025-04-23  1:51   ` Joanne Koong
@ 2025-04-23 20:19     ` John Groves
  0 siblings, 0 replies; 58+ messages in thread
From: John Groves @ 2025-04-23 20:19 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Darrick J . Wong,
	Luis Henriques, Randy Dunlap, Jeff Layton, Kent Overstreet,
	Petr Vorel, Brian Foster, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Josef Bacik, Aravind Ramesh, Ajay Joshi

On 25/04/22 06:51PM, Joanne Koong wrote:
> On Sun, Apr 20, 2025 at 6:34 PM John Groves <John@groves.net> wrote:
> >
> > * -o shadow=<shadowpath>
> > * -o daxdev=<daxdev>
> >
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> >  fs/fuse/fuse_i.h |  8 +++++++-
> >  fs/fuse/inode.c  | 25 ++++++++++++++++++++++++-
> >  2 files changed, 31 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index b2c563b1a1c8..931613102d32 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -580,9 +580,11 @@ struct fuse_fs_context {
> >         unsigned int blksize;
> >         const char *subtype;
> >
> > -       /* DAX device, may be NULL */
> > +       /* DAX device for virtiofs, may be NULL */
> >         struct dax_device *dax_dev;
> >
> > +       const char *shadow; /* famfs - null if not famfs */
> > +
> >         /* fuse_dev pointer to fill in, should contain NULL on entry */
> >         void **fudptr;
> >  };
> > @@ -938,6 +940,10 @@ struct fuse_conn {
> >         /**  uring connection information*/
> >         struct fuse_ring *ring;
> >  #endif
> > +
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > +       char *shadow;
> > +#endif
> >  };
> >
> >  /*
> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > index 5c6947b12503..7f4b73e739cb 100644
> > --- a/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -766,6 +766,9 @@ enum {
> >         OPT_ALLOW_OTHER,
> >         OPT_MAX_READ,
> >         OPT_BLKSIZE,
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > +       OPT_SHADOW,
> > +#endif
> >         OPT_ERR
> >  };
> >
> > @@ -780,6 +783,9 @@ static const struct fs_parameter_spec fuse_fs_parameters[] = {
> >         fsparam_u32     ("max_read",            OPT_MAX_READ),
> >         fsparam_u32     ("blksize",             OPT_BLKSIZE),
> >         fsparam_string  ("subtype",             OPT_SUBTYPE),
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > +       fsparam_string("shadow",                OPT_SHADOW),
> > +#endif
> >         {}
> >  };
> >
> > @@ -875,6 +881,15 @@ static int fuse_parse_param(struct fs_context *fsc, struct fs_parameter *param)
> >                 ctx->blksize = result.uint_32;
> >                 break;
> >
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > +       case OPT_SHADOW:
> > +               if (ctx->shadow)
> > +                       return invalfc(fsc, "Multiple shadows specified");
> > +               ctx->shadow = param->string;
> > +               param->string = NULL;
> > +               break;
> > +#endif
> > +
> >         default:
> >                 return -EINVAL;
> >         }
> > @@ -888,6 +903,7 @@ static void fuse_free_fsc(struct fs_context *fsc)
> >
> >         if (ctx) {
> >                 kfree(ctx->subtype);
> > +               kfree(ctx->shadow);
> >                 kfree(ctx);
> >         }
> >  }
> > @@ -919,7 +935,10 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
> >         else if (fc->dax_mode == FUSE_DAX_INODE_USER)
> >                 seq_puts(m, ",dax=inode");
> >  #endif
> > -
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > +       if (fc->shadow)
> > +               seq_printf(m, ",shadow=%s", fc->shadow);
> > +#endif
> >         return 0;
> >  }
> >
> > @@ -1825,6 +1844,10 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
> >         sb->s_root = root_dentry;
> >         if (ctx->fudptr)
> >                 *ctx->fudptr = fud;
> > +
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > +       fc->shadow = kstrdup(ctx->shadow, GFP_KERNEL);
> > +#endif
> 
> Since this is kstrdup-ed, I think you meant to also kfree this in
> fuse_conn_put() when the last refcount on fc gets dropped?

Good catch Joanne! That's queued in my "-next" branch.

Thanks,
John


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 10/19] famfs_fuse: Basic fuse kernel ABI enablement for famfs
  2025-04-23  1:36   ` Joanne Koong
@ 2025-04-23 20:23     ` John Groves
  0 siblings, 0 replies; 58+ messages in thread
From: John Groves @ 2025-04-23 20:23 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Darrick J . Wong,
	Luis Henriques, Randy Dunlap, Jeff Layton, Kent Overstreet,
	Petr Vorel, Brian Foster, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Josef Bacik, Aravind Ramesh, Ajay Joshi

On 25/04/22 06:36PM, Joanne Koong wrote:
> On Sun, Apr 20, 2025 at 6:34 PM John Groves <John@groves.net> wrote:
> >
> > * FUSE_DAX_FMAP flag in INIT request/reply
> >
> > * fuse_conn->famfs_iomap (enable famfs-mapped files) to denote a
> >   famfs-enabled connection
> >
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> >  fs/fuse/fuse_i.h          | 3 +++
> >  fs/fuse/inode.c           | 5 +++++
> >  include/uapi/linux/fuse.h | 2 ++
> >  3 files changed, 10 insertions(+)
> >
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index e04d160fa995..b2c563b1a1c8 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -870,6 +870,9 @@ struct fuse_conn {
> >         /* Use io_uring for communication */
> >         unsigned int io_uring;
> >
> > +       /* dev_dax_iomap support for famfs */
> > +       unsigned int famfs_iomap:1;
> > +
> >         /** Maximum stack depth for passthrough backing files */
> >         int max_stack_depth;
> >
> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > index 29147657a99f..5c6947b12503 100644
> > --- a/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -1392,6 +1392,9 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
> >                         }
> >                         if (flags & FUSE_OVER_IO_URING && fuse_uring_enabled())
> >                                 fc->io_uring = 1;
> > +                       if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX) &&
> > +                                      flags & FUSE_DAX_FMAP)
> > +                               fc->famfs_iomap = 1;
> >                 } else {
> >                         ra_pages = fc->max_read / PAGE_SIZE;
> >                         fc->no_lock = 1;
> > @@ -1450,6 +1453,8 @@ void fuse_send_init(struct fuse_mount *fm)
> >                 flags |= FUSE_SUBMOUNTS;
> >         if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> >                 flags |= FUSE_PASSTHROUGH;
> > +       if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
> > +               flags |= FUSE_DAX_FMAP;
> >
> >         /*
> >          * This is just an information flag for fuse server. No need to check
> > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > index 5e0eb41d967e..f9e14180367a 100644
> > --- a/include/uapi/linux/fuse.h
> > +++ b/include/uapi/linux/fuse.h
> > @@ -435,6 +435,7 @@ struct fuse_file_lock {
> >   *                 of the request ID indicates resend requests
> >   * FUSE_ALLOW_IDMAP: allow creation of idmapped mounts
> >   * FUSE_OVER_IO_URING: Indicate that client supports io-uring
> > + * FUSE_DAX_FMAP: kernel supports dev_dax_iomap (aka famfs) fmaps
> >   */
> >  #define FUSE_ASYNC_READ                (1 << 0)
> >  #define FUSE_POSIX_LOCKS       (1 << 1)
> > @@ -482,6 +483,7 @@ struct fuse_file_lock {
> >  #define FUSE_DIRECT_IO_RELAX   FUSE_DIRECT_IO_ALLOW_MMAP
> >  #define FUSE_ALLOW_IDMAP       (1ULL << 40)
> >  #define FUSE_OVER_IO_URING     (1ULL << 41)
> > +#define FUSE_DAX_FMAP          (1ULL << 42)
> 
> There's also a protocol changelog at the top of this file that tracks
> any updates made to the uapi. We should probably also update that to
> include this?

Another good catch, thanks Joanne! Adding that in -next

John


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 13/19] famfs_fuse: Create files with famfs fmaps
  2025-04-21  1:33 ` [RFC PATCH 13/19] famfs_fuse: Create files with famfs fmaps John Groves
  2025-04-21 21:57   ` Darrick J. Wong
@ 2025-04-24 13:43   ` John Groves
  2025-04-24 14:38     ` Darrick J. Wong
  1 sibling, 1 reply; 58+ messages in thread
From: John Groves @ 2025-04-24 13:43 UTC (permalink / raw)
  To: Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Luis Henriques, Randy Dunlap, Jeff Layton,
	Kent Overstreet, Petr Vorel, Brian Foster, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Amir Goldstein,
	Jonathan Cameron, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
	Aravind Ramesh, Ajay Joshi

On 25/04/20 08:33PM, John Groves wrote:
> On completion of GET_FMAP message/response, setup the full famfs
> metadata such that it's possible to handle read/write/mmap directly to
> dax. Note that the devdax_iomap plumbing is not in yet...
> 
> Update MAINTAINERS for the new files.
> 
> Signed-off-by: John Groves <john@groves.net>
> ---
>  MAINTAINERS               |   9 +
>  fs/fuse/Makefile          |   2 +-
>  fs/fuse/dir.c             |   3 +
>  fs/fuse/famfs.c           | 344 ++++++++++++++++++++++++++++++++++++++
>  fs/fuse/famfs_kfmap.h     |  63 +++++++
>  fs/fuse/fuse_i.h          |  16 +-
>  fs/fuse/inode.c           |   2 +-
>  include/uapi/linux/fuse.h |  42 +++++
>  8 files changed, 477 insertions(+), 4 deletions(-)
>  create mode 100644 fs/fuse/famfs.c
>  create mode 100644 fs/fuse/famfs_kfmap.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 00e94bec401e..2a5a7e0e8b28 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -8808,6 +8808,15 @@ F:	Documentation/networking/failover.rst
>  F:	include/net/failover.h
>  F:	net/core/failover.c
>  
> +FAMFS
> +M:	John Groves <jgroves@micron.com>
> +M:	John Groves <John@Groves.net>
> +L:	linux-cxl@vger.kernel.org
> +L:	linux-fsdevel@vger.kernel.org
> +S:	Supported
> +F:	fs/fuse/famfs.c
> +F:	fs/fuse/famfs_kfmap.h
> +
>  FANOTIFY
>  M:	Jan Kara <jack@suse.cz>
>  R:	Amir Goldstein <amir73il@gmail.com>
> diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
> index 3f0f312a31c1..65a12975d734 100644
> --- a/fs/fuse/Makefile
> +++ b/fs/fuse/Makefile
> @@ -16,5 +16,5 @@ fuse-$(CONFIG_FUSE_DAX) += dax.o
>  fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o
>  fuse-$(CONFIG_SYSCTL) += sysctl.o
>  fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o
> -
> +fuse-$(CONFIG_FUSE_FAMFS_DAX) += famfs.o
>  virtiofs-y := virtio_fs.o
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index ae135c55b9f6..b28a1e912d6b 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -405,6 +405,9 @@ fuse_get_fmap(struct fuse_mount *fm, struct inode *inode, u64 nodeid)
>  	fmap_size = args.out_args[0].size;
>  	pr_notice("%s: nodei=%lld fmap_size=%ld\n", __func__, nodeid, fmap_size);
>  
> +	/* Convert fmap into in-memory format and hang from inode */
> +	famfs_file_init_dax(fm, inode, fmap_buf, fmap_size);
> +
>  	return 0;
>  }
>  #endif
> diff --git a/fs/fuse/famfs.c b/fs/fuse/famfs.c
> new file mode 100644
> index 000000000000..e62c047d0950
> --- /dev/null
> +++ b/fs/fuse/famfs.c
> @@ -0,0 +1,344 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * famfs - dax file system for shared fabric-attached memory
> + *
> + * Copyright 2023-2025 Micron Technology, Inc.
> + *
> + * This file system, originally based on ramfs the dax support from xfs,
> + * is intended to allow multiple host systems to mount a common file system
> + * view of dax files that map to shared memory.
> + */
> +
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include <linux/dax.h>
> +#include <linux/iomap.h>
> +#include <linux/path.h>
> +#include <linux/namei.h>
> +#include <linux/string.h>
> +
> +#include "famfs_kfmap.h"
> +#include "fuse_i.h"
> +
> +
> +void
> +__famfs_meta_free(void *famfs_meta)
> +{
> +	struct famfs_file_meta *fmap = famfs_meta;
> +
> +	if (!fmap)
> +		return;
> +
> +	if (fmap) {
> +		switch (fmap->fm_extent_type) {
> +		case SIMPLE_DAX_EXTENT:
> +			kfree(fmap->se);
> +			break;
> +		case INTERLEAVED_EXTENT:
> +			if (fmap->ie)
> +				kfree(fmap->ie->ie_strips);
> +
> +			kfree(fmap->ie);
> +			break;
> +		default:
> +			pr_err("%s: invalid fmap type\n", __func__);
> +			break;
> +		}
> +	}
> +	kfree(fmap);
> +}
> +
> +static int
> +famfs_check_ext_alignment(struct famfs_meta_simple_ext *se)
> +{
> +	int errs = 0;
> +
> +	if (se->dev_index != 0)
> +		errs++;
> +
> +	/* TODO: pass in alignment so we can support the other page sizes */
> +	if (!IS_ALIGNED(se->ext_offset, PMD_SIZE))
> +		errs++;
> +
> +	if (!IS_ALIGNED(se->ext_len, PMD_SIZE))
> +		errs++;
> +
> +	return errs;
> +}
> +
> +/**
> + * famfs_meta_alloc() - Allocate famfs file metadata
> + * @metap:       Pointer to an mcache_map_meta pointer
> + * @ext_count:  The number of extents needed
> + */
> +static int
> +famfs_meta_alloc_v3(
> +	void *fmap_buf,
> +	size_t fmap_buf_size,
> +	struct famfs_file_meta **metap)
> +{
> +	struct famfs_file_meta *meta = NULL;
> +	struct fuse_famfs_fmap_header *fmh;
> +	size_t extent_total = 0;
> +	size_t next_offset = 0;
> +	int errs = 0;
> +	int i, j;
> +	int rc;
> +
> +	fmh = (struct fuse_famfs_fmap_header *)fmap_buf;
> +
> +	/* Move past fmh in fmap_buf */
> +	next_offset += sizeof(*fmh);
> +	if (next_offset > fmap_buf_size) {
> +		pr_err("%s:%d: fmap_buf underflow offset/size %ld/%ld\n",
> +		       __func__, __LINE__, next_offset, fmap_buf_size);
> +		rc = -EINVAL;
> +		goto errout;
> +	}
> +
> +	if (fmh->nextents < 1) {
> +		pr_err("%s: nextents %d < 1\n", __func__, fmh->nextents);
> +		rc = -EINVAL;
> +		goto errout;
> +	}
> +
> +	if (fmh->nextents > FUSE_FAMFS_MAX_EXTENTS) {
> +		pr_err("%s: nextents %d > max (%d) 1\n",
> +		       __func__, fmh->nextents, FUSE_FAMFS_MAX_EXTENTS);
> +		rc = -E2BIG;
> +		goto errout;
> +	}
> +
> +	meta = kzalloc(sizeof(*meta), GFP_KERNEL);
> +	if (!meta)
> +		return -ENOMEM;
> +	meta->error = false;
> +
> +	meta->file_type = fmh->file_type;
> +	meta->file_size = fmh->file_size;
> +	meta->fm_extent_type = fmh->ext_type;
> +
> +	switch (fmh->ext_type) {
> +	case FUSE_FAMFS_EXT_SIMPLE: {
> +		struct fuse_famfs_simple_ext *se_in;
> +
> +		se_in = (struct fuse_famfs_simple_ext *)(fmap_buf + next_offset);
> +
> +		/* Move past simple extents */
> +		next_offset += fmh->nextents * sizeof(*se_in);
> +		if (next_offset > fmap_buf_size) {
> +			pr_err("%s:%d: fmap_buf underflow offset/size %ld/%ld\n",
> +			       __func__, __LINE__, next_offset, fmap_buf_size);
> +			rc = -EINVAL;
> +			goto errout;
> +		}
> +
> +		meta->fm_nextents = fmh->nextents;
> +
> +		meta->se = kcalloc(meta->fm_nextents, sizeof(*(meta->se)),
> +				   GFP_KERNEL);
> +		if (!meta->se) {
> +			rc = -ENOMEM;
> +			goto errout;
> +		}
> +
> +		if ((meta->fm_nextents > FUSE_FAMFS_MAX_EXTENTS) ||
> +		    (meta->fm_nextents < 1)) {
> +			rc = -EINVAL;
> +			goto errout;
> +		}
> +
> +		for (i = 0; i < fmh->nextents; i++) {
> +			meta->se[i].dev_index  = se_in[i].se_devindex;
> +			meta->se[i].ext_offset = se_in[i].se_offset;
> +			meta->se[i].ext_len    = se_in[i].se_len;
> +
> +			/* Record bitmap of referenced daxdev indices */
> +			meta->dev_bitmap |= (1 << meta->se[i].dev_index);
> +
> +			errs += famfs_check_ext_alignment(&meta->se[i]);
> +
> +			extent_total += meta->se[i].ext_len;
> +		}
> +		break;
> +	}
> +
> +	case FUSE_FAMFS_EXT_INTERLEAVE: {
> +		s64 size_remainder = meta->file_size;
> +		struct fuse_famfs_iext *ie_in;
> +		int niext = fmh->nextents;
> +
> +		meta->fm_niext = niext;
> +
> +		/* Allocate interleaved extent */
> +		meta->ie = kcalloc(niext, sizeof(*(meta->ie)), GFP_KERNEL);
> +		if (!meta->ie) {
> +			rc = -ENOMEM;
> +			goto errout;
> +		}
> +
> +		/*
> +		 * Each interleaved extent has a simple extent list of strips.
> +		 * Outer loop is over separate interleaved extents
> +		 */
> +		for (i = 0; i < niext; i++) {
> +			u64 nstrips;
> +			struct fuse_famfs_simple_ext *sie_in;
> +
> +			/* ie_in = one interleaved extent in fmap_buf */
> +			ie_in = (struct fuse_famfs_iext *)
> +				(fmap_buf + next_offset);
> +
> +			/* Move past one interleaved extent header in fmap_buf */
> +			next_offset += sizeof(*ie_in);
> +			if (next_offset > fmap_buf_size) {
> +				pr_err("%s:%d: fmap_buf underflow offset/size %ld/%ld\n",
> +				       __func__, __LINE__, next_offset, fmap_buf_size);
> +				rc = -EINVAL;
> +				goto errout;
> +			}
> +
> +			nstrips = ie_in->ie_nstrips;
> +			meta->ie[i].fie_chunk_size = ie_in->ie_chunk_size;
> +			meta->ie[i].fie_nstrips    = ie_in->ie_nstrips;
> +			meta->ie[i].fie_nbytes     = ie_in->ie_nbytes;
> +
> +			if (!meta->ie[i].fie_nbytes) {
> +				pr_err("%s: zero-length interleave!\n",
> +				       __func__);
> +				rc = -EINVAL;
> +				goto errout;
> +			}
> +
> +			/* sie_in = the strip extents in fmap_buf */
> +			sie_in = (struct fuse_famfs_simple_ext *)
> +				(fmap_buf + next_offset);
> +
> +			/* Move past strip extents in fmap_buf */
> +			next_offset += nstrips * sizeof(*sie_in);
> +			if (next_offset > fmap_buf_size) {
> +				pr_err("%s:%d: fmap_buf underflow offset/size %ld/%ld\n",
> +				       __func__, __LINE__, next_offset, fmap_buf_size);
> +				rc = -EINVAL;
> +				goto errout;
> +			}
> +
> +			if ((nstrips > FUSE_FAMFS_MAX_STRIPS) || (nstrips < 1)) {
> +				pr_err("%s: invalid nstrips=%lld (max=%d)\n",
> +				       __func__, nstrips,
> +				       FUSE_FAMFS_MAX_STRIPS);
> +				errs++;
> +			}
> +
> +			/* Allocate strip extent array */
> +			meta->ie[i].ie_strips = kcalloc(ie_in->ie_nstrips,
> +					sizeof(meta->ie[i].ie_strips[0]),
> +							GFP_KERNEL);
> +			if (!meta->ie[i].ie_strips) {
> +				rc = -ENOMEM;
> +				goto errout;
> +			}
> +
> +			/* Inner loop is over strips */
> +			for (j = 0; j < nstrips; j++) {
> +				struct famfs_meta_simple_ext *strips_out;
> +				u64 devindex = sie_in[j].se_devindex;
> +				u64 offset   = sie_in[j].se_offset;
> +				u64 len      = sie_in[j].se_len;
> +
> +				strips_out = meta->ie[i].ie_strips;
> +				strips_out[j].dev_index  = devindex;
> +				strips_out[j].ext_offset = offset;
> +				strips_out[j].ext_len    = len;
> +
> +				/* Record bitmap of referenced daxdev indices */
> +				meta->dev_bitmap |= (1 << devindex);
> +
> +				extent_total += len;
> +				errs += famfs_check_ext_alignment(&strips_out[j]);
> +				size_remainder -= len;
> +			}
> +		}
> +
> +		if (size_remainder > 0) {
> +			/* Sum of interleaved extent sizes is less than file size! */
> +			pr_err("%s: size_remainder %lld (0x%llx)\n",
> +			       __func__, size_remainder, size_remainder);
> +			rc = -EINVAL;
> +			goto errout;
> +		}
> +		break;
> +	}
> +
> +	default:
> +		pr_err("%s: invalid ext_type %d\n", __func__, fmh->ext_type);
> +		rc = -EINVAL;
> +		goto errout;
> +	}
> +
> +	if (errs > 0) {
> +		pr_err("%s: %d alignment errors found\n", __func__, errs);
> +		rc = -EINVAL;
> +		goto errout;
> +	}
> +
> +	/* More sanity checks */
> +	if (extent_total < meta->file_size) {
> +		pr_err("%s: file size %ld larger than map size %ld\n",
> +		       __func__, meta->file_size, extent_total);
> +		rc = -EINVAL;
> +		goto errout;
> +	}
> +
> +	*metap = meta;
> +
> +	return 0;
> +errout:
> +	__famfs_meta_free(meta);
> +	return rc;
> +}
> +
> +int
> +famfs_file_init_dax(
> +	struct fuse_mount *fm,
> +	struct inode *inode,
> +	void *fmap_buf,
> +	size_t fmap_size)
> +{
> +	struct fuse_inode *fi = get_fuse_inode(inode);
> +	struct famfs_file_meta *meta = NULL;
> +	int rc;
> +
> +	if (fi->famfs_meta) {
> +		pr_notice("%s: i_no=%ld fmap_size=%ld ALREADY INITIALIZED\n",
> +			  __func__,
> +			  inode->i_ino, fmap_size);
> +		return -EEXIST;
> +	}
> +
> +	rc = famfs_meta_alloc_v3(fmap_buf, fmap_size, &meta);
> +	if (rc)
> +		goto errout;
> +
> +	/* Publish the famfs metadata on fi->famfs_meta */
> +	inode_lock(inode);
> +	if (fi->famfs_meta) {
> +		rc = -EEXIST; /* file already has famfs metadata */
> +	} else {
> +		if (famfs_meta_set(fi, meta) != NULL) {
> +			pr_err("%s: file already had metadata\n", __func__);
> +			rc = -EALREADY;
> +			goto errout;
> +		}
> +		i_size_write(inode, meta->file_size);
> +		inode->i_flags |= S_DAX;
> +	}
> +	inode_unlock(inode);
> +
> + errout:
> +	if (rc)
> +		__famfs_meta_free(meta);
> +
> +	return rc;
> +}
> +
> diff --git a/fs/fuse/famfs_kfmap.h b/fs/fuse/famfs_kfmap.h
> new file mode 100644
> index 000000000000..ce785d76719c
> --- /dev/null
> +++ b/fs/fuse/famfs_kfmap.h
> @@ -0,0 +1,63 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * famfs - dax file system for shared fabric-attached memory
> + *
> + * Copyright 2023-2025 Micron Technology, Inc.
> + */
> +#ifndef FAMFS_KFMAP_H
> +#define FAMFS_KFMAP_H
> +
> +/*
> + * These structures are the in-memory metadata format for famfs files. Metadata
> + * retrieved via the GET_FMAP response is converted to this format for use in
> + * resolving file mapping faults.
> + */
> +
> +enum famfs_file_type {
> +	FAMFS_REG,
> +	FAMFS_SUPERBLOCK,
> +	FAMFS_LOG,
> +};
> +
> +/* We anticipate the possiblity of supporting additional types of extents */
> +enum famfs_extent_type {
> +	SIMPLE_DAX_EXTENT,
> +	INTERLEAVED_EXTENT,
> +	INVALID_EXTENT_TYPE,
> +};
> +
> +struct famfs_meta_simple_ext {
> +	u64 dev_index;
> +	u64 ext_offset;
> +	u64 ext_len;
> +};
> +
> +struct famfs_meta_interleaved_ext {
> +	u64 fie_nstrips;
> +	u64 fie_chunk_size;
> +	u64 fie_nbytes;
> +	struct famfs_meta_simple_ext *ie_strips;
> +};
> +
> +/*
> + * Each famfs dax file has this hanging from its fuse_inode->famfs_meta
> + */
> +struct famfs_file_meta {
> +	bool                   error;
> +	enum famfs_file_type   file_type;
> +	size_t                 file_size;
> +	enum famfs_extent_type fm_extent_type;
> +	u64 dev_bitmap; /* bitmap of referenced daxdevs by index */
> +	union { /* This will make code a bit more readable */
> +		struct {
> +			size_t         fm_nextents;
> +			struct famfs_meta_simple_ext  *se;
> +		};
> +		struct {
> +			size_t         fm_niext;
> +			struct famfs_meta_interleaved_ext *ie;
> +		};
> +	};
> +};
> +
> +#endif /* FAMFS_KFMAP_H */
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 437177c2f092..d8e0ac784224 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -1557,11 +1557,18 @@ extern void fuse_sysctl_unregister(void);
>  #endif /* CONFIG_SYSCTL */
>  
>  /* famfs.c */
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +int famfs_file_init_dax(struct fuse_mount *fm,
> +			     struct inode *inode, void *fmap_buf,
> +			     size_t fmap_size);
> +void __famfs_meta_free(void *map);
> +#endif
> +
>  static inline struct fuse_backing *famfs_meta_set(struct fuse_inode *fi,
>  						       void *meta)
>  {
>  #if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> -	return xchg(&fi->famfs_meta, meta);
> +	return cmpxchg(&fi->famfs_meta, NULL, meta);
>  #else
>  	return NULL;
>  #endif
> @@ -1569,7 +1576,12 @@ static inline struct fuse_backing *famfs_meta_set(struct fuse_inode *fi,
>  
>  static inline void famfs_meta_free(struct fuse_inode *fi)
>  {
> -	/* Stub wil be connected in a subsequent commit */
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +	if (fi->famfs_meta != NULL) {
> +		__famfs_meta_free(fi->famfs_meta);
> +		famfs_meta_set(fi, NULL);
> +	}
> +#endif
>  }
>  
>  static inline int fuse_file_famfs(struct fuse_inode *fi)
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 848c8818e6f7..e86bf330117f 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -118,7 +118,7 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
>  		fuse_inode_backing_set(fi, NULL);
>  
>  	if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
> -		famfs_meta_set(fi, NULL);
> +		fi->famfs_meta = NULL; /* XXX new inodes currently not zeroed; why not? */
>  
>  	return &fi->inode;
>  
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index d85fb692cf3b..0f6ff1ffb23d 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -1286,4 +1286,46 @@ struct fuse_uring_cmd_req {
>  	uint8_t padding[6];
>  };
>  
> +/* Famfs fmap message components */
> +
> +#define FAMFS_FMAP_VERSION 1
> +
> +#define FUSE_FAMFS_MAX_EXTENTS 2
> +#define FUSE_FAMFS_MAX_STRIPS 16

FYI, after thinking through the conversation with Darrick,  I'm planning 
to drop FUSE_FAMFS_MAX_(EXTENTS|STRIPS) in the next version.  In the 
response to GET_FMAP, it's the structures below serialized into a message 
buffer. If it fits, it's good - and if not it's invalid. When the
in-memory metadata (defined in famfs_kfmap.h) gets assembled, if there is
a reason to apply limits it can be done - but I don't currently see a reason
do to that (so if I'm currently enforcing limits there, I'll probably drop
that.


> +
> +enum fuse_famfs_file_type {
> +	FUSE_FAMFS_FILE_REG,
> +	FUSE_FAMFS_FILE_SUPERBLOCK,
> +	FUSE_FAMFS_FILE_LOG,
> +};
> +
> +enum famfs_ext_type {
> +	FUSE_FAMFS_EXT_SIMPLE = 0,
> +	FUSE_FAMFS_EXT_INTERLEAVE = 1,
> +};
> +
> +struct fuse_famfs_simple_ext {
> +	uint32_t se_devindex;
> +	uint32_t reserved;
> +	uint64_t se_offset;
> +	uint64_t se_len;
> +};
> +
> +struct fuse_famfs_iext { /* Interleaved extent */
> +	uint32_t ie_nstrips;
> +	uint32_t ie_chunk_size;
> +	uint64_t ie_nbytes; /* Total bytes for this interleaved_ext; sum of strips may be more */
> +	uint64_t reserved;
> +};
> +
> +struct fuse_famfs_fmap_header {
> +	uint8_t file_type; /* enum famfs_file_type */
> +	uint8_t reserved;
> +	uint16_t fmap_version;
> +	uint32_t ext_type; /* enum famfs_log_ext_type */
> +	uint32_t nextents;
> +	uint32_t reserved0;
> +	uint64_t file_size;
> +	uint64_t reserved1;
> +};
>  #endif /* _LINUX_FUSE_H */
> -- 
> 2.49.0
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 13/19] famfs_fuse: Create files with famfs fmaps
  2025-04-24 13:43   ` John Groves
@ 2025-04-24 14:38     ` Darrick J. Wong
  2025-04-28  1:48       ` John Groves
  0 siblings, 1 reply; 58+ messages in thread
From: Darrick J. Wong @ 2025-04-24 14:38 UTC (permalink / raw)
  To: John Groves
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Luis Henriques,
	Randy Dunlap, Jeff Layton, Kent Overstreet, Petr Vorel,
	Brian Foster, linux-doc, linux-kernel, nvdimm, linux-cxl,
	linux-fsdevel, Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi,
	Joanne Koong, Josef Bacik, Aravind Ramesh, Ajay Joshi

On Thu, Apr 24, 2025 at 08:43:33AM -0500, John Groves wrote:
> On 25/04/20 08:33PM, John Groves wrote:
> > On completion of GET_FMAP message/response, setup the full famfs
> > metadata such that it's possible to handle read/write/mmap directly to
> > dax. Note that the devdax_iomap plumbing is not in yet...
> > 
> > Update MAINTAINERS for the new files.
> > 
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> >  MAINTAINERS               |   9 +
> >  fs/fuse/Makefile          |   2 +-
> >  fs/fuse/dir.c             |   3 +
> >  fs/fuse/famfs.c           | 344 ++++++++++++++++++++++++++++++++++++++
> >  fs/fuse/famfs_kfmap.h     |  63 +++++++
> >  fs/fuse/fuse_i.h          |  16 +-
> >  fs/fuse/inode.c           |   2 +-
> >  include/uapi/linux/fuse.h |  42 +++++
> >  8 files changed, 477 insertions(+), 4 deletions(-)
> >  create mode 100644 fs/fuse/famfs.c
> >  create mode 100644 fs/fuse/famfs_kfmap.h
> > 

<snip>

> > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > index d85fb692cf3b..0f6ff1ffb23d 100644
> > --- a/include/uapi/linux/fuse.h
> > +++ b/include/uapi/linux/fuse.h
> > @@ -1286,4 +1286,46 @@ struct fuse_uring_cmd_req {
> >  	uint8_t padding[6];
> >  };
> >  
> > +/* Famfs fmap message components */
> > +
> > +#define FAMFS_FMAP_VERSION 1
> > +
> > +#define FUSE_FAMFS_MAX_EXTENTS 2
> > +#define FUSE_FAMFS_MAX_STRIPS 16
> 
> FYI, after thinking through the conversation with Darrick,  I'm planning 
> to drop FUSE_FAMFS_MAX_(EXTENTS|STRIPS) in the next version.  In the 
> response to GET_FMAP, it's the structures below serialized into a message 
> buffer. If it fits, it's good - and if not it's invalid. When the
> in-memory metadata (defined in famfs_kfmap.h) gets assembled, if there is
> a reason to apply limits it can be done - but I don't currently see a reason
> do to that (so if I'm currently enforcing limits there, I'll probably drop
> that.

You could also define GET_FMAP to have an offset in the request buffer,
and have the famfs daemon send back the next offset at the end of its
reply (or -1ULL to stop).  Then the kernel can call GET_FMAP again with
that new offset to get more mappings.

Though at this point maybe it should go the /other/ way, where the fuse
server can sends a "notification" to the kernel to populate its mapping
data?  fuse already defines a handful of notifications for invalidating
pagecache and directory links.

(Ugly wart: notifications aren't yet implemented for the iouring channel)

--D

> 
> > +
> > +enum fuse_famfs_file_type {
> > +	FUSE_FAMFS_FILE_REG,
> > +	FUSE_FAMFS_FILE_SUPERBLOCK,
> > +	FUSE_FAMFS_FILE_LOG,
> > +};
> > +
> > +enum famfs_ext_type {
> > +	FUSE_FAMFS_EXT_SIMPLE = 0,
> > +	FUSE_FAMFS_EXT_INTERLEAVE = 1,
> > +};
> > +
> > +struct fuse_famfs_simple_ext {
> > +	uint32_t se_devindex;
> > +	uint32_t reserved;
> > +	uint64_t se_offset;
> > +	uint64_t se_len;
> > +};
> > +
> > +struct fuse_famfs_iext { /* Interleaved extent */
> > +	uint32_t ie_nstrips;
> > +	uint32_t ie_chunk_size;
> > +	uint64_t ie_nbytes; /* Total bytes for this interleaved_ext; sum of strips may be more */
> > +	uint64_t reserved;
> > +};
> > +
> > +struct fuse_famfs_fmap_header {
> > +	uint8_t file_type; /* enum famfs_file_type */
> > +	uint8_t reserved;
> > +	uint16_t fmap_version;
> > +	uint32_t ext_type; /* enum famfs_log_ext_type */
> > +	uint32_t nextents;
> > +	uint32_t reserved0;
> > +	uint64_t file_size;
> > +	uint64_t reserved1;
> > +};
> >  #endif /* _LINUX_FUSE_H */
> > -- 
> > 2.49.0
> > 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 13/19] famfs_fuse: Create files with famfs fmaps
  2025-04-24 14:38     ` Darrick J. Wong
@ 2025-04-28  1:48       ` John Groves
  2025-04-28 19:00         ` Darrick J. Wong
  0 siblings, 1 reply; 58+ messages in thread
From: John Groves @ 2025-04-28  1:48 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Luis Henriques,
	Randy Dunlap, Jeff Layton, Kent Overstreet, Petr Vorel,
	Brian Foster, linux-doc, linux-kernel, nvdimm, linux-cxl,
	linux-fsdevel, Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi,
	Joanne Koong, Josef Bacik, Aravind Ramesh, Ajay Joshi, 0

On 25/04/24 07:38AM, Darrick J. Wong wrote:
> On Thu, Apr 24, 2025 at 08:43:33AM -0500, John Groves wrote:
> > On 25/04/20 08:33PM, John Groves wrote:
> > > On completion of GET_FMAP message/response, setup the full famfs
> > > metadata such that it's possible to handle read/write/mmap directly to
> > > dax. Note that the devdax_iomap plumbing is not in yet...
> > > 
> > > Update MAINTAINERS for the new files.
> > > 
> > > Signed-off-by: John Groves <john@groves.net>
> > > ---
> > >  MAINTAINERS               |   9 +
> > >  fs/fuse/Makefile          |   2 +-
> > >  fs/fuse/dir.c             |   3 +
> > >  fs/fuse/famfs.c           | 344 ++++++++++++++++++++++++++++++++++++++
> > >  fs/fuse/famfs_kfmap.h     |  63 +++++++
> > >  fs/fuse/fuse_i.h          |  16 +-
> > >  fs/fuse/inode.c           |   2 +-
> > >  include/uapi/linux/fuse.h |  42 +++++
> > >  8 files changed, 477 insertions(+), 4 deletions(-)
> > >  create mode 100644 fs/fuse/famfs.c
> > >  create mode 100644 fs/fuse/famfs_kfmap.h
> > > 
> 
> <snip>
> 
> > > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > > index d85fb692cf3b..0f6ff1ffb23d 100644
> > > --- a/include/uapi/linux/fuse.h
> > > +++ b/include/uapi/linux/fuse.h
> > > @@ -1286,4 +1286,46 @@ struct fuse_uring_cmd_req {
> > >  	uint8_t padding[6];
> > >  };
> > >  
> > > +/* Famfs fmap message components */
> > > +
> > > +#define FAMFS_FMAP_VERSION 1
> > > +
> > > +#define FUSE_FAMFS_MAX_EXTENTS 2
> > > +#define FUSE_FAMFS_MAX_STRIPS 16
> > 
> > FYI, after thinking through the conversation with Darrick,  I'm planning 
> > to drop FUSE_FAMFS_MAX_(EXTENTS|STRIPS) in the next version.  In the 
> > response to GET_FMAP, it's the structures below serialized into a message 
> > buffer. If it fits, it's good - and if not it's invalid. When the
> > in-memory metadata (defined in famfs_kfmap.h) gets assembled, if there is
> > a reason to apply limits it can be done - but I don't currently see a reason
> > do to that (so if I'm currently enforcing limits there, I'll probably drop
> > that.
> 
> You could also define GET_FMAP to have an offset in the request buffer,
> and have the famfs daemon send back the next offset at the end of its
> reply (or -1ULL to stop).  Then the kernel can call GET_FMAP again with
> that new offset to get more mappings.
> 
> Though at this point maybe it should go the /other/ way, where the fuse
> server can sends a "notification" to the kernel to populate its mapping
> data?  fuse already defines a handful of notifications for invalidating
> pagecache and directory links.
> 
> (Ugly wart: notifications aren't yet implemented for the iouring channel)

I don't have fully-formed thoughts about notifications yet; thinking...

If the fmap stuff may be shared by more than one use case (as has always
seemed possible), it's a good idea to think through a couple of things: 
1) is there anything important missing from this general approach, and 
2) do you need to *partially* cache fmaps? (or is the "offset" idea above 
just to deal with an fmap that might otherwise overflow a response size?)

The current approach lets the kernel retrieve and cache simple and 
interleaved fmaps (and BTW interleaved can be multi-dev or single-dev - 
there are current weird cases where that's useful). Also too, FWIW everything
that can be done with simple ext list fmaps can be done with a collection
of interleaved extents, each with strip count = 1. But I think there is a
worthwhile clarity to having both.

But the current implementation does not contemplate partially cached fmaps.

Adding notification could address revoking them post-haste (is that why
you're thinking about notifications? And if not can you elaborate on what
you're after there?).

> 
> --D

Cheers,
John


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 18/19] famfs_fuse: Add documentation
  2025-04-22  2:10   ` Randy Dunlap
@ 2025-04-28  1:50     ` John Groves
  0 siblings, 0 replies; 58+ messages in thread
From: John Groves @ 2025-04-28  1:50 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Darrick J . Wong,
	Luis Henriques, Jeff Layton, Kent Overstreet, Petr Vorel,
	Brian Foster, linux-doc, linux-kernel, nvdimm, linux-cxl,
	linux-fsdevel, Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi,
	Joanne Koong, Josef Bacik, Aravind Ramesh, Ajay Joshi

On 25/04/21 07:10PM, Randy Dunlap wrote:
> 
> 
> On 4/20/25 6:33 PM, John Groves wrote:
> > Add Documentation/filesystems/famfs.rst and update MAINTAINERS
> > 
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> >  Documentation/filesystems/famfs.rst | 142 ++++++++++++++++++++++++++++
> >  Documentation/filesystems/index.rst |   1 +
> >  MAINTAINERS                         |   1 +
> >  3 files changed, 144 insertions(+)
> >  create mode 100644 Documentation/filesystems/famfs.rst
> > 
> > diff --git a/Documentation/filesystems/famfs.rst b/Documentation/filesystems/famfs.rst
> > new file mode 100644
> > index 000000000000..b6b3500b6905
> > --- /dev/null
> > +++ b/Documentation/filesystems/famfs.rst
> > @@ -0,0 +1,142 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +.. _famfs_index:
> > +
> > +==================================================================
> > +famfs: The fabric-attached memory file system
> > +==================================================================
> > +
> > +- Copyright (C) 2024-2025 Micron Technology, Inc.
> > +
> > +Introduction
> > +============
> > +Compute Express Link (CXL) provides a mechanism for disaggregated or
> > +fabric-attached memory (FAM). This creates opportunities for data sharing;
> > +clustered apps that would otherwise have to shard or replicate data can
> > +share one copy in disaggregated memory.
> > +
> > +Famfs, which is not CXL-specific in any way, provides a mechanism for
> > +multiple hosts to concurrently access data in shared memory, by giving it
> > +a file system interface. With famfs, any app that understands files can
> > +access data sets in shared memory. Although famfs supports read and write,
> > +the real point is to support mmap, which provides direct (dax) access to
> > +the memory - either writable or read-only.
> > +
> > +Shared memory can pose complex coherency and synchronization issues, but
> > +there are also simple cases. Two simple and eminently useful patterns that
> > +occur frequently in data analytics and AI are:
> > +
> > +* Serial Sharing - Only one host or process at a time has access to a file
> > +* Read-only Sharing - Multiple hosts or processes share read-only access
> > +  to a file
> > +
> > +The famfs fuse file system is part of the famfs framework; User space
> 
>                                                               user
> 
> > +components [1] handle metadata allocation and distribution, and provide a
> > +low-level fuse server to expose files that map directly to [presumably
> > +shared] memory.
> > +
> > +The famfs framework manages coherency of its own metadata and structures,
> > +but does not attempt to manage coherency for applications.
> > +
> > +Famfs also provides data isolation between files. That is, even though
> > +the host has access to an entire memory "device" (as a devdax device), apps
> > +cannot write to memory for which the file is read-only, and mapping one
> > +file provides isolation from the memory of all other files. This is pretty
> > +basic, but some experimental shared memory usage patterns provide no such
> > +isolation.
> > +
> > +Principles of Operation
> > +=======================
> > +
> > +Famfs is a file system with one or more devdax devices as a first-class
> > +backing device(s). Metadata maintenance and query operations happen
> > +entirely in user space.
> > +
> > +The famfs low-level fuse server daemon provides file maps (fmaps) and
> > +devdax device info to the fuse/famfs kernel component so that
> > +read/write/mapping faults can be handled without up-calls for all active
> > +files.
> > +
> > +The famfs user space is responsible for maintaining and distributing
> > +consistent metadata. This is currently handled via an append-only
> > +metadata log within the memory, but this is orthogonal to the fuse/famfs
> > +kernel code.
> > +
> > +Once instantiated, "the same file" on each host points to the same shared
> > +memory, but in-memory metadata (inodes, etc.) is ephemeral on each host
> > +that has a famfs instance mounted. Use cases are free to allow or not
> > +allow mutations to data on a file-by-file basis.
> > +
> > +When an app accesses a data object in a famfs file, there is no page cache
> > +involvement. The CPU cache is loaded directly from the shared memory. In
> > +some use cases, this is an enormous reduction read amplification compared
> > +to loading an entire page into the page cache.
> > +
> > +
> > +Famfs is Not a Conventional File System
> > +---------------------------------------
> > +
> > +Famfs files can be accessed by conventional means, but there are
> > +limitations. The kernel component of fuse/famfs is not involved in the
> > +allocation of backing memory for files at all; the famfs user space
> > +creates files and responds as a low-level fuse server with fmaps and
> > +devdax device info upon request.
> > +
> > +Famfs differs in some important ways from conventional file systems:
> > +
> > +* Files must be pre-allocated by the famfs framework; Allocation is never
> 
>                                                          allocation
> 
> > +  performed on (or after) write.
> > +* Any operation that changes a file's size is considered to put the file
> > +  in an invalid state, disabling access to the data. It may be possible to
> > +  revisit this in the future. (Typically the famfs user space can restore
> > +  files to a valid state by replaying the famfs metadata log.)
> > +
> > +Famfs exists to apply the existing file system abstractions to shared
> > +memory so applications and workflows can more easily adapt to an
> > +environment with disaggregated shared memory.
> 
> 
> -- 
> ~Randy
> 

Both edits applied to the -next branch for the patch set. Thanks!


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 13/19] famfs_fuse: Create files with famfs fmaps
  2025-04-28  1:48       ` John Groves
@ 2025-04-28 19:00         ` Darrick J. Wong
  2025-05-06 16:56           ` Miklos Szeredi
  0 siblings, 1 reply; 58+ messages in thread
From: Darrick J. Wong @ 2025-04-28 19:00 UTC (permalink / raw)
  To: John Groves
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Luis Henriques,
	Randy Dunlap, Jeff Layton, Kent Overstreet, Petr Vorel,
	Brian Foster, linux-doc, linux-kernel, nvdimm, linux-cxl,
	linux-fsdevel, Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi,
	Joanne Koong, Josef Bacik, Aravind Ramesh, Ajay Joshi, 0

On Sun, Apr 27, 2025 at 08:48:30PM -0500, John Groves wrote:
> On 25/04/24 07:38AM, Darrick J. Wong wrote:
> > On Thu, Apr 24, 2025 at 08:43:33AM -0500, John Groves wrote:
> > > On 25/04/20 08:33PM, John Groves wrote:
> > > > On completion of GET_FMAP message/response, setup the full famfs
> > > > metadata such that it's possible to handle read/write/mmap directly to
> > > > dax. Note that the devdax_iomap plumbing is not in yet...
> > > > 
> > > > Update MAINTAINERS for the new files.
> > > > 
> > > > Signed-off-by: John Groves <john@groves.net>
> > > > ---
> > > >  MAINTAINERS               |   9 +
> > > >  fs/fuse/Makefile          |   2 +-
> > > >  fs/fuse/dir.c             |   3 +
> > > >  fs/fuse/famfs.c           | 344 ++++++++++++++++++++++++++++++++++++++
> > > >  fs/fuse/famfs_kfmap.h     |  63 +++++++
> > > >  fs/fuse/fuse_i.h          |  16 +-
> > > >  fs/fuse/inode.c           |   2 +-
> > > >  include/uapi/linux/fuse.h |  42 +++++
> > > >  8 files changed, 477 insertions(+), 4 deletions(-)
> > > >  create mode 100644 fs/fuse/famfs.c
> > > >  create mode 100644 fs/fuse/famfs_kfmap.h
> > > > 
> > 
> > <snip>
> > 
> > > > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > > > index d85fb692cf3b..0f6ff1ffb23d 100644
> > > > --- a/include/uapi/linux/fuse.h
> > > > +++ b/include/uapi/linux/fuse.h
> > > > @@ -1286,4 +1286,46 @@ struct fuse_uring_cmd_req {
> > > >  	uint8_t padding[6];
> > > >  };
> > > >  
> > > > +/* Famfs fmap message components */
> > > > +
> > > > +#define FAMFS_FMAP_VERSION 1
> > > > +
> > > > +#define FUSE_FAMFS_MAX_EXTENTS 2
> > > > +#define FUSE_FAMFS_MAX_STRIPS 16
> > > 
> > > FYI, after thinking through the conversation with Darrick,  I'm planning 
> > > to drop FUSE_FAMFS_MAX_(EXTENTS|STRIPS) in the next version.  In the 
> > > response to GET_FMAP, it's the structures below serialized into a message 
> > > buffer. If it fits, it's good - and if not it's invalid. When the
> > > in-memory metadata (defined in famfs_kfmap.h) gets assembled, if there is
> > > a reason to apply limits it can be done - but I don't currently see a reason
> > > do to that (so if I'm currently enforcing limits there, I'll probably drop
> > > that.
> > 
> > You could also define GET_FMAP to have an offset in the request buffer,
> > and have the famfs daemon send back the next offset at the end of its
> > reply (or -1ULL to stop).  Then the kernel can call GET_FMAP again with
> > that new offset to get more mappings.
> > 
> > Though at this point maybe it should go the /other/ way, where the fuse
> > server can sends a "notification" to the kernel to populate its mapping
> > data?  fuse already defines a handful of notifications for invalidating
> > pagecache and directory links.
> > 
> > (Ugly wart: notifications aren't yet implemented for the iouring channel)
> 
> I don't have fully-formed thoughts about notifications yet; thinking...

Me neither.  The existing ones seem like they /could/ be useful for 

> If the fmap stuff may be shared by more than one use case (as has always
> seemed possible), it's a good idea to think through a couple of things: 
> 1) is there anything important missing from this general approach, and 

Well for general iomap caching, I think we'd need to pull in a lot more
of the iomap fields:

struct fuse_iomap {
	u64		addr;	/* disk offset of mapping, bytes */
	loff_t		offset;	/* file offset of mapping, bytes */
	u64		length;	/* length of mapping, bytes */
	u16		type;	/* type of mapping */
	u16		flags;	/* flags for mapping */
	u32		devindex;
	u64		validity_cookie; /* used with .iomap_valid() */
};

fuse would use devindex to find the block_device/dax_device, but
otherwise the fields are exactly the same as struct iomap.  Given that
this is exposed to userspace we'd probably want to add some padding.

The validity cookie I'm not 100% sure about -- buffered IO uses it to
detect stale iomappings after we've locked a folio for write, having
dropped whatever locks protect the iomappings.  The ->iomap_valid
function compares the iomap::validity_cookie against some internal magic
value (this would have to be the iomap cache) to decide if revalidation
is needed.

One way to make this work is to implement the cookie entirely within the
fuse-iomap cache itself -- every time a new mapping comes in (or a range
gets invalidated) the cache bumps its cookie.  The fuse server doesn't
have to implement the cookie itself, but it will have to push a new
mapping or invalidate something every time the mappings change.

Another way would be to have the fuse server implement the cookie
itself, but now we have to find a way to have the kernel and userspace
share a piece of memory where the cookie lives.  I don't like this
option, but it does give the fuse server direct control over when the
cookie value changes.

> 2) do you need to *partially* cache fmaps? (or is the "offset" idea above 
> just to deal with an fmap that might otherwise overflow a response size?)

It's mostly to cap the amount of mapping data being copied into the
kernel in a specific GET_FMAP call.  For famfs I don't think you have
that many mappings, but for (say) an XFS filesystem there could be
billions of them.

Though at that point it might make more sense to populate the cache
piecemeal as file IO actually happens.

I wouldn't split an existing mapping, FWIW.  Think "I have 1,000,000
mappings and I'm only going to upload them 1,000 at a time", not "I'm
going to upload mappings for 100MB worth of file range at a time".

> The current approach lets the kernel retrieve and cache simple and 
> interleaved fmaps (and BTW interleaved can be multi-dev or single-dev - 
> there are current weird cases where that's useful). Also too, FWIW everything
> that can be done with simple ext list fmaps can be done with a collection
> of interleaved extents, each with strip count = 1. But I think there is a
> worthwhile clarity to having both.

<nod> I don't know what Miklos' opinion is about having multiple
fusecmds that do similar things -- on the one hand keeping yours and my
efforts separate explodes the amount of userspace abi that everyone must
maintain, but on the other hand it then doesn't couple our projects
together, which might be a good thing if it turns out that our domain
models are /really/ actually quite different.

(Especially because I suspect that interleaving is the norm for memory,
whereas we try to avoid that for disk filesystems.)

> But the current implementation does not contemplate partially cached fmaps.
> 
> Adding notification could address revoking them post-haste (is that why
> you're thinking about notifications? And if not can you elaborate on what
> you're after there?).

Yeah, invalidating the mapping cache at random places.  If, say, you
implement a clustered filesystem with iomap, the metadata server could
inform the fuse server on the local node that a certain range of inode X
has been written to, at which point you need to revoke any local leases,
invalidate the pagecache, and invalidate the iomapping cache to force
the client to requery the server.

Or if your fuse server wants to implement its own weird operations (e.g.
XFS EXCHANGE-RANGE) this would make that possible without needing to
add a bunch of code to fs/fuse/ for the benefit of a single fuse driver.

--D

> 
> > 
> > --D
> 
> Cheers,
> John
> 
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 00/19] famfs: port into fuse
  2025-04-21  1:33 [RFC PATCH 00/19] famfs: port into fuse John Groves
                   ` (19 preceding siblings ...)
  2025-04-21 18:27 ` [RFC PATCH 00/19] famfs: port into fuse Darrick J. Wong
@ 2025-04-30 14:42 ` Alireza Sanaee
  2025-05-01  2:13   ` John Groves
  2025-05-21 22:30 ` John Groves
  21 siblings, 1 reply; 58+ messages in thread
From: Alireza Sanaee @ 2025-04-30 14:42 UTC (permalink / raw)
  To: John Groves
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Darrick J . Wong,
	Luis Henriques, Randy Dunlap, Jeff Layton, Kent Overstreet,
	Petr Vorel, Brian Foster, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On Sun, 20 Apr 2025 20:33:27 -0500
John Groves <John@Groves.net> wrote:

> Subject: famfs: port into fuse
> 
> This is the initial RFC for the fabric-attached memory file system
> (famfs) integration into fuse. In order to function, this requires a
> related patch to libfuse [1] and the famfs user space [2]. 
> 
> This RFC is mainly intended to socialize the approach and get
> feedback from the fuse developers and maintainers. There is some dax
> work that needs to be done before this should be merged (see the
> "poisoned page|folio problem" below).
> 
> This patch set fully works with Linux 6.14 -- passing all existing
> famfs smoke and unit tests -- and I encourage existing famfs users to
> test it.
> 
> This is really two patch sets mashed up:
> 
> * The patches with the dev_dax_iomap: prefix fill in missing
> functionality for devdax to host an fs-dax file system.
> * The famfs_fuse: patches add famfs into fs/fuse/. These are
> effectively unchanged since last year.
> 
> Because this is not ready to merge yet, I have felt free to leave
> some debug prints in place because we still find them useful; those
> will be cleaned up in a subsequent revision.
> 
> Famfs Overview
> 
> Famfs exposes shared memory as a file system. Famfs consumes shared
> memory from dax devices, and provides memory-mappable files that map
> directly to the memory - no page cache involvement. Famfs differs
> from conventional file systems in fs-dax mode, in that it handles
> in-memory metadata in a sharable way (which begins with never caching
> dirty shared metadata).
> 
> Famfs started as a standalone file system [3,4], but the consensus at
> LSFMM 2024 [5] was that it should be ported into fuse - and this RFC
> is the first public evidence that I've been working on that.
> 
> The key performance requirement is that famfs must resolve mapping
> faults without upcalls. This is achieved by fully caching the
> file-to-devdax metadata for all active files. This is done via two
> fuse client/server message/response pairs: GET_FMAP and GET_DAXDEV.
> 
> Famfs remains the first fs-dax file system that is backed by devdax
> rather than pmem in fs-dax mode (hence the need for the dev_dax_iomap
> fixups).
> 
> Notes
> 
> * Once the dev_dax_iomap patches land, I suspect it may make sense for
>   virtiofs to update to use the improved interface.
> 
> * I'm currently maintaining compatibility between the famfs user
> space and both the standalone famfs kernel file system and this new
> fuse implementation. In the near future I'll be running performance
> comparisons and sharing them - but there is no reason to expect
> significant degradation with fuse, since famfs caches entire "fmaps"
> in the kernel to resolve faults with no upcalls. This patch has a bit
> too much debug turned on to to that testing quite yet. A branch 
> 
> * Two new fuse messages / responses are added: GET_FMAP and
> GET_DAXDEV.
> 
> * When a file is looked up in a famfs mount, the LOOKUP is followed
> by a GET_FMAP message and response. The "fmap" is the full
> file-to-dax mapping, allowing the fuse/famfs kernel code to handle
> read/write/fault without any upcalls.
> 
> * After each GET_FMAP, the fmap is checked for extents that reference
>   previously-unknown daxdevs. Each such occurence is handled with a
>   GET_DAXDEV message and response.
> 
> * Daxdevs are stored in a table (which might become an xarray at some
> point). When entries are added to the table, we acquire exclusive
> access to the daxdev via the fs_dax_get() call (modeled after how
> fs-dax handles this with pmem devices). famfs provides
> holder_operations to devdax, providing a notification path in the
> event of memory errors.
> 
> * If devdax notifies famfs of memory errors on a dax device, famfs
> currently bocks all subsequent accesses to data on that device. The
> recovery is to re-initialize the memory and file system. Famfs is
> memory, not storage...
> 
> * Because famfs uses backing (devdax) devices, only privileged mounts
> are supported.
> 
> * The famfs kernel code never accesses the memory directly - it only
>   facilitates read, write and mmap on behalf of user processes. As
> such, the RAS of the shared memory affects applications, but not the
> kernel.
> 
> * Famfs has backing device(s), but they are devdax (char) rather than
>   block. Right now there is no way to tell the vfs layer that famfs
> has a char backing device (unless we say it's block, but it's not).
> Currently we use the standard anonymous fuse fs_type - but I'm not
> sure that's ultimately optimal (thoughts?)
> 
> The "poisoned page|folio problem"
> 
> * Background: before doing a kernel mount, the famfs user space [2]
> validates the superblock and log. This is done via raw mmap of the
> primary devdax device. If valid, the file system is mounted, and the
> superblock and log get exposed through a pair of files
> (.meta/.superblock and .meta/.log) - because we can't be using raw
> device mmap when a file system is mounted on the device. But this
> exposes a devdax bug and warning...
> 
> * Pages that have been memory mapped via devdax are left in a
> permanently problematic state. Devdax sets page|folio->mapping when a
> page is accessed via raw devdax mmap (as famfs does before mount),
> but never cleans it up. When the pages of the famfs superblock and
> log are accessed via the "meta" files after mount, we see a
> WARN_ONCE() in dax_insert_entry(), which notices that
> page|folio->mapping is still set. I intend to address this prior to
> asking for the famfs patches to be merged.
> 
> * Alistair Popple's recent dax patch series [6], which has been merged
>   for 6.15, addresses some dax issues, but sadly does not fix the
> poisoned page|folio problem - its enhanced refcount checking turns
> the warning into an error.
> 
> * This 6.14 patch set disables the warning; a proper fix will be
> required for famfs to work at all in 6.15. Dan W. and I are actively
> discussing how to do this properly...
> 
> * In terms of the correct functionality of famfs, the warning can be
> ignored.
> 
> References
> 
> [1] - https://github.com/libfuse/libfuse/pull/1200
> [2] - https://github.com/cxl-micron-reskit/famfs
> [3]
> - https://lore.kernel.org/linux-cxl/cover.1708709155.git.john@groves.net/ [4] - https://lore.kernel.org/linux-cxl/cover.1714409084.git.john@groves.net/
> [5] - https://lwn.net/Articles/983105/
> [6]
> - https://lore.kernel.org/linux-cxl/cover.8068ad144a7eea4a813670301f4d2a86a8e68ec4.1740713401.git-series.apopple@nvidia.com/
> 
> 
> John Groves (19):
>   dev_dax_iomap: Move dax_pgoff_to_phys() from device.c to bus.c
>   dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage
>   dev_dax_iomap: Save the kva from memremap
>   dev_dax_iomap: Add dax_operations for use by fs-dax on devdax
>   dev_dax_iomap: export dax_dev_get()
>   dev_dax_iomap: (ignore!) Drop poisoned page warning in fs/dax.c
>   famfs_fuse: magic.h: Add famfs magic numbers
>   famfs_fuse: Kconfig
>   famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/
>   famfs_fuse: Basic fuse kernel ABI enablement for famfs
>   famfs_fuse: Basic famfs mount opts
>   famfs_fuse: Plumb the GET_FMAP message/response
>   famfs_fuse: Create files with famfs fmaps
>   famfs_fuse: GET_DAXDEV message and daxdev_table
>   famfs_fuse: Plumb dax iomap and fuse read/write/mmap
>   famfs_fuse: Add holder_operations for dax notify_failure()
>   famfs_fuse: Add famfs metadata documentation
>   famfs_fuse: Add documentation
>   famfs_fuse: (ignore) debug cruft
> 
>  Documentation/filesystems/famfs.rst |  142 ++++
>  Documentation/filesystems/index.rst |    1 +
>  MAINTAINERS                         |   10 +
>  drivers/dax/Kconfig                 |    6 +
>  drivers/dax/bus.c                   |  144 +++-
>  drivers/dax/dax-private.h           |    1 +
>  drivers/dax/device.c                |   38 +-
>  drivers/dax/super.c                 |   33 +-
>  fs/dax.c                            |    1 -
>  fs/fuse/Kconfig                     |   13 +
>  fs/fuse/Makefile                    |    4 +-
>  fs/fuse/dev.c                       |   61 ++
>  fs/fuse/dir.c                       |   74 +-
>  fs/fuse/famfs.c                     | 1105
> +++++++++++++++++++++++++++ fs/fuse/famfs_kfmap.h               |
> 166 ++++ fs/fuse/file.c                      |   27 +-
>  fs/fuse/fuse_i.h                    |   67 +-
>  fs/fuse/inode.c                     |   49 +-
>  fs/fuse/iomode.c                    |    2 +-
>  fs/namei.c                          |    1 +
>  include/linux/dax.h                 |    6 +
>  include/uapi/linux/fuse.h           |   63 ++
>  include/uapi/linux/magic.h          |    2 +
>  23 files changed, 1973 insertions(+), 43 deletions(-)
>  create mode 100644 Documentation/filesystems/famfs.rst
>  create mode 100644 fs/fuse/famfs.c
>  create mode 100644 fs/fuse/famfs_kfmap.h
> 
> 
> base-commit: 38fec10eb60d687e30c8c6b5420d86e8149f7557

Hi John,

Apologies if the question is far off or irrelevant.

I am trying to understand FAMFS, and I am thinking where does FAMFS
stand when compared to OpenSHMEM PGAS. Can't we have a OpenSHMEM-based
shared memory implementation over CXL that serves as FAMFS?

Maybe FAMFS does more than that!?!

Thanks,
Alireza


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 00/19] famfs: port into fuse
  2025-04-30 14:42 ` Alireza Sanaee
@ 2025-05-01  2:13   ` John Groves
  0 siblings, 0 replies; 58+ messages in thread
From: John Groves @ 2025-05-01  2:13 UTC (permalink / raw)
  To: Alireza Sanaee
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Darrick J . Wong,
	Luis Henriques, Randy Dunlap, Jeff Layton, Kent Overstreet,
	Petr Vorel, Brian Foster, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Joanne Koong, Josef Bacik, Aravind Ramesh,
	Ajay Joshi

On 25/04/30 03:42PM, Alireza Sanaee wrote:
> On Sun, 20 Apr 2025 20:33:27 -0500
> John Groves <John@Groves.net> wrote:
> 
>> <snip>
> 
> Hi John,
> 
> Apologies if the question is far off or irrelevant.
> 
> I am trying to understand FAMFS, and I am thinking where does FAMFS
> stand when compared to OpenSHMEM PGAS. Can't we have a OpenSHMEM-based
> shared memory implementation over CXL that serves as FAMFS?
> 
> Maybe FAMFS does more than that!?!
> 
> Thanks,
> Alireza
>

Continuation of this conversation likely belongs in the discusison section
at [1], but a couple of thoughts.

Famfs provides a scale-out filesystem mounts where the files that map to the
same disaggregated shared memory. If you mmap a famfs file, you are accessing
the memory directly. Since shmem is file-backed (usually tmpfs or
its ilk), shmem is a higher-level and more specialized abstraction, and
OpenSHMEM may be able to run atop famfs. It looks like OpenSHMEM and PGAS
cover the possibility that "shared memory" might require grabbing a copy via
[r]dma - which famfs will probably never do. Famfs only handles cases where
the memory is actually shared. (hey, I work for a memory company.)

Since famfs provides memory-mappable files, almost all apps can access them
(no requirement to write to the shmem, or other related but more estoteric
interfaces). Apps are responsible for not doing "nonsense" access WRT cache
coherency, but famfs manages cache coherency for its metadata.

The video at [2] may be useful to get up to speed.

[1] http://github.com/cxl-micron-reskit/famfs
[2] https://www.youtube.com/watch?v=L1QNpb-8VgM&t=1680


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 12/19] famfs_fuse: Plumb the GET_FMAP message/response
  2025-04-21  1:33 ` [RFC PATCH 12/19] famfs_fuse: Plumb the GET_FMAP message/response John Groves
@ 2025-05-02  5:48   ` Joanne Koong
  2025-05-02 20:35     ` Darrick J. Wong
  2025-05-12 16:28     ` John Groves
  0 siblings, 2 replies; 58+ messages in thread
From: Joanne Koong @ 2025-05-02  5:48 UTC (permalink / raw)
  To: John Groves
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Darrick J . Wong,
	Luis Henriques, Randy Dunlap, Jeff Layton, Kent Overstreet,
	Petr Vorel, Brian Foster, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Josef Bacik, Aravind Ramesh, Ajay Joshi

On Sun, Apr 20, 2025 at 6:34 PM John Groves <John@groves.net> wrote:
>
> Upon completion of a LOOKUP, if we're in famfs-mode we do a GET_FMAP to
> retrieve and cache up the file-to-dax map in the kernel. If this
> succeeds, read/write/mmap are resolved direct-to-dax with no upcalls.
>
> Signed-off-by: John Groves <john@groves.net>
> ---
>  fs/fuse/dir.c             | 69 +++++++++++++++++++++++++++++++++++++++
>  fs/fuse/fuse_i.h          | 36 +++++++++++++++++++-
>  fs/fuse/inode.c           | 15 +++++++++
>  include/uapi/linux/fuse.h |  4 +++
>  4 files changed, 123 insertions(+), 1 deletion(-)
>
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index bc29db0117f4..ae135c55b9f6 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -359,6 +359,56 @@ bool fuse_invalid_attr(struct fuse_attr *attr)
>         return !fuse_valid_type(attr->mode) || !fuse_valid_size(attr->size);
>  }
>
> +#define FMAP_BUFSIZE 4096
> +
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +static void
> +fuse_get_fmap_init(
> +       struct fuse_conn *fc,
> +       struct fuse_args *args,
> +       u64 nodeid,
> +       void *outbuf,
> +       size_t outbuf_size)
> +{
> +       memset(outbuf, 0, outbuf_size);

I think we can skip the memset here since kcalloc will zero out the
memory automatically when the fmap_buf gets allocated

> +       args->opcode = FUSE_GET_FMAP;
> +       args->nodeid = nodeid;
> +
> +       args->in_numargs = 0;
> +
> +       args->out_numargs = 1;
> +       args->out_args[0].size = FMAP_BUFSIZE;
> +       args->out_args[0].value = outbuf;
> +}
> +
> +static int
> +fuse_get_fmap(struct fuse_mount *fm, struct inode *inode, u64 nodeid)
> +{
> +       size_t fmap_size;
> +       void *fmap_buf;
> +       int err;
> +
> +       pr_notice("%s: nodeid=%lld, inode=%llx\n", __func__,
> +                 nodeid, (u64)inode);
> +       fmap_buf = kcalloc(1, FMAP_BUFSIZE, GFP_KERNEL);
> +       FUSE_ARGS(args);
> +       fuse_get_fmap_init(fm->fc, &args, nodeid, fmap_buf, FMAP_BUFSIZE);
> +
> +       /* Send GET_FMAP command */
> +       err = fuse_simple_request(fm, &args);

I'm assuming the fmap_buf gets freed in a later patch, but for this
one we'll probably need a kfree(fmap_buf) here in the meantime?

> +       if (err) {
> +               pr_err("%s: err=%d from fuse_simple_request()\n",
> +                      __func__, err);
> +               return err;
> +       }
> +
> +       fmap_size = args.out_args[0].size;
> +       pr_notice("%s: nodei=%lld fmap_size=%ld\n", __func__, nodeid, fmap_size);
> +
> +       return 0;
> +}
> +#endif
> +
>  int fuse_lookup_name(struct super_block *sb, u64 nodeid, const struct qstr *name,
>                      struct fuse_entry_out *outarg, struct inode **inode)
>  {
> @@ -404,6 +454,25 @@ int fuse_lookup_name(struct super_block *sb, u64 nodeid, const struct qstr *name
>                 fuse_queue_forget(fm->fc, forget, outarg->nodeid, 1);
>                 goto out;
>         }
> +
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +       if (fm->fc->famfs_iomap) {
> +               if (S_ISREG((*inode)->i_mode)) {
> +                       /* Note Lookup returns the looked-up inode in the attr
> +                        * struct, but not in outarg->nodeid !
> +                        */
> +                       pr_notice("%s: outarg: size=%d nodeid=%lld attr.ino=%lld\n",
> +                                __func__, args.out_args[0].size, outarg->nodeid,
> +                                outarg->attr.ino);
> +                       /* Get the famfs fmap */
> +                       fuse_get_fmap(fm, *inode, outarg->attr.ino);

I agree with Darrick's comment about fetching the mappings only if the
file gets opened. I wonder though if we could bundle the open with the
get_fmap so that we don't have to do an additional request / incur 2
extra context switches. This seems feasible to me. When we send the
open request, we could check if fc->famfs_iomap is set and if so, set
inarg.open_flags to include FUSE_OPEN_GET_FMAP and set outarg.value to
an allocated buffer that holds both struct fuse_open_out and the
fmap_buf and adjust outarg.size accordingly. Then the server could
send both the open and corresponding fmap data in the reply.

> +               } else
> +                       pr_notice("%s: no get_fmap for non-regular file\n",
> +                                __func__);
> +       } else
> +               pr_notice("%s: fc->dax_iomap is not set\n", __func__);
> +#endif
> +
>         err = 0;
>
>   out_put_forget:
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 931613102d32..437177c2f092 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -193,6 +193,10 @@ struct fuse_inode {
>         /** Reference to backing file in passthrough mode */
>         struct fuse_backing *fb;
>  #endif
> +
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +       void *famfs_meta;
> +#endif
>  };
>
>  /** FUSE inode state bits */
> @@ -942,6 +946,8 @@ struct fuse_conn {
>  #endif
>
>  #if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +       struct rw_semaphore famfs_devlist_sem;
> +       struct famfs_dax_devlist *dax_devlist;
>         char *shadow;
>  #endif
>  };
> @@ -1432,11 +1438,14 @@ void fuse_free_conn(struct fuse_conn *fc);
>
>  /* dax.c */
>
> +static inline int fuse_file_famfs(struct fuse_inode *fi); /* forward */
> +
>  /* This macro is used by virtio_fs, but now it also needs to filter for
>   * "not famfs"
>   */
>  #define FUSE_IS_VIRTIO_DAX(fuse_inode) (IS_ENABLED(CONFIG_FUSE_DAX)    \
> -                                       && IS_DAX(&fuse_inode->inode))
> +                                       && IS_DAX(&fuse_inode->inode)   \
> +                                       && !fuse_file_famfs(fuse_inode))
>
>  ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to);
>  ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from);
> @@ -1547,4 +1556,29 @@ extern void fuse_sysctl_unregister(void);
>  #define fuse_sysctl_unregister()       do { } while (0)
>  #endif /* CONFIG_SYSCTL */
>
> +/* famfs.c */
> +static inline struct fuse_backing *famfs_meta_set(struct fuse_inode *fi,
> +                                                      void *meta)
> +{
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +       return xchg(&fi->famfs_meta, meta);
> +#else
> +       return NULL;
> +#endif
> +}
> +
> +static inline void famfs_meta_free(struct fuse_inode *fi)
> +{
> +       /* Stub wil be connected in a subsequent commit */
> +}
> +
> +static inline int fuse_file_famfs(struct fuse_inode *fi)
> +{
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +       return (fi->famfs_meta != NULL);

Does this need to be "return READ_ONCE(fi->famfs_meta) != NULL"?

> +#else
> +       return 0;
> +#endif
> +}
> +
>  #endif /* _FS_FUSE_I_H */
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 7f4b73e739cb..848c8818e6f7 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -117,6 +117,9 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
>         if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
>                 fuse_inode_backing_set(fi, NULL);
>
> +       if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
> +               famfs_meta_set(fi, NULL);

"fi->famfs_meta = NULL;" looks simpler here

> +
>         return &fi->inode;
>
>  out_free_forget:
> @@ -138,6 +141,13 @@ static void fuse_free_inode(struct inode *inode)
>         if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
>                 fuse_backing_put(fuse_inode_backing(fi));
>
> +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> +       if (S_ISREG(inode->i_mode) && fi->famfs_meta) {
> +               famfs_meta_free(fi);
> +               famfs_meta_set(fi, NULL);
> +       }
> +#endif
> +
>         kmem_cache_free(fuse_inode_cachep, fi);
>  }
>
> @@ -1002,6 +1012,11 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm,
>         if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
>                 fuse_backing_files_init(fc);
>
> +       if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)) {
> +               pr_notice("%s: Kernel is FUSE_FAMFS_DAX capable\n", __func__);
> +               init_rwsem(&fc->famfs_devlist_sem);
> +       }

Should we only init this if the server chooses to opt into famfs (eg
if their init reply sets the FUSE_DAX_FMAP flag)? This imo seems to
belong more in process_init_reply().


Thanks,
Joanne
> +
>         INIT_LIST_HEAD(&fc->mounts);
>         list_add(&fm->fc_entry, &fc->mounts);
>         fm->fc = fc;
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index f9e14180367a..d85fb692cf3b 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -652,6 +652,10 @@ enum fuse_opcode {
>         FUSE_TMPFILE            = 51,
>         FUSE_STATX              = 52,
>
> +       /* Famfs / devdax opcodes */
> +       FUSE_GET_FMAP           = 53,
> +       FUSE_GET_DAXDEV         = 54,
> +
>         /* CUSE specific operations */
>         CUSE_INIT               = 4096,
>
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 12/19] famfs_fuse: Plumb the GET_FMAP message/response
  2025-05-02  5:48   ` Joanne Koong
@ 2025-05-02 20:35     ` Darrick J. Wong
  2025-05-12 16:28     ` John Groves
  1 sibling, 0 replies; 58+ messages in thread
From: Darrick J. Wong @ 2025-05-02 20:35 UTC (permalink / raw)
  To: Joanne Koong
  Cc: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert,
	John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Luis Henriques, Randy Dunlap, Jeff Layton, Kent Overstreet,
	Petr Vorel, Brian Foster, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Josef Bacik, Aravind Ramesh, Ajay Joshi

On Thu, May 01, 2025 at 10:48:15PM -0700, Joanne Koong wrote:
> On Sun, Apr 20, 2025 at 6:34 PM John Groves <John@groves.net> wrote:
> >
> > Upon completion of a LOOKUP, if we're in famfs-mode we do a GET_FMAP to
> > retrieve and cache up the file-to-dax map in the kernel. If this
> > succeeds, read/write/mmap are resolved direct-to-dax with no upcalls.
> >
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> >  fs/fuse/dir.c             | 69 +++++++++++++++++++++++++++++++++++++++
> >  fs/fuse/fuse_i.h          | 36 +++++++++++++++++++-
> >  fs/fuse/inode.c           | 15 +++++++++
> >  include/uapi/linux/fuse.h |  4 +++
> >  4 files changed, 123 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> > index bc29db0117f4..ae135c55b9f6 100644
> > --- a/fs/fuse/dir.c
> > +++ b/fs/fuse/dir.c
> > @@ -359,6 +359,56 @@ bool fuse_invalid_attr(struct fuse_attr *attr)
> >         return !fuse_valid_type(attr->mode) || !fuse_valid_size(attr->size);
> >  }
> >
> > +#define FMAP_BUFSIZE 4096
> > +
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > +static void
> > +fuse_get_fmap_init(
> > +       struct fuse_conn *fc,
> > +       struct fuse_args *args,
> > +       u64 nodeid,
> > +       void *outbuf,
> > +       size_t outbuf_size)
> > +{
> > +       memset(outbuf, 0, outbuf_size);
> 
> I think we can skip the memset here since kcalloc will zero out the
> memory automatically when the fmap_buf gets allocated
> 
> > +       args->opcode = FUSE_GET_FMAP;
> > +       args->nodeid = nodeid;
> > +
> > +       args->in_numargs = 0;
> > +
> > +       args->out_numargs = 1;
> > +       args->out_args[0].size = FMAP_BUFSIZE;
> > +       args->out_args[0].value = outbuf;
> > +}
> > +
> > +static int
> > +fuse_get_fmap(struct fuse_mount *fm, struct inode *inode, u64 nodeid)
> > +{
> > +       size_t fmap_size;
> > +       void *fmap_buf;
> > +       int err;
> > +
> > +       pr_notice("%s: nodeid=%lld, inode=%llx\n", __func__,
> > +                 nodeid, (u64)inode);
> > +       fmap_buf = kcalloc(1, FMAP_BUFSIZE, GFP_KERNEL);
> > +       FUSE_ARGS(args);
> > +       fuse_get_fmap_init(fm->fc, &args, nodeid, fmap_buf, FMAP_BUFSIZE);
> > +
> > +       /* Send GET_FMAP command */
> > +       err = fuse_simple_request(fm, &args);
> 
> I'm assuming the fmap_buf gets freed in a later patch, but for this
> one we'll probably need a kfree(fmap_buf) here in the meantime?
> 
> > +       if (err) {
> > +               pr_err("%s: err=%d from fuse_simple_request()\n",
> > +                      __func__, err);
> > +               return err;
> > +       }
> > +
> > +       fmap_size = args.out_args[0].size;
> > +       pr_notice("%s: nodei=%lld fmap_size=%ld\n", __func__, nodeid, fmap_size);
> > +
> > +       return 0;
> > +}
> > +#endif
> > +
> >  int fuse_lookup_name(struct super_block *sb, u64 nodeid, const struct qstr *name,
> >                      struct fuse_entry_out *outarg, struct inode **inode)
> >  {
> > @@ -404,6 +454,25 @@ int fuse_lookup_name(struct super_block *sb, u64 nodeid, const struct qstr *name
> >                 fuse_queue_forget(fm->fc, forget, outarg->nodeid, 1);
> >                 goto out;
> >         }
> > +
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > +       if (fm->fc->famfs_iomap) {
> > +               if (S_ISREG((*inode)->i_mode)) {
> > +                       /* Note Lookup returns the looked-up inode in the attr
> > +                        * struct, but not in outarg->nodeid !
> > +                        */
> > +                       pr_notice("%s: outarg: size=%d nodeid=%lld attr.ino=%lld\n",
> > +                                __func__, args.out_args[0].size, outarg->nodeid,
> > +                                outarg->attr.ino);
> > +                       /* Get the famfs fmap */
> > +                       fuse_get_fmap(fm, *inode, outarg->attr.ino);
> 
> I agree with Darrick's comment about fetching the mappings only if the
> file gets opened. I wonder though if we could bundle the open with the
> get_fmap so that we don't have to do an additional request / incur 2
> extra context switches.

What's the intended lifetime of these files?  If we only have to do this
once per file lifetime then perhaps that amortizes towards zero?  That
said, I don't know how aggressively fuse reclaims the inode structures,
so maybe the need to GET_FMAP is more frequent?  AFAICT the fuse code
seems to use the regular lru so maybe that's not so bad.  But maybe the
kernel shouldn't ask for mappings until someone tries a file IO
operation so that walking the directory tree doesn't bog down in a bunch
of GET_FMAP operations.

It also occurred to me just now that this is a LOOKUP operation, which
makes me wonder what happens for the other things like .  But maybe the
order of operations for creat() is that you tell the famfs management
layer to create a file, it preallocates space and the directory tree,
and later you can just resolve the pathname to open it, at which point
fuse+famfs creates the in-kernel abstractions?

> extra context switches. This seems feasible to me. When we send the
> open request, we could check if fc->famfs_iomap is set and if so, set
> inarg.open_flags to include FUSE_OPEN_GET_FMAP and set outarg.value to
> an allocated buffer that holds both struct fuse_open_out and the
> fmap_buf and adjust outarg.size accordingly. Then the server could
> send both the open and corresponding fmap data in the reply.

Alternately, we could create a fuse "notification" that really is just a
means for the famfs open function to send mappings to the kernel.  But I
don't know if it makes much difference for the kernel to demand-page in
mapping information as needed vs just using GET_FMAP here.

> 
> > +               } else
> > +                       pr_notice("%s: no get_fmap for non-regular file\n",
> > +                                __func__);
> > +       } else
> > +               pr_notice("%s: fc->dax_iomap is not set\n", __func__);
> > +#endif
> > +
> >         err = 0;
> >
> >   out_put_forget:
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index 931613102d32..437177c2f092 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -193,6 +193,10 @@ struct fuse_inode {
> >         /** Reference to backing file in passthrough mode */
> >         struct fuse_backing *fb;
> >  #endif
> > +
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > +       void *famfs_meta;
> > +#endif
> >  };
> >
> >  /** FUSE inode state bits */
> > @@ -942,6 +946,8 @@ struct fuse_conn {
> >  #endif
> >
> >  #if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > +       struct rw_semaphore famfs_devlist_sem;
> > +       struct famfs_dax_devlist *dax_devlist;
> >         char *shadow;
> >  #endif
> >  };
> > @@ -1432,11 +1438,14 @@ void fuse_free_conn(struct fuse_conn *fc);
> >
> >  /* dax.c */
> >
> > +static inline int fuse_file_famfs(struct fuse_inode *fi); /* forward */
> > +
> >  /* This macro is used by virtio_fs, but now it also needs to filter for
> >   * "not famfs"
> >   */
> >  #define FUSE_IS_VIRTIO_DAX(fuse_inode) (IS_ENABLED(CONFIG_FUSE_DAX)    \
> > -                                       && IS_DAX(&fuse_inode->inode))
> > +                                       && IS_DAX(&fuse_inode->inode)   \
> > +                                       && !fuse_file_famfs(fuse_inode))
> >
> >  ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to);
> >  ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from);
> > @@ -1547,4 +1556,29 @@ extern void fuse_sysctl_unregister(void);
> >  #define fuse_sysctl_unregister()       do { } while (0)
> >  #endif /* CONFIG_SYSCTL */
> >
> > +/* famfs.c */
> > +static inline struct fuse_backing *famfs_meta_set(struct fuse_inode *fi,
> > +                                                      void *meta)
> > +{
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > +       return xchg(&fi->famfs_meta, meta);
> > +#else
> > +       return NULL;
> > +#endif
> > +}
> > +
> > +static inline void famfs_meta_free(struct fuse_inode *fi)
> > +{
> > +       /* Stub wil be connected in a subsequent commit */
> > +}
> > +
> > +static inline int fuse_file_famfs(struct fuse_inode *fi)
> > +{
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > +       return (fi->famfs_meta != NULL);
> 
> Does this need to be "return READ_ONCE(fi->famfs_meta) != NULL"?
> 
> > +#else
> > +       return 0;
> > +#endif
> > +}
> > +
> >  #endif /* _FS_FUSE_I_H */
> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > index 7f4b73e739cb..848c8818e6f7 100644
> > --- a/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -117,6 +117,9 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
> >         if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> >                 fuse_inode_backing_set(fi, NULL);
> >
> > +       if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
> > +               famfs_meta_set(fi, NULL);
> 
> "fi->famfs_meta = NULL;" looks simpler here
> 
> > +
> >         return &fi->inode;
> >
> >  out_free_forget:
> > @@ -138,6 +141,13 @@ static void fuse_free_inode(struct inode *inode)
> >         if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> >                 fuse_backing_put(fuse_inode_backing(fi));
> >
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > +       if (S_ISREG(inode->i_mode) && fi->famfs_meta) {
> > +               famfs_meta_free(fi);
> > +               famfs_meta_set(fi, NULL);
> > +       }
> > +#endif
> > +
> >         kmem_cache_free(fuse_inode_cachep, fi);
> >  }
> >
> > @@ -1002,6 +1012,11 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm,
> >         if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> >                 fuse_backing_files_init(fc);
> >
> > +       if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)) {
> > +               pr_notice("%s: Kernel is FUSE_FAMFS_DAX capable\n", __func__);
> > +               init_rwsem(&fc->famfs_devlist_sem);
> > +       }
> 
> Should we only init this if the server chooses to opt into famfs (eg
> if their init reply sets the FUSE_DAX_FMAP flag)? This imo seems to
> belong more in process_init_reply().

/me has no idea what this means, but has a sneaking suspicion it'll
become important for his science projects. ;)

Though I guess for general purpose files maybe we want to allow opting
into iomapping on a per-inode basis like the existing iomap allows.

--D

> 
> Thanks,
> Joanne
> > +
> >         INIT_LIST_HEAD(&fc->mounts);
> >         list_add(&fm->fc_entry, &fc->mounts);
> >         fm->fc = fc;
> > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > index f9e14180367a..d85fb692cf3b 100644
> > --- a/include/uapi/linux/fuse.h
> > +++ b/include/uapi/linux/fuse.h
> > @@ -652,6 +652,10 @@ enum fuse_opcode {
> >         FUSE_TMPFILE            = 51,
> >         FUSE_STATX              = 52,
> >
> > +       /* Famfs / devdax opcodes */
> > +       FUSE_GET_FMAP           = 53,
> > +       FUSE_GET_DAXDEV         = 54,
> > +
> >         /* CUSE specific operations */
> >         CUSE_INIT               = 4096,
> >
> > --
> > 2.49.0
> >
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 13/19] famfs_fuse: Create files with famfs fmaps
  2025-04-28 19:00         ` Darrick J. Wong
@ 2025-05-06 16:56           ` Miklos Szeredi
  2025-05-08 15:56             ` Darrick J. Wong
  2025-05-12 19:51             ` John Groves
  0 siblings, 2 replies; 58+ messages in thread
From: Miklos Szeredi @ 2025-05-06 16:56 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: John Groves, Dan Williams, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Luis Henriques,
	Randy Dunlap, Jeff Layton, Kent Overstreet, Petr Vorel,
	Brian Foster, linux-doc, linux-kernel, nvdimm, linux-cxl,
	linux-fsdevel, Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi,
	Joanne Koong, Josef Bacik, Aravind Ramesh, Ajay Joshi, 0

On Mon, 28 Apr 2025 at 21:00, Darrick J. Wong <djwong@kernel.org> wrote:

> <nod> I don't know what Miklos' opinion is about having multiple
> fusecmds that do similar things -- on the one hand keeping yours and my
> efforts separate explodes the amount of userspace abi that everyone must
> maintain, but on the other hand it then doesn't couple our projects
> together, which might be a good thing if it turns out that our domain
> models are /really/ actually quite different.

Sharing the interface at least would definitely be worthwhile, as
there does not seem to be a great deal of difference between the
generic one and the famfs specific one.  Only implementing part of the
functionality that the generic one provides would be fine.

> (Especially because I suspect that interleaving is the norm for memory,
> whereas we try to avoid that for disk filesystems.)

So interleaved extents are just like normal ones except they repeat,
right?  What about adding a special "repeat last N extent
descriptions" type of extent?

> > But the current implementation does not contemplate partially cached fmaps.
> >
> > Adding notification could address revoking them post-haste (is that why
> > you're thinking about notifications? And if not can you elaborate on what
> > you're after there?).
>
> Yeah, invalidating the mapping cache at random places.  If, say, you
> implement a clustered filesystem with iomap, the metadata server could
> inform the fuse server on the local node that a certain range of inode X
> has been written to, at which point you need to revoke any local leases,
> invalidate the pagecache, and invalidate the iomapping cache to force
> the client to requery the server.
>
> Or if your fuse server wants to implement its own weird operations (e.g.
> XFS EXCHANGE-RANGE) this would make that possible without needing to
> add a bunch of code to fs/fuse/ for the benefit of a single fuse driver.

Wouldn't existing invalidation framework be sufficient?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 13/19] famfs_fuse: Create files with famfs fmaps
  2025-05-06 16:56           ` Miklos Szeredi
@ 2025-05-08 15:56             ` Darrick J. Wong
  2025-05-13  9:14               ` Miklos Szeredi
  2025-05-12 19:51             ` John Groves
  1 sibling, 1 reply; 58+ messages in thread
From: Darrick J. Wong @ 2025-05-08 15:56 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: John Groves, Dan Williams, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Luis Henriques,
	Randy Dunlap, Jeff Layton, Kent Overstreet, Petr Vorel,
	Brian Foster, linux-doc, linux-kernel, nvdimm, linux-cxl,
	linux-fsdevel, Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi,
	Joanne Koong, Josef Bacik, Aravind Ramesh, Ajay Joshi, 0

On Tue, May 06, 2025 at 06:56:29PM +0200, Miklos Szeredi wrote:
> On Mon, 28 Apr 2025 at 21:00, Darrick J. Wong <djwong@kernel.org> wrote:
> 
> > <nod> I don't know what Miklos' opinion is about having multiple
> > fusecmds that do similar things -- on the one hand keeping yours and my
> > efforts separate explodes the amount of userspace abi that everyone must
> > maintain, but on the other hand it then doesn't couple our projects
> > together, which might be a good thing if it turns out that our domain
> > models are /really/ actually quite different.
> 
> Sharing the interface at least would definitely be worthwhile, as
> there does not seem to be a great deal of difference between the
> generic one and the famfs specific one.  Only implementing part of the
> functionality that the generic one provides would be fine.

Well right now my barely functional prototype exposes this interface
for communicating mappings to the kernel.  I've only gotten as far as
exposing the ->iomap_{begin,end} and ->iomap_ioend calls to the fuse
server with no caching, because the only functions I've implemented so
far are FIEMAP, SEEK_{DATA,HOLE}, and directio.

So basically the kernel sends a FUSE_IOMAP_BEGIN command with the
desired (pos, count) file range to the fuse server, which responds with
a struct fuse_iomap_begin_out object that is translated into a struct
iomap.

The fuse server then responds with a read mapping and a write mapping,
which tell the kernel from where to read data, and where to write data.
As a shortcut, the write mapping can be of type
FUSE_IOMAP_TYPE_PURE_OVERWRITE to avoid having to fill out fields twice.

iomap_end is only called if there were errors while processing the
mapping, or if the fuse server sets FUSE_IOMAP_F_WANT_IOMAP_END.

iomap_ioend is called after read or write IOs complete, so that the
filesystem can update mapping metadata (e.g. unwritten extent
conversion, remapping after an out of place write, ondisk isize update).

Some of the flags here might not be needed or workable; I was merely
cutting and pasting the #defines from iomap.h.

#define FUSE_IOMAP_TYPE_PURE_OVERWRITE	(0xFFFF) /* use read mapping data */
#define FUSE_IOMAP_TYPE_HOLE		0	/* no blocks allocated, need allocation */
#define FUSE_IOMAP_TYPE_DELALLOC	1	/* delayed allocation blocks */
#define FUSE_IOMAP_TYPE_MAPPED		2	/* blocks allocated at @addr */
#define FUSE_IOMAP_TYPE_UNWRITTEN	3	/* blocks allocated at @addr in unwritten state */
#define FUSE_IOMAP_TYPE_INLINE		4	/* data inline in the inode */

#define FUSE_IOMAP_DEV_SBDEV		(0)	/* use superblock bdev */

#define FUSE_IOMAP_F_NEW		(1U << 0)
#define FUSE_IOMAP_F_DIRTY		(1U << 1)
#define FUSE_IOMAP_F_SHARED		(1U << 2)
#define FUSE_IOMAP_F_MERGED		(1U << 3)
#define FUSE_IOMAP_F_XATTR		(1U << 5)
#define FUSE_IOMAP_F_BOUNDARY		(1U << 6)
#define FUSE_IOMAP_F_ANON_WRITE		(1U << 7)

#define FUSE_IOMAP_F_WANT_IOMAP_END	(1U << 15) /* want ->iomap_end call */

#define FUSE_IOMAP_OP_WRITE		(1 << 0) /* writing, must allocate blocks */
#define FUSE_IOMAP_OP_ZERO		(1 << 1) /* zeroing operation, may skip holes */
#define FUSE_IOMAP_OP_REPORT		(1 << 2) /* report extent status, e.g. FIEMAP */
#define FUSE_IOMAP_OP_FAULT		(1 << 3) /* mapping for page fault */
#define FUSE_IOMAP_OP_DIRECT		(1 << 4) /* direct I/O */
#define FUSE_IOMAP_OP_NOWAIT		(1 << 5) /* do not block */
#define FUSE_IOMAP_OP_OVERWRITE_ONLY	(1 << 6) /* only pure overwrites allowed */
#define FUSE_IOMAP_OP_UNSHARE		(1 << 7) /* unshare_file_range */
#define FUSE_IOMAP_OP_ATOMIC		(1 << 9) /* torn-write protection */
#define FUSE_IOMAP_OP_DONTCACHE		(1 << 10) /* dont retain pagecache */

#define FUSE_IOMAP_NULL_ADDR		-1ULL	/* addr is not valid */

struct fuse_iomap_begin_in {
	uint32_t opflags;	/* FUSE_IOMAP_OP_* */
	uint32_t reserved;
	uint64_t ino;		/* matches st_ino provided by getattr/open */
	uint64_t pos;		/* file position, in bytes */
	uint64_t count;		/* operation length, in bytes */
};

struct fuse_iomap_begin_out {
	uint64_t offset;	/* file offset of mapping, bytes */
	uint64_t length;	/* length of both mappings, bytes */

	uint64_t read_addr;	/* disk offset of mapping, bytes */
	uint16_t read_type;	/* FUSE_IOMAP_TYPE_* */
	uint16_t read_flags;	/* FUSE_IOMAP_F_* */
	uint32_t read_dev;	/* FUSE_IOMAP_DEV_* */

	uint64_t write_addr;	/* disk offset of mapping, bytes */
	uint16_t write_type;	/* FUSE_IOMAP_TYPE_* */
	uint16_t write_flags;	/* FUSE_IOMAP_F_* */
	uint32_t write_dev;	/* FUSE_IOMAP_DEV_* */
};

struct fuse_iomap_end_in {
	uint32_t opflags;	/* FUSE_IOMAP_OP_* */
	uint32_t reserved;
	uint64_t ino;		/* matches st_ino provided iomap_begin */
	uint64_t pos;		/* file position, in bytes */
	uint64_t count;		/* operation length, in bytes */
	int64_t written;	/* bytes processed */

	uint64_t map_length;	/* length of mapping, bytes */
	uint64_t map_addr;	/* disk offset of mapping, bytes */
	uint16_t map_type;	/* FUSE_IOMAP_TYPE_* */
	uint16_t map_flags;	/* FUSE_IOMAP_F_* */
	uint32_t map_dev;	/* FUSE_IOMAP_DEV_* */
};

/* out of place write extent */
#define FUSE_IOMAP_IOEND_SHARED		(1U << 0)
/* unwritten extent */
#define FUSE_IOMAP_IOEND_UNWRITTEN	(1U << 1)
/* don't merge into previous ioend */
#define FUSE_IOMAP_IOEND_BOUNDARY	(1U << 2)
/* is direct I/O */
#define FUSE_IOMAP_IOEND_DIRECT		(1U << 3)

/* is append ioend */
#define FUSE_IOMAP_IOEND_APPEND		(1U << 15)

struct fuse_iomap_ioend_in {
	uint16_t ioendflags;	/* FUSE_IOMAP_IOEND_* */
	uint16_t reserved;
	int32_t error;		/* negative errno or 0 */
	uint64_t ino;		/* matches st_ino provided iomap_begin */
	uint64_t pos;		/* file position, in bytes */
	uint64_t addr;		/* disk offset of new mapping, in bytes */
	uint32_t written;	/* bytes processed */
	uint32_t reserved1;
};

> > (Especially because I suspect that interleaving is the norm for memory,
> > whereas we try to avoid that for disk filesystems.)
> 
> So interleaved extents are just like normal ones except they repeat,
> right?  What about adding a special "repeat last N extent
> descriptions" type of extent?

Yeah, I suppose a mapping cache could do that.  From talking to John
last week, it sounds like the mappings are supposed to be static for the
life of the file, as opposed to ext* where truncates and fallocate can
appear at any time.

One thing I forgot to ask John -- can there be multiple sets of
interleaved mappings per file?  e.g. the first 32g of a file are split
between 4 memory controllers, whereas the next 64g are split between 4
different domains?

> > > But the current implementation does not contemplate partially cached fmaps.
> > >
> > > Adding notification could address revoking them post-haste (is that why
> > > you're thinking about notifications? And if not can you elaborate on what
> > > you're after there?).
> >
> > Yeah, invalidating the mapping cache at random places.  If, say, you
> > implement a clustered filesystem with iomap, the metadata server could
> > inform the fuse server on the local node that a certain range of inode X
> > has been written to, at which point you need to revoke any local leases,
> > invalidate the pagecache, and invalidate the iomapping cache to force
> > the client to requery the server.
> >
> > Or if your fuse server wants to implement its own weird operations (e.g.
> > XFS EXCHANGE-RANGE) this would make that possible without needing to
> > add a bunch of code to fs/fuse/ for the benefit of a single fuse driver.
> 
> Wouldn't existing invalidation framework be sufficient?

I'm a little confused, are you talking about FUSE_NOTIFY_INVAL_INODE?
If so, then I think that's the wrong layer -- INVAL_INODE invalidates
the page cache, whereas I'm talking about caching the file space
mappings that iomap uses to construct bios for disk IO, and possibly
wanting to invalidate parts of that cache to force the kernel to upcall
the fuse server for a new mapping.

(Obviously this only applies to fuse servers for ondisk filesystems.)

--D

> Thanks,
> Miklos
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 12/19] famfs_fuse: Plumb the GET_FMAP message/response
  2025-05-02  5:48   ` Joanne Koong
  2025-05-02 20:35     ` Darrick J. Wong
@ 2025-05-12 16:28     ` John Groves
  2025-05-22 15:45       ` Amir Goldstein
  1 sibling, 1 reply; 58+ messages in thread
From: John Groves @ 2025-05-12 16:28 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Darrick J . Wong,
	Luis Henriques, Randy Dunlap, Jeff Layton, Kent Overstreet,
	Petr Vorel, Brian Foster, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Amir Goldstein, Jonathan Cameron,
	Stefan Hajnoczi, Josef Bacik, Aravind Ramesh, Ajay Joshi

On 25/05/01 10:48PM, Joanne Koong wrote:
> On Sun, Apr 20, 2025 at 6:34 PM John Groves <John@groves.net> wrote:
> >
> > Upon completion of a LOOKUP, if we're in famfs-mode we do a GET_FMAP to
> > retrieve and cache up the file-to-dax map in the kernel. If this
> > succeeds, read/write/mmap are resolved direct-to-dax with no upcalls.
> >
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> >  fs/fuse/dir.c             | 69 +++++++++++++++++++++++++++++++++++++++
> >  fs/fuse/fuse_i.h          | 36 +++++++++++++++++++-
> >  fs/fuse/inode.c           | 15 +++++++++
> >  include/uapi/linux/fuse.h |  4 +++
> >  4 files changed, 123 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> > index bc29db0117f4..ae135c55b9f6 100644
> > --- a/fs/fuse/dir.c
> > +++ b/fs/fuse/dir.c
> > @@ -359,6 +359,56 @@ bool fuse_invalid_attr(struct fuse_attr *attr)
> >         return !fuse_valid_type(attr->mode) || !fuse_valid_size(attr->size);
> >  }
> >
> > +#define FMAP_BUFSIZE 4096
> > +
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > +static void
> > +fuse_get_fmap_init(
> > +       struct fuse_conn *fc,
> > +       struct fuse_args *args,
> > +       u64 nodeid,
> > +       void *outbuf,
> > +       size_t outbuf_size)
> > +{
> > +       memset(outbuf, 0, outbuf_size);
> 
> I think we can skip the memset here since kcalloc will zero out the
> memory automatically when the fmap_buf gets allocated

Good catch, thanks. Queued to -next.

> 
> > +       args->opcode = FUSE_GET_FMAP;
> > +       args->nodeid = nodeid;
> > +
> > +       args->in_numargs = 0;
> > +
> > +       args->out_numargs = 1;
> > +       args->out_args[0].size = FMAP_BUFSIZE;
> > +       args->out_args[0].value = outbuf;
> > +}
> > +
> > +static int
> > +fuse_get_fmap(struct fuse_mount *fm, struct inode *inode, u64 nodeid)
> > +{
> > +       size_t fmap_size;
> > +       void *fmap_buf;
> > +       int err;
> > +
> > +       pr_notice("%s: nodeid=%lld, inode=%llx\n", __func__,
> > +                 nodeid, (u64)inode);
> > +       fmap_buf = kcalloc(1, FMAP_BUFSIZE, GFP_KERNEL);
> > +       FUSE_ARGS(args);
> > +       fuse_get_fmap_init(fm->fc, &args, nodeid, fmap_buf, FMAP_BUFSIZE);
> > +
> > +       /* Send GET_FMAP command */
> > +       err = fuse_simple_request(fm, &args);
> 
> I'm assuming the fmap_buf gets freed in a later patch, but for this
> one we'll probably need a kfree(fmap_buf) here in the meantime?

Nice of you to give me the benefit of the doubt there ;)

At this commit, nothing is done with fmap_buf, and a subsequent
commit adds a call to famfs_file_init_dax(...fmap_buf...). But
the fmap_buf was leaked.

I'm adding a kfree(fmap_buf) to this commit, which will come after the
call to famfs_file_init_dax() when that's added in a subsequent
commit.

Thanks!

> 
> > +       if (err) {
> > +               pr_err("%s: err=%d from fuse_simple_request()\n",
> > +                      __func__, err);
> > +               return err;
> > +       }
> > +
> > +       fmap_size = args.out_args[0].size;
> > +       pr_notice("%s: nodei=%lld fmap_size=%ld\n", __func__, nodeid, fmap_size);
> > +
> > +       return 0;
> > +}
> > +#endif
> > +
> >  int fuse_lookup_name(struct super_block *sb, u64 nodeid, const struct qstr *name,
> >                      struct fuse_entry_out *outarg, struct inode **inode)
> >  {
> > @@ -404,6 +454,25 @@ int fuse_lookup_name(struct super_block *sb, u64 nodeid, const struct qstr *name
> >                 fuse_queue_forget(fm->fc, forget, outarg->nodeid, 1);
> >                 goto out;
> >         }
> > +
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > +       if (fm->fc->famfs_iomap) {
> > +               if (S_ISREG((*inode)->i_mode)) {
> > +                       /* Note Lookup returns the looked-up inode in the attr
> > +                        * struct, but not in outarg->nodeid !
> > +                        */
> > +                       pr_notice("%s: outarg: size=%d nodeid=%lld attr.ino=%lld\n",
> > +                                __func__, args.out_args[0].size, outarg->nodeid,
> > +                                outarg->attr.ino);
> > +                       /* Get the famfs fmap */
> > +                       fuse_get_fmap(fm, *inode, outarg->attr.ino);
> 
> I agree with Darrick's comment about fetching the mappings only if the
> file gets opened. I wonder though if we could bundle the open with the
> get_fmap so that we don't have to do an additional request / incur 2
> extra context switches. This seems feasible to me. When we send the
> open request, we could check if fc->famfs_iomap is set and if so, set
> inarg.open_flags to include FUSE_OPEN_GET_FMAP and set outarg.value to
> an allocated buffer that holds both struct fuse_open_out and the
> fmap_buf and adjust outarg.size accordingly. Then the server could
> send both the open and corresponding fmap data in the reply.

I agree about moving GET_FMAP to open, but I want to be cautious about 
moving it *into* open. Right now fitting an entire fmap into a single
message response looks like a totally acceptable requirement for famfs -
but it might not survive as a permanent requirement, and it seems likely 
not to work out for Darrick's use cases - which I think would lead us back 
to needing GET_FMAP.

Elswhere in this thread, and also 1:1, Darrick and I have discussed the
possibility of partial retrieval of fmaps (in part due to the possibility
that they might not always fit in a single message). If these responses 
can get arbitrarily large, this would become a requirement. GET_FMAP could 
specify an offset, and the reply could also specify its starting  offset; 
I think it has to be in both places because  the current "elegantly simple" 
fmap format doesn't always split easily at arbitrary offsets.

Also, with famfs I think fmaps can be retained in-kernel past close,
making the retrieval-on-open only needed if the fmap isn't already
present. Famfs doesn't currently allow fmaps to change, although there
are reasons we might relax that later.

This can be revisited down the road.

Unless I run into a blocker, the next rev of the series will call
GET_FMAP on open...

BTW I think moving GET_FMAP to open will remove the reasons why famfs
currently needs to avoid READDIRPLUS.

> 
> > +               } else
> > +                       pr_notice("%s: no get_fmap for non-regular file\n",
> > +                                __func__);
> > +       } else
> > +               pr_notice("%s: fc->dax_iomap is not set\n", __func__);
> > +#endif
> > +
> >         err = 0;
> >
> >   out_put_forget:
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index 931613102d32..437177c2f092 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -193,6 +193,10 @@ struct fuse_inode {
> >         /** Reference to backing file in passthrough mode */
> >         struct fuse_backing *fb;
> >  #endif
> > +
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > +       void *famfs_meta;
> > +#endif
> >  };
> >
> >  /** FUSE inode state bits */
> > @@ -942,6 +946,8 @@ struct fuse_conn {
> >  #endif
> >
> >  #if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > +       struct rw_semaphore famfs_devlist_sem;
> > +       struct famfs_dax_devlist *dax_devlist;
> >         char *shadow;
> >  #endif
> >  };
> > @@ -1432,11 +1438,14 @@ void fuse_free_conn(struct fuse_conn *fc);
> >
> >  /* dax.c */
> >
> > +static inline int fuse_file_famfs(struct fuse_inode *fi); /* forward */
> > +
> >  /* This macro is used by virtio_fs, but now it also needs to filter for
> >   * "not famfs"
> >   */
> >  #define FUSE_IS_VIRTIO_DAX(fuse_inode) (IS_ENABLED(CONFIG_FUSE_DAX)    \
> > -                                       && IS_DAX(&fuse_inode->inode))
> > +                                       && IS_DAX(&fuse_inode->inode)   \
> > +                                       && !fuse_file_famfs(fuse_inode))
> >
> >  ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to);
> >  ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from);
> > @@ -1547,4 +1556,29 @@ extern void fuse_sysctl_unregister(void);
> >  #define fuse_sysctl_unregister()       do { } while (0)
> >  #endif /* CONFIG_SYSCTL */
> >
> > +/* famfs.c */
> > +static inline struct fuse_backing *famfs_meta_set(struct fuse_inode *fi,
> > +                                                      void *meta)
> > +{
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > +       return xchg(&fi->famfs_meta, meta);
> > +#else
> > +       return NULL;
> > +#endif
> > +}
> > +
> > +static inline void famfs_meta_free(struct fuse_inode *fi)
> > +{
> > +       /* Stub wil be connected in a subsequent commit */
> > +}
> > +
> > +static inline int fuse_file_famfs(struct fuse_inode *fi)
> > +{
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > +       return (fi->famfs_meta != NULL);
> 
> Does this need to be "return READ_ONCE(fi->famfs_meta) != NULL"?

I'm not sure, but it can't hurt. Queued...

> 
> > +#else
> > +       return 0;
> > +#endif
> > +}
> > +
> >  #endif /* _FS_FUSE_I_H */
> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > index 7f4b73e739cb..848c8818e6f7 100644
> > --- a/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -117,6 +117,9 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
> >         if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> >                 fuse_inode_backing_set(fi, NULL);
> >
> > +       if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
> > +               famfs_meta_set(fi, NULL);
> 
> "fi->famfs_meta = NULL;" looks simpler here

I toootally agree here, but I was following the passthrough pattern 
just above.  @miklos or @Amir, got a preference here?

Furthermore, initially I didn't init fi->famfs_meta at all because I 
*assumed* fi (the fuse_inode) would be zeroed upon allocation - but it's 
currently not. @miklos, would you object to zeroing fuse_inodes on 
allocation?  Clearly it's working without that, but it seems like a 
"normal" thing to do, that might someday pre-empt a problem.

> 
> > +
> >         return &fi->inode;
> >
> >  out_free_forget:
> > @@ -138,6 +141,13 @@ static void fuse_free_inode(struct inode *inode)
> >         if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> >                 fuse_backing_put(fuse_inode_backing(fi));
> >
> > +#if IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)
> > +       if (S_ISREG(inode->i_mode) && fi->famfs_meta) {
> > +               famfs_meta_free(fi);
> > +               famfs_meta_set(fi, NULL);
> > +       }
> > +#endif
> > +
> >         kmem_cache_free(fuse_inode_cachep, fi);
> >  }
> >
> > @@ -1002,6 +1012,11 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm,
> >         if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> >                 fuse_backing_files_init(fc);
> >
> > +       if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX)) {
> > +               pr_notice("%s: Kernel is FUSE_FAMFS_DAX capable\n", __func__);
> > +               init_rwsem(&fc->famfs_devlist_sem);
> > +       }
> 
> Should we only init this if the server chooses to opt into famfs (eg
> if their init reply sets the FUSE_DAX_FMAP flag)? This imo seems to
> belong more in process_init_reply().

Another good catch. Queued - thanks!

> 
> 
> Thanks,
> Joanne

Thank you!


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 13/19] famfs_fuse: Create files with famfs fmaps
  2025-05-06 16:56           ` Miklos Szeredi
  2025-05-08 15:56             ` Darrick J. Wong
@ 2025-05-12 19:51             ` John Groves
  2025-05-13  4:03               ` Darrick J. Wong
  1 sibling, 1 reply; 58+ messages in thread
From: John Groves @ 2025-05-12 19:51 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Darrick J. Wong, Dan Williams, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Luis Henriques,
	Randy Dunlap, Jeff Layton, Kent Overstreet, Petr Vorel,
	Brian Foster, linux-doc, linux-kernel, nvdimm, linux-cxl,
	linux-fsdevel, Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi,
	Joanne Koong, Josef Bacik, Aravind Ramesh, Ajay Joshi, john

On 25/05/06 06:56PM, Miklos Szeredi wrote:
> On Mon, 28 Apr 2025 at 21:00, Darrick J. Wong <djwong@kernel.org> wrote:
> 
> > <nod> I don't know what Miklos' opinion is about having multiple
> > fusecmds that do similar things -- on the one hand keeping yours and my
> > efforts separate explodes the amount of userspace abi that everyone must
> > maintain, but on the other hand it then doesn't couple our projects
> > together, which might be a good thing if it turns out that our domain
> > models are /really/ actually quite different.
> 
> Sharing the interface at least would definitely be worthwhile, as
> there does not seem to be a great deal of difference between the
> generic one and the famfs specific one.  Only implementing part of the
> functionality that the generic one provides would be fine.

Agreed. I'm coming around to thinking the most practical approach would be
to share the GET_FMAP message/response, but to add a separate response
format for Darrick's use case - when the time comes. In this patch set, 
that starts with 'struct fuse_famfs_fmap_header' and is followed by the 
approriate extent structures, serialized in the message. Collectively 
that's an fmap in message format.

Side note: the current patch set sends back the logically-variable-sized 
fmap in a fixed-size message, but V2 of the series will address that; 
I got some help from Bernd there, but haven't finished it yet.

So the next version of the patch set would, say, add a more generic first
'struct fmap_header' that would indicate whether the next item would be
'struct fuse_famfs_fmap_header' (i.e. my/famfs metadata) or some other
to be codified metadata format. I'm going here because I'm dubious that
we even *can* do grand-unified-fmap-metadata (or that we should try).

This will require versioning the affected structures, unless we think
the fmap-in-message structure can be opaque to the rest of fuse. @miklos,
is there an example to follow regarding struct versioning in 
already-existing fuse structures?

> 
> > (Especially because I suspect that interleaving is the norm for memory,
> > whereas we try to avoid that for disk filesystems.)
> 
> So interleaved extents are just like normal ones except they repeat,
> right?  What about adding a special "repeat last N extent
> descriptions" type of extent?

It's a bit more than that. The comment at [1] makes it possible to understand
the scheme, but I'd be happy to talk through it with you on a call if that
seems helpful.

An interleaved extent stripes data spread across N memory devices in raid 0
format; the space from each device is described by a single simple extent 
(so it's contigous), but it's not consumed contiguously - it's consumed in 
fixed-sized chunks that precess across the devices. Notwithstanding that I 
couldn't explain it very well when we talked about it at LPC, I think I 
could make it pretty clear in a pretty brief call now.

In any case, you have my word that it's actually quite elegant :D
(seriously, but also with a smile...)

> 
> > > But the current implementation does not contemplate partially cached fmaps.
> > >
> > > Adding notification could address revoking them post-haste (is that why
> > > you're thinking about notifications? And if not can you elaborate on what
> > > you're after there?).
> >
> > Yeah, invalidating the mapping cache at random places.  If, say, you
> > implement a clustered filesystem with iomap, the metadata server could
> > inform the fuse server on the local node that a certain range of inode X
> > has been written to, at which point you need to revoke any local leases,
> > invalidate the pagecache, and invalidate the iomapping cache to force
> > the client to requery the server.
> >
> > Or if your fuse server wants to implement its own weird operations (e.g.
> > XFS EXCHANGE-RANGE) this would make that possible without needing to
> > add a bunch of code to fs/fuse/ for the benefit of a single fuse driver.
> 
> Wouldn't existing invalidation framework be sufficient?
> 
> Thanks,
> Miklos

My current thinking is that Darrick's use case doesn't need GET_DAXDEV, but
famfs does. I think Darrick's use case has one backing device, and that should
be passed in at mount time. Correct me if you think that might be wrong.

Famfs doesn't necessarily have just one backing dev, which means that famfs
could pass in the *primary* backing dev at mount time, but it would still
need GET_DAXDEV to get the rest. But if I just use GET_FMAP every time, I
only need one way to do this.

I'll add a few more responses to Darrick's reply...

Thanks,
John

[1] https://github.com/cxl-micron-reskit/famfs-linux/blob/c57553c4ca91f0634f137285840ab25be8a87c30/fs/fuse/famfs_kfmap.h#L13


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 13/19] famfs_fuse: Create files with famfs fmaps
  2025-05-12 19:51             ` John Groves
@ 2025-05-13  4:03               ` Darrick J. Wong
  0 siblings, 0 replies; 58+ messages in thread
From: Darrick J. Wong @ 2025-05-13  4:03 UTC (permalink / raw)
  To: John Groves
  Cc: Miklos Szeredi, Dan Williams, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Luis Henriques,
	Randy Dunlap, Jeff Layton, Kent Overstreet, Petr Vorel,
	Brian Foster, linux-doc, linux-kernel, nvdimm, linux-cxl,
	linux-fsdevel, Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi,
	Joanne Koong, Josef Bacik, Aravind Ramesh, Ajay Joshi

On Mon, May 12, 2025 at 02:51:45PM -0500, John Groves wrote:
> On 25/05/06 06:56PM, Miklos Szeredi wrote:
> > On Mon, 28 Apr 2025 at 21:00, Darrick J. Wong <djwong@kernel.org> wrote:
> > 
> > > <nod> I don't know what Miklos' opinion is about having multiple
> > > fusecmds that do similar things -- on the one hand keeping yours and my
> > > efforts separate explodes the amount of userspace abi that everyone must
> > > maintain, but on the other hand it then doesn't couple our projects
> > > together, which might be a good thing if it turns out that our domain
> > > models are /really/ actually quite different.
> > 
> > Sharing the interface at least would definitely be worthwhile, as
> > there does not seem to be a great deal of difference between the
> > generic one and the famfs specific one.  Only implementing part of the
> > functionality that the generic one provides would be fine.
> 
> Agreed. I'm coming around to thinking the most practical approach would be
> to share the GET_FMAP message/response, but to add a separate response
> format for Darrick's use case - when the time comes. In this patch set, 
> that starts with 'struct fuse_famfs_fmap_header' and is followed by the 
> approriate extent structures, serialized in the message. Collectively 
> that's an fmap in message format.

Well in that case I might as well just plumb in the pieces I need as
separate fuse commands.  fuse_args::opcode is u32, there's plenty of
space left.

> Side note: the current patch set sends back the logically-variable-sized 
> fmap in a fixed-size message, but V2 of the series will address that; 
> I got some help from Bernd there, but haven't finished it yet.
> 
> So the next version of the patch set would, say, add a more generic first
> 'struct fmap_header' that would indicate whether the next item would be
> 'struct fuse_famfs_fmap_header' (i.e. my/famfs metadata) or some other
> to be codified metadata format. I'm going here because I'm dubious that
> we even *can* do grand-unified-fmap-metadata (or that we should try).
> 
> This will require versioning the affected structures, unless we think
> the fmap-in-message structure can be opaque to the rest of fuse. @miklos,
> is there an example to follow regarding struct versioning in 
> already-existing fuse structures?

/me is a n00b, but isn't that a simple matter of making sure that new
revisions change the structure size, and then you can key off of that?

> > > (Especially because I suspect that interleaving is the norm for memory,
> > > whereas we try to avoid that for disk filesystems.)
> > 
> > So interleaved extents are just like normal ones except they repeat,
> > right?  What about adding a special "repeat last N extent
> > descriptions" type of extent?
> 
> It's a bit more than that. The comment at [1] makes it possible to understand
> the scheme, but I'd be happy to talk through it with you on a call if that
> seems helpful.
> 
> An interleaved extent stripes data spread across N memory devices in raid 0
> format; the space from each device is described by a single simple extent 
> (so it's contigous), but it's not consumed contiguously - it's consumed in 
> fixed-sized chunks that precess across the devices. Notwithstanding that I 
> couldn't explain it very well when we talked about it at LPC, I think I 
> could make it pretty clear in a pretty brief call now.
> 
> In any case, you have my word that it's actually quite elegant :D
> (seriously, but also with a smile...)

Admittedly the more I think about the interleaving in famfs vs straight
block mappings for disk filesystems, the more I think they ought to be
separate interfaces for code that solves different problems.  Then both
our codebases will remain relatively cohesive.

> > > > But the current implementation does not contemplate partially cached fmaps.
> > > >
> > > > Adding notification could address revoking them post-haste (is that why
> > > > you're thinking about notifications? And if not can you elaborate on what
> > > > you're after there?).
> > >
> > > Yeah, invalidating the mapping cache at random places.  If, say, you
> > > implement a clustered filesystem with iomap, the metadata server could
> > > inform the fuse server on the local node that a certain range of inode X
> > > has been written to, at which point you need to revoke any local leases,
> > > invalidate the pagecache, and invalidate the iomapping cache to force
> > > the client to requery the server.
> > >
> > > Or if your fuse server wants to implement its own weird operations (e.g.
> > > XFS EXCHANGE-RANGE) this would make that possible without needing to
> > > add a bunch of code to fs/fuse/ for the benefit of a single fuse driver.
> > 
> > Wouldn't existing invalidation framework be sufficient?
> > 
> > Thanks,
> > Miklos
> 
> My current thinking is that Darrick's use case doesn't need GET_DAXDEV, but
> famfs does. I think Darrick's use case has one backing device, and that should
> be passed in at mount time. Correct me if you think that might be wrong.

Technically speaking iomap can operate on /any/ block or dax device as
long as you have a reference to them.  Once I get more of the plumbing
sorted out I'll start thinking about how to handle multi-device
filesystems like XFS which can put file data on more than 1 block
device.

I was thinking that the fuse server could just send a REGISTER_DEVICE
notification to the fuse driver (I know, again with the notifications
:)), the kernel replies with a magic cookie, and that's what gets passed
in the {read,write,map}_dev field.

Right now I reconfigured fuse2fs to present itself as a "fuseblk" driver
so that at least we know that inode->i_sb->s_bdev is a valid pointer.
It turns out to be useful because the kernel sends FUSE_DESTROY commands
synchronously during unmount, which avoids the situation where umount
exits but the block device still can't be opened O_EXCL because the fuse
server program is still exiting.  It may be useful for some day wiring
up some of the block device ops to fuse servers.  Though I think it
might conflict with CONFIG_BLK_DEV_WRITE_MOUNTED=y

I just barely got directio writes and pagecache read/write working
through iomap today, though I'm still getting used to the fuse inode
locking model and sorting through the bugs. :)

(I wonder how nasty would it be to pass fds to the fuse kernel driver
from fuseblk servers?)

> Famfs doesn't necessarily have just one backing dev, which means that famfs
> could pass in the *primary* backing dev at mount time, but it would still
> need GET_DAXDEV to get the rest. But if I just use GET_FMAP every time, I
> only need one way to do this.
> 
> I'll add a few more responses to Darrick's reply...

Hehhe onto that message go I.

--D

> 
> Thanks,
> John
> 
> [1] https://github.com/cxl-micron-reskit/famfs-linux/blob/c57553c4ca91f0634f137285840ab25be8a87c30/fs/fuse/famfs_kfmap.h#L13
> 
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 13/19] famfs_fuse: Create files with famfs fmaps
  2025-05-08 15:56             ` Darrick J. Wong
@ 2025-05-13  9:14               ` Miklos Szeredi
  2025-05-15  2:06                 ` Darrick J. Wong
  0 siblings, 1 reply; 58+ messages in thread
From: Miklos Szeredi @ 2025-05-13  9:14 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: John Groves, Dan Williams, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Luis Henriques,
	Randy Dunlap, Jeff Layton, Kent Overstreet, Petr Vorel,
	Brian Foster, linux-doc, linux-kernel, nvdimm, linux-cxl,
	linux-fsdevel, Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi,
	Joanne Koong, Josef Bacik, Aravind Ramesh, Ajay Joshi

On Thu, 8 May 2025 at 17:56, Darrick J. Wong <djwong@kernel.org> wrote:

> Well right now my barely functional prototype exposes this interface
> for communicating mappings to the kernel.  I've only gotten as far as
> exposing the ->iomap_{begin,end} and ->iomap_ioend calls to the fuse
> server with no caching, because the only functions I've implemented so
> far are FIEMAP, SEEK_{DATA,HOLE}, and directio.
>
> So basically the kernel sends a FUSE_IOMAP_BEGIN command with the
> desired (pos, count) file range to the fuse server, which responds with
> a struct fuse_iomap_begin_out object that is translated into a struct
> iomap.
>
> The fuse server then responds with a read mapping and a write mapping,
> which tell the kernel from where to read data, and where to write data.

So far so good.

The iomap layer is non-caching, right?   This means that e.g. a
direct_io request spanning two extents will result in two separate
requests, since one FUSE_IOMAP_BEGIN can only return one extent.

And the next direct_io request may need to repeat the query for the
same extent as the previous one if the I/O boundary wasn't on the
extent boundary (which is likely).

So some sort of caching would make sense, but seeing the multitude of
FUSE_IOMAP_OP_ types I'm not clearly seeing how that would look.

> I'm a little confused, are you talking about FUSE_NOTIFY_INVAL_INODE?
> If so, then I think that's the wrong layer -- INVAL_INODE invalidates
> the page cache, whereas I'm talking about caching the file space
> mappings that iomap uses to construct bios for disk IO, and possibly
> wanting to invalidate parts of that cache to force the kernel to upcall
> the fuse server for a new mapping.

Maybe I'm confused, as the layering is not very clear in my head yet.

But in your example you did say that invalidation of data as well as
mapping needs to be invalidated, so I thought that the simplest thing
to do is to just invalidate the cached mapping from
FUSE_NOTIFY_INVAL_INODE as well.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 13/19] famfs_fuse: Create files with famfs fmaps
  2025-05-13  9:14               ` Miklos Szeredi
@ 2025-05-15  2:06                 ` Darrick J. Wong
  2025-05-16 10:06                   ` Miklos Szeredi
  0 siblings, 1 reply; 58+ messages in thread
From: Darrick J. Wong @ 2025-05-15  2:06 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: John Groves, Dan Williams, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Luis Henriques,
	Randy Dunlap, Jeff Layton, Kent Overstreet, Petr Vorel,
	Brian Foster, linux-doc, linux-kernel, nvdimm, linux-cxl,
	linux-fsdevel, Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi,
	Joanne Koong, Josef Bacik, Aravind Ramesh, Ajay Joshi

On Tue, May 13, 2025 at 11:14:55AM +0200, Miklos Szeredi wrote:
> On Thu, 8 May 2025 at 17:56, Darrick J. Wong <djwong@kernel.org> wrote:
> 
> > Well right now my barely functional prototype exposes this interface
> > for communicating mappings to the kernel.  I've only gotten as far as
> > exposing the ->iomap_{begin,end} and ->iomap_ioend calls to the fuse
> > server with no caching, because the only functions I've implemented so
> > far are FIEMAP, SEEK_{DATA,HOLE}, and directio.
> >
> > So basically the kernel sends a FUSE_IOMAP_BEGIN command with the
> > desired (pos, count) file range to the fuse server, which responds with
> > a struct fuse_iomap_begin_out object that is translated into a struct
> > iomap.
> >
> > The fuse server then responds with a read mapping and a write mapping,
> > which tell the kernel from where to read data, and where to write data.
> 
> So far so good.
> 
> The iomap layer is non-caching, right?   This means that e.g. a
> direct_io request spanning two extents will result in two separate
> requests, since one FUSE_IOMAP_BEGIN can only return one extent.

Originally it wasn't supposed to be cached at all.  Then history taught
us a lesson. :P

In hindsight, there needs to be coordination of the space mapping
manipulations that go on between pagecache writes and reclaim writeback.
Pagecache write can get an unwritten iomap, then go to sleep while it
tries to get a folio.  In the meantime, writeback can find the folio for
that range, write it back to the disk (which converts unwritten to
written) and reclaim the folio.  Now the first process wakes up and
grabs a new folio.  Because its unwritten mapping is now stale, it must
not start zeroing that folio; it needs to go get a new mapping.

So iomap still doesn't need caching per se, but it needs writer threads
to revalidate the mapping after locking a folio.  The reason for caching
iomaps under the fuse_inode somewhere is that I don't want the
revalidations to have to jump all the way out to userspace with a folio
lock held.

That said, on a VM on this 12 year old workstation, I can get about
2.0GB/s direct writes in fuse2fs and 2.2GB/s in kernel ext4, and that's
with initiating iomap_begin/end/ioends with no caching of the mappings.
Pagecache writes run at about 1.9GB/s through fuse2fs and 1.5GB/s
through the kernel, but only if I tweak fuse to use large folios and a
relatively unconstrained bdi.  2GB/s might be enough IO for anyone. ;)

> And the next direct_io request may need to repeat the query for the
> same extent as the previous one if the I/O boundary wasn't on the
> extent boundary (which is likely).
> 
> So some sort of caching would make sense, but seeing the multitude of
> FUSE_IOMAP_OP_ types I'm not clearly seeing how that would look.

Yeah, it's confusing.  The design doc tries to clarify this, but this is
roughly what we need for fuse:

FUSE_IOMAP_OP_WRITE being set means we're writing to the file.
FUSE_IOMAP_OP_ZERO being set means we're zeroing the file.
Neither of those being set means we're reading the file.

(3 different operations)

FUSE_IOMAP_OP_DIRECT being set means directio, and it not being set
means pagecache.

(and one flag, for 6 different types of IO)

FUSE_IOMAP_OP_REPORT is set all by itself for things like FIEMAP and
SEEK_DATA/HOLE.

> > I'm a little confused, are you talking about FUSE_NOTIFY_INVAL_INODE?
> > If so, then I think that's the wrong layer -- INVAL_INODE invalidates
> > the page cache, whereas I'm talking about caching the file space
> > mappings that iomap uses to construct bios for disk IO, and possibly
> > wanting to invalidate parts of that cache to force the kernel to upcall
> > the fuse server for a new mapping.
> 
> Maybe I'm confused, as the layering is not very clear in my head yet.
> 
> But in your example you did say that invalidation of data as well as
> mapping needs to be invalidated, so I thought that the simplest thing
> to do is to just invalidate the cached mapping from
> FUSE_NOTIFY_INVAL_INODE as well.

For now I want to keep the two invalidation types separate while I build
out more of the prototype so that I can be more sure that I haven't
broken any existing code. :)

The mapping invalidation might be more useful for things like FICLONE on
weird filesystems where the file allocation unit size is larger than the
block size and we actually need to invalidate more mappings than the vfs
knows about.

But I'm only 80% sure of that, as I'm still figuring out how to create a
notification and send it from fuse2fs and haven't gotten to the caching
layer yet.

--D

> Thanks,
> Miklos
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 13/19] famfs_fuse: Create files with famfs fmaps
  2025-05-15  2:06                 ` Darrick J. Wong
@ 2025-05-16 10:06                   ` Miklos Szeredi
  2025-05-16 23:17                     ` Darrick J. Wong
  0 siblings, 1 reply; 58+ messages in thread
From: Miklos Szeredi @ 2025-05-16 10:06 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: John Groves, Dan Williams, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Luis Henriques,
	Randy Dunlap, Jeff Layton, Kent Overstreet, Petr Vorel,
	Brian Foster, linux-doc, linux-kernel, nvdimm, linux-cxl,
	linux-fsdevel, Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi,
	Joanne Koong, Josef Bacik, Aravind Ramesh, Ajay Joshi

On Thu, 15 May 2025 at 04:06, Darrick J. Wong <djwong@kernel.org> wrote:

> Yeah, it's confusing.  The design doc tries to clarify this, but this is
> roughly what we need for fuse:
>
> FUSE_IOMAP_OP_WRITE being set means we're writing to the file.
> FUSE_IOMAP_OP_ZERO being set means we're zeroing the file.
> Neither of those being set means we're reading the file.
>
> (3 different operations)

Okay, I get why these need to be distinct cases.

Am I right that the only read is sanely cacheable?

> FUSE_IOMAP_OP_DIRECT being set means directio, and it not being set
> means pagecache.
>
> (and one flag, for 6 different types of IO)

Why does this make a difference?

Okay, maybe I can imagine difference allocation strategies.  Which
means that it only matters for the write case?

> FUSE_IOMAP_OP_REPORT is set all by itself for things like FIEMAP and
> SEEK_DATA/HOLE.

Which should again always be the same as the read case, no?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 13/19] famfs_fuse: Create files with famfs fmaps
  2025-05-16 10:06                   ` Miklos Szeredi
@ 2025-05-16 23:17                     ` Darrick J. Wong
  0 siblings, 0 replies; 58+ messages in thread
From: Darrick J. Wong @ 2025-05-16 23:17 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: John Groves, Dan Williams, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Luis Henriques,
	Randy Dunlap, Jeff Layton, Kent Overstreet, Petr Vorel,
	Brian Foster, linux-doc, linux-kernel, nvdimm, linux-cxl,
	linux-fsdevel, Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi,
	Joanne Koong, Josef Bacik, Aravind Ramesh, Ajay Joshi

On Fri, May 16, 2025 at 12:06:44PM +0200, Miklos Szeredi wrote:
> On Thu, 15 May 2025 at 04:06, Darrick J. Wong <djwong@kernel.org> wrote:
> 
> > Yeah, it's confusing.  The design doc tries to clarify this, but this is
> > roughly what we need for fuse:
> >
> > FUSE_IOMAP_OP_WRITE being set means we're writing to the file.
> > FUSE_IOMAP_OP_ZERO being set means we're zeroing the file.
> > Neither of those being set means we're reading the file.
> >
> > (3 different operations)
> 
> Okay, I get why these need to be distinct cases.
> 
> Am I right that the only read is sanely cacheable?

That depends on the filesystem.  Old filesystems (e.g. the ones that
don't support out of place writes or unwritten extents) most likely can
cache mappings for writes and zeroing.  Filesystems with static mappings
(like zonefs which are convenient wrappers around hardware) can cache
most everything too.

My next step for this prototype is to go build a real cache and make
fuse2fs manage the cache, which puts the filesystem in charge of
maintaining the cache however is appropriate for the design.

> > FUSE_IOMAP_OP_DIRECT being set means directio, and it not being set
> > means pagecache.
> >
> > (and one flag, for 6 different types of IO)
> 
> Why does this make a difference?

Different allocation strategies -- we can use delayed allocation for
pagecache writes, whereas with direct writes we must have real disk
space.

> Okay, maybe I can imagine difference allocation strategies.  Which
> means that it only matters for the write case?

Probably.  I don't see why a directio read would be any different from a
pageacache read(ahead) but the distinction exists for the in-kernel
iomap callers.

> > FUSE_IOMAP_OP_REPORT is set all by itself for things like FIEMAP and
> > SEEK_DATA/HOLE.
> 
> Which should again always be the same as the read case, no?

Not entirely -- if the fuse driver is doing weird caching things with
file data blocks, a read requires it to invalidate its own cache,
whereas it needn't do anything for a mapping report.  fuse2fs is guilty
of this, because it does ... crazy things.

Also for now I don't support read/write to inline data files, though I
think it would be possible to use the FUSE_READ/FUSE_WRITE for that...
as soon as I find a filesystem where inline data for regular files isn't
a giant trash fire and can be QAd properly.

--D

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 00/19] famfs: port into fuse
  2025-04-21  1:33 [RFC PATCH 00/19] famfs: port into fuse John Groves
                   ` (20 preceding siblings ...)
  2025-04-30 14:42 ` Alireza Sanaee
@ 2025-05-21 22:30 ` John Groves
  2025-05-21 23:11   ` Darrick J. Wong
  2025-05-22 15:55   ` Amir Goldstein
  21 siblings, 2 replies; 58+ messages in thread
From: John Groves @ 2025-05-21 22:30 UTC (permalink / raw)
  To: Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Luis Henriques, Randy Dunlap, Jeff Layton,
	Kent Overstreet, Petr Vorel, Brian Foster, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Amir Goldstein,
	Jonathan Cameron, Stefan Hajnoczi, Joanne Koong, Josef Bacik,
	Aravind Ramesh, Ajay Joshi, Alistair Popple, john

On 25/04/20 08:33PM, John Groves wrote:
> Subject: famfs: port into fuse
>
> <snip>

I'm planning to apply the review comments and send v2 of
this patch series soon - hopefully next week.

I asked a couple of specific questions for Miklos and
Amir at [1] that I hope they will answer in the next few
days. Do you object to zeroing fuse_inodes when they're
allocated, and do I really need an xchg() to set the
fi->famfs_meta pointer during fuse_alloc_inode()? cmpxchg
would be good for avoiding stepping on an "already set"
pointer, but not useful if fi->famfs_meta has random
contents (which it does when allocated).

I plan to move the GET_FMAP message to OPEN time rather than
LOOKUP - unless that leads to problems that I don't
currently foresee. The GET_FMAP response will also get a
variable-sized payload.

Darrick and I have met and discussed commonality between our
use cases, and the only thing from famfs that he will be able
to directly use is the GET_FMAP message/response - but likely
with a different response payload. The file operations in
famfs.c are not applicable for Darrick, as they only handle
mapping file offsets to devdax offsets (i.e. fs-dax over
devdax).

Darrick is primarily exploring adapting block-backed file
systems to use fuse. These are conventional page-cache-backed
files that will primarily be read and written between
blockdev and page cache.

(Darrick, correct me if I got anything important wrong there.)

In prep for Darrick, I'll add an offset and length to the
GET_FMAP message, to specify what range of the file map is
being requested. I'll also add a new "first header" struct
in the GET_FMAP response that can accommodate additional fmap
types, and will specify the file size as well as the offset
and length of the fmap portion described in the response
(allowing for GET_FMAP responses that contain an incomplete
file map).

If there is desire to give GET_FMAP a different name, like
GET_IOMAP, I don't much care - although the term "iomap" may
be a bit overloaded already (e.g. the dax_iomap_rw()/
dax_iomap_fault() functions debatably didn't need "iomap"
in their names since they're about converting a file offset
range to daxdev ranges, and they don't handle anything
specifically iomap-related). At least "FMAP" is more narrowly
descriptive of what it is.

I don't think Darrick needs GET_DAXDEV (or anything
analogous), because passing in the backing dev at mount time
seems entirely sufficient - so I assume that at least for now
GET_DAXDEV won't be shared. But famfs definitely needs
GET_DAXDEV, because files must be able to interleave across
memory devices.

The one issue that I will kick down the road until v3 is
fixing the "poisoned page|folio" problem. Because of that,
v2 of this series will still be against a 6.14 kernel. Not
solving that problem means this series won't be merge-able
until v3.

I hope this is all clear and obvious. Let me know if not (or
if so).

Thanks,
John


[1] https://lore.kernel.org/linux-fsdevel/20250421013346.32530-1-john@groves.net/T/#me47467b781d6c637899a38b898c27afb619206e0


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 00/19] famfs: port into fuse
  2025-05-21 22:30 ` John Groves
@ 2025-05-21 23:11   ` Darrick J. Wong
  2025-05-22 15:55   ` Amir Goldstein
  1 sibling, 0 replies; 58+ messages in thread
From: Darrick J. Wong @ 2025-05-21 23:11 UTC (permalink / raw)
  To: John Groves
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Luis Henriques,
	Randy Dunlap, Jeff Layton, Kent Overstreet, Petr Vorel,
	Brian Foster, linux-doc, linux-kernel, nvdimm, linux-cxl,
	linux-fsdevel, Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi,
	Joanne Koong, Josef Bacik, Aravind Ramesh, Ajay Joshi,
	Alistair Popple

On Wed, May 21, 2025 at 05:30:12PM -0500, John Groves wrote:
> On 25/04/20 08:33PM, John Groves wrote:
> > Subject: famfs: port into fuse
> >
> > <snip>
> 
> I'm planning to apply the review comments and send v2 of
> this patch series soon - hopefully next week.

Heh, I'm just about to push go on an RFC patchbomb for the entirety of
fuse + iomap + ext4-fuse2fs.

> I asked a couple of specific questions for Miklos and
> Amir at [1] that I hope they will answer in the next few
> days. Do you object to zeroing fuse_inodes when they're
> allocated, and do I really need an xchg() to set the
> fi->famfs_meta pointer during fuse_alloc_inode()? cmpxchg
> would be good for avoiding stepping on an "already set"
> pointer, but not useful if fi->famfs_meta has random
> contents (which it does when allocated).

I guess you could always null it out in fuse_inode_init_once and again
when you free the inode...

> I plan to move the GET_FMAP message to OPEN time rather than
> LOOKUP - unless that leads to problems that I don't
> currently foresee. The GET_FMAP response will also get a
> variable-sized payload.
> 
> Darrick and I have met and discussed commonality between our
> use cases, and the only thing from famfs that he will be able
> to directly use is the GET_FMAP message/response - but likely
> with a different response payload. The file operations in
> famfs.c are not applicable for Darrick, as they only handle
> mapping file offsets to devdax offsets (i.e. fs-dax over
> devdax).
> 
> Darrick is primarily exploring adapting block-backed file
> systems to use fuse. These are conventional page-cache-backed
> files that will primarily be read and written between
> blockdev and page cache.

Yeah, I really do need to get moving on sending out the RFC.

Everyone: patchbomb incoming!

> (Darrick, correct me if I got anything important wrong there.)
> 
> In prep for Darrick, I'll add an offset and length to the
> GET_FMAP message, to specify what range of the file map is
> being requested. I'll also add a new "first header" struct
> in the GET_FMAP response that can accommodate additional fmap
> types, and will specify the file size as well as the offset
> and length of the fmap portion described in the response
> (allowing for GET_FMAP responses that contain an incomplete
> file map).

Hrrmrmrmm.  I don't think there's much use in trying to share a fuse
command but then have to parse through the size of the response to
figure out what the server actually sent back.  It's less confusing to
have just one response type per fuse command.

I also don't think that FUSE_IOMAP_BEGIN is all that good of an
interface for what John is trying to do.  A regular filesystem creates
whatever mappings it likes in response to the far too many file IO APIs
in Linux, and needs to throw them at the kernel.  OTOH, famfs'
management daemon creates a static mapping with repeating elements and
that gets uploaded in one go via FUSE_GET_FMAP.  Yes, we could mash them
into a single uncohesive mess of an interface, but why would we torture
ourselves so?

(For me it's the "repeating sequences" aspect of GET_FMAP that makes me
think it should be a separate interface.  OTOH I haven't thought much
about how to support filesystems that implement RAID.)

> If there is desire to give GET_FMAP a different name, like
> GET_IOMAP, I don't much care - although the term "iomap" may
> be a bit overloaded already (e.g. the dax_iomap_rw()/
> dax_iomap_fault() functions debatably didn't need "iomap"
> in their names since they're about converting a file offset
> range to daxdev ranges, and they don't handle anything
> specifically iomap-related). At least "FMAP" is more narrowly
> descriptive of what it is.
> 
> I don't think Darrick needs GET_DAXDEV (or anything
> analogous), because passing in the backing dev at mount time
> seems entirely sufficient - so I assume that at least for now
> GET_DAXDEV won't be shared. But famfs definitely needs
> GET_DAXDEV, because files must be able to interleave across
> memory devices.

I actually /did/ add a notification so that the fuse server can tell the
kernel that they'd like to use a particular fd with iomap.  It doesn't
support dax devices by virtue of gatekeeping on S_ISBLK, but it wouldn't
be hard to do that.

> The one issue that I will kick down the road until v3 is
> fixing the "poisoned page|folio" problem. Because of that,
> v2 of this series will still be against a 6.14 kernel. Not
> solving that problem means this series won't be merge-able
> until v3.
> 
> I hope this is all clear and obvious. Let me know if not (or
> if so).

Hee hee.

--D

> 
> Thanks,
> John
> 
> 
> [1] https://lore.kernel.org/linux-fsdevel/20250421013346.32530-1-john@groves.net/T/#me47467b781d6c637899a38b898c27afb619206e0
> 
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 12/19] famfs_fuse: Plumb the GET_FMAP message/response
  2025-05-12 16:28     ` John Groves
@ 2025-05-22 15:45       ` Amir Goldstein
  2025-05-23  0:30         ` John Groves
  0 siblings, 1 reply; 58+ messages in thread
From: Amir Goldstein @ 2025-05-22 15:45 UTC (permalink / raw)
  To: John Groves
  Cc: Joanne Koong, Dan Williams, Miklos Szeredi, Bernd Schubert,
	John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Luis Henriques, Randy Dunlap, Jeff Layton,
	Kent Overstreet, Petr Vorel, Brian Foster, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Jonathan Cameron,
	Stefan Hajnoczi, Josef Bacik, Aravind Ramesh, Ajay Joshi

On Mon, May 12, 2025 at 6:28 PM John Groves <John@groves.net> wrote:
>
> On 25/05/01 10:48PM, Joanne Koong wrote:
> > On Sun, Apr 20, 2025 at 6:34 PM John Groves <John@groves.net> wrote:
> > >
> > > Upon completion of a LOOKUP, if we're in famfs-mode we do a GET_FMAP to
> > > retrieve and cache up the file-to-dax map in the kernel. If this
> > > succeeds, read/write/mmap are resolved direct-to-dax with no upcalls.
> > >
> > > Signed-off-by: John Groves <john@groves.net>
> > > ---
...
> > > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > > index 7f4b73e739cb..848c8818e6f7 100644
> > > --- a/fs/fuse/inode.c
> > > +++ b/fs/fuse/inode.c
> > > @@ -117,6 +117,9 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
> > >         if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> > >                 fuse_inode_backing_set(fi, NULL);
> > >
> > > +       if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
> > > +               famfs_meta_set(fi, NULL);
> >
> > "fi->famfs_meta = NULL;" looks simpler here
>
> I toootally agree here, but I was following the passthrough pattern
> just above.  @miklos or @Amir, got a preference here?
>

It's not about preference, this fails build.
Even though compiler (or pre-processor whatever) should be able to skip
"fi->famfs_meta = NULL;" if CONFIG_FUSE_FAMFS_DAX is not defined
IIRC it does not. Feel free to try. Maybe I do not remember correctly.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 00/19] famfs: port into fuse
  2025-05-21 22:30 ` John Groves
  2025-05-21 23:11   ` Darrick J. Wong
@ 2025-05-22 15:55   ` Amir Goldstein
  1 sibling, 0 replies; 58+ messages in thread
From: Amir Goldstein @ 2025-05-22 15:55 UTC (permalink / raw)
  To: John Groves
  Cc: Dan Williams, Miklos Szeredi, Bernd Schubert, John Groves,
	Jonathan Corbet, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jan Kara, Alexander Viro, Christian Brauner, Darrick J . Wong,
	Luis Henriques, Randy Dunlap, Jeff Layton, Kent Overstreet,
	Petr Vorel, Brian Foster, linux-doc, linux-kernel, nvdimm,
	linux-cxl, linux-fsdevel, Jonathan Cameron, Stefan Hajnoczi,
	Joanne Koong, Josef Bacik, Aravind Ramesh, Ajay Joshi,
	Alistair Popple

On Thu, May 22, 2025 at 12:30 AM John Groves <John@groves.net> wrote:
>
> On 25/04/20 08:33PM, John Groves wrote:
> > Subject: famfs: port into fuse
> >
> > <snip>
>
> I'm planning to apply the review comments and send v2 of
> this patch series soon - hopefully next week.
>
> I asked a couple of specific questions for Miklos and
> Amir at [1] that I hope they will answer in the next few
> days.

I missed this question.
Feel free to ping me next time if I am not answering.

> Do you object to zeroing fuse_inodes when they're
> allocated, and do I really need an xchg() to set the
> fi->famfs_meta pointer during fuse_alloc_inode()? cmpxchg
> would be good for avoiding stepping on an "already set"
> pointer, but not useful if fi->famfs_meta has random
> contents (which it does when allocated).
>

I don't have anything against zeroing the fuse inode fields
but be careful not to step over fuse_inode_init_once().

The answer to the xchg() question is quite technically boring.
At least in my case it was done to avoid an #ifdef in c file.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 12/19] famfs_fuse: Plumb the GET_FMAP message/response
  2025-05-22 15:45       ` Amir Goldstein
@ 2025-05-23  0:30         ` John Groves
  0 siblings, 0 replies; 58+ messages in thread
From: John Groves @ 2025-05-23  0:30 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Joanne Koong, Dan Williams, Miklos Szeredi, Bernd Schubert,
	John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Luis Henriques, Randy Dunlap, Jeff Layton,
	Kent Overstreet, Petr Vorel, Brian Foster, linux-doc,
	linux-kernel, nvdimm, linux-cxl, linux-fsdevel, Jonathan Cameron,
	Stefan Hajnoczi, Josef Bacik, Aravind Ramesh, Ajay Joshi

On 25/05/22 05:45PM, Amir Goldstein wrote:
> On Mon, May 12, 2025 at 6:28 PM John Groves <John@groves.net> wrote:
> >
> > On 25/05/01 10:48PM, Joanne Koong wrote:
> > > On Sun, Apr 20, 2025 at 6:34 PM John Groves <John@groves.net> wrote:
> > > >
> > > > Upon completion of a LOOKUP, if we're in famfs-mode we do a GET_FMAP to
> > > > retrieve and cache up the file-to-dax map in the kernel. If this
> > > > succeeds, read/write/mmap are resolved direct-to-dax with no upcalls.
> > > >
> > > > Signed-off-by: John Groves <john@groves.net>
> > > > ---
> ...
> > > > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > > > index 7f4b73e739cb..848c8818e6f7 100644
> > > > --- a/fs/fuse/inode.c
> > > > +++ b/fs/fuse/inode.c
> > > > @@ -117,6 +117,9 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
> > > >         if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> > > >                 fuse_inode_backing_set(fi, NULL);
> > > >
> > > > +       if (IS_ENABLED(CONFIG_FUSE_FAMFS_DAX))
> > > > +               famfs_meta_set(fi, NULL);
> > >
> > > "fi->famfs_meta = NULL;" looks simpler here
> >
> > I toootally agree here, but I was following the passthrough pattern
> > just above.  @miklos or @Amir, got a preference here?
> >
> 
> It's not about preference, this fails build.
> Even though compiler (or pre-processor whatever) should be able to skip
> "fi->famfs_meta = NULL;" if CONFIG_FUSE_FAMFS_DAX is not defined
> IIRC it does not. Feel free to try. Maybe I do not remember correctly.
> 
> Thanks,
> Amir.

Thanks Amir. Will fiddle with this when cleaning up V2.

John


^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2025-05-23  0:30 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-21  1:33 [RFC PATCH 00/19] famfs: port into fuse John Groves
2025-04-21  1:33 ` [RFC PATCH 01/19] dev_dax_iomap: Move dax_pgoff_to_phys() from device.c to bus.c John Groves
2025-04-21  1:33 ` [RFC PATCH 02/19] dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage John Groves
2025-04-21  1:33 ` [RFC PATCH 03/19] dev_dax_iomap: Save the kva from memremap John Groves
2025-04-21  1:33 ` [RFC PATCH 04/19] dev_dax_iomap: Add dax_operations for use by fs-dax on devdax John Groves
2025-04-21  1:33 ` [RFC PATCH 05/19] dev_dax_iomap: export dax_dev_get() John Groves
2025-04-21  1:33 ` [RFC PATCH 06/19] dev_dax_iomap: (ignore!) Drop poisoned page warning in fs/dax.c John Groves
2025-04-21  1:33 ` [RFC PATCH 07/19] famfs_fuse: magic.h: Add famfs magic numbers John Groves
2025-04-21  1:33 ` [RFC PATCH 08/19] famfs_fuse: Kconfig John Groves
2025-04-21  1:33 ` [RFC PATCH 09/19] famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/ John Groves
2025-04-21  1:33 ` [RFC PATCH 10/19] famfs_fuse: Basic fuse kernel ABI enablement for famfs John Groves
2025-04-23  1:36   ` Joanne Koong
2025-04-23 20:23     ` John Groves
2025-04-21  1:33 ` [RFC PATCH 11/19] famfs_fuse: Basic famfs mount opts John Groves
2025-04-23  1:51   ` Joanne Koong
2025-04-23 20:19     ` John Groves
2025-04-21  1:33 ` [RFC PATCH 12/19] famfs_fuse: Plumb the GET_FMAP message/response John Groves
2025-05-02  5:48   ` Joanne Koong
2025-05-02 20:35     ` Darrick J. Wong
2025-05-12 16:28     ` John Groves
2025-05-22 15:45       ` Amir Goldstein
2025-05-23  0:30         ` John Groves
2025-04-21  1:33 ` [RFC PATCH 13/19] famfs_fuse: Create files with famfs fmaps John Groves
2025-04-21 21:57   ` Darrick J. Wong
2025-04-21 22:31     ` John Groves
2025-04-24 13:43   ` John Groves
2025-04-24 14:38     ` Darrick J. Wong
2025-04-28  1:48       ` John Groves
2025-04-28 19:00         ` Darrick J. Wong
2025-05-06 16:56           ` Miklos Szeredi
2025-05-08 15:56             ` Darrick J. Wong
2025-05-13  9:14               ` Miklos Szeredi
2025-05-15  2:06                 ` Darrick J. Wong
2025-05-16 10:06                   ` Miklos Szeredi
2025-05-16 23:17                     ` Darrick J. Wong
2025-05-12 19:51             ` John Groves
2025-05-13  4:03               ` Darrick J. Wong
2025-04-21  1:33 ` [RFC PATCH 14/19] famfs_fuse: GET_DAXDEV message and daxdev_table John Groves
2025-04-21  3:43   ` Randy Dunlap
2025-04-21 20:57     ` John Groves
2025-04-21  1:33 ` [RFC PATCH 15/19] famfs_fuse: Plumb dax iomap and fuse read/write/mmap John Groves
2025-04-21  1:33 ` [RFC PATCH 16/19] famfs_fuse: Add holder_operations for dax notify_failure() John Groves
2025-04-21  1:33 ` [RFC PATCH 17/19] famfs_fuse: Add famfs metadata documentation John Groves
2025-04-21  3:51   ` Randy Dunlap
2025-04-21 21:00     ` John Groves
2025-04-21  1:33 ` [RFC PATCH 18/19] famfs_fuse: Add documentation John Groves
2025-04-22  2:10   ` Randy Dunlap
2025-04-28  1:50     ` John Groves
2025-04-21  1:33 ` [RFC PATCH 19/19] famfs_fuse: (ignore) debug cruft John Groves
2025-04-21 18:27 ` [RFC PATCH 00/19] famfs: port into fuse Darrick J. Wong
2025-04-21 22:00   ` John Groves
2025-04-22  1:25     ` Darrick J. Wong
2025-04-22 11:50       ` John Groves
2025-04-30 14:42 ` Alireza Sanaee
2025-05-01  2:13   ` John Groves
2025-05-21 22:30 ` John Groves
2025-05-21 23:11   ` Darrick J. Wong
2025-05-22 15:55   ` Amir Goldstein

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).