linux-cxl.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC V2 00/18] famfs: port into fuse
@ 2025-07-03 18:50 John Groves
  2025-07-03 18:50 ` [RFC V2 01/18] dev_dax_iomap: Move dax_pgoff_to_phys() from device.c to bus.c John Groves
                   ` (18 more replies)
  0 siblings, 19 replies; 91+ messages in thread
From: John Groves @ 2025-07-03 18:50 UTC (permalink / raw)
  To: John Groves, Dan Williams, Miklos Szeredi, Bernd Schubert
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Darrick J . Wong, Randy Dunlap, Jeff Layton, Kent Overstreet,
	linux-doc, linux-kernel, nvdimm, linux-cxl, linux-fsdevel,
	Amir Goldstein, Jonathan Cameron, Stefan Hajnoczi, Joanne Koong,
	Josef Bacik, Aravind Ramesh, Ajay Joshi, John Groves

Changes since v1:

- The GET_FMAP message/response has been moved from LOOKUP to OPEN, as was
  the pretty much unanimous consensus.
- Made the response payload to GET_FMAP variable sized (patch 12)
- Dodgy kerneldoc comments cleaned up or removed.
- Fixed memory leak of fc->shadow in patch 11 (thanks Joanne)
- Dropped many pr_debug and pr_notice calls

Open Issues:

- This is still marked RFC because I have not tackled the "poisoned page
  problem" yet (see below the original description below). That's next on my
  agenda for this patch set; I'm planning to address that in V3, and to drop
  RFC and make V3 mergeable.
- Note: this patch is still against 6.14 because of the interaction of the
  poisoned page issue with Alistair Popple's multitudinous recent DAX
  patches. ;) I have some work to do to move forward, but the next rev will
  do that.
- Because I haven't moved forward past 6.14, the related libfuse patch [2.1]
  is out of sync with the libfuse master branch. This will be addressed in the
  next version.

Other Notes:
- This patch is available as a git branch at [2.2]

References to V2
[2.1] - https://github.com/libfuse/libfuse/pull/1271
[2.2] - https://github.com/cxl-micron-reskit/famfs-linux/tree/famfs-fuse-v2


Original Description:

This is the initial RFC for the fabric-attached memory file system (famfs)
integration into fuse. In order to function, this requires a related patch
to libfuse [1] and the famfs user space [2]. 

This RFC is mainly intended to socialize the approach and get feedback from
the fuse developers and maintainers. There is some dax work that needs to
be done before this should be merged (see the "poisoned page|folio problem"
below).

This patch set fully works with Linux 6.14 -- passing all existing famfs
smoke and unit tests -- and I encourage existing famfs users to test it.

This is really two patch sets mashed up:

* The patches with the dev_dax_iomap: prefix fill in missing functionality for
  devdax to host an fs-dax file system.
* The famfs_fuse: patches add famfs into fs/fuse/. These are effectively
  unchanged since last year.

Because this is not ready to merge yet, I have felt free to leave some debug
prints in place because we still find them useful; those will be cleaned up
in a subsequent revision.

Famfs Overview

Famfs exposes shared memory as a file system. Famfs consumes shared memory
from dax devices, and provides memory-mappable files that map directly to
the memory - no page cache involvement. Famfs differs from conventional
file systems in fs-dax mode, in that it handles in-memory metadata in a
sharable way (which begins with never caching dirty shared metadata).

Famfs started as a standalone file system [3,4], but the consensus at LSFMM
2024 [5] was that it should be ported into fuse - and this RFC is the first
public evidence that I've been working on that.

The key performance requirement is that famfs must resolve mapping faults
without upcalls. This is achieved by fully caching the file-to-devdax
metadata for all active files. This is done via two fuse client/server
message/response pairs: GET_FMAP and GET_DAXDEV.

Famfs remains the first fs-dax file system that is backed by devdax rather
than pmem in fs-dax mode (hence the need for the dev_dax_iomap fixups).

Notes

* Once the dev_dax_iomap patches land, I suspect it may make sense for
  virtiofs to update to use the improved interface.

* I'm currently maintaining compatibility between the famfs user space and
  both the standalone famfs kernel file system and this new fuse
  implementation. In the near future I'll be running performance comparisons
  and sharing them - but there is no reason to expect significant degradation
  with fuse, since famfs caches entire "fmaps" in the kernel to resolve
  faults with no upcalls. This patch has a bit too much debug turned on to
  to that testing quite yet. A branch 

* Two new fuse messages / responses are added: GET_FMAP and GET_DAXDEV.

* When a file is looked up in a famfs mount, the LOOKUP is followed by a
  GET_FMAP message and response. The "fmap" is the full file-to-dax mapping,
  allowing the fuse/famfs kernel code to handle read/write/fault without any
  upcalls.

* After each GET_FMAP, the fmap is checked for extents that reference
  previously-unknown daxdevs. Each such occurrence is handled with a
  GET_DAXDEV message and response.

* Daxdevs are stored in a table (which might become an xarray at some point).
  When entries are added to the table, we acquire exclusive access to the
  daxdev via the fs_dax_get() call (modeled after how fs-dax handles this
  with pmem devices). famfs provides holder_operations to devdax, providing
  a notification path in the event of memory errors.

* If devdax notifies famfs of memory errors on a dax device, famfs currently
  blocks all subsequent accesses to data on that device. The recovery is to
  re-initialize the memory and file system. Famfs is memory, not storage...

* Because famfs uses backing (devdax) devices, only privileged mounts are
  supported.

* The famfs kernel code never accesses the memory directly - it only
  facilitates read, write and mmap on behalf of user processes. As such,
  the RAS of the shared memory affects applications, but not the kernel.

* Famfs has backing device(s), but they are devdax (char) rather than
  block. Right now there is no way to tell the vfs layer that famfs has a
  char backing device (unless we say it's block, but it's not). Currently
  we use the standard anonymous fuse fs_type - but I'm not sure that's
  ultimately optimal (thoughts?)

The "poisoned page|folio problem"

* Background: before doing a kernel mount, the famfs user space [2] validates
  the superblock and log. This is done via raw mmap of the primary devdax
  device. If valid, the file system is mounted, and the superblock and log
  get exposed through a pair of files (.meta/.superblock and .meta/.log) -
  because we can't be using raw device mmap when a file system is mounted
  on the device. But this exposes a devdax bug and warning...

* Pages that have been memory mapped via devdax are left in a permanently
  problematic state. Devdax sets page|folio->mapping when a page is accessed
  via raw devdax mmap (as famfs does before mount), but never cleans it up.
  When the pages of the famfs superblock and log are accessed via the "meta"
  files after mount, we see a WARN_ONCE() in dax_insert_entry(), which
  notices that page|folio->mapping is still set. I intend to address this
  prior to asking for the famfs patches to be merged.

* Alistair Popple's recent dax patch series [6], which has been merged
  for 6.15, addresses some dax issues, but sadly does not fix the poisoned
  page|folio problem - its enhanced refcount checking turns the warning into
  an error.

* This 6.14 patch set disables the warning; a proper fix will be required for
  famfs to work at all in 6.15. Dan W. and I are actively discussing how to do
  this properly...

* In terms of the correct functionality of famfs, the warning can be ignored.

References

[1] - https://github.com/libfuse/libfuse/pull/1200
[2] - https://github.com/cxl-micron-reskit/famfs
[3] - https://lore.kernel.org/linux-cxl/cover.1708709155.git.john@groves.net/
[4] - https://lore.kernel.org/linux-cxl/cover.1714409084.git.john@groves.net/
[5] - https://lwn.net/Articles/983105/
[6] - https://lore.kernel.org/linux-cxl/cover.8068ad144a7eea4a813670301f4d2a86a8e68ec4.1740713401.git-series.apopple@nvidia.com/


John Groves (18):
  dev_dax_iomap: Move dax_pgoff_to_phys() from device.c to bus.c
  dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage
  dev_dax_iomap: Save the kva from memremap
  dev_dax_iomap: Add dax_operations for use by fs-dax on devdax
  dev_dax_iomap: export dax_dev_get()
  dev_dax_iomap: (ignore!) Drop poisoned page warning in fs/dax.c
  famfs_fuse: magic.h: Add famfs magic numbers
  famfs_fuse: Kconfig
  famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/
  famfs_fuse: Basic fuse kernel ABI enablement for famfs
  famfs_fuse: Basic famfs mount opts
  famfs_fuse: Plumb the GET_FMAP message/response
  famfs_fuse: Create files with famfs fmaps
  famfs_fuse: GET_DAXDEV message and daxdev_table
  famfs_fuse: Plumb dax iomap and fuse read/write/mmap
  famfs_fuse: Add holder_operations for dax notify_failure()
  famfs_fuse: Add famfs metadata documentation
  famfs_fuse: Add documentation

 Documentation/filesystems/famfs.rst |  142 ++++
 Documentation/filesystems/index.rst |    1 +
 MAINTAINERS                         |   10 +
 drivers/dax/Kconfig                 |    6 +
 drivers/dax/bus.c                   |  144 +++-
 drivers/dax/dax-private.h           |    1 +
 drivers/dax/device.c                |   38 +-
 drivers/dax/super.c                 |   33 +-
 fs/dax.c                            |    1 -
 fs/fuse/Kconfig                     |   13 +
 fs/fuse/Makefile                    |    2 +-
 fs/fuse/dir.c                       |    2 +-
 fs/fuse/famfs.c                     | 1087 +++++++++++++++++++++++++++
 fs/fuse/famfs_kfmap.h               |  166 ++++
 fs/fuse/file.c                      |  124 ++-
 fs/fuse/fuse_i.h                    |   67 +-
 fs/fuse/inode.c                     |   59 +-
 fs/fuse/iomode.c                    |    2 +-
 fs/namei.c                          |    1 +
 include/linux/dax.h                 |    6 +
 include/uapi/linux/fuse.h           |   96 +++
 include/uapi/linux/magic.h          |    2 +
 22 files changed, 1961 insertions(+), 42 deletions(-)
 create mode 100644 Documentation/filesystems/famfs.rst
 create mode 100644 fs/fuse/famfs.c
 create mode 100644 fs/fuse/famfs_kfmap.h


base-commit: b9d5d463c216763cec719c04536ea9e14512cad4
-- 
2.49.0


^ permalink raw reply	[flat|nested] 91+ messages in thread

end of thread, other threads:[~2025-08-19 22:34 UTC | newest]

Thread overview: 91+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-03 18:50 [RFC V2 00/18] famfs: port into fuse John Groves
2025-07-03 18:50 ` [RFC V2 01/18] dev_dax_iomap: Move dax_pgoff_to_phys() from device.c to bus.c John Groves
2025-07-03 18:50 ` [RFC V2 02/18] dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage John Groves
2025-07-04 10:39   ` Jonathan Cameron
2025-07-04 12:54     ` John Groves
2025-07-03 18:50 ` [RFC V2 03/18] dev_dax_iomap: Save the kva from memremap John Groves
2025-07-04 11:11   ` Jonathan Cameron
2025-07-03 18:50 ` [RFC V2 04/18] dev_dax_iomap: Add dax_operations for use by fs-dax on devdax John Groves
2025-07-04 12:47   ` Jonathan Cameron
2025-07-05 22:56     ` John Groves
2025-07-03 18:50 ` [RFC V2 05/18] dev_dax_iomap: export dax_dev_get() John Groves
2025-07-03 18:50 ` [RFC V2 06/18] dev_dax_iomap: (ignore!) Drop poisoned page warning in fs/dax.c John Groves
2025-07-03 18:50 ` [RFC V2 07/18] famfs_fuse: magic.h: Add famfs magic numbers John Groves
2025-07-03 18:50 ` [RFC V2 08/18] famfs_fuse: Kconfig John Groves
2025-07-03 18:50 ` [RFC V2 09/18] famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/ John Groves
2025-07-04  8:44   ` Amir Goldstein
2025-07-03 18:50 ` [RFC V2 10/18] famfs_fuse: Basic fuse kernel ABI enablement for famfs John Groves
2025-07-03 22:45   ` John Groves
2025-07-07 17:32     ` Darrick J. Wong
2025-07-04  7:54   ` Amir Goldstein
2025-07-04 13:39     ` John Groves
2025-07-07 17:39       ` Darrick J. Wong
2025-07-08 12:02         ` John Groves
2025-07-09  1:53           ` Darrick J. Wong
2025-07-11  1:32             ` John Groves
2025-07-12  4:49               ` Darrick J. Wong
2025-08-11 18:30               ` John Groves
2025-08-12 16:37                 ` Darrick J. Wong
2025-08-13 13:07                   ` John Groves
2025-08-14 17:16                     ` Darrick J. Wong
2025-07-03 18:50 ` [RFC V2 11/18] famfs_fuse: Basic famfs mount opts John Groves
2025-07-09  3:59   ` Darrick J. Wong
2025-07-11 15:28     ` John Groves
2025-07-12  5:54       ` Darrick J. Wong
2025-08-14 10:37         ` Miklos Szeredi
2025-08-14 14:39           ` John Groves
2025-08-14 15:19             ` Miklos Szeredi
2025-08-14 23:52               ` John Groves
2025-07-03 18:50 ` [RFC V2 12/18] famfs_fuse: Plumb the GET_FMAP message/response John Groves
2025-07-04  8:54   ` Amir Goldstein
2025-07-04 20:30     ` John Groves
2025-07-05  0:06       ` John Groves
2025-07-05  7:58         ` Amir Goldstein
2025-07-05 19:17           ` John Groves
2025-07-09  4:27   ` Darrick J. Wong
2025-07-11 13:46     ` John Groves
2025-08-14 13:36   ` Miklos Szeredi
2025-08-14 14:36     ` Miklos Szeredi
2025-08-14 18:20       ` Darrick J. Wong
2025-08-15 15:06         ` John Groves
2025-08-19 21:55           ` Darrick J. Wong
2025-08-15 16:53       ` John Groves
2025-08-19 22:13         ` Darrick J. Wong
2025-08-14 18:05     ` Darrick J. Wong
2025-08-16 15:00       ` John Groves
2025-08-19 22:17         ` Darrick J. Wong
2025-08-15  0:38     ` John Groves
2025-07-03 18:50 ` [RFC V2 13/18] famfs_fuse: Create files with famfs fmaps John Groves
2025-07-04  9:01   ` Amir Goldstein
2025-07-05 19:27     ` John Groves
2025-07-03 18:50 ` [RFC V2 14/18] famfs_fuse: GET_DAXDEV message and daxdev_table John Groves
2025-07-04 13:20   ` Jonathan Cameron
2025-07-06 17:07     ` John Groves
2025-08-14 13:58   ` Miklos Szeredi
2025-08-14 17:19     ` Darrick J. Wong
2025-08-14 18:25       ` Miklos Szeredi
2025-08-14 18:55         ` Darrick J. Wong
2025-08-14 19:19           ` Miklos Szeredi
2025-08-16 16:22         ` John Groves
2025-08-19 22:32           ` Darrick J. Wong
2025-08-15 16:38     ` John Groves
2025-08-19 22:34       ` Darrick J. Wong
2025-07-03 18:50 ` [RFC V2 15/18] famfs_fuse: Plumb dax iomap and fuse read/write/mmap John Groves
2025-07-04  9:13   ` Amir Goldstein
2025-07-05 19:44     ` John Groves
2025-07-03 18:50 ` [RFC V2 16/18] famfs_fuse: Add holder_operations for dax notify_failure() John Groves
2025-07-03 18:50 ` [RFC V2 17/18] famfs_fuse: Add famfs metadata documentation John Groves
2025-07-03 18:50 ` [RFC V2 18/18] famfs_fuse: Add documentation John Groves
2025-07-04  0:27   ` Bagas Sanjaya
2025-07-04  2:22     ` Jonathan Corbet
2025-07-04  3:53       ` Bagas Sanjaya
2025-07-04 18:58         ` Matthew Wilcox
2025-07-04 23:29           ` Bagas Sanjaya
2025-07-04 23:43             ` Matthew Wilcox
2025-07-05  1:11               ` Bagas Sanjaya
2025-07-04  6:09   ` Randy Dunlap
2025-07-04  8:27   ` Amir Goldstein
2025-07-04 23:36     ` Bagas Sanjaya
2025-07-03 18:56 ` [RFC V2 00/18] famfs: port into fuse John Groves
2025-07-09  3:26   ` Miklos Szeredi
2025-07-11  1:18     ` John Groves

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).