Linux CXL
 help / color / mirror / Atom feed
From: Alireza Sanaee <alireza.sanaee@huawei.com>
To: John Groves <John@Groves.net>
Cc: Dan Williams <dan.j.williams@intel.com>,
	Miklos Szeredi <miklos@szeredb.hu>,
	Bernd Schubert <bschubert@ddn.com>,
	John Groves <jgroves@micron.com>,
	Jonathan Corbet <corbet@lwn.net>,
	Vishal Verma <vishal.l.verma@intel.com>,
	Dave Jiang <dave.jiang@intel.com>,
	"Matthew Wilcox" <willy@infradead.org>, Jan Kara <jack@suse.cz>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Christian Brauner <brauner@kernel.org>,
	"Darrick J . Wong" <djwong@kernel.org>,
	Luis Henriques <luis@igalia.com>,
	"Randy Dunlap" <rdunlap@infradead.org>,
	Jeff Layton <jlayton@kernel.org>,
	"Kent Overstreet" <kent.overstreet@linux.dev>,
	Petr Vorel <pvorel@suse.cz>, "Brian Foster" <bfoster@redhat.com>,
	<linux-doc@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
	<nvdimm@lists.linux.dev>, <linux-cxl@vger.kernel.org>,
	<linux-fsdevel@vger.kernel.org>,
	Amir Goldstein <amir73il@gmail.com>,
	Jonathan Cameron <Jonathan.Cameron@huawei.com>,
	Stefan Hajnoczi <shajnocz@redhat.com>,
	Joanne Koong <joannelkoong@gmail.com>,
	Josef Bacik <josef@toxicpanda.com>,
	"Aravind Ramesh" <arramesh@micron.com>,
	Ajay Joshi <ajayjoshi@micron.com>
Subject: Re: [RFC PATCH 00/19] famfs: port into fuse
Date: Wed, 30 Apr 2025 15:42:32 +0100	[thread overview]
Message-ID: <20250430154232.000045dd.alireza.sanaee@huawei.com> (raw)
In-Reply-To: <20250421013346.32530-1-john@groves.net>

On Sun, 20 Apr 2025 20:33:27 -0500
John Groves <John@Groves.net> wrote:

> Subject: famfs: port into fuse
> 
> This is the initial RFC for the fabric-attached memory file system
> (famfs) integration into fuse. In order to function, this requires a
> related patch to libfuse [1] and the famfs user space [2]. 
> 
> This RFC is mainly intended to socialize the approach and get
> feedback from the fuse developers and maintainers. There is some dax
> work that needs to be done before this should be merged (see the
> "poisoned page|folio problem" below).
> 
> This patch set fully works with Linux 6.14 -- passing all existing
> famfs smoke and unit tests -- and I encourage existing famfs users to
> test it.
> 
> This is really two patch sets mashed up:
> 
> * The patches with the dev_dax_iomap: prefix fill in missing
> functionality for devdax to host an fs-dax file system.
> * The famfs_fuse: patches add famfs into fs/fuse/. These are
> effectively unchanged since last year.
> 
> Because this is not ready to merge yet, I have felt free to leave
> some debug prints in place because we still find them useful; those
> will be cleaned up in a subsequent revision.
> 
> Famfs Overview
> 
> Famfs exposes shared memory as a file system. Famfs consumes shared
> memory from dax devices, and provides memory-mappable files that map
> directly to the memory - no page cache involvement. Famfs differs
> from conventional file systems in fs-dax mode, in that it handles
> in-memory metadata in a sharable way (which begins with never caching
> dirty shared metadata).
> 
> Famfs started as a standalone file system [3,4], but the consensus at
> LSFMM 2024 [5] was that it should be ported into fuse - and this RFC
> is the first public evidence that I've been working on that.
> 
> The key performance requirement is that famfs must resolve mapping
> faults without upcalls. This is achieved by fully caching the
> file-to-devdax metadata for all active files. This is done via two
> fuse client/server message/response pairs: GET_FMAP and GET_DAXDEV.
> 
> Famfs remains the first fs-dax file system that is backed by devdax
> rather than pmem in fs-dax mode (hence the need for the dev_dax_iomap
> fixups).
> 
> Notes
> 
> * Once the dev_dax_iomap patches land, I suspect it may make sense for
>   virtiofs to update to use the improved interface.
> 
> * I'm currently maintaining compatibility between the famfs user
> space and both the standalone famfs kernel file system and this new
> fuse implementation. In the near future I'll be running performance
> comparisons and sharing them - but there is no reason to expect
> significant degradation with fuse, since famfs caches entire "fmaps"
> in the kernel to resolve faults with no upcalls. This patch has a bit
> too much debug turned on to to that testing quite yet. A branch 
> 
> * Two new fuse messages / responses are added: GET_FMAP and
> GET_DAXDEV.
> 
> * When a file is looked up in a famfs mount, the LOOKUP is followed
> by a GET_FMAP message and response. The "fmap" is the full
> file-to-dax mapping, allowing the fuse/famfs kernel code to handle
> read/write/fault without any upcalls.
> 
> * After each GET_FMAP, the fmap is checked for extents that reference
>   previously-unknown daxdevs. Each such occurence is handled with a
>   GET_DAXDEV message and response.
> 
> * Daxdevs are stored in a table (which might become an xarray at some
> point). When entries are added to the table, we acquire exclusive
> access to the daxdev via the fs_dax_get() call (modeled after how
> fs-dax handles this with pmem devices). famfs provides
> holder_operations to devdax, providing a notification path in the
> event of memory errors.
> 
> * If devdax notifies famfs of memory errors on a dax device, famfs
> currently bocks all subsequent accesses to data on that device. The
> recovery is to re-initialize the memory and file system. Famfs is
> memory, not storage...
> 
> * Because famfs uses backing (devdax) devices, only privileged mounts
> are supported.
> 
> * The famfs kernel code never accesses the memory directly - it only
>   facilitates read, write and mmap on behalf of user processes. As
> such, the RAS of the shared memory affects applications, but not the
> kernel.
> 
> * Famfs has backing device(s), but they are devdax (char) rather than
>   block. Right now there is no way to tell the vfs layer that famfs
> has a char backing device (unless we say it's block, but it's not).
> Currently we use the standard anonymous fuse fs_type - but I'm not
> sure that's ultimately optimal (thoughts?)
> 
> The "poisoned page|folio problem"
> 
> * Background: before doing a kernel mount, the famfs user space [2]
> validates the superblock and log. This is done via raw mmap of the
> primary devdax device. If valid, the file system is mounted, and the
> superblock and log get exposed through a pair of files
> (.meta/.superblock and .meta/.log) - because we can't be using raw
> device mmap when a file system is mounted on the device. But this
> exposes a devdax bug and warning...
> 
> * Pages that have been memory mapped via devdax are left in a
> permanently problematic state. Devdax sets page|folio->mapping when a
> page is accessed via raw devdax mmap (as famfs does before mount),
> but never cleans it up. When the pages of the famfs superblock and
> log are accessed via the "meta" files after mount, we see a
> WARN_ONCE() in dax_insert_entry(), which notices that
> page|folio->mapping is still set. I intend to address this prior to
> asking for the famfs patches to be merged.
> 
> * Alistair Popple's recent dax patch series [6], which has been merged
>   for 6.15, addresses some dax issues, but sadly does not fix the
> poisoned page|folio problem - its enhanced refcount checking turns
> the warning into an error.
> 
> * This 6.14 patch set disables the warning; a proper fix will be
> required for famfs to work at all in 6.15. Dan W. and I are actively
> discussing how to do this properly...
> 
> * In terms of the correct functionality of famfs, the warning can be
> ignored.
> 
> References
> 
> [1] - https://github.com/libfuse/libfuse/pull/1200
> [2] - https://github.com/cxl-micron-reskit/famfs
> [3]
> - https://lore.kernel.org/linux-cxl/cover.1708709155.git.john@groves.net/ [4] - https://lore.kernel.org/linux-cxl/cover.1714409084.git.john@groves.net/
> [5] - https://lwn.net/Articles/983105/
> [6]
> - https://lore.kernel.org/linux-cxl/cover.8068ad144a7eea4a813670301f4d2a86a8e68ec4.1740713401.git-series.apopple@nvidia.com/
> 
> 
> John Groves (19):
>   dev_dax_iomap: Move dax_pgoff_to_phys() from device.c to bus.c
>   dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage
>   dev_dax_iomap: Save the kva from memremap
>   dev_dax_iomap: Add dax_operations for use by fs-dax on devdax
>   dev_dax_iomap: export dax_dev_get()
>   dev_dax_iomap: (ignore!) Drop poisoned page warning in fs/dax.c
>   famfs_fuse: magic.h: Add famfs magic numbers
>   famfs_fuse: Kconfig
>   famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/
>   famfs_fuse: Basic fuse kernel ABI enablement for famfs
>   famfs_fuse: Basic famfs mount opts
>   famfs_fuse: Plumb the GET_FMAP message/response
>   famfs_fuse: Create files with famfs fmaps
>   famfs_fuse: GET_DAXDEV message and daxdev_table
>   famfs_fuse: Plumb dax iomap and fuse read/write/mmap
>   famfs_fuse: Add holder_operations for dax notify_failure()
>   famfs_fuse: Add famfs metadata documentation
>   famfs_fuse: Add documentation
>   famfs_fuse: (ignore) debug cruft
> 
>  Documentation/filesystems/famfs.rst |  142 ++++
>  Documentation/filesystems/index.rst |    1 +
>  MAINTAINERS                         |   10 +
>  drivers/dax/Kconfig                 |    6 +
>  drivers/dax/bus.c                   |  144 +++-
>  drivers/dax/dax-private.h           |    1 +
>  drivers/dax/device.c                |   38 +-
>  drivers/dax/super.c                 |   33 +-
>  fs/dax.c                            |    1 -
>  fs/fuse/Kconfig                     |   13 +
>  fs/fuse/Makefile                    |    4 +-
>  fs/fuse/dev.c                       |   61 ++
>  fs/fuse/dir.c                       |   74 +-
>  fs/fuse/famfs.c                     | 1105
> +++++++++++++++++++++++++++ fs/fuse/famfs_kfmap.h               |
> 166 ++++ fs/fuse/file.c                      |   27 +-
>  fs/fuse/fuse_i.h                    |   67 +-
>  fs/fuse/inode.c                     |   49 +-
>  fs/fuse/iomode.c                    |    2 +-
>  fs/namei.c                          |    1 +
>  include/linux/dax.h                 |    6 +
>  include/uapi/linux/fuse.h           |   63 ++
>  include/uapi/linux/magic.h          |    2 +
>  23 files changed, 1973 insertions(+), 43 deletions(-)
>  create mode 100644 Documentation/filesystems/famfs.rst
>  create mode 100644 fs/fuse/famfs.c
>  create mode 100644 fs/fuse/famfs_kfmap.h
> 
> 
> base-commit: 38fec10eb60d687e30c8c6b5420d86e8149f7557

Hi John,

Apologies if the question is far off or irrelevant.

I am trying to understand FAMFS, and I am thinking where does FAMFS
stand when compared to OpenSHMEM PGAS. Can't we have a OpenSHMEM-based
shared memory implementation over CXL that serves as FAMFS?

Maybe FAMFS does more than that!?!

Thanks,
Alireza


  parent reply	other threads:[~2025-04-30 14:42 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-21  1:33 [RFC PATCH 00/19] famfs: port into fuse John Groves
2025-04-21  1:33 ` [RFC PATCH 01/19] dev_dax_iomap: Move dax_pgoff_to_phys() from device.c to bus.c John Groves
2025-04-21  1:33 ` [RFC PATCH 02/19] dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage John Groves
2025-04-21  1:33 ` [RFC PATCH 03/19] dev_dax_iomap: Save the kva from memremap John Groves
2025-04-21  1:33 ` [RFC PATCH 04/19] dev_dax_iomap: Add dax_operations for use by fs-dax on devdax John Groves
2025-04-21  1:33 ` [RFC PATCH 05/19] dev_dax_iomap: export dax_dev_get() John Groves
2025-04-21  1:33 ` [RFC PATCH 06/19] dev_dax_iomap: (ignore!) Drop poisoned page warning in fs/dax.c John Groves
2025-04-21  1:33 ` [RFC PATCH 07/19] famfs_fuse: magic.h: Add famfs magic numbers John Groves
2025-04-21  1:33 ` [RFC PATCH 08/19] famfs_fuse: Kconfig John Groves
2025-04-21  1:33 ` [RFC PATCH 09/19] famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/ John Groves
2025-04-21  1:33 ` [RFC PATCH 10/19] famfs_fuse: Basic fuse kernel ABI enablement for famfs John Groves
2025-04-23  1:36   ` Joanne Koong
2025-04-23 20:23     ` John Groves
2025-04-21  1:33 ` [RFC PATCH 11/19] famfs_fuse: Basic famfs mount opts John Groves
2025-04-23  1:51   ` Joanne Koong
2025-04-23 20:19     ` John Groves
2025-04-21  1:33 ` [RFC PATCH 12/19] famfs_fuse: Plumb the GET_FMAP message/response John Groves
2025-05-02  5:48   ` Joanne Koong
2025-05-02 20:35     ` Darrick J. Wong
2025-05-12 16:28     ` John Groves
2025-05-22 15:45       ` Amir Goldstein
2025-05-23  0:30         ` John Groves
2025-04-21  1:33 ` [RFC PATCH 13/19] famfs_fuse: Create files with famfs fmaps John Groves
2025-04-21 21:57   ` Darrick J. Wong
2025-04-21 22:31     ` John Groves
2025-04-24 13:43   ` John Groves
2025-04-24 14:38     ` Darrick J. Wong
2025-04-28  1:48       ` John Groves
2025-04-28 19:00         ` Darrick J. Wong
2025-05-06 16:56           ` Miklos Szeredi
2025-05-08 15:56             ` Darrick J. Wong
2025-05-13  9:14               ` Miklos Szeredi
2025-05-15  2:06                 ` Darrick J. Wong
2025-05-16 10:06                   ` Miklos Szeredi
2025-05-16 23:17                     ` Darrick J. Wong
2025-05-12 19:51             ` John Groves
2025-05-13  4:03               ` Darrick J. Wong
2025-04-21  1:33 ` [RFC PATCH 14/19] famfs_fuse: GET_DAXDEV message and daxdev_table John Groves
2025-04-21  3:43   ` Randy Dunlap
2025-04-21 20:57     ` John Groves
2025-04-21  1:33 ` [RFC PATCH 15/19] famfs_fuse: Plumb dax iomap and fuse read/write/mmap John Groves
2025-04-21  1:33 ` [RFC PATCH 16/19] famfs_fuse: Add holder_operations for dax notify_failure() John Groves
2025-04-21  1:33 ` [RFC PATCH 17/19] famfs_fuse: Add famfs metadata documentation John Groves
2025-04-21  3:51   ` Randy Dunlap
2025-04-21 21:00     ` John Groves
2025-04-21  1:33 ` [RFC PATCH 18/19] famfs_fuse: Add documentation John Groves
2025-04-22  2:10   ` Randy Dunlap
2025-04-28  1:50     ` John Groves
2025-04-21  1:33 ` [RFC PATCH 19/19] famfs_fuse: (ignore) debug cruft John Groves
2025-04-21 18:27 ` [RFC PATCH 00/19] famfs: port into fuse Darrick J. Wong
2025-04-21 22:00   ` John Groves
2025-04-22  1:25     ` Darrick J. Wong
2025-04-22 11:50       ` John Groves
2025-04-30 14:42 ` Alireza Sanaee [this message]
2025-05-01  2:13   ` John Groves
2025-05-21 22:30 ` John Groves
2025-05-21 23:11   ` Darrick J. Wong
2025-05-22 15:55   ` Amir Goldstein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250430154232.000045dd.alireza.sanaee@huawei.com \
    --to=alireza.sanaee@huawei.com \
    --cc=John@Groves.net \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=ajayjoshi@micron.com \
    --cc=amir73il@gmail.com \
    --cc=arramesh@micron.com \
    --cc=bfoster@redhat.com \
    --cc=brauner@kernel.org \
    --cc=bschubert@ddn.com \
    --cc=corbet@lwn.net \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=djwong@kernel.org \
    --cc=jack@suse.cz \
    --cc=jgroves@micron.com \
    --cc=jlayton@kernel.org \
    --cc=joannelkoong@gmail.com \
    --cc=josef@toxicpanda.com \
    --cc=kent.overstreet@linux.dev \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luis@igalia.com \
    --cc=miklos@szeredb.hu \
    --cc=nvdimm@lists.linux.dev \
    --cc=pvorel@suse.cz \
    --cc=rdunlap@infradead.org \
    --cc=shajnocz@redhat.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=vishal.l.verma@intel.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox