* [PATCH 01/10] guestmemfs: Introduce filesystem skeleton
2024-08-05 9:32 [PATCH 00/10] Introduce guestmemfs: persistent in-memory filesystem James Gowans
@ 2024-08-05 9:32 ` James Gowans
2024-08-05 10:20 ` Christian Brauner
2024-08-05 9:32 ` [PATCH 02/10] guestmemfs: add inode store, files and dirs James Gowans
` (12 subsequent siblings)
13 siblings, 1 reply; 35+ messages in thread
From: James Gowans @ 2024-08-05 9:32 UTC (permalink / raw)
To: linux-kernel
Cc: James Gowans, Sean Christopherson, Paolo Bonzini, Alexander Viro,
Steve Sistare, Christian Brauner, Jan Kara, Anthony Yznaga,
Mike Rapoport, Andrew Morton, linux-mm, Jason Gunthorpe,
linux-fsdevel, Usama Arif, kvm, Alexander Graf, David Woodhouse,
Paul Durrant, Nicolas Saenz Julienne
Add an in-memory filesystem: guestmemfs. Memory is donated to guestmemfs
by carving it out of the normal System RAM range with the memmap= cmdline
parameter and then giving that same physical range to guestmemfs with the
guestmemfs= cmdline parameter.
A new filesystem is added; so far it doesn't do much except persist a
super block at the start of the donated memory and allows itself to be
mounted.
A hook to x86 mm init is added to reserve the memory really early on via
memblock allocator. There is probably a better arch-independent place to
do this...
Signed-off-by: James Gowans <jgowans@amazon.com>
---
arch/x86/mm/init_64.c | 2 +
fs/Kconfig | 1 +
fs/Makefile | 1 +
fs/guestmemfs/Kconfig | 11 ++++
fs/guestmemfs/Makefile | 6 ++
fs/guestmemfs/guestmemfs.c | 116 +++++++++++++++++++++++++++++++++++++
fs/guestmemfs/guestmemfs.h | 9 +++
include/linux/guestmemfs.h | 16 +++++
8 files changed, 162 insertions(+)
create mode 100644 fs/guestmemfs/Kconfig
create mode 100644 fs/guestmemfs/Makefile
create mode 100644 fs/guestmemfs/guestmemfs.c
create mode 100644 fs/guestmemfs/guestmemfs.h
create mode 100644 include/linux/guestmemfs.h
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 8932ba8f5cdd..39fcf017c90c 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -18,6 +18,7 @@
#include <linux/mm.h>
#include <linux/swap.h>
#include <linux/smp.h>
+#include <linux/guestmemfs.h>
#include <linux/init.h>
#include <linux/initrd.h>
#include <linux/kexec.h>
@@ -1331,6 +1332,7 @@ static void __init preallocate_vmalloc_pages(void)
void __init mem_init(void)
{
+ guestmemfs_reserve_mem();
pci_iommu_alloc();
/* clear_bss() already clear the empty_zero_page */
diff --git a/fs/Kconfig b/fs/Kconfig
index a46b0cbc4d8f..727359901da8 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -321,6 +321,7 @@ source "fs/befs/Kconfig"
source "fs/bfs/Kconfig"
source "fs/efs/Kconfig"
source "fs/jffs2/Kconfig"
+source "fs/guestmemfs/Kconfig"
# UBIFS File system configuration
source "fs/ubifs/Kconfig"
source "fs/cramfs/Kconfig"
diff --git a/fs/Makefile b/fs/Makefile
index 6ecc9b0a53f2..044524b17d63 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -129,3 +129,4 @@ obj-$(CONFIG_EFIVAR_FS) += efivarfs/
obj-$(CONFIG_EROFS_FS) += erofs/
obj-$(CONFIG_VBOXSF_FS) += vboxsf/
obj-$(CONFIG_ZONEFS_FS) += zonefs/
+obj-$(CONFIG_GUESTMEMFS_FS) += guestmemfs/
diff --git a/fs/guestmemfs/Kconfig b/fs/guestmemfs/Kconfig
new file mode 100644
index 000000000000..d87fca4822cb
--- /dev/null
+++ b/fs/guestmemfs/Kconfig
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0-only
+
+config GUESTMEMFS_FS
+ bool "Persistent Guest memory filesystem (guestmemfs)"
+ help
+ An in-memory filesystem on top of reserved memory specified via
+ guestmemfs= cmdline argument. Used for storing kernel state and
+ userspace memory which is preserved across kexec to support
+ live update of a hypervisor when running guest virtual machines.
+ Select this if you need the ability to persist memory for guest VMs
+ across kexec to do live update.
diff --git a/fs/guestmemfs/Makefile b/fs/guestmemfs/Makefile
new file mode 100644
index 000000000000..6dc820a9d4fe
--- /dev/null
+++ b/fs/guestmemfs/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# Makefile for persistent kernel filesystem
+#
+
+obj-y += guestmemfs.o
diff --git a/fs/guestmemfs/guestmemfs.c b/fs/guestmemfs/guestmemfs.c
new file mode 100644
index 000000000000..3aaada1b8df6
--- /dev/null
+++ b/fs/guestmemfs/guestmemfs.c
@@ -0,0 +1,116 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "guestmemfs.h"
+#include <linux/dcache.h>
+#include <linux/fs.h>
+#include <linux/module.h>
+#include <linux/fs_context.h>
+#include <linux/io.h>
+#include <linux/memblock.h>
+#include <linux/statfs.h>
+
+static phys_addr_t guestmemfs_base, guestmemfs_size;
+struct guestmemfs_sb *psb;
+
+static int statfs(struct dentry *root, struct kstatfs *buf)
+{
+ simple_statfs(root, buf);
+ buf->f_bsize = PMD_SIZE;
+ buf->f_blocks = guestmemfs_size / PMD_SIZE;
+ buf->f_bfree = buf->f_bavail = buf->f_blocks;
+ return 0;
+}
+
+static const struct super_operations guestmemfs_super_ops = {
+ .statfs = statfs,
+};
+
+static int guestmemfs_fill_super(struct super_block *sb, struct fs_context *fc)
+{
+ struct inode *inode;
+ struct dentry *dentry;
+
+ psb = kzalloc(sizeof(*psb), GFP_KERNEL);
+ /*
+ * Keep a reference to the persistent super block in the
+ * ephemeral super block.
+ */
+ sb->s_fs_info = psb;
+ sb->s_op = &guestmemfs_super_ops;
+
+ inode = new_inode(sb);
+ if (!inode)
+ return -ENOMEM;
+
+ inode->i_ino = 1;
+ inode->i_mode = S_IFDIR;
+ inode->i_op = &simple_dir_inode_operations;
+ inode->i_fop = &simple_dir_operations;
+ simple_inode_init_ts(inode);
+ /* directory inodes start off with i_nlink == 2 (for "." entry) */
+ inc_nlink(inode);
+
+ dentry = d_make_root(inode);
+ if (!dentry)
+ return -ENOMEM;
+ sb->s_root = dentry;
+
+ return 0;
+}
+
+static int guestmemfs_get_tree(struct fs_context *fc)
+{
+ return get_tree_nodev(fc, guestmemfs_fill_super);
+}
+
+static const struct fs_context_operations guestmemfs_context_ops = {
+ .get_tree = guestmemfs_get_tree,
+};
+
+static int guestmemfs_init_fs_context(struct fs_context *const fc)
+{
+ fc->ops = &guestmemfs_context_ops;
+ return 0;
+}
+
+static struct file_system_type guestmemfs_fs_type = {
+ .owner = THIS_MODULE,
+ .name = "guestmemfs",
+ .init_fs_context = guestmemfs_init_fs_context,
+ .kill_sb = kill_litter_super,
+ .fs_flags = FS_USERNS_MOUNT,
+};
+
+static int __init guestmemfs_init(void)
+{
+ int ret;
+
+ ret = register_filesystem(&guestmemfs_fs_type);
+ return ret;
+}
+
+/**
+ * Format: guestmemfs=<size>:<base>
+ * Just like: memmap=nn[KMG]!ss[KMG]
+ */
+static int __init parse_guestmemfs_extents(char *p)
+{
+ guestmemfs_size = memparse(p, &p);
+ return 0;
+}
+
+early_param("guestmemfs", parse_guestmemfs_extents);
+
+void __init guestmemfs_reserve_mem(void)
+{
+ guestmemfs_base = memblock_phys_alloc(guestmemfs_size, 4 << 10);
+ if (guestmemfs_base) {
+ memblock_reserved_mark_noinit(guestmemfs_base, guestmemfs_size);
+ memblock_mark_nomap(guestmemfs_base, guestmemfs_size);
+ } else {
+ pr_warn("Failed to alloc %llu bytes for guestmemfs\n", guestmemfs_size);
+ }
+}
+
+MODULE_ALIAS_FS("guestmemfs");
+module_init(guestmemfs_init);
diff --git a/fs/guestmemfs/guestmemfs.h b/fs/guestmemfs/guestmemfs.h
new file mode 100644
index 000000000000..37d8cf630e0a
--- /dev/null
+++ b/fs/guestmemfs/guestmemfs.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#define pr_fmt(fmt) "guestmemfs: " KBUILD_MODNAME ": " fmt
+
+#include <linux/guestmemfs.h>
+
+struct guestmemfs_sb {
+ /* Will be populated soon... */
+};
diff --git a/include/linux/guestmemfs.h b/include/linux/guestmemfs.h
new file mode 100644
index 000000000000..60e769c8e533
--- /dev/null
+++ b/include/linux/guestmemfs.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: MIT */
+
+#ifndef _LINUX_GUESTMEMFS_H
+#define _LINUX_GUESTMEMFS_H
+
+/*
+ * Carves out chunks of memory from memblocks for guestmemfs.
+ * Must be called in early boot before memblocks are freed.
+ */
+# ifdef CONFIG_GUESTMEMFS_FS
+void guestmemfs_reserve_mem(void);
+#else
+void guestmemfs_reserve_mem(void) { }
+#endif
+
+#endif
--
2.34.1
^ permalink raw reply related [flat|nested] 35+ messages in thread* Re: [PATCH 01/10] guestmemfs: Introduce filesystem skeleton
2024-08-05 9:32 ` [PATCH 01/10] guestmemfs: Introduce filesystem skeleton James Gowans
@ 2024-08-05 10:20 ` Christian Brauner
0 siblings, 0 replies; 35+ messages in thread
From: Christian Brauner @ 2024-08-05 10:20 UTC (permalink / raw)
To: James Gowans
Cc: linux-kernel, Sean Christopherson, Paolo Bonzini, Alexander Viro,
Steve Sistare, Jan Kara, Anthony Yznaga, Mike Rapoport,
Andrew Morton, linux-mm, Jason Gunthorpe, linux-fsdevel,
Usama Arif, kvm, Alexander Graf, David Woodhouse, Paul Durrant,
Nicolas Saenz Julienne
(I'm just going to point at a few things but it's by no means a
comprehensive review.)
On Mon, Aug 05, 2024 at 11:32:36AM GMT, James Gowans wrote:
> Add an in-memory filesystem: guestmemfs. Memory is donated to guestmemfs
> by carving it out of the normal System RAM range with the memmap= cmdline
> parameter and then giving that same physical range to guestmemfs with the
> guestmemfs= cmdline parameter.
>
> A new filesystem is added; so far it doesn't do much except persist a
> super block at the start of the donated memory and allows itself to be
> mounted.
>
> A hook to x86 mm init is added to reserve the memory really early on via
> memblock allocator. There is probably a better arch-independent place to
> do this...
>
> Signed-off-by: James Gowans <jgowans@amazon.com>
> ---
> arch/x86/mm/init_64.c | 2 +
> fs/Kconfig | 1 +
> fs/Makefile | 1 +
> fs/guestmemfs/Kconfig | 11 ++++
> fs/guestmemfs/Makefile | 6 ++
> fs/guestmemfs/guestmemfs.c | 116 +++++++++++++++++++++++++++++++++++++
> fs/guestmemfs/guestmemfs.h | 9 +++
> include/linux/guestmemfs.h | 16 +++++
> 8 files changed, 162 insertions(+)
> create mode 100644 fs/guestmemfs/Kconfig
> create mode 100644 fs/guestmemfs/Makefile
> create mode 100644 fs/guestmemfs/guestmemfs.c
> create mode 100644 fs/guestmemfs/guestmemfs.h
> create mode 100644 include/linux/guestmemfs.h
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 8932ba8f5cdd..39fcf017c90c 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -18,6 +18,7 @@
> #include <linux/mm.h>
> #include <linux/swap.h>
> #include <linux/smp.h>
> +#include <linux/guestmemfs.h>
> #include <linux/init.h>
> #include <linux/initrd.h>
> #include <linux/kexec.h>
> @@ -1331,6 +1332,7 @@ static void __init preallocate_vmalloc_pages(void)
>
> void __init mem_init(void)
> {
> + guestmemfs_reserve_mem();
> pci_iommu_alloc();
>
> /* clear_bss() already clear the empty_zero_page */
> diff --git a/fs/Kconfig b/fs/Kconfig
> index a46b0cbc4d8f..727359901da8 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -321,6 +321,7 @@ source "fs/befs/Kconfig"
> source "fs/bfs/Kconfig"
> source "fs/efs/Kconfig"
> source "fs/jffs2/Kconfig"
> +source "fs/guestmemfs/Kconfig"
> # UBIFS File system configuration
> source "fs/ubifs/Kconfig"
> source "fs/cramfs/Kconfig"
> diff --git a/fs/Makefile b/fs/Makefile
> index 6ecc9b0a53f2..044524b17d63 100644
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -129,3 +129,4 @@ obj-$(CONFIG_EFIVAR_FS) += efivarfs/
> obj-$(CONFIG_EROFS_FS) += erofs/
> obj-$(CONFIG_VBOXSF_FS) += vboxsf/
> obj-$(CONFIG_ZONEFS_FS) += zonefs/
> +obj-$(CONFIG_GUESTMEMFS_FS) += guestmemfs/
> diff --git a/fs/guestmemfs/Kconfig b/fs/guestmemfs/Kconfig
> new file mode 100644
> index 000000000000..d87fca4822cb
> --- /dev/null
> +++ b/fs/guestmemfs/Kconfig
> @@ -0,0 +1,11 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +
> +config GUESTMEMFS_FS
> + bool "Persistent Guest memory filesystem (guestmemfs)"
> + help
> + An in-memory filesystem on top of reserved memory specified via
> + guestmemfs= cmdline argument. Used for storing kernel state and
> + userspace memory which is preserved across kexec to support
> + live update of a hypervisor when running guest virtual machines.
> + Select this if you need the ability to persist memory for guest VMs
> + across kexec to do live update.
> diff --git a/fs/guestmemfs/Makefile b/fs/guestmemfs/Makefile
> new file mode 100644
> index 000000000000..6dc820a9d4fe
> --- /dev/null
> +++ b/fs/guestmemfs/Makefile
> @@ -0,0 +1,6 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +#
> +# Makefile for persistent kernel filesystem
> +#
> +
> +obj-y += guestmemfs.o
> diff --git a/fs/guestmemfs/guestmemfs.c b/fs/guestmemfs/guestmemfs.c
> new file mode 100644
> index 000000000000..3aaada1b8df6
> --- /dev/null
> +++ b/fs/guestmemfs/guestmemfs.c
> @@ -0,0 +1,116 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +#include "guestmemfs.h"
> +#include <linux/dcache.h>
> +#include <linux/fs.h>
> +#include <linux/module.h>
> +#include <linux/fs_context.h>
> +#include <linux/io.h>
> +#include <linux/memblock.h>
> +#include <linux/statfs.h>
> +
> +static phys_addr_t guestmemfs_base, guestmemfs_size;
> +struct guestmemfs_sb *psb;
> +
> +static int statfs(struct dentry *root, struct kstatfs *buf)
> +{
> + simple_statfs(root, buf);
> + buf->f_bsize = PMD_SIZE;
> + buf->f_blocks = guestmemfs_size / PMD_SIZE;
> + buf->f_bfree = buf->f_bavail = buf->f_blocks;
> + return 0;
> +}
> +
> +static const struct super_operations guestmemfs_super_ops = {
> + .statfs = statfs,
(Please make it a habit to name these things with a consistent prefix.
Doesn't matter if it's wubalubadubdub_statfs() or guestmemfs_statfs() as
far as I'm concerned but just something that is grep-able and local to
your fs.)
> +};
> +
> +static int guestmemfs_fill_super(struct super_block *sb, struct fs_context *fc)
> +{
> + struct inode *inode;
> + struct dentry *dentry;
> +
> + psb = kzalloc(sizeof(*psb), GFP_KERNEL);
> + /*
> + * Keep a reference to the persistent super block in the
> + * ephemeral super block.
> + */
> + sb->s_fs_info = psb;
> + sb->s_op = &guestmemfs_super_ops;
> +
> + inode = new_inode(sb);
> + if (!inode)
> + return -ENOMEM;
> +
> + inode->i_ino = 1;
> + inode->i_mode = S_IFDIR;
> + inode->i_op = &simple_dir_inode_operations;
> + inode->i_fop = &simple_dir_operations;
> + simple_inode_init_ts(inode);
> + /* directory inodes start off with i_nlink == 2 (for "." entry) */
> + inc_nlink(inode);
> +
> + dentry = d_make_root(inode);
> + if (!dentry)
> + return -ENOMEM;
> + sb->s_root = dentry;
> +
> + return 0;
> +}
> +
> +static int guestmemfs_get_tree(struct fs_context *fc)
> +{
> + return get_tree_nodev(fc, guestmemfs_fill_super);
That makes the filesystem multi-instance so
mount -t guestmemfs guestmemfs /mnt
mount -t guestmemfs guestmemfs /opt
would mount two separate instances of guestmemfs. That is intentional,
right as multiple instances draw memory from the same reserved
memblock?
> +}
> +
> +static const struct fs_context_operations guestmemfs_context_ops = {
> + .get_tree = guestmemfs_get_tree,
> +};
> +
> +static int guestmemfs_init_fs_context(struct fs_context *const fc)
> +{
> + fc->ops = &guestmemfs_context_ops;
> + return 0;
> +}
> +
> +static struct file_system_type guestmemfs_fs_type = {
> + .owner = THIS_MODULE,
> + .name = "guestmemfs",
> + .init_fs_context = guestmemfs_init_fs_context,
> + .kill_sb = kill_litter_super,
> + .fs_flags = FS_USERNS_MOUNT,
This makes the filesystem mountable by unprivileged containers and
therefore unprivileged users. Iiuc, you need a mechanism to prevent a
container from just taking over the whole reserved memory block. Afaict
memblock isn't accounted for in cgroups at all so it'd be good to know
how that would be done. And that should be explained somewhere in the
documentation patch, please.
> +};
> +
> +static int __init guestmemfs_init(void)
> +{
> + int ret;
> +
> + ret = register_filesystem(&guestmemfs_fs_type);
> + return ret;
> +}
> +
> +/**
> + * Format: guestmemfs=<size>:<base>
> + * Just like: memmap=nn[KMG]!ss[KMG]
> + */
> +static int __init parse_guestmemfs_extents(char *p)
> +{
> + guestmemfs_size = memparse(p, &p);
> + return 0;
> +}
> +
> +early_param("guestmemfs", parse_guestmemfs_extents);
> +
> +void __init guestmemfs_reserve_mem(void)
> +{
> + guestmemfs_base = memblock_phys_alloc(guestmemfs_size, 4 << 10);
> + if (guestmemfs_base) {
> + memblock_reserved_mark_noinit(guestmemfs_base, guestmemfs_size);
> + memblock_mark_nomap(guestmemfs_base, guestmemfs_size);
> + } else {
> + pr_warn("Failed to alloc %llu bytes for guestmemfs\n", guestmemfs_size);
> + }
> +}
> +
> +MODULE_ALIAS_FS("guestmemfs");
> +module_init(guestmemfs_init);
> diff --git a/fs/guestmemfs/guestmemfs.h b/fs/guestmemfs/guestmemfs.h
> new file mode 100644
> index 000000000000..37d8cf630e0a
> --- /dev/null
> +++ b/fs/guestmemfs/guestmemfs.h
> @@ -0,0 +1,9 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +
> +#define pr_fmt(fmt) "guestmemfs: " KBUILD_MODNAME ": " fmt
> +
> +#include <linux/guestmemfs.h>
> +
> +struct guestmemfs_sb {
> + /* Will be populated soon... */
> +};
> diff --git a/include/linux/guestmemfs.h b/include/linux/guestmemfs.h
> new file mode 100644
> index 000000000000..60e769c8e533
> --- /dev/null
> +++ b/include/linux/guestmemfs.h
> @@ -0,0 +1,16 @@
> +/* SPDX-License-Identifier: MIT */
> +
> +#ifndef _LINUX_GUESTMEMFS_H
> +#define _LINUX_GUESTMEMFS_H
> +
> +/*
> + * Carves out chunks of memory from memblocks for guestmemfs.
> + * Must be called in early boot before memblocks are freed.
> + */
> +# ifdef CONFIG_GUESTMEMFS_FS
> +void guestmemfs_reserve_mem(void);
> +#else
> +void guestmemfs_reserve_mem(void) { }
> +#endif
> +
> +#endif
> --
> 2.34.1
>
^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH 02/10] guestmemfs: add inode store, files and dirs
2024-08-05 9:32 [PATCH 00/10] Introduce guestmemfs: persistent in-memory filesystem James Gowans
2024-08-05 9:32 ` [PATCH 01/10] guestmemfs: Introduce filesystem skeleton James Gowans
@ 2024-08-05 9:32 ` James Gowans
2024-08-05 9:32 ` [PATCH 03/10] guestmemfs: add persistent data block allocator James Gowans
` (11 subsequent siblings)
13 siblings, 0 replies; 35+ messages in thread
From: James Gowans @ 2024-08-05 9:32 UTC (permalink / raw)
To: linux-kernel
Cc: James Gowans, Sean Christopherson, Paolo Bonzini, Alexander Viro,
Steve Sistare, Christian Brauner, Jan Kara, Anthony Yznaga,
Mike Rapoport, Andrew Morton, linux-mm, Jason Gunthorpe,
linux-fsdevel, Usama Arif, kvm, Alexander Graf, David Woodhouse,
Paul Durrant, Nicolas Saenz Julienne
Here inodes are added to the filesystem: inodes for both regular files
and directories. This involes supporting the callbacks to create inodes
in a directory, as well as being able to list the contents of a
directory and lookup and inode by name.
The inode store is implemented as a 2 MiB page which is an array of
struct guestmemfs_inode. The reason to have a large allocation and put
them all in a big flat array is to make persistence easy: when it's time
to introduce persistence to the filesystem it will need to persist this
one big chunk of inodes across kexec using KHO.
Free inodes in the page form a slab type structure, the first free inode
pointing to the next free inode, etc. The super block points to the
first free, so allocating involves popping the head, and freeing an
inode involves pushing a new head.
Directories point to the first inode in the directory via a child_inode
reference. Subsequent inodes within the same directory are pointed to
via a sibling_inode member. Essentially forming a linked list of inodes
within the directory.
Looking up an inode in a directory involves traversing the sibling_inode
linked list until one with a matching name is found.
Filesystem stats are updated to account for total and allocated inodes.
Signed-off-by: James Gowans <jgowans@amazon.com>
---
fs/guestmemfs/Makefile | 2 +-
fs/guestmemfs/dir.c | 43 ++++++++++
fs/guestmemfs/guestmemfs.c | 21 ++++-
fs/guestmemfs/guestmemfs.h | 36 +++++++-
fs/guestmemfs/inode.c | 164 +++++++++++++++++++++++++++++++++++++
5 files changed, 260 insertions(+), 6 deletions(-)
create mode 100644 fs/guestmemfs/dir.c
create mode 100644 fs/guestmemfs/inode.c
diff --git a/fs/guestmemfs/Makefile b/fs/guestmemfs/Makefile
index 6dc820a9d4fe..804997799ce8 100644
--- a/fs/guestmemfs/Makefile
+++ b/fs/guestmemfs/Makefile
@@ -3,4 +3,4 @@
# Makefile for persistent kernel filesystem
#
-obj-y += guestmemfs.o
+obj-y += guestmemfs.o inode.o dir.o
diff --git a/fs/guestmemfs/dir.c b/fs/guestmemfs/dir.c
new file mode 100644
index 000000000000..4acd81421c85
--- /dev/null
+++ b/fs/guestmemfs/dir.c
@@ -0,0 +1,43 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "guestmemfs.h"
+
+static int guestmemfs_dir_iterate(struct file *dir, struct dir_context *ctx)
+{
+ struct guestmemfs_inode *guestmemfs_inode;
+ struct super_block *sb = dir->f_inode->i_sb;
+
+ /* Indication from previous invoke that there's no more to iterate. */
+ if (ctx->pos == -1)
+ return 0;
+
+ if (!dir_emit_dots(dir, ctx))
+ return 0;
+
+ /*
+ * Just emitted this dir; go to dir contents. Use pos to smuggle
+ * the next inode number to emit across iterations.
+ * -1 indicates no valid inode. Can't use 0 because first loop has pos=0
+ */
+ if (ctx->pos == 2) {
+ ctx->pos = guestmemfs_get_persisted_inode(sb, dir->f_inode->i_ino)->child_ino;
+ /* Empty dir case. */
+ if (ctx->pos == 0)
+ ctx->pos = -1;
+ }
+
+ while (ctx->pos > 1) {
+ guestmemfs_inode = guestmemfs_get_persisted_inode(sb, ctx->pos);
+ dir_emit(ctx, guestmemfs_inode->filename, GUESTMEMFS_FILENAME_LEN,
+ ctx->pos, DT_UNKNOWN);
+ ctx->pos = guestmemfs_inode->sibling_ino;
+ if (!ctx->pos)
+ ctx->pos = -1;
+ }
+ return 0;
+}
+
+const struct file_operations guestmemfs_dir_fops = {
+ .owner = THIS_MODULE,
+ .iterate_shared = guestmemfs_dir_iterate,
+};
diff --git a/fs/guestmemfs/guestmemfs.c b/fs/guestmemfs/guestmemfs.c
index 3aaada1b8df6..21cb3490a2bd 100644
--- a/fs/guestmemfs/guestmemfs.c
+++ b/fs/guestmemfs/guestmemfs.c
@@ -18,6 +18,9 @@ static int statfs(struct dentry *root, struct kstatfs *buf)
buf->f_bsize = PMD_SIZE;
buf->f_blocks = guestmemfs_size / PMD_SIZE;
buf->f_bfree = buf->f_bavail = buf->f_blocks;
+ buf->f_files = PMD_SIZE / sizeof(struct guestmemfs_inode);
+ buf->f_ffree = buf->f_files -
+ GUESTMEMFS_PSB(root->d_sb)->allocated_inodes;
return 0;
}
@@ -31,24 +34,34 @@ static int guestmemfs_fill_super(struct super_block *sb, struct fs_context *fc)
struct dentry *dentry;
psb = kzalloc(sizeof(*psb), GFP_KERNEL);
+ psb->inodes = kzalloc(2 << 20, GFP_KERNEL);
+ if (!psb->inodes)
+ return -ENOMEM;
+
/*
* Keep a reference to the persistent super block in the
* ephemeral super block.
*/
sb->s_fs_info = psb;
+ spin_lock_init(&psb->allocation_lock);
+ guestmemfs_initialise_inode_store(sb);
+ guestmemfs_get_persisted_inode(sb, 1)->flags = GUESTMEMFS_INODE_FLAG_DIR;
+ strscpy(guestmemfs_get_persisted_inode(sb, 1)->filename, ".",
+ GUESTMEMFS_FILENAME_LEN);
+ psb->next_free_ino = 2;
+
sb->s_op = &guestmemfs_super_ops;
- inode = new_inode(sb);
+ inode = guestmemfs_inode_get(sb, 1);
if (!inode)
return -ENOMEM;
- inode->i_ino = 1;
inode->i_mode = S_IFDIR;
- inode->i_op = &simple_dir_inode_operations;
- inode->i_fop = &simple_dir_operations;
+ inode->i_fop = &guestmemfs_dir_fops;
simple_inode_init_ts(inode);
/* directory inodes start off with i_nlink == 2 (for "." entry) */
inc_nlink(inode);
+ inode_init_owner(&nop_mnt_idmap, inode, NULL, inode->i_mode);
dentry = d_make_root(inode);
if (!dentry)
diff --git a/fs/guestmemfs/guestmemfs.h b/fs/guestmemfs/guestmemfs.h
index 37d8cf630e0a..3a2954d1beec 100644
--- a/fs/guestmemfs/guestmemfs.h
+++ b/fs/guestmemfs/guestmemfs.h
@@ -3,7 +3,41 @@
#define pr_fmt(fmt) "guestmemfs: " KBUILD_MODNAME ": " fmt
#include <linux/guestmemfs.h>
+#include <linux/fs.h>
+
+#define GUESTMEMFS_FILENAME_LEN 255
+#define GUESTMEMFS_PSB(sb) ((struct guestmemfs_sb *)sb->s_fs_info)
struct guestmemfs_sb {
- /* Will be populated soon... */
+ /* Inode number */
+ unsigned long next_free_ino;
+ unsigned long allocated_inodes;
+ struct guestmemfs_inode *inodes;
+ spinlock_t allocation_lock;
+};
+
+// If neither of these are set the inode is not in use.
+#define GUESTMEMFS_INODE_FLAG_FILE (1 << 0)
+#define GUESTMEMFS_INODE_FLAG_DIR (1 << 1)
+struct guestmemfs_inode {
+ int flags;
+ /*
+ * Points to next inode in the same directory, or
+ * 0 if last file in directory.
+ */
+ unsigned long sibling_ino;
+ /*
+ * If this inode is a directory, this points to the
+ * first inode *in* that directory.
+ */
+ unsigned long child_ino;
+ char filename[GUESTMEMFS_FILENAME_LEN];
+ void *mappings;
+ int num_mappings;
};
+
+void guestmemfs_initialise_inode_store(struct super_block *sb);
+struct inode *guestmemfs_inode_get(struct super_block *sb, unsigned long ino);
+struct guestmemfs_inode *guestmemfs_get_persisted_inode(struct super_block *sb, int ino);
+
+extern const struct file_operations guestmemfs_dir_fops;
diff --git a/fs/guestmemfs/inode.c b/fs/guestmemfs/inode.c
new file mode 100644
index 000000000000..2360c3a4857d
--- /dev/null
+++ b/fs/guestmemfs/inode.c
@@ -0,0 +1,164 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "guestmemfs.h"
+#include <linux/fs.h>
+
+const struct inode_operations guestmemfs_dir_inode_operations;
+
+struct guestmemfs_inode *guestmemfs_get_persisted_inode(struct super_block *sb, int ino)
+{
+ /*
+ * Inode index starts at 1, so -1 to get memory index.
+ */
+ return GUESTMEMFS_PSB(sb)->inodes + ino - 1;
+}
+
+struct inode *guestmemfs_inode_get(struct super_block *sb, unsigned long ino)
+{
+ struct inode *inode = iget_locked(sb, ino);
+
+ /* If this inode is cached it is already populated; just return */
+ if (!(inode->i_state & I_NEW))
+ return inode;
+ inode->i_op = &guestmemfs_dir_inode_operations;
+ inode->i_sb = sb;
+ inode->i_mode = S_IFREG;
+ unlock_new_inode(inode);
+ return inode;
+}
+
+static unsigned long guestmemfs_allocate_inode(struct super_block *sb)
+{
+
+ unsigned long next_free_ino = -ENOMEM;
+ struct guestmemfs_sb *psb = GUESTMEMFS_PSB(sb);
+
+ spin_lock(&psb->allocation_lock);
+ next_free_ino = psb->next_free_ino;
+ psb->allocated_inodes += 1;
+ if (!next_free_ino)
+ goto out;
+ psb->next_free_ino =
+ guestmemfs_get_persisted_inode(sb, next_free_ino)->sibling_ino;
+out:
+ spin_unlock(&psb->allocation_lock);
+ return next_free_ino;
+}
+
+/*
+ * Zeroes the inode and makes it the head of the free list.
+ */
+static void guestmemfs_free_inode(struct super_block *sb, unsigned long ino)
+{
+ struct guestmemfs_sb *psb = GUESTMEMFS_PSB(sb);
+ struct guestmemfs_inode *inode = guestmemfs_get_persisted_inode(sb, ino);
+
+ spin_lock(&psb->allocation_lock);
+ memset(inode, 0, sizeof(struct guestmemfs_inode));
+ inode->sibling_ino = psb->next_free_ino;
+ psb->next_free_ino = ino;
+ psb->allocated_inodes -= 1;
+ spin_unlock(&psb->allocation_lock);
+}
+
+/*
+ * Sets all inodes as free and points each free inode to the next one.
+ */
+void guestmemfs_initialise_inode_store(struct super_block *sb)
+{
+ /* Inode store is a PMD sized (ie: 2 MiB) page */
+ memset(guestmemfs_get_persisted_inode(sb, 1), 0, PMD_SIZE);
+ /* Point each inode for the next one; linked-list initialisation. */
+ for (unsigned long ino = 2; ino * sizeof(struct guestmemfs_inode) < PMD_SIZE; ino++)
+ guestmemfs_get_persisted_inode(sb, ino - 1)->sibling_ino = ino;
+}
+
+static int guestmemfs_create(struct mnt_idmap *id, struct inode *dir,
+ struct dentry *dentry, umode_t mode, bool excl)
+{
+ unsigned long free_inode;
+ struct guestmemfs_inode *guestmemfs_inode;
+ struct inode *vfs_inode;
+
+ free_inode = guestmemfs_allocate_inode(dir->i_sb);
+ if (free_inode <= 0)
+ return -ENOMEM;
+
+ guestmemfs_inode = guestmemfs_get_persisted_inode(dir->i_sb, free_inode);
+ guestmemfs_inode->sibling_ino =
+ guestmemfs_get_persisted_inode(dir->i_sb, dir->i_ino)->child_ino;
+ guestmemfs_get_persisted_inode(dir->i_sb, dir->i_ino)->child_ino = free_inode;
+ strscpy(guestmemfs_inode->filename, dentry->d_name.name, GUESTMEMFS_FILENAME_LEN);
+ guestmemfs_inode->flags = GUESTMEMFS_INODE_FLAG_FILE;
+ /* TODO: make dynamic */
+ guestmemfs_inode->mappings = kzalloc(PAGE_SIZE, GFP_KERNEL);
+
+ vfs_inode = guestmemfs_inode_get(dir->i_sb, free_inode);
+ d_instantiate(dentry, vfs_inode);
+ return 0;
+}
+
+static struct dentry *guestmemfs_lookup(struct inode *dir,
+ struct dentry *dentry,
+ unsigned int flags)
+{
+ struct guestmemfs_inode *guestmemfs_inode;
+ unsigned long ino;
+
+ guestmemfs_inode = guestmemfs_get_persisted_inode(dir->i_sb, dir->i_ino);
+ ino = guestmemfs_inode->child_ino;
+ while (ino) {
+ guestmemfs_inode = guestmemfs_get_persisted_inode(dir->i_sb, ino);
+ if (!strncmp(guestmemfs_inode->filename,
+ dentry->d_name.name,
+ GUESTMEMFS_FILENAME_LEN)) {
+ d_add(dentry, guestmemfs_inode_get(dir->i_sb, ino));
+ break;
+ }
+ ino = guestmemfs_inode->sibling_ino;
+ }
+ return NULL;
+}
+
+static int guestmemfs_unlink(struct inode *dir, struct dentry *dentry)
+{
+ unsigned long ino;
+ struct guestmemfs_inode *inode;
+
+ ino = guestmemfs_get_persisted_inode(dir->i_sb, dir->i_ino)->child_ino;
+
+ /* Special case for first file in dir */
+ if (ino == dentry->d_inode->i_ino) {
+ guestmemfs_get_persisted_inode(dir->i_sb, dir->i_ino)->child_ino =
+ guestmemfs_get_persisted_inode(dir->i_sb,
+ dentry->d_inode->i_ino)->sibling_ino;
+ guestmemfs_free_inode(dir->i_sb, ino);
+ return 0;
+ }
+
+ /*
+ * Although we know exactly the inode to free, because we maintain only
+ * a singly linked list we need to scan for it to find the previous
+ * element so it's "next" pointer can be updated.
+ */
+ while (ino) {
+ inode = guestmemfs_get_persisted_inode(dir->i_sb, ino);
+ /* We've found the one pointing to the one we want to delete */
+ if (inode->sibling_ino == dentry->d_inode->i_ino) {
+ inode->sibling_ino =
+ guestmemfs_get_persisted_inode(dir->i_sb,
+ dentry->d_inode->i_ino)->sibling_ino;
+ guestmemfs_free_inode(dir->i_sb, dentry->d_inode->i_ino);
+ break;
+ }
+ ino = guestmemfs_get_persisted_inode(dir->i_sb, ino)->sibling_ino;
+ }
+
+ return 0;
+}
+
+const struct inode_operations guestmemfs_dir_inode_operations = {
+ .create = guestmemfs_create,
+ .lookup = guestmemfs_lookup,
+ .unlink = guestmemfs_unlink,
+};
--
2.34.1
^ permalink raw reply related [flat|nested] 35+ messages in thread* [PATCH 03/10] guestmemfs: add persistent data block allocator
2024-08-05 9:32 [PATCH 00/10] Introduce guestmemfs: persistent in-memory filesystem James Gowans
2024-08-05 9:32 ` [PATCH 01/10] guestmemfs: Introduce filesystem skeleton James Gowans
2024-08-05 9:32 ` [PATCH 02/10] guestmemfs: add inode store, files and dirs James Gowans
@ 2024-08-05 9:32 ` James Gowans
2024-08-05 9:32 ` [PATCH 04/10] guestmemfs: support file truncation James Gowans
` (10 subsequent siblings)
13 siblings, 0 replies; 35+ messages in thread
From: James Gowans @ 2024-08-05 9:32 UTC (permalink / raw)
To: linux-kernel
Cc: James Gowans, Sean Christopherson, Paolo Bonzini, Alexander Viro,
Steve Sistare, Christian Brauner, Jan Kara, Anthony Yznaga,
Mike Rapoport, Andrew Morton, linux-mm, Jason Gunthorpe,
linux-fsdevel, Usama Arif, kvm, Alexander Graf, David Woodhouse,
Paul Durrant, Nicolas Saenz Julienne
In order to assign backing data memory to files there needs to be the
ability to allocate blocks of data from the large contiguous reserved
memory block of filesystem memory. Here an allocated is added to serve
that purpose. For now it's a simple bitmap allocator: each bit
corresponds to a 2 MiB chunk in the filesystem data block.
On initialisation the bitmap is allocated for a fixed size (TODO: make
this dynamic based on filesystem memory size). Allocating a block
involves finding and setting the next free bit.
Allocations will be done in the next commit which adds support for
truncating files.
It's quite limiting having a fixed size bitmap, and we perhaps want to
look at making this a dynamic and potentially large allocation early in
boot using the memblock allocator. It may also turn out that a simple
bitmap is too limiting and something with more metadata is needed.
Signed-off-by: James Gowans <jgowans@amazon.com>
---
fs/guestmemfs/Makefile | 2 +-
fs/guestmemfs/allocator.c | 40 ++++++++++++++++++++++++++++++++++++++
fs/guestmemfs/guestmemfs.c | 4 ++++
fs/guestmemfs/guestmemfs.h | 3 +++
4 files changed, 48 insertions(+), 1 deletion(-)
create mode 100644 fs/guestmemfs/allocator.c
diff --git a/fs/guestmemfs/Makefile b/fs/guestmemfs/Makefile
index 804997799ce8..b357073a60f3 100644
--- a/fs/guestmemfs/Makefile
+++ b/fs/guestmemfs/Makefile
@@ -3,4 +3,4 @@
# Makefile for persistent kernel filesystem
#
-obj-y += guestmemfs.o inode.o dir.o
+obj-y += guestmemfs.o inode.o dir.o allocator.o
diff --git a/fs/guestmemfs/allocator.c b/fs/guestmemfs/allocator.c
new file mode 100644
index 000000000000..3da14d11b60f
--- /dev/null
+++ b/fs/guestmemfs/allocator.c
@@ -0,0 +1,40 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "guestmemfs.h"
+
+/**
+ * For allocating blocks from the guestmemfs filesystem.
+ */
+
+static void *guestmemfs_allocations_bitmap(struct super_block *sb)
+{
+ return GUESTMEMFS_PSB(sb)->allocator_bitmap;
+}
+
+void guestmemfs_zero_allocations(struct super_block *sb)
+{
+ memset(guestmemfs_allocations_bitmap(sb), 0, (1 << 20));
+}
+
+/*
+ * Allocs one 2 MiB block, and returns the block index.
+ * Index is 2 MiB chunk index.
+ * Negative error code if unable to alloc.
+ */
+long guestmemfs_alloc_block(struct super_block *sb)
+{
+ unsigned long free_bit;
+ void *allocations_mem = guestmemfs_allocations_bitmap(sb);
+
+ free_bit = bitmap_find_next_zero_area(allocations_mem,
+ (1 << 20), /* Size */
+ 0, /* Start */
+ 1, /* Number of zeroed bits to look for */
+ 0); /* Alignment mask - none required. */
+
+ if (free_bit >= PMD_SIZE / 2)
+ return -ENOMEM;
+
+ bitmap_set(allocations_mem, free_bit, 1);
+ return free_bit;
+}
diff --git a/fs/guestmemfs/guestmemfs.c b/fs/guestmemfs/guestmemfs.c
index 21cb3490a2bd..c45c796c497a 100644
--- a/fs/guestmemfs/guestmemfs.c
+++ b/fs/guestmemfs/guestmemfs.c
@@ -37,6 +37,9 @@ static int guestmemfs_fill_super(struct super_block *sb, struct fs_context *fc)
psb->inodes = kzalloc(2 << 20, GFP_KERNEL);
if (!psb->inodes)
return -ENOMEM;
+ psb->allocator_bitmap = kzalloc(1 << 20, GFP_KERNEL);
+ if (!psb->allocator_bitmap)
+ return -ENOMEM;
/*
* Keep a reference to the persistent super block in the
@@ -45,6 +48,7 @@ static int guestmemfs_fill_super(struct super_block *sb, struct fs_context *fc)
sb->s_fs_info = psb;
spin_lock_init(&psb->allocation_lock);
guestmemfs_initialise_inode_store(sb);
+ guestmemfs_zero_allocations(sb);
guestmemfs_get_persisted_inode(sb, 1)->flags = GUESTMEMFS_INODE_FLAG_DIR;
strscpy(guestmemfs_get_persisted_inode(sb, 1)->filename, ".",
GUESTMEMFS_FILENAME_LEN);
diff --git a/fs/guestmemfs/guestmemfs.h b/fs/guestmemfs/guestmemfs.h
index 3a2954d1beec..af9832390be3 100644
--- a/fs/guestmemfs/guestmemfs.h
+++ b/fs/guestmemfs/guestmemfs.h
@@ -13,6 +13,7 @@ struct guestmemfs_sb {
unsigned long next_free_ino;
unsigned long allocated_inodes;
struct guestmemfs_inode *inodes;
+ void *allocator_bitmap;
spinlock_t allocation_lock;
};
@@ -37,6 +38,8 @@ struct guestmemfs_inode {
};
void guestmemfs_initialise_inode_store(struct super_block *sb);
+void guestmemfs_zero_allocations(struct super_block *sb);
+long guestmemfs_alloc_block(struct super_block *sb);
struct inode *guestmemfs_inode_get(struct super_block *sb, unsigned long ino);
struct guestmemfs_inode *guestmemfs_get_persisted_inode(struct super_block *sb, int ino);
--
2.34.1
^ permalink raw reply related [flat|nested] 35+ messages in thread* [PATCH 04/10] guestmemfs: support file truncation
2024-08-05 9:32 [PATCH 00/10] Introduce guestmemfs: persistent in-memory filesystem James Gowans
` (2 preceding siblings ...)
2024-08-05 9:32 ` [PATCH 03/10] guestmemfs: add persistent data block allocator James Gowans
@ 2024-08-05 9:32 ` James Gowans
2024-08-05 9:32 ` [PATCH 05/10] guestmemfs: add file mmap callback James Gowans
` (9 subsequent siblings)
13 siblings, 0 replies; 35+ messages in thread
From: James Gowans @ 2024-08-05 9:32 UTC (permalink / raw)
To: linux-kernel
Cc: James Gowans, Sean Christopherson, Paolo Bonzini, Alexander Viro,
Steve Sistare, Christian Brauner, Jan Kara, Anthony Yznaga,
Mike Rapoport, Andrew Morton, linux-mm, Jason Gunthorpe,
linux-fsdevel, Usama Arif, kvm, Alexander Graf, David Woodhouse,
Paul Durrant, Nicolas Saenz Julienne
In a previous commit a block allocator was added. Now use that block
allocator to allocate blocks for files when ftruncate is run on them.
To do that a inode_operations is added on the file inodes with a getattr
callback handling the ATTR_SIZE attribute. When this is invoked pages
are allocated, the indexes of which are put into a mappings block.
The mappings block is an array with the index being the file offset
block and the value at that index being the pkernfs block backign that
file offset.
Signed-off-by: James Gowans <jgowans@amazon.com>
---
fs/guestmemfs/Makefile | 2 +-
fs/guestmemfs/file.c | 52 ++++++++++++++++++++++++++++++++++++++
fs/guestmemfs/guestmemfs.h | 2 ++
fs/guestmemfs/inode.c | 25 +++++++++++++++---
4 files changed, 77 insertions(+), 4 deletions(-)
create mode 100644 fs/guestmemfs/file.c
diff --git a/fs/guestmemfs/Makefile b/fs/guestmemfs/Makefile
index b357073a60f3..e93e43ba274b 100644
--- a/fs/guestmemfs/Makefile
+++ b/fs/guestmemfs/Makefile
@@ -3,4 +3,4 @@
# Makefile for persistent kernel filesystem
#
-obj-y += guestmemfs.o inode.o dir.o allocator.o
+obj-y += guestmemfs.o inode.o dir.o allocator.o file.o
diff --git a/fs/guestmemfs/file.c b/fs/guestmemfs/file.c
new file mode 100644
index 000000000000..618c93b12196
--- /dev/null
+++ b/fs/guestmemfs/file.c
@@ -0,0 +1,52 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "guestmemfs.h"
+
+static int truncate(struct inode *inode, loff_t newsize)
+{
+ unsigned long free_block;
+ struct guestmemfs_inode *guestmemfs_inode;
+ unsigned long *mappings;
+
+ guestmemfs_inode = guestmemfs_get_persisted_inode(inode->i_sb, inode->i_ino);
+ mappings = guestmemfs_inode->mappings;
+ i_size_write(inode, newsize);
+ for (int block_idx = 0; block_idx * PMD_SIZE < newsize; ++block_idx) {
+ free_block = guestmemfs_alloc_block(inode->i_sb);
+ if (free_block < 0)
+ /* TODO: roll back allocations. */
+ return -ENOMEM;
+ *(mappings + block_idx) = free_block;
+ ++guestmemfs_inode->num_mappings;
+ }
+ return 0;
+}
+
+static int inode_setattr(struct mnt_idmap *idmap, struct dentry *dentry, struct iattr *iattr)
+{
+ struct inode *inode = dentry->d_inode;
+ int error;
+
+ error = setattr_prepare(idmap, dentry, iattr);
+ if (error)
+ return error;
+
+ if (iattr->ia_valid & ATTR_SIZE) {
+ error = truncate(inode, iattr->ia_size);
+ if (error)
+ return error;
+ }
+ setattr_copy(idmap, inode, iattr);
+ mark_inode_dirty(inode);
+ return 0;
+}
+
+const struct inode_operations guestmemfs_file_inode_operations = {
+ .setattr = inode_setattr,
+ .getattr = simple_getattr,
+};
+
+const struct file_operations guestmemfs_file_fops = {
+ .owner = THIS_MODULE,
+ .iterate_shared = NULL,
+};
diff --git a/fs/guestmemfs/guestmemfs.h b/fs/guestmemfs/guestmemfs.h
index af9832390be3..7ea03ac8ecca 100644
--- a/fs/guestmemfs/guestmemfs.h
+++ b/fs/guestmemfs/guestmemfs.h
@@ -44,3 +44,5 @@ struct inode *guestmemfs_inode_get(struct super_block *sb, unsigned long ino);
struct guestmemfs_inode *guestmemfs_get_persisted_inode(struct super_block *sb, int ino);
extern const struct file_operations guestmemfs_dir_fops;
+extern const struct file_operations guestmemfs_file_fops;
+extern const struct inode_operations guestmemfs_file_inode_operations;
diff --git a/fs/guestmemfs/inode.c b/fs/guestmemfs/inode.c
index 2360c3a4857d..61f70441d82c 100644
--- a/fs/guestmemfs/inode.c
+++ b/fs/guestmemfs/inode.c
@@ -15,14 +15,28 @@ struct guestmemfs_inode *guestmemfs_get_persisted_inode(struct super_block *sb,
struct inode *guestmemfs_inode_get(struct super_block *sb, unsigned long ino)
{
+ struct guestmemfs_inode *guestmemfs_inode;
struct inode *inode = iget_locked(sb, ino);
/* If this inode is cached it is already populated; just return */
if (!(inode->i_state & I_NEW))
return inode;
- inode->i_op = &guestmemfs_dir_inode_operations;
+ guestmemfs_inode = guestmemfs_get_persisted_inode(sb, ino);
inode->i_sb = sb;
- inode->i_mode = S_IFREG;
+
+ if (guestmemfs_inode->flags & GUESTMEMFS_INODE_FLAG_DIR) {
+ inode->i_op = &guestmemfs_dir_inode_operations;
+ inode->i_mode = S_IFDIR;
+ } else {
+ inode->i_op = &guestmemfs_file_inode_operations;
+ inode->i_mode = S_IFREG;
+ inode->i_fop = &guestmemfs_file_fops;
+ inode->i_size = guestmemfs_inode->num_mappings * PMD_SIZE;
+ }
+
+ set_nlink(inode, 1);
+
+ /* Switch based on file type */
unlock_new_inode(inode);
return inode;
}
@@ -103,6 +117,7 @@ static struct dentry *guestmemfs_lookup(struct inode *dir,
unsigned int flags)
{
struct guestmemfs_inode *guestmemfs_inode;
+ struct inode *vfs_inode;
unsigned long ino;
guestmemfs_inode = guestmemfs_get_persisted_inode(dir->i_sb, dir->i_ino);
@@ -112,7 +127,10 @@ static struct dentry *guestmemfs_lookup(struct inode *dir,
if (!strncmp(guestmemfs_inode->filename,
dentry->d_name.name,
GUESTMEMFS_FILENAME_LEN)) {
- d_add(dentry, guestmemfs_inode_get(dir->i_sb, ino));
+ vfs_inode = guestmemfs_inode_get(dir->i_sb, ino);
+ mark_inode_dirty(dir);
+ inode_update_timestamps(vfs_inode, S_ATIME);
+ d_add(dentry, vfs_inode);
break;
}
ino = guestmemfs_inode->sibling_ino;
@@ -162,3 +180,4 @@ const struct inode_operations guestmemfs_dir_inode_operations = {
.lookup = guestmemfs_lookup,
.unlink = guestmemfs_unlink,
};
+
--
2.34.1
^ permalink raw reply related [flat|nested] 35+ messages in thread* [PATCH 05/10] guestmemfs: add file mmap callback
2024-08-05 9:32 [PATCH 00/10] Introduce guestmemfs: persistent in-memory filesystem James Gowans
` (3 preceding siblings ...)
2024-08-05 9:32 ` [PATCH 04/10] guestmemfs: support file truncation James Gowans
@ 2024-08-05 9:32 ` James Gowans
2024-10-29 23:05 ` Elliot Berman
2024-08-05 9:32 ` [PATCH 06/10] kexec/kho: Add addr flag to not initialise memory James Gowans
` (8 subsequent siblings)
13 siblings, 1 reply; 35+ messages in thread
From: James Gowans @ 2024-08-05 9:32 UTC (permalink / raw)
To: linux-kernel
Cc: James Gowans, Sean Christopherson, Paolo Bonzini, Alexander Viro,
Steve Sistare, Christian Brauner, Jan Kara, Anthony Yznaga,
Mike Rapoport, Andrew Morton, linux-mm, Jason Gunthorpe,
linux-fsdevel, Usama Arif, kvm, Alexander Graf, David Woodhouse,
Paul Durrant, Nicolas Saenz Julienne
Make the file data usable to userspace by adding mmap. That's all that
QEMU needs for guest RAM, so that's all be bother implementing for now.
When mmaping the file the VMA is marked as PFNMAP to indicate that there
are no struct pages for the memory in this VMA. Remap_pfn_range() is
used to actually populate the page tables. All PTEs are pre-faulted into
the pgtables at mmap time so that the pgtables are usable when this
virtual address range is given to VFIO's MAP_DMA.
Signed-off-by: James Gowans <jgowans@amazon.com>
---
fs/guestmemfs/file.c | 43 +++++++++++++++++++++++++++++++++++++-
fs/guestmemfs/guestmemfs.c | 2 +-
fs/guestmemfs/guestmemfs.h | 3 +++
3 files changed, 46 insertions(+), 2 deletions(-)
diff --git a/fs/guestmemfs/file.c b/fs/guestmemfs/file.c
index 618c93b12196..b1a52abcde65 100644
--- a/fs/guestmemfs/file.c
+++ b/fs/guestmemfs/file.c
@@ -1,6 +1,7 @@
// SPDX-License-Identifier: GPL-2.0-only
#include "guestmemfs.h"
+#include <linux/mm.h>
static int truncate(struct inode *inode, loff_t newsize)
{
@@ -41,6 +42,46 @@ static int inode_setattr(struct mnt_idmap *idmap, struct dentry *dentry, struct
return 0;
}
+/*
+ * To be able to use PFNMAP VMAs for VFIO DMA mapping we need the page tables
+ * populated with mappings. Pre-fault everything.
+ */
+static int mmap(struct file *filp, struct vm_area_struct *vma)
+{
+ int rc;
+ unsigned long *mappings_block;
+ struct guestmemfs_inode *guestmemfs_inode;
+
+ guestmemfs_inode = guestmemfs_get_persisted_inode(filp->f_inode->i_sb,
+ filp->f_inode->i_ino);
+
+ mappings_block = guestmemfs_inode->mappings;
+
+ /* Remap-pfn-range will mark the range VM_IO */
+ for (unsigned long vma_addr_offset = vma->vm_start;
+ vma_addr_offset < vma->vm_end;
+ vma_addr_offset += PMD_SIZE) {
+ int block, mapped_block;
+ unsigned long map_size = min(PMD_SIZE, vma->vm_end - vma_addr_offset);
+
+ block = (vma_addr_offset - vma->vm_start) / PMD_SIZE;
+ mapped_block = *(mappings_block + block);
+ /*
+ * It's wrong to use rempa_pfn_range; this will install PTE-level entries.
+ * The whole point of 2 MiB allocs is to improve TLB perf!
+ * We should use something like mm/huge_memory.c#insert_pfn_pmd
+ * but that is currently static.
+ * TODO: figure out the best way to install PMDs.
+ */
+ rc = remap_pfn_range(vma,
+ vma_addr_offset,
+ (guestmemfs_base >> PAGE_SHIFT) + (mapped_block * 512),
+ map_size,
+ vma->vm_page_prot);
+ }
+ return 0;
+}
+
const struct inode_operations guestmemfs_file_inode_operations = {
.setattr = inode_setattr,
.getattr = simple_getattr,
@@ -48,5 +89,5 @@ const struct inode_operations guestmemfs_file_inode_operations = {
const struct file_operations guestmemfs_file_fops = {
.owner = THIS_MODULE,
- .iterate_shared = NULL,
+ .mmap = mmap,
};
diff --git a/fs/guestmemfs/guestmemfs.c b/fs/guestmemfs/guestmemfs.c
index c45c796c497a..38f20ad25286 100644
--- a/fs/guestmemfs/guestmemfs.c
+++ b/fs/guestmemfs/guestmemfs.c
@@ -9,7 +9,7 @@
#include <linux/memblock.h>
#include <linux/statfs.h>
-static phys_addr_t guestmemfs_base, guestmemfs_size;
+phys_addr_t guestmemfs_base, guestmemfs_size;
struct guestmemfs_sb *psb;
static int statfs(struct dentry *root, struct kstatfs *buf)
diff --git a/fs/guestmemfs/guestmemfs.h b/fs/guestmemfs/guestmemfs.h
index 7ea03ac8ecca..0f2788ce740e 100644
--- a/fs/guestmemfs/guestmemfs.h
+++ b/fs/guestmemfs/guestmemfs.h
@@ -8,6 +8,9 @@
#define GUESTMEMFS_FILENAME_LEN 255
#define GUESTMEMFS_PSB(sb) ((struct guestmemfs_sb *)sb->s_fs_info)
+/* Units of bytes */
+extern phys_addr_t guestmemfs_base, guestmemfs_size;
+
struct guestmemfs_sb {
/* Inode number */
unsigned long next_free_ino;
--
2.34.1
^ permalink raw reply related [flat|nested] 35+ messages in thread* Re: [PATCH 05/10] guestmemfs: add file mmap callback
2024-08-05 9:32 ` [PATCH 05/10] guestmemfs: add file mmap callback James Gowans
@ 2024-10-29 23:05 ` Elliot Berman
2024-10-30 22:18 ` Frank van der Linden
2024-10-31 15:30 ` Gowans, James
0 siblings, 2 replies; 35+ messages in thread
From: Elliot Berman @ 2024-10-29 23:05 UTC (permalink / raw)
To: James Gowans
Cc: linux-kernel, Sean Christopherson, Paolo Bonzini, Alexander Viro,
Steve Sistare, Christian Brauner, Jan Kara, Anthony Yznaga,
Mike Rapoport, Andrew Morton, linux-mm, Jason Gunthorpe,
linux-fsdevel, Usama Arif, kvm, Alexander Graf, David Woodhouse,
Paul Durrant, Nicolas Saenz Julienne
On Mon, Aug 05, 2024 at 11:32:40AM +0200, James Gowans wrote:
> Make the file data usable to userspace by adding mmap. That's all that
> QEMU needs for guest RAM, so that's all be bother implementing for now.
>
> When mmaping the file the VMA is marked as PFNMAP to indicate that there
> are no struct pages for the memory in this VMA. Remap_pfn_range() is
> used to actually populate the page tables. All PTEs are pre-faulted into
> the pgtables at mmap time so that the pgtables are usable when this
> virtual address range is given to VFIO's MAP_DMA.
Thanks for sending this out! I'm going through the series with the
intention to see how it might fit within the existing guest_memfd work
for pKVM/CoCo/Gunyah.
It might've been mentioned in the MM alignment session -- you might be
interested to join the guest_memfd bi-weekly call to see how we are
overlapping [1].
[1]: https://lore.kernel.org/kvm/ae794891-fe69-411a-b82e-6963b594a62a@redhat.com/T/
---
Was the decision to pre-fault everything because it was convenient to do
or otherwise intentionally different from hugetlb?
>
> Signed-off-by: James Gowans <jgowans@amazon.com>
> ---
> fs/guestmemfs/file.c | 43 +++++++++++++++++++++++++++++++++++++-
> fs/guestmemfs/guestmemfs.c | 2 +-
> fs/guestmemfs/guestmemfs.h | 3 +++
> 3 files changed, 46 insertions(+), 2 deletions(-)
>
> diff --git a/fs/guestmemfs/file.c b/fs/guestmemfs/file.c
> index 618c93b12196..b1a52abcde65 100644
> --- a/fs/guestmemfs/file.c
> +++ b/fs/guestmemfs/file.c
> @@ -1,6 +1,7 @@
> // SPDX-License-Identifier: GPL-2.0-only
>
> #include "guestmemfs.h"
> +#include <linux/mm.h>
>
> static int truncate(struct inode *inode, loff_t newsize)
> {
> @@ -41,6 +42,46 @@ static int inode_setattr(struct mnt_idmap *idmap, struct dentry *dentry, struct
> return 0;
> }
>
> +/*
> + * To be able to use PFNMAP VMAs for VFIO DMA mapping we need the page tables
> + * populated with mappings. Pre-fault everything.
> + */
> +static int mmap(struct file *filp, struct vm_area_struct *vma)
> +{
> + int rc;
> + unsigned long *mappings_block;
> + struct guestmemfs_inode *guestmemfs_inode;
> +
> + guestmemfs_inode = guestmemfs_get_persisted_inode(filp->f_inode->i_sb,
> + filp->f_inode->i_ino);
> +
> + mappings_block = guestmemfs_inode->mappings;
> +
> + /* Remap-pfn-range will mark the range VM_IO */
> + for (unsigned long vma_addr_offset = vma->vm_start;
> + vma_addr_offset < vma->vm_end;
> + vma_addr_offset += PMD_SIZE) {
> + int block, mapped_block;
> + unsigned long map_size = min(PMD_SIZE, vma->vm_end - vma_addr_offset);
> +
> + block = (vma_addr_offset - vma->vm_start) / PMD_SIZE;
> + mapped_block = *(mappings_block + block);
> + /*
> + * It's wrong to use rempa_pfn_range; this will install PTE-level entries.
> + * The whole point of 2 MiB allocs is to improve TLB perf!
> + * We should use something like mm/huge_memory.c#insert_pfn_pmd
> + * but that is currently static.
> + * TODO: figure out the best way to install PMDs.
> + */
> + rc = remap_pfn_range(vma,
> + vma_addr_offset,
> + (guestmemfs_base >> PAGE_SHIFT) + (mapped_block * 512),
> + map_size,
> + vma->vm_page_prot);
> + }
> + return 0;
> +}
> +
> const struct inode_operations guestmemfs_file_inode_operations = {
> .setattr = inode_setattr,
> .getattr = simple_getattr,
> @@ -48,5 +89,5 @@ const struct inode_operations guestmemfs_file_inode_operations = {
>
> const struct file_operations guestmemfs_file_fops = {
> .owner = THIS_MODULE,
> - .iterate_shared = NULL,
> + .mmap = mmap,
> };
> diff --git a/fs/guestmemfs/guestmemfs.c b/fs/guestmemfs/guestmemfs.c
> index c45c796c497a..38f20ad25286 100644
> --- a/fs/guestmemfs/guestmemfs.c
> +++ b/fs/guestmemfs/guestmemfs.c
> @@ -9,7 +9,7 @@
> #include <linux/memblock.h>
> #include <linux/statfs.h>
>
> -static phys_addr_t guestmemfs_base, guestmemfs_size;
> +phys_addr_t guestmemfs_base, guestmemfs_size;
> struct guestmemfs_sb *psb;
>
> static int statfs(struct dentry *root, struct kstatfs *buf)
> diff --git a/fs/guestmemfs/guestmemfs.h b/fs/guestmemfs/guestmemfs.h
> index 7ea03ac8ecca..0f2788ce740e 100644
> --- a/fs/guestmemfs/guestmemfs.h
> +++ b/fs/guestmemfs/guestmemfs.h
> @@ -8,6 +8,9 @@
> #define GUESTMEMFS_FILENAME_LEN 255
> #define GUESTMEMFS_PSB(sb) ((struct guestmemfs_sb *)sb->s_fs_info)
>
> +/* Units of bytes */
> +extern phys_addr_t guestmemfs_base, guestmemfs_size;
> +
> struct guestmemfs_sb {
> /* Inode number */
> unsigned long next_free_ino;
> --
> 2.34.1
>
>
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: [PATCH 05/10] guestmemfs: add file mmap callback
2024-10-29 23:05 ` Elliot Berman
@ 2024-10-30 22:18 ` Frank van der Linden
2024-11-01 12:55 ` Gowans, James
2024-10-31 15:30 ` Gowans, James
1 sibling, 1 reply; 35+ messages in thread
From: Frank van der Linden @ 2024-10-30 22:18 UTC (permalink / raw)
To: Elliot Berman
Cc: James Gowans, linux-kernel, Sean Christopherson, Paolo Bonzini,
Alexander Viro, Steve Sistare, Christian Brauner, Jan Kara,
Anthony Yznaga, Mike Rapoport, Andrew Morton, linux-mm,
Jason Gunthorpe, linux-fsdevel, Usama Arif, kvm, Alexander Graf,
David Woodhouse, Paul Durrant, Nicolas Saenz Julienne
On Tue, Oct 29, 2024 at 4:06 PM Elliot Berman <quic_eberman@quicinc.com> wrote:
>
> On Mon, Aug 05, 2024 at 11:32:40AM +0200, James Gowans wrote:
> > Make the file data usable to userspace by adding mmap. That's all that
> > QEMU needs for guest RAM, so that's all be bother implementing for now.
> >
> > When mmaping the file the VMA is marked as PFNMAP to indicate that there
> > are no struct pages for the memory in this VMA. Remap_pfn_range() is
> > used to actually populate the page tables. All PTEs are pre-faulted into
> > the pgtables at mmap time so that the pgtables are usable when this
> > virtual address range is given to VFIO's MAP_DMA.
>
> Thanks for sending this out! I'm going through the series with the
> intention to see how it might fit within the existing guest_memfd work
> for pKVM/CoCo/Gunyah.
>
> It might've been mentioned in the MM alignment session -- you might be
> interested to join the guest_memfd bi-weekly call to see how we are
> overlapping [1].
>
> [1]: https://lore.kernel.org/kvm/ae794891-fe69-411a-b82e-6963b594a62a@redhat.com/T/
>
> ---
>
> Was the decision to pre-fault everything because it was convenient to do
> or otherwise intentionally different from hugetlb?
>
It's memory that is placed outside of of page allocator control, or
even outside of System RAM - VM_PFNMAP only. So you don't have much of
a choice..
In general, for things like guest memory or persistent memory, even if
struct pages were available, it doesn't seem all that useful to adhere
to the !MAP_POPULATE standard, why go through any faults to begin
with?
For guest_memfd: as I understand it, it's folio-based. And this is
VM_PFNMAP memory without struct pages / folios. So the main task there
is probably to teach guest_memfd about VM_PFNMAP memory. That would be
great, since it then ties in guest_memfd with external guest memory.
- Frank
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 05/10] guestmemfs: add file mmap callback
2024-10-30 22:18 ` Frank van der Linden
@ 2024-11-01 12:55 ` Gowans, James
0 siblings, 0 replies; 35+ messages in thread
From: Gowans, James @ 2024-11-01 12:55 UTC (permalink / raw)
To: quic_eberman@quicinc.com, fvdl@google.com
Cc: kvm@vger.kernel.org, rppt@kernel.org, brauner@kernel.org,
Graf (AWS), Alexander, anthony.yznaga@oracle.com,
steven.sistare@oracle.com, akpm@linux-foundation.org,
linux-kernel@vger.kernel.org, seanjc@google.com, Woodhouse, David,
pbonzini@redhat.com, linux-mm@kvack.org, Saenz Julienne, Nicolas,
Durrant, Paul, viro@zeniv.linux.org.uk, jack@suse.cz,
linux-fsdevel@vger.kernel.org, jgg@ziepe.ca
On Wed, 2024-10-30 at 15:18 -0700, Frank van der Linden wrote:
> On Tue, Oct 29, 2024 at 4:06 PM Elliot Berman <quic_eberman@quicinc.com> wrote:
> >
> > On Mon, Aug 05, 2024 at 11:32:40AM +0200, James Gowans wrote:
> > > Make the file data usable to userspace by adding mmap. That's all that
> > > QEMU needs for guest RAM, so that's all be bother implementing for now.
> > >
> > > When mmaping the file the VMA is marked as PFNMAP to indicate that there
> > > are no struct pages for the memory in this VMA. Remap_pfn_range() is
> > > used to actually populate the page tables. All PTEs are pre-faulted into
> > > the pgtables at mmap time so that the pgtables are usable when this
> > > virtual address range is given to VFIO's MAP_DMA.
> >
> > Thanks for sending this out! I'm going through the series with the
> > intention to see how it might fit within the existing guest_memfd work
> > for pKVM/CoCo/Gunyah.
> >
> > It might've been mentioned in the MM alignment session -- you might be
> > interested to join the guest_memfd bi-weekly call to see how we are
> > overlapping [1].
> >
> > [1]: https://lore.kernel.org/kvm/ae794891-fe69-411a-b82e-6963b594a62a@redhat.com/T/
> >
> > ---
> >
> > Was the decision to pre-fault everything because it was convenient to do
> > or otherwise intentionally different from hugetlb?
> >
>
> It's memory that is placed outside of of page allocator control, or
> even outside of System RAM - VM_PFNMAP only. So you don't have much of
> a choice..
>
> In general, for things like guest memory or persistent memory, even if
> struct pages were available, it doesn't seem all that useful to adhere
> to the !MAP_POPULATE standard, why go through any faults to begin
> with?
>
> For guest_memfd: as I understand it, it's folio-based. And this is
> VM_PFNMAP memory without struct pages / folios. So the main task there
> is probably to teach guest_memfd about VM_PFNMAP memory. That would be
> great, since it then ties in guest_memfd with external guest memory.
Exactly - I think all of the comments on this series are heading in a
similar direction: let's add a custom reserved (PFNMAP) persistent
memory allocator behind guest_memfd and expose that as a filesystem.
This will be what the next version of patch series will do.
JG
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 05/10] guestmemfs: add file mmap callback
2024-10-29 23:05 ` Elliot Berman
2024-10-30 22:18 ` Frank van der Linden
@ 2024-10-31 15:30 ` Gowans, James
2024-10-31 16:06 ` Jason Gunthorpe
1 sibling, 1 reply; 35+ messages in thread
From: Gowans, James @ 2024-10-31 15:30 UTC (permalink / raw)
To: quic_eberman@quicinc.com
Cc: kvm@vger.kernel.org, rppt@kernel.org, brauner@kernel.org,
Graf (AWS), Alexander, anthony.yznaga@oracle.com,
steven.sistare@oracle.com, akpm@linux-foundation.org,
linux-kernel@vger.kernel.org, seanjc@google.com, Woodhouse, David,
pbonzini@redhat.com, linux-mm@kvack.org, Saenz Julienne, Nicolas,
Durrant, Paul, viro@zeniv.linux.org.uk, jack@suse.cz,
linux-fsdevel@vger.kernel.org, jgg@ziepe.ca,
usama.arif@bytedance.com
On Tue, 2024-10-29 at 16:05 -0700, Elliot Berman wrote:
> On Mon, Aug 05, 2024 at 11:32:40AM +0200, James Gowans wrote:
> > Make the file data usable to userspace by adding mmap. That's all that
> > QEMU needs for guest RAM, so that's all be bother implementing for now.
> >
> > When mmaping the file the VMA is marked as PFNMAP to indicate that there
> > are no struct pages for the memory in this VMA. Remap_pfn_range() is
> > used to actually populate the page tables. All PTEs are pre-faulted into
> > the pgtables at mmap time so that the pgtables are usable when this
> > virtual address range is given to VFIO's MAP_DMA.
>
> Thanks for sending this out! I'm going through the series with the
> intention to see how it might fit within the existing guest_memfd work
> for pKVM/CoCo/Gunyah.
>
> It might've been mentioned in the MM alignment session -- you might be
> interested to join the guest_memfd bi-weekly call to see how we are
> overlapping [1].
>
> [1]: https://lore.kernel.org/kvm/ae794891-fe69-411a-b82e-6963b594a62a@redhat.com/T/
Hi Elliot, yes, I think that there is a lot more overlap with
guest_memfd necessary here. The idea was to extend guestmemfs at some
point to have a guest_memfd style interface, but it was pointed out at
the MM alignment call that doing so would require guestmemfs to
duplicate the API surface of guest_memfd. This is undesirable. Better
would be to have persistence implemented as a custom allocator behind a
normal guest_memfd. I'm not too sure how this would be actually done in
practice, specifically:
- how the persistent pool would be defined
- how it would be supplied to guest_memfd
- how the guest_memfds would be re-discovered after kexec
But assuming we can figure out some way to do this, I think it's a
better way to go.
I'll join the guest_memfd call shortly to see the developments there and
where persistence would fit best.
Hopefully we can figure out in theory how this could work, the I'll put
together another RFC sketching it out.
JG
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 05/10] guestmemfs: add file mmap callback
2024-10-31 15:30 ` Gowans, James
@ 2024-10-31 16:06 ` Jason Gunthorpe
2024-11-01 13:01 ` Gowans, James
0 siblings, 1 reply; 35+ messages in thread
From: Jason Gunthorpe @ 2024-10-31 16:06 UTC (permalink / raw)
To: Gowans, James
Cc: quic_eberman@quicinc.com, kvm@vger.kernel.org, rppt@kernel.org,
brauner@kernel.org, Graf (AWS), Alexander,
anthony.yznaga@oracle.com, steven.sistare@oracle.com,
akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
seanjc@google.com, Woodhouse, David, pbonzini@redhat.com,
linux-mm@kvack.org, Saenz Julienne, Nicolas, Durrant, Paul,
viro@zeniv.linux.org.uk, jack@suse.cz,
linux-fsdevel@vger.kernel.org, usama.arif@bytedance.com
On Thu, Oct 31, 2024 at 03:30:59PM +0000, Gowans, James wrote:
> On Tue, 2024-10-29 at 16:05 -0700, Elliot Berman wrote:
> > On Mon, Aug 05, 2024 at 11:32:40AM +0200, James Gowans wrote:
> > > Make the file data usable to userspace by adding mmap. That's all that
> > > QEMU needs for guest RAM, so that's all be bother implementing for now.
> > >
> > > When mmaping the file the VMA is marked as PFNMAP to indicate that there
> > > are no struct pages for the memory in this VMA. Remap_pfn_range() is
> > > used to actually populate the page tables. All PTEs are pre-faulted into
> > > the pgtables at mmap time so that the pgtables are usable when this
> > > virtual address range is given to VFIO's MAP_DMA.
> >
> > Thanks for sending this out! I'm going through the series with the
> > intention to see how it might fit within the existing guest_memfd work
> > for pKVM/CoCo/Gunyah.
> >
> > It might've been mentioned in the MM alignment session -- you might be
> > interested to join the guest_memfd bi-weekly call to see how we are
> > overlapping [1].
> >
> > [1]: https://lore.kernel.org/kvm/ae794891-fe69-411a-b82e-6963b594a62a@redhat.com/T/
>
> Hi Elliot, yes, I think that there is a lot more overlap with
> guest_memfd necessary here. The idea was to extend guestmemfs at some
> point to have a guest_memfd style interface, but it was pointed out at
> the MM alignment call that doing so would require guestmemfs to
> duplicate the API surface of guest_memfd. This is undesirable. Better
> would be to have persistence implemented as a custom allocator behind a
> normal guest_memfd. I'm not too sure how this would be actually done in
> practice, specifically:
> - how the persistent pool would be defined
> - how it would be supplied to guest_memfd
> - how the guest_memfds would be re-discovered after kexec
> But assuming we can figure out some way to do this, I think it's a
> better way to go.
I think the filesystem interface seemed reasonable, you just want
open() on the filesystem to return back a normal guest_memfd and
re-use all of that code to implement it.
When opened through the filesystem guest_memfd would get hooked by the
KHO stuff to manage its memory, somehow.
Really KHO just needs to keep track of the addresess in the
guest_memfd when it serializes, right? So maybe all it needs is a way
to freeze the guest_memfd so it's memory map doesn't change anymore,
then a way to extract the addresses from it for serialization?
Jason
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 05/10] guestmemfs: add file mmap callback
2024-10-31 16:06 ` Jason Gunthorpe
@ 2024-11-01 13:01 ` Gowans, James
2024-11-01 13:42 ` Jason Gunthorpe
0 siblings, 1 reply; 35+ messages in thread
From: Gowans, James @ 2024-11-01 13:01 UTC (permalink / raw)
To: jgg@ziepe.ca
Cc: quic_eberman@quicinc.com, kvm@vger.kernel.org, rppt@kernel.org,
brauner@kernel.org, anthony.yznaga@oracle.com,
steven.sistare@oracle.com, akpm@linux-foundation.org,
linux-kernel@vger.kernel.org, seanjc@google.com, Durrant, Paul,
pbonzini@redhat.com, linux-mm@kvack.org, Woodhouse, David,
Saenz Julienne, Nicolas, viro@zeniv.linux.org.uk,
Graf (AWS), Alexander, jack@suse.cz,
linux-fsdevel@vger.kernel.org
On Thu, 2024-10-31 at 13:06 -0300, Jason Gunthorpe wrote:
> On Thu, Oct 31, 2024 at 03:30:59PM +0000, Gowans, James wrote:
> > On Tue, 2024-10-29 at 16:05 -0700, Elliot Berman wrote:
> > > On Mon, Aug 05, 2024 at 11:32:40AM +0200, James Gowans wrote:
> > > > Make the file data usable to userspace by adding mmap. That's all that
> > > > QEMU needs for guest RAM, so that's all be bother implementing for now.
> > > >
> > > > When mmaping the file the VMA is marked as PFNMAP to indicate that there
> > > > are no struct pages for the memory in this VMA. Remap_pfn_range() is
> > > > used to actually populate the page tables. All PTEs are pre-faulted into
> > > > the pgtables at mmap time so that the pgtables are usable when this
> > > > virtual address range is given to VFIO's MAP_DMA.
> > >
> > > Thanks for sending this out! I'm going through the series with the
> > > intention to see how it might fit within the existing guest_memfd work
> > > for pKVM/CoCo/Gunyah.
> > >
> > > It might've been mentioned in the MM alignment session -- you might be
> > > interested to join the guest_memfd bi-weekly call to see how we are
> > > overlapping [1].
> > >
> > > [1]: https://lore.kernel.org/kvm/ae794891-fe69-411a-b82e-6963b594a62a@redhat.com/T/
> >
> > Hi Elliot, yes, I think that there is a lot more overlap with
> > guest_memfd necessary here. The idea was to extend guestmemfs at some
> > point to have a guest_memfd style interface, but it was pointed out at
> > the MM alignment call that doing so would require guestmemfs to
> > duplicate the API surface of guest_memfd. This is undesirable. Better
> > would be to have persistence implemented as a custom allocator behind a
> > normal guest_memfd. I'm not too sure how this would be actually done in
> > practice, specifically:
> > - how the persistent pool would be defined
> > - how it would be supplied to guest_memfd
> > - how the guest_memfds would be re-discovered after kexec
> > But assuming we can figure out some way to do this, I think it's a
> > better way to go.
>
> I think the filesystem interface seemed reasonable, you just want
> open() on the filesystem to return back a normal guest_memfd and
> re-use all of that code to implement it.
>
> When opened through the filesystem guest_memfd would get hooked by the
> KHO stuff to manage its memory, somehow.
>
> Really KHO just needs to keep track of the addresess in the
> guest_memfd when it serializes, right? So maybe all it needs is a way
> to freeze the guest_memfd so it's memory map doesn't change anymore,
> then a way to extract the addresses from it for serialization?
Thanks Jason, that sounds perfect. I'll work on the next rev which will:
- expose a filesystem which owns reserved/persistent memory, just like
this patch.
- rebased on top of the patches which pull out the guest_memfd code into
a library
- rebased on top of the guest_memfd patches which supports adding a
different backing allocator (hugetlbfs) to guest_memfd
- when a file in guestmemfs is opened, create a guest_memfd object from
the guest_memfd library code and set guestmemfs as the custom allocator
for the file.
- serialise and re-hydrate the guest_memfds which have been created in
guestmemfs on kexec via KHO.
The main difference is that opening a guestmemfs file won't give a
regular file, rather it will give a guest_memfd library object. This
will give good code re-used with guest_memfd library and prevent needing
to re-implement the guest_memfd API surface here.
Sounds like a great path forward. :-)
JG
>
> Jason
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 05/10] guestmemfs: add file mmap callback
2024-11-01 13:01 ` Gowans, James
@ 2024-11-01 13:42 ` Jason Gunthorpe
2024-11-02 8:24 ` Gowans, James
2024-11-04 10:49 ` Mike Rapoport
0 siblings, 2 replies; 35+ messages in thread
From: Jason Gunthorpe @ 2024-11-01 13:42 UTC (permalink / raw)
To: Gowans, James
Cc: quic_eberman@quicinc.com, kvm@vger.kernel.org, rppt@kernel.org,
brauner@kernel.org, anthony.yznaga@oracle.com,
steven.sistare@oracle.com, akpm@linux-foundation.org,
linux-kernel@vger.kernel.org, seanjc@google.com, Durrant, Paul,
pbonzini@redhat.com, linux-mm@kvack.org, Woodhouse, David,
Saenz Julienne, Nicolas, viro@zeniv.linux.org.uk,
Graf (AWS), Alexander, jack@suse.cz,
linux-fsdevel@vger.kernel.org
On Fri, Nov 01, 2024 at 01:01:00PM +0000, Gowans, James wrote:
> Thanks Jason, that sounds perfect. I'll work on the next rev which will:
> - expose a filesystem which owns reserved/persistent memory, just like
> this patch.
Is this step needed?
If the guest memfd is already told to get 1G pages in some normal way,
why do we need a dedicated pool just for the KHO filesystem?
Back to my suggestion, can't KHO simply freeze the guest memfd and
then extract the memory layout, and just use the normal allocator?
Or do you have a hard requirement that only KHO allocated memory can
be preserved across kexec?
Jason
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 05/10] guestmemfs: add file mmap callback
2024-11-01 13:42 ` Jason Gunthorpe
@ 2024-11-02 8:24 ` Gowans, James
2024-11-04 11:11 ` Mike Rapoport
2024-11-04 14:39 ` Jason Gunthorpe
2024-11-04 10:49 ` Mike Rapoport
1 sibling, 2 replies; 35+ messages in thread
From: Gowans, James @ 2024-11-02 8:24 UTC (permalink / raw)
To: jgg@ziepe.ca
Cc: quic_eberman@quicinc.com, kvm@vger.kernel.org, rppt@kernel.org,
brauner@kernel.org, anthony.yznaga@oracle.com,
steven.sistare@oracle.com, akpm@linux-foundation.org,
linux-kernel@vger.kernel.org, Durrant, Paul, Woodhouse, David,
pbonzini@redhat.com, seanjc@google.com, linux-mm@kvack.org,
Saenz Julienne, Nicolas, Graf (AWS), Alexander,
viro@zeniv.linux.org.uk, jack@suse.cz,
linux-fsdevel@vger.kernel.org
On Fri, 2024-11-01 at 10:42 -0300, Jason Gunthorpe wrote:
>
> On Fri, Nov 01, 2024 at 01:01:00PM +0000, Gowans, James wrote:
>
> > Thanks Jason, that sounds perfect. I'll work on the next rev which will:
> > - expose a filesystem which owns reserved/persistent memory, just like
> > this patch.
>
> Is this step needed?
>
> If the guest memfd is already told to get 1G pages in some normal way,
> why do we need a dedicated pool just for the KHO filesystem?
>
> Back to my suggestion, can't KHO simply freeze the guest memfd and
> then extract the memory layout, and just use the normal allocator?
>
> Or do you have a hard requirement that only KHO allocated memory can
> be preserved across kexec?
KHO can persist any memory ranges which are not MOVABLE. Provided that
guest_memfd does non-movable allocations then serialising and persisting
should be possible.
There are other requirements here, specifically the ability to be
*guaranteed* GiB-level allocations, have the guest memory out of the
direct map for secret hiding, and remove the struct page overhead.
Struct page overhead could be handled via HVO. But considering that the
memory must be out of the direct map it seems unnecessary to have struct
pages, and unnecessary to have it managed by an existing allocator. The
only existing 1 GiB allocator I know of is hugetlbfs? Let me know if
there's something else that can be used.
That's the main motivation for a separate pool allocated on early boot.
This is quite similar to hugetlbfs, so a natural question is if we could
use and serialise hugetlbfs instead, but that probably opens another can
of worms of complexity.
There's more than just the guest_memfds and their allocations to
serialise; it's probably useful to be able to have a directory structure
in the filesystem, POSIX file ACLs, and perhaps some other filesystem
metadata. For this reason I still think that having a new filesystem
designed for this use-case which creates guest_memfd objects when files
are opened is the way to go.
Let me know what you think.
JG
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 05/10] guestmemfs: add file mmap callback
2024-11-02 8:24 ` Gowans, James
@ 2024-11-04 11:11 ` Mike Rapoport
2024-11-04 14:39 ` Jason Gunthorpe
1 sibling, 0 replies; 35+ messages in thread
From: Mike Rapoport @ 2024-11-04 11:11 UTC (permalink / raw)
To: Gowans, James
Cc: jgg@ziepe.ca, quic_eberman@quicinc.com, kvm@vger.kernel.org,
brauner@kernel.org, anthony.yznaga@oracle.com,
steven.sistare@oracle.com, akpm@linux-foundation.org,
linux-kernel@vger.kernel.org, Durrant, Paul, Woodhouse, David,
pbonzini@redhat.com, seanjc@google.com, linux-mm@kvack.org,
Saenz Julienne, Nicolas, Graf (AWS), Alexander,
viro@zeniv.linux.org.uk, jack@suse.cz,
linux-fsdevel@vger.kernel.org
On Sat, Nov 02, 2024 at 08:24:15AM +0000, Gowans, James wrote:
> On Fri, 2024-11-01 at 10:42 -0300, Jason Gunthorpe wrote:
> >
> > On Fri, Nov 01, 2024 at 01:01:00PM +0000, Gowans, James wrote:
> >
> > > Thanks Jason, that sounds perfect. I'll work on the next rev which will:
> > > - expose a filesystem which owns reserved/persistent memory, just like
> > > this patch.
> >
> > Is this step needed?
> >
> > If the guest memfd is already told to get 1G pages in some normal way,
> > why do we need a dedicated pool just for the KHO filesystem?
> >
> > Back to my suggestion, can't KHO simply freeze the guest memfd and
> > then extract the memory layout, and just use the normal allocator?
> >
> > Or do you have a hard requirement that only KHO allocated memory can
> > be preserved across kexec?
>
> KHO can persist any memory ranges which are not MOVABLE. Provided that
> guest_memfd does non-movable allocations then serialising and persisting
> should be possible.
>
> There are other requirements here, specifically the ability to be
> *guaranteed* GiB-level allocations, have the guest memory out of the
> direct map for secret hiding, and remove the struct page overhead.
> Struct page overhead could be handled via HVO. But considering that the
> memory must be out of the direct map it seems unnecessary to have struct
> pages, and unnecessary to have it managed by an existing allocator.
Having memory out of direct map does not preclude manipulations of struct
page unless that memory is completely out of the kernel control (e.g.
excluded by mem=X) and this is not necessarily the case even for VM hosts.
It's not not necessary to manage the memory using an existing allocator,
but I think a specialized allocator should not be a part of guestmemfs.`
> JG
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 05/10] guestmemfs: add file mmap callback
2024-11-02 8:24 ` Gowans, James
2024-11-04 11:11 ` Mike Rapoport
@ 2024-11-04 14:39 ` Jason Gunthorpe
1 sibling, 0 replies; 35+ messages in thread
From: Jason Gunthorpe @ 2024-11-04 14:39 UTC (permalink / raw)
To: Gowans, James
Cc: quic_eberman@quicinc.com, kvm@vger.kernel.org, rppt@kernel.org,
brauner@kernel.org, anthony.yznaga@oracle.com,
steven.sistare@oracle.com, akpm@linux-foundation.org,
linux-kernel@vger.kernel.org, Durrant, Paul, Woodhouse, David,
pbonzini@redhat.com, seanjc@google.com, linux-mm@kvack.org,
Saenz Julienne, Nicolas, Graf (AWS), Alexander,
viro@zeniv.linux.org.uk, jack@suse.cz,
linux-fsdevel@vger.kernel.org
On Sat, Nov 02, 2024 at 08:24:15AM +0000, Gowans, James wrote:
> KHO can persist any memory ranges which are not MOVABLE. Provided that
> guest_memfd does non-movable allocations then serialising and persisting
> should be possible.
>
> There are other requirements here, specifically the ability to be
> *guaranteed* GiB-level allocations, have the guest memory out of the
> direct map for secret hiding, and remove the struct page overhead.
> Struct page overhead could be handled via HVO.
IMHO this should all be handled as part of normal guestmemfd operation
because it has nothing to do with KHO. Many others have asked for the
same things in guest memfd already.
So I would start by assuming guest memfd will get those things
eventually and design around a 'freeze and record' model for KHO of a
guestmemfd, instead of yet another special memory allocator..
Jason
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 05/10] guestmemfs: add file mmap callback
2024-11-01 13:42 ` Jason Gunthorpe
2024-11-02 8:24 ` Gowans, James
@ 2024-11-04 10:49 ` Mike Rapoport
1 sibling, 0 replies; 35+ messages in thread
From: Mike Rapoport @ 2024-11-04 10:49 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Gowans, James, quic_eberman@quicinc.com, kvm@vger.kernel.org,
brauner@kernel.org, anthony.yznaga@oracle.com,
steven.sistare@oracle.com, akpm@linux-foundation.org,
linux-kernel@vger.kernel.org, seanjc@google.com, Durrant, Paul,
pbonzini@redhat.com, linux-mm@kvack.org, Woodhouse, David,
Saenz Julienne, Nicolas, viro@zeniv.linux.org.uk,
Graf (AWS), Alexander, jack@suse.cz,
linux-fsdevel@vger.kernel.org
On Fri, Nov 01, 2024 at 10:42:02AM -0300, Jason Gunthorpe wrote:
> On Fri, Nov 01, 2024 at 01:01:00PM +0000, Gowans, James wrote:
>
> > Thanks Jason, that sounds perfect. I'll work on the next rev which will:
> > - expose a filesystem which owns reserved/persistent memory, just like
> > this patch.
>
> Is this step needed?
>
> If the guest memfd is already told to get 1G pages in some normal way,
> why do we need a dedicated pool just for the KHO filesystem?
>
> Back to my suggestion, can't KHO simply freeze the guest memfd and
> then extract the memory layout, and just use the normal allocator?
>
> Or do you have a hard requirement that only KHO allocated memory can
> be preserved across kexec?
KHO does not allocate memory, it gets the ranges to preserve, makes sure
they are not overwritten during kexec and can be retrieved by the second
kernel.
For KHO it does not matter if the memory comes from a normal or a special
allocator.
> Jason
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH 06/10] kexec/kho: Add addr flag to not initialise memory
2024-08-05 9:32 [PATCH 00/10] Introduce guestmemfs: persistent in-memory filesystem James Gowans
` (4 preceding siblings ...)
2024-08-05 9:32 ` [PATCH 05/10] guestmemfs: add file mmap callback James Gowans
@ 2024-08-05 9:32 ` James Gowans
2024-08-05 9:32 ` [PATCH 07/10] guestmemfs: Persist filesystem metadata via KHO James Gowans
` (7 subsequent siblings)
13 siblings, 0 replies; 35+ messages in thread
From: James Gowans @ 2024-08-05 9:32 UTC (permalink / raw)
To: linux-kernel
Cc: James Gowans, Sean Christopherson, Paolo Bonzini, Alexander Viro,
Steve Sistare, Christian Brauner, Jan Kara, Anthony Yznaga,
Mike Rapoport, Andrew Morton, linux-mm, Jason Gunthorpe,
linux-fsdevel, Usama Arif, kvm, Alexander Graf, David Woodhouse,
Paul Durrant, Nicolas Saenz Julienne
Smuggle a flag on the address field. If set the memory region being
reserved via KHO will be marked as no init in memblocks so it will not
get struct pages, will not get given to the buddy allocator and will not
be part of the direct map.
This allows drivers to pass memory ranges which the driver has allocated
itself from memblocks, independent of the kernel's mm and struct page
based memory management.
Signed-off-by: James Gowans <jgowans@amazon.com>
---
include/uapi/linux/kexec.h | 6 ++++++
kernel/kexec_kho_in.c | 12 +++++++++++-
kernel/kexec_kho_out.c | 4 ++++
3 files changed, 21 insertions(+), 1 deletion(-)
diff --git a/include/uapi/linux/kexec.h b/include/uapi/linux/kexec.h
index ad9e95b88b34..1c031a261c2c 100644
--- a/include/uapi/linux/kexec.h
+++ b/include/uapi/linux/kexec.h
@@ -52,6 +52,12 @@
/* KHO passes an array of kho_mem as "mem cache" to the new kernel */
struct kho_mem {
+ /*
+ * Use the last bits for flags; addrs should be at least word
+ * aligned.
+ */
+#define KHO_MEM_ADDR_FLAG_NOINIT BIT(0)
+#define KHO_MEM_ADDR_FLAG_MASK (BIT(1) - 1)
__u64 addr;
__u64 len;
};
diff --git a/kernel/kexec_kho_in.c b/kernel/kexec_kho_in.c
index 5f8e0d9f9e12..943d9483b009 100644
--- a/kernel/kexec_kho_in.c
+++ b/kernel/kexec_kho_in.c
@@ -75,6 +75,11 @@ __init void kho_populate_refcount(void)
*/
for (offset = 0; offset < mem_len; offset += sizeof(struct kho_mem)) {
struct kho_mem *mem = mem_virt + offset;
+
+ /* No struct pages for this region; nothing to claim. */
+ if (mem->addr & KHO_MEM_ADDR_FLAG_NOINIT)
+ continue;
+
u64 start_pfn = PFN_DOWN(mem->addr);
u64 end_pfn = PFN_UP(mem->addr + mem->len);
u64 pfn;
@@ -183,8 +188,13 @@ void __init kho_reserve_previous_mem(void)
/* Then populate all preserved memory areas as reserved */
for (off = 0; off < mem_len; off += sizeof(struct kho_mem)) {
struct kho_mem *mem = mem_virt + off;
+ __u64 addr = mem->addr & ~KHO_MEM_ADDR_FLAG_MASK;
- memblock_reserve(mem->addr, mem->len);
+ memblock_reserve(addr, mem->len);
+ if (mem->addr & KHO_MEM_ADDR_FLAG_NOINIT) {
+ memblock_reserved_mark_noinit(addr, mem->len);
+ memblock_mark_nomap(addr, mem->len);
+ }
}
/* Unreserve the mem cache - we don't need it from here on */
diff --git a/kernel/kexec_kho_out.c b/kernel/kexec_kho_out.c
index 2cf5755f5e4a..4d9da501c5dc 100644
--- a/kernel/kexec_kho_out.c
+++ b/kernel/kexec_kho_out.c
@@ -175,6 +175,10 @@ static int kho_alloc_mem_cache(struct kimage *image, void *fdt)
const struct kho_mem *mem = &mems[i];
ulong mstart = PAGE_ALIGN_DOWN(mem->addr);
ulong mend = PAGE_ALIGN(mem->addr + mem->len);
+
+ /* Re-apply flags lost during round down. */
+ mstart |= mem->addr & KHO_MEM_ADDR_FLAG_MASK;
+
struct kho_mem cmem = {
.addr = mstart,
.len = (mend - mstart),
--
2.34.1
^ permalink raw reply related [flat|nested] 35+ messages in thread* [PATCH 07/10] guestmemfs: Persist filesystem metadata via KHO
2024-08-05 9:32 [PATCH 00/10] Introduce guestmemfs: persistent in-memory filesystem James Gowans
` (5 preceding siblings ...)
2024-08-05 9:32 ` [PATCH 06/10] kexec/kho: Add addr flag to not initialise memory James Gowans
@ 2024-08-05 9:32 ` James Gowans
2024-08-05 9:32 ` [PATCH 08/10] guestmemfs: Block modifications when serialised James Gowans
` (6 subsequent siblings)
13 siblings, 0 replies; 35+ messages in thread
From: James Gowans @ 2024-08-05 9:32 UTC (permalink / raw)
To: linux-kernel
Cc: James Gowans, Sean Christopherson, Paolo Bonzini, Alexander Viro,
Steve Sistare, Christian Brauner, Jan Kara, Anthony Yznaga,
Mike Rapoport, Andrew Morton, linux-mm, Jason Gunthorpe,
linux-fsdevel, Usama Arif, kvm, Alexander Graf, David Woodhouse,
Paul Durrant, Nicolas Saenz Julienne
Filesystem metadata consists of: physical memory extents, superblock,
inodes block and allocation bitmap. Here serialisation and
deserialisation of all of these is done via the KHO framework.
A serialisation callback is added which is run when KHO activate is
triggered. This creates the device tree blob for the metadata and marks
the memory as persistent via struct kho_mem(s).
When the filesystem is mounted it attempts to re-hydrate metadata from
KHO. Only if this fails (first boot, for example) then it allocates
fresh metadata pages.
The privatet data struct is switched from holding a reference to the
persistent superblock to now referencing the regular struct super_block.
This is necessary for the serialisation code. Better would be to be able
to define callback private data, if that were possible.
Signed-off-by: James Gowans <jgowans@amazon.com>
---
fs/guestmemfs/Makefile | 2 +
fs/guestmemfs/guestmemfs.c | 72 ++++++---
fs/guestmemfs/guestmemfs.h | 8 +
fs/guestmemfs/serialise.c | 296 +++++++++++++++++++++++++++++++++++++
4 files changed, 355 insertions(+), 23 deletions(-)
create mode 100644 fs/guestmemfs/serialise.c
diff --git a/fs/guestmemfs/Makefile b/fs/guestmemfs/Makefile
index e93e43ba274b..8b95cac34564 100644
--- a/fs/guestmemfs/Makefile
+++ b/fs/guestmemfs/Makefile
@@ -4,3 +4,5 @@
#
obj-y += guestmemfs.o inode.o dir.o allocator.o file.o
+
+obj-$(CONFIG_KEXEC_KHO) += serialise.o
diff --git a/fs/guestmemfs/guestmemfs.c b/fs/guestmemfs/guestmemfs.c
index 38f20ad25286..cf47e5100504 100644
--- a/fs/guestmemfs/guestmemfs.c
+++ b/fs/guestmemfs/guestmemfs.c
@@ -3,6 +3,7 @@
#include "guestmemfs.h"
#include <linux/dcache.h>
#include <linux/fs.h>
+#include <linux/kexec.h>
#include <linux/module.h>
#include <linux/fs_context.h>
#include <linux/io.h>
@@ -10,7 +11,7 @@
#include <linux/statfs.h>
phys_addr_t guestmemfs_base, guestmemfs_size;
-struct guestmemfs_sb *psb;
+struct super_block *guestmemfs_sb;
static int statfs(struct dentry *root, struct kstatfs *buf)
{
@@ -33,26 +34,39 @@ static int guestmemfs_fill_super(struct super_block *sb, struct fs_context *fc)
struct inode *inode;
struct dentry *dentry;
- psb = kzalloc(sizeof(*psb), GFP_KERNEL);
- psb->inodes = kzalloc(2 << 20, GFP_KERNEL);
- if (!psb->inodes)
- return -ENOMEM;
- psb->allocator_bitmap = kzalloc(1 << 20, GFP_KERNEL);
- if (!psb->allocator_bitmap)
- return -ENOMEM;
-
/*
* Keep a reference to the persistent super block in the
* ephemeral super block.
*/
- sb->s_fs_info = psb;
- spin_lock_init(&psb->allocation_lock);
- guestmemfs_initialise_inode_store(sb);
- guestmemfs_zero_allocations(sb);
- guestmemfs_get_persisted_inode(sb, 1)->flags = GUESTMEMFS_INODE_FLAG_DIR;
- strscpy(guestmemfs_get_persisted_inode(sb, 1)->filename, ".",
- GUESTMEMFS_FILENAME_LEN);
- psb->next_free_ino = 2;
+ sb->s_fs_info = guestmemfs_restore_from_kho();
+
+ if (GUESTMEMFS_PSB(sb)) {
+ pr_info("Restored super block from KHO\n");
+ } else {
+ struct guestmemfs_sb *psb;
+
+ pr_info("Did not restore from KHO - allocating free\n");
+ psb = kzalloc(sizeof(*psb), GFP_KERNEL);
+ psb->inodes = kzalloc(2 << 20, GFP_KERNEL);
+ if (!psb->inodes)
+ return -ENOMEM;
+ psb->allocator_bitmap = kzalloc(1 << 20, GFP_KERNEL);
+ if (!psb->allocator_bitmap)
+ return -ENOMEM;
+ sb->s_fs_info = psb;
+ spin_lock_init(&psb->allocation_lock);
+ guestmemfs_initialise_inode_store(sb);
+ guestmemfs_zero_allocations(sb);
+ guestmemfs_get_persisted_inode(sb, 1)->flags = GUESTMEMFS_INODE_FLAG_DIR;
+ strscpy(guestmemfs_get_persisted_inode(sb, 1)->filename, ".",
+ GUESTMEMFS_FILENAME_LEN);
+ GUESTMEMFS_PSB(sb)->next_free_ino = 2;
+ }
+ /*
+ * Keep a reference to this sb; the serialise callback needs it
+ * and has no oher way to get it.
+ */
+ guestmemfs_sb = sb;
sb->s_op = &guestmemfs_super_ops;
@@ -98,11 +112,18 @@ static struct file_system_type guestmemfs_fs_type = {
.fs_flags = FS_USERNS_MOUNT,
};
+
+static struct notifier_block trace_kho_nb = {
+ .notifier_call = guestmemfs_serialise_to_kho,
+};
+
static int __init guestmemfs_init(void)
{
int ret;
ret = register_filesystem(&guestmemfs_fs_type);
+ if (IS_ENABLED(CONFIG_FTRACE_KHO))
+ register_kho_notifier(&trace_kho_nb);
return ret;
}
@@ -120,13 +141,18 @@ early_param("guestmemfs", parse_guestmemfs_extents);
void __init guestmemfs_reserve_mem(void)
{
- guestmemfs_base = memblock_phys_alloc(guestmemfs_size, 4 << 10);
- if (guestmemfs_base) {
- memblock_reserved_mark_noinit(guestmemfs_base, guestmemfs_size);
- memblock_mark_nomap(guestmemfs_base, guestmemfs_size);
- } else {
- pr_warn("Failed to alloc %llu bytes for guestmemfs\n", guestmemfs_size);
+ if (guestmemfs_size) {
+ guestmemfs_base = memblock_phys_alloc(guestmemfs_size, 4 << 10);
+
+ if (guestmemfs_base) {
+ memblock_reserved_mark_noinit(guestmemfs_base, guestmemfs_size);
+ memblock_mark_nomap(guestmemfs_base, guestmemfs_size);
+ pr_debug("guestmemfs reserved base=%llu from memblocks\n", guestmemfs_base);
+ } else {
+ pr_warn("Failed to alloc %llu bytes for guestmemfs\n", guestmemfs_size);
+ }
}
+
}
MODULE_ALIAS_FS("guestmemfs");
diff --git a/fs/guestmemfs/guestmemfs.h b/fs/guestmemfs/guestmemfs.h
index 0f2788ce740e..263d995b75ed 100644
--- a/fs/guestmemfs/guestmemfs.h
+++ b/fs/guestmemfs/guestmemfs.h
@@ -10,11 +10,14 @@
/* Units of bytes */
extern phys_addr_t guestmemfs_base, guestmemfs_size;
+extern struct super_block *guestmemfs_sb;
struct guestmemfs_sb {
/* Inode number */
unsigned long next_free_ino;
unsigned long allocated_inodes;
+
+ /* Ephemeral fields - must be updated on deserialise */
struct guestmemfs_inode *inodes;
void *allocator_bitmap;
spinlock_t allocation_lock;
@@ -46,6 +49,11 @@ long guestmemfs_alloc_block(struct super_block *sb);
struct inode *guestmemfs_inode_get(struct super_block *sb, unsigned long ino);
struct guestmemfs_inode *guestmemfs_get_persisted_inode(struct super_block *sb, int ino);
+int guestmemfs_serialise_to_kho(struct notifier_block *self,
+ unsigned long cmd,
+ void *v);
+struct guestmemfs_sb *guestmemfs_restore_from_kho(void);
+
extern const struct file_operations guestmemfs_dir_fops;
extern const struct file_operations guestmemfs_file_fops;
extern const struct inode_operations guestmemfs_file_inode_operations;
diff --git a/fs/guestmemfs/serialise.c b/fs/guestmemfs/serialise.c
new file mode 100644
index 000000000000..eb70d496a3eb
--- /dev/null
+++ b/fs/guestmemfs/serialise.c
@@ -0,0 +1,296 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "guestmemfs.h"
+#include <linux/kexec.h>
+#include <linux/memblock.h>
+
+/*
+ * Responsible for serialisation and deserialisation of filesystem metadata
+ * to and from KHO to survive kexec. The deserialisation logic needs to mirror
+ * serialisation, so putting them in the same file.
+ *
+ * The format of the device tree structure is:
+ *
+ * /guestmemfs
+ * compatible = "guestmemfs-v1"
+ * fs_mem {
+ * mem = [ ... ]
+ * };
+ * superblock {
+ * mem = [
+ * persistent super block,
+ * inodes,
+ * allocator_bitmap,
+ * };
+ * mappings_block {
+ * mem = [ ... ]
+ * };
+ * // For every mappings_block mem, which inode it belongs to.
+ * mappings_to_inode {
+ * num_inodes,
+ * mem = [ ... ],
+ * }
+ */
+
+static int serialise_superblock(struct super_block *sb, void *fdt)
+{
+ struct kho_mem mem[3];
+ int err = 0;
+ struct guestmemfs_sb *psb = sb->s_fs_info;
+
+ err |= fdt_begin_node(fdt, "superblock");
+
+ mem[0].addr = virt_to_phys(psb);
+ mem[0].len = sizeof(*psb);
+
+ mem[1].addr = virt_to_phys(psb->inodes);
+ mem[1].len = 2 << 20;
+
+ mem[2].addr = virt_to_phys(psb->allocator_bitmap);
+ mem[2].len = 1 << 20;
+
+ err |= fdt_property(fdt, "mem", &mem, sizeof(mem));
+ err |= fdt_end_node(fdt);
+
+ return err;
+}
+
+static int serialise_mappings_blocks(struct super_block *sb, void *fdt)
+{
+ struct kho_mem *mappings_mems;
+ struct kho_mem mappings_to_inode_mem;
+ struct guestmemfs_sb *psb = sb->s_fs_info;
+ int inode_idx;
+ size_t num_inodes = PMD_SIZE / sizeof(struct guestmemfs_inode);
+ struct guestmemfs_inode *inode;
+ int err = 0;
+ int *mappings_to_inode;
+ int mappings_to_inode_idx = 0;
+
+ mappings_to_inode = kzalloc(PAGE_SIZE, GFP_KERNEL);
+
+ mappings_mems = kcalloc(psb->allocated_inodes, sizeof(struct kho_mem), GFP_KERNEL);
+
+ for (inode_idx = 1; inode_idx < num_inodes; ++inode_idx) {
+ inode = guestmemfs_get_persisted_inode(sb, inode_idx);
+ if (inode->flags & GUESTMEMFS_INODE_FLAG_FILE) {
+ mappings_mems[mappings_to_inode_idx].addr = virt_to_phys(inode->mappings);
+ mappings_mems[mappings_to_inode_idx].len = PAGE_SIZE;
+ mappings_to_inode[mappings_to_inode_idx] = inode_idx;
+ mappings_to_inode_idx++;
+ }
+ }
+
+ err |= fdt_begin_node(fdt, "mappings_blocks");
+ err |= fdt_property(fdt, "mem", mappings_mems,
+ sizeof(struct kho_mem) * mappings_to_inode_idx);
+ err |= fdt_end_node(fdt);
+
+
+ err |= fdt_begin_node(fdt, "mappings_to_inode");
+ mappings_to_inode_mem.addr = virt_to_phys(mappings_to_inode);
+ mappings_to_inode_mem.len = PAGE_SIZE;
+ err |= fdt_property(fdt, "mem", &mappings_to_inode_mem,
+ sizeof(mappings_to_inode_mem));
+ err |= fdt_property(fdt, "num_inodes", &psb->allocated_inodes,
+ sizeof(psb->allocated_inodes));
+
+ err |= fdt_end_node(fdt);
+
+ return err;
+}
+
+int guestmemfs_serialise_to_kho(struct notifier_block *self,
+ unsigned long cmd,
+ void *v)
+{
+ static const char compatible[] = "guestmemfs-v1";
+ struct kho_mem mem;
+ void *fdt = v;
+ int err = 0;
+
+ switch (cmd) {
+ case KEXEC_KHO_ABORT:
+ /* No rollback action needed. */
+ return NOTIFY_DONE;
+ case KEXEC_KHO_DUMP:
+ /* Handled below */
+ break;
+ default:
+ return NOTIFY_BAD;
+ }
+
+ err |= fdt_begin_node(fdt, "guestmemfs");
+ err |= fdt_property(fdt, "compatible", compatible, sizeof(compatible));
+
+ err |= fdt_begin_node(fdt, "fs_mem");
+ mem.addr = guestmemfs_base | KHO_MEM_ADDR_FLAG_NOINIT;
+ mem.len = guestmemfs_size;
+ err |= fdt_property(fdt, "mem", &mem, sizeof(mem));
+ err |= fdt_end_node(fdt);
+
+ err |= serialise_superblock(guestmemfs_sb, fdt);
+ err |= serialise_mappings_blocks(guestmemfs_sb, fdt);
+
+ err |= fdt_end_node(fdt);
+
+ pr_info("Serialised extends [0x%llx + 0x%llx] via KHO: %i\n",
+ guestmemfs_base, guestmemfs_size, err);
+
+ return err;
+}
+
+static struct guestmemfs_sb *deserialise_superblock(const void *fdt, int root_off)
+{
+ const struct kho_mem *mem;
+ int mem_len;
+ struct guestmemfs_sb *old_sb;
+ int off;
+
+ off = fdt_subnode_offset(fdt, root_off, "superblock");
+ mem = fdt_getprop(fdt, off, "mem", &mem_len);
+
+ if (mem_len != 3 * sizeof(struct kho_mem)) {
+ pr_err("Incorrect mem_len; got %i\n", mem_len);
+ return NULL;
+ }
+
+ old_sb = kho_claim_mem(mem);
+ old_sb->inodes = kho_claim_mem(mem + 1);
+ old_sb->allocator_bitmap = kho_claim_mem(mem + 2);
+
+ return old_sb;
+}
+
+static int deserialise_mappings_blocks(const void *fdt, int root_off,
+ struct guestmemfs_sb *sb)
+{
+ int off;
+ int len = 0;
+ const unsigned long *num_inodes;
+ const struct kho_mem *mappings_to_inode_mem;
+ int *mappings_to_inode;
+ int mappings_block;
+ const struct kho_mem *mappings_blocks_mems;
+
+ /*
+ * Array of struct kho_mem - one for each persisted mappings
+ * blocks.
+ */
+ off = fdt_subnode_offset(fdt, root_off, "mappings_blocks");
+ mappings_blocks_mems = fdt_getprop(fdt, off, "mem", &len);
+
+ /*
+ * Array specifying which inode a specific index into the
+ * mappings_blocks kho_mem array corresponds to. num_inodes
+ * indicates the size of the array which is the number of mappings
+ * blocks which need to be restored.
+ */
+ off = fdt_subnode_offset(fdt, root_off, "mappings_to_inode");
+ if (off < 0) {
+ pr_warn("No fs_mem available in KHO\n");
+ return -EINVAL;
+ }
+ num_inodes = fdt_getprop(fdt, off, "num_inodes", &len);
+ if (len != sizeof(num_inodes)) {
+ pr_warn("Invalid num_inodes len: %i\n", len);
+ return -EINVAL;
+ }
+ mappings_to_inode_mem = fdt_getprop(fdt, off, "mem", &len);
+ if (len != sizeof(*mappings_to_inode_mem)) {
+ pr_warn("Invalid mappings_to_inode_mem len: %i\n", len);
+ return -EINVAL;
+ }
+ mappings_to_inode = kho_claim_mem(mappings_to_inode_mem);
+
+ /*
+ * Re-assigned the mappings block to the inodes. Indexes into
+ * mappings_to_inode specifies which inode to assign each mappings
+ * block to.
+ */
+ for (mappings_block = 0; mappings_block < *num_inodes; ++mappings_block) {
+ int inode = mappings_to_inode[mappings_block];
+
+ sb->inodes[inode].mappings = kho_claim_mem(&mappings_blocks_mems[mappings_block]);
+ }
+
+ return 0;
+}
+
+static int deserialise_fs_mem(const void *fdt, int root_off)
+{
+ int err;
+ /* Offset into the KHO DT */
+ int off;
+ int len = 0;
+ const struct kho_mem *mem;
+
+ off = fdt_subnode_offset(fdt, root_off, "fs_mem");
+ if (off < 0) {
+ pr_info("No fs_mem available in KHO\n");
+ return -EINVAL;
+ }
+
+ mem = fdt_getprop(fdt, off, "mem", &len);
+ if (mem && len == sizeof(*mem)) {
+ guestmemfs_base = mem->addr & ~KHO_MEM_ADDR_FLAG_MASK;
+ guestmemfs_size = mem->len;
+ } else {
+ pr_err("KHO did not contain a guestmemfs base address and size\n");
+ return -EINVAL;
+ }
+
+ pr_info("Reclaimed [%llx + %llx] via KHO\n", guestmemfs_base, guestmemfs_size);
+ if (err) {
+ pr_err("Unable to reserve [0x%llx + 0x%llx] from memblock: %i\n",
+ guestmemfs_base, guestmemfs_size, err);
+ return err;
+ }
+ return 0;
+}
+struct guestmemfs_sb *guestmemfs_restore_from_kho(void)
+{
+ const void *fdt = kho_get_fdt();
+ struct guestmemfs_sb *old_sb;
+ int err;
+ /* Offset into the KHO DT */
+ int off;
+
+ if (!fdt) {
+ pr_err("Unable to get KHO DT after KHO boot?\n");
+ return NULL;
+ }
+
+ off = fdt_path_offset(fdt, "/guestmemfs");
+ pr_info("guestmemfs offset: %i\n", off);
+
+ if (!off) {
+ pr_info("No guestmemfs data available in KHO\n");
+ return NULL;
+ }
+ err = fdt_node_check_compatible(fdt, off, "guestmemfs-v1");
+ if (err) {
+ pr_err("Existing KHO superblock format is not compatible with this kernel\n");
+ return NULL;
+ }
+
+ old_sb = deserialise_superblock(fdt, off);
+ if (!old_sb) {
+ pr_warn("Failed to restore superblock\n");
+ return NULL;
+ }
+
+ err = deserialise_mappings_blocks(fdt, off, old_sb);
+ if (err) {
+ pr_warn("Failed to restore mappings blocks\n");
+ return NULL;
+ }
+
+ err = deserialise_fs_mem(fdt, off);
+ if (err) {
+ pr_warn("Failed to restore filesystem memory extents\n");
+ return NULL;
+ }
+
+ return old_sb;
+}
--
2.34.1
^ permalink raw reply related [flat|nested] 35+ messages in thread* [PATCH 08/10] guestmemfs: Block modifications when serialised
2024-08-05 9:32 [PATCH 00/10] Introduce guestmemfs: persistent in-memory filesystem James Gowans
` (6 preceding siblings ...)
2024-08-05 9:32 ` [PATCH 07/10] guestmemfs: Persist filesystem metadata via KHO James Gowans
@ 2024-08-05 9:32 ` James Gowans
2024-08-05 9:32 ` [PATCH 09/10] guestmemfs: Add documentation and usage instructions James Gowans
` (5 subsequent siblings)
13 siblings, 0 replies; 35+ messages in thread
From: James Gowans @ 2024-08-05 9:32 UTC (permalink / raw)
To: linux-kernel
Cc: James Gowans, Sean Christopherson, Paolo Bonzini, Alexander Viro,
Steve Sistare, Christian Brauner, Jan Kara, Anthony Yznaga,
Mike Rapoport, Andrew Morton, linux-mm, Jason Gunthorpe,
linux-fsdevel, Usama Arif, kvm, Alexander Graf, David Woodhouse,
Paul Durrant, Nicolas Saenz Julienne
Once the memory regions for inodes, mappings and allocations have been
serialised, further modifications would break the serialised data; it
would no longer be valid.
Return an error code if attempting to create new files or allocate data
for files once serialised.
Signed-off-by: James Gowans <jgowans@amazon.com>
---
fs/guestmemfs/file.c | 19 ++++++++++++++++---
fs/guestmemfs/guestmemfs.c | 1 +
fs/guestmemfs/guestmemfs.h | 1 +
fs/guestmemfs/inode.c | 6 ++++++
fs/guestmemfs/serialise.c | 8 +++++++-
5 files changed, 31 insertions(+), 4 deletions(-)
diff --git a/fs/guestmemfs/file.c b/fs/guestmemfs/file.c
index b1a52abcde65..8707a9d3ad90 100644
--- a/fs/guestmemfs/file.c
+++ b/fs/guestmemfs/file.c
@@ -8,19 +8,32 @@ static int truncate(struct inode *inode, loff_t newsize)
unsigned long free_block;
struct guestmemfs_inode *guestmemfs_inode;
unsigned long *mappings;
+ int rc = 0;
+ struct guestmemfs_sb *psb = GUESTMEMFS_PSB(inode->i_sb);
+
+ spin_lock(&psb->allocation_lock);
+
+ if (psb->serialised) {
+ rc = -EBUSY;
+ goto out;
+ }
guestmemfs_inode = guestmemfs_get_persisted_inode(inode->i_sb, inode->i_ino);
mappings = guestmemfs_inode->mappings;
i_size_write(inode, newsize);
for (int block_idx = 0; block_idx * PMD_SIZE < newsize; ++block_idx) {
free_block = guestmemfs_alloc_block(inode->i_sb);
- if (free_block < 0)
+ if (free_block < 0) {
/* TODO: roll back allocations. */
- return -ENOMEM;
+ rc = -ENOMEM;
+ goto out;
+ }
*(mappings + block_idx) = free_block;
++guestmemfs_inode->num_mappings;
}
- return 0;
+out:
+ spin_unlock(&psb->allocation_lock);
+ return rc;
}
static int inode_setattr(struct mnt_idmap *idmap, struct dentry *dentry, struct iattr *iattr)
diff --git a/fs/guestmemfs/guestmemfs.c b/fs/guestmemfs/guestmemfs.c
index cf47e5100504..d854033bfb7e 100644
--- a/fs/guestmemfs/guestmemfs.c
+++ b/fs/guestmemfs/guestmemfs.c
@@ -42,6 +42,7 @@ static int guestmemfs_fill_super(struct super_block *sb, struct fs_context *fc)
if (GUESTMEMFS_PSB(sb)) {
pr_info("Restored super block from KHO\n");
+ GUESTMEMFS_PSB(sb)->serialised = 0;
} else {
struct guestmemfs_sb *psb;
diff --git a/fs/guestmemfs/guestmemfs.h b/fs/guestmemfs/guestmemfs.h
index 263d995b75ed..91cc06ae45a5 100644
--- a/fs/guestmemfs/guestmemfs.h
+++ b/fs/guestmemfs/guestmemfs.h
@@ -21,6 +21,7 @@ struct guestmemfs_sb {
struct guestmemfs_inode *inodes;
void *allocator_bitmap;
spinlock_t allocation_lock;
+ bool serialised;
};
// If neither of these are set the inode is not in use.
diff --git a/fs/guestmemfs/inode.c b/fs/guestmemfs/inode.c
index 61f70441d82c..d521b35d4992 100644
--- a/fs/guestmemfs/inode.c
+++ b/fs/guestmemfs/inode.c
@@ -48,6 +48,12 @@ static unsigned long guestmemfs_allocate_inode(struct super_block *sb)
struct guestmemfs_sb *psb = GUESTMEMFS_PSB(sb);
spin_lock(&psb->allocation_lock);
+
+ if (psb->serialised) {
+ spin_unlock(&psb->allocation_lock);
+ return -EBUSY;
+ }
+
next_free_ino = psb->next_free_ino;
psb->allocated_inodes += 1;
if (!next_free_ino)
diff --git a/fs/guestmemfs/serialise.c b/fs/guestmemfs/serialise.c
index eb70d496a3eb..347eb8049a71 100644
--- a/fs/guestmemfs/serialise.c
+++ b/fs/guestmemfs/serialise.c
@@ -111,7 +111,7 @@ int guestmemfs_serialise_to_kho(struct notifier_block *self,
switch (cmd) {
case KEXEC_KHO_ABORT:
- /* No rollback action needed. */
+ GUESTMEMFS_PSB(guestmemfs_sb)->serialised = 0;
return NOTIFY_DONE;
case KEXEC_KHO_DUMP:
/* Handled below */
@@ -120,6 +120,7 @@ int guestmemfs_serialise_to_kho(struct notifier_block *self,
return NOTIFY_BAD;
}
+ spin_lock(&GUESTMEMFS_PSB(guestmemfs_sb)->allocation_lock);
err |= fdt_begin_node(fdt, "guestmemfs");
err |= fdt_property(fdt, "compatible", compatible, sizeof(compatible));
@@ -134,6 +135,11 @@ int guestmemfs_serialise_to_kho(struct notifier_block *self,
err |= fdt_end_node(fdt);
+ if (!err)
+ GUESTMEMFS_PSB(guestmemfs_sb)->serialised = 1;
+
+ spin_unlock(&GUESTMEMFS_PSB(guestmemfs_sb)->allocation_lock);
+
pr_info("Serialised extends [0x%llx + 0x%llx] via KHO: %i\n",
guestmemfs_base, guestmemfs_size, err);
--
2.34.1
^ permalink raw reply related [flat|nested] 35+ messages in thread* [PATCH 09/10] guestmemfs: Add documentation and usage instructions
2024-08-05 9:32 [PATCH 00/10] Introduce guestmemfs: persistent in-memory filesystem James Gowans
` (7 preceding siblings ...)
2024-08-05 9:32 ` [PATCH 08/10] guestmemfs: Block modifications when serialised James Gowans
@ 2024-08-05 9:32 ` James Gowans
2024-08-05 9:32 ` [PATCH 10/10] MAINTAINERS: Add maintainers for guestmemfs James Gowans
` (4 subsequent siblings)
13 siblings, 0 replies; 35+ messages in thread
From: James Gowans @ 2024-08-05 9:32 UTC (permalink / raw)
To: linux-kernel
Cc: James Gowans, Sean Christopherson, Paolo Bonzini, Alexander Viro,
Steve Sistare, Christian Brauner, Jan Kara, Anthony Yznaga,
Mike Rapoport, Andrew Morton, linux-mm, Jason Gunthorpe,
linux-fsdevel, Usama Arif, kvm, Alexander Graf, David Woodhouse,
Paul Durrant, Nicolas Saenz Julienne
Describe the motivation for guestmemfs, the functionality it provides,
how to compile it in, how to use it as a source of guest memory, how to
persist it across kexec and save/restore a VM.
Signed-off-by: James Gowans <jgowans@amazon.com>
---
Documentation/filesystems/guestmemfs.rst | 87 ++++++++++++++++++++++++
1 file changed, 87 insertions(+)
create mode 100644 Documentation/filesystems/guestmemfs.rst
diff --git a/Documentation/filesystems/guestmemfs.rst b/Documentation/filesystems/guestmemfs.rst
new file mode 100644
index 000000000000..d6ce0d194cc8
--- /dev/null
+++ b/Documentation/filesystems/guestmemfs.rst
@@ -0,0 +1,87 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================================================
+Guestmemfs - Persistent in-memory guest RAM filesystem
+======================================================
+
+Overview
+========
+
+Guestmemfs is an in-memory filesystem designed specifically for the purpose of
+live update of virtual machines by being a persistent across kexec source of
+guest VM memory.
+
+Live update of a hypervisor refers to act of pausing running VMs, serialising
+state, kexec-ing into a new hypervisor image, re-hydraing the KVM guests and
+resuming them. To achieve this guest memory must be preserved across kexec.
+
+Additionally, guestmemfs provides:
+- secret hiding for guest memory: the physical memory allocated for guestmemfs
+ is carved out of the direct map early in boot.
+- struct page overhead elimination: guestmemfs memory is not allocated by the
+ buddy allocator and does not have associated struct pages.
+- huge page mappings: allocations are done at PMD size and this improves TLB
+ performance (work in progress.)
+
+Compilation
+===========
+
+Guestmemfs is enabled via CONFIG_GUESTMEMFS_FS
+
+Persistence across kexec is enabled via CONFIG_KEXEC_KHO
+
+Usage
+=====
+
+On first boot (cold boot), allocate a large contiguous chunk of memory for
+guestmemfs via a kernel cmdline argument, eg:
+`guestmemfs=10G`.
+
+Mount guestmemfs:
+mount -t guestmemfs guestmemfs /mnt/guestmemfs/
+
+Create and truncate a file which will be used for guest RAM:
+
+touch /mnt/guesttmemfs/guest-ram
+truncate -s 500M /mnt/guestmemfs/guest-ram
+
+Boot a VM with this as the RAM source and the live update option enabled:
+
+qemu-system-x86_64 ... \
+ -object memory-backend-file,id=pc.ram,size=100M,mem-path=/mnt/guestmemfs/guest-ram,share=yes,prealloc=off \
+ -migrate-mode-enable cpr-reboot \
+ ...
+
+Suspect the guest and save the state via QEMU monitor:
+
+migrate_set_parameter mode cpr-reboot
+migrate file:/qemu.sav
+
+Activate KHO to serialise guestmemfs metadata and then kexec to the new
+hypervisor image:
+
+echo 1 > /sys/kernel/kho/active
+kexec -s -l --reuse-cmdline
+kexec -e
+
+After the kexec completes remount guestmemfs (or have it added to fstab)
+Re-start QEMU in live update restore mode:
+
+qemu-system-x86_64 ... \
+ -object memory-backend-file,id=pc.ram,size=100M,mem-path=/mnt/guestmemfs/guest-ram,share=yes,prealloc=off \
+ -migrate-mode-enable cpr-reboot \
+ -incoming defer
+ ...
+
+Finally restore the VM state and resume it via QEMU console:
+
+migrate_incoming file:/qemu.sav
+
+Future Work
+===========
+- NUMA awareness and multi-mount point support
+- Actually creating PMD-level mappings in page tables
+- guest_memfd style interface for confidential computing
+- supporting PUD-level allocations and mappings
+- MCE handling
+- Persisting IOMMU pgtables to allow DMA to guestmemfs during kexec
--
2.34.1
^ permalink raw reply related [flat|nested] 35+ messages in thread* [PATCH 10/10] MAINTAINERS: Add maintainers for guestmemfs
2024-08-05 9:32 [PATCH 00/10] Introduce guestmemfs: persistent in-memory filesystem James Gowans
` (8 preceding siblings ...)
2024-08-05 9:32 ` [PATCH 09/10] guestmemfs: Add documentation and usage instructions James Gowans
@ 2024-08-05 9:32 ` James Gowans
2024-08-05 14:32 ` [PATCH 00/10] Introduce guestmemfs: persistent in-memory filesystem Theodore Ts'o
` (3 subsequent siblings)
13 siblings, 0 replies; 35+ messages in thread
From: James Gowans @ 2024-08-05 9:32 UTC (permalink / raw)
To: linux-kernel
Cc: James Gowans, Sean Christopherson, Paolo Bonzini, Alexander Viro,
Steve Sistare, Christian Brauner, Jan Kara, Anthony Yznaga,
Mike Rapoport, Andrew Morton, linux-mm, Jason Gunthorpe,
linux-fsdevel, Usama Arif, kvm, Alexander Graf, David Woodhouse,
Paul Durrant, Nicolas Saenz Julienne
Signed-off-by: James Gowans <jgowans@amazon.com>
---
MAINTAINERS | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/MAINTAINERS b/MAINTAINERS
index 1028eceb59ca..e9c841bb18ba 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9412,6 +9412,14 @@ S: Maintained
T: git git://git.kernel.org/pub/scm/linux/kernel/git/pablo/gtp.git
F: drivers/net/gtp.c
+GUESTMEMFS
+M: James Gowans <jgowans@amazon.com>
+M: Alex Graf <graf@amazon.de>
+L: linux-fsdevel@vger.kernel.org
+S: Maintained
+F: Documentation/filesystems/guestmemfs.rst
+F: fs/guestmemfs/
+
GUID PARTITION TABLE (GPT)
M: Davidlohr Bueso <dave@stgolabs.net>
L: linux-efi@vger.kernel.org
--
2.34.1
^ permalink raw reply related [flat|nested] 35+ messages in thread* Re: [PATCH 00/10] Introduce guestmemfs: persistent in-memory filesystem
2024-08-05 9:32 [PATCH 00/10] Introduce guestmemfs: persistent in-memory filesystem James Gowans
` (9 preceding siblings ...)
2024-08-05 9:32 ` [PATCH 10/10] MAINTAINERS: Add maintainers for guestmemfs James Gowans
@ 2024-08-05 14:32 ` Theodore Ts'o
2024-08-05 14:41 ` Paolo Bonzini
2024-08-05 19:53 ` Gowans, James
2024-08-05 20:01 ` Jan Kara
` (2 subsequent siblings)
13 siblings, 2 replies; 35+ messages in thread
From: Theodore Ts'o @ 2024-08-05 14:32 UTC (permalink / raw)
To: James Gowans
Cc: linux-kernel, Sean Christopherson, Paolo Bonzini, Alexander Viro,
Steve Sistare, Christian Brauner, Jan Kara, Anthony Yznaga,
Mike Rapoport, Andrew Morton, linux-mm, Jason Gunthorpe,
linux-fsdevel, Usama Arif, kvm, Alexander Graf, David Woodhouse,
Paul Durrant, Nicolas Saenz Julienne
On Mon, Aug 05, 2024 at 11:32:35AM +0200, James Gowans wrote:
> Guestmemfs implements preservation acrosss kexec by carving out a
> large contiguous block of host system RAM early in boot which is
> then used as the data for the guestmemfs files.
Why does the memory have to be (a) contiguous, and (b) carved out of
*host* system memory early in boot? This seems to be very inflexible;
it means that you have to know how much memory will be needed for
guestmemfs in early boot.
Also, the VMM update process is not a common case thing, so we don't
need to optimize for performance. If we need to temporarily use
swap/zswap to allocate memory at VMM update time, and if the pages
aren't contiguous when they are copied out before doing the VMM
update, that might be very well worth the vast of of memory needed to
pay for reserving memory on the host for the VMM update that only
might happen once every few days/weeks/months (depending on whether
you are doing update just for high severity security fixes, or for
random VMM updates).
Even if you are updating the VMM every few days, it still doesn't seem
that permanently reserving contiguous memory on the host can be
justified from a TCO perspective.
Cheers,
- Ted
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: [PATCH 00/10] Introduce guestmemfs: persistent in-memory filesystem
2024-08-05 14:32 ` [PATCH 00/10] Introduce guestmemfs: persistent in-memory filesystem Theodore Ts'o
@ 2024-08-05 14:41 ` Paolo Bonzini
2024-08-05 19:47 ` Gowans, James
2024-08-05 19:53 ` Gowans, James
1 sibling, 1 reply; 35+ messages in thread
From: Paolo Bonzini @ 2024-08-05 14:41 UTC (permalink / raw)
To: Theodore Ts'o
Cc: James Gowans, linux-kernel, Sean Christopherson, Alexander Viro,
Steve Sistare, Christian Brauner, Jan Kara, Anthony Yznaga,
Mike Rapoport, Andrew Morton, linux-mm, Jason Gunthorpe,
linux-fsdevel, Usama Arif, kvm, Alexander Graf, David Woodhouse,
Paul Durrant, Nicolas Saenz Julienne
On Mon, Aug 5, 2024 at 4:35 PM Theodore Ts'o <tytso@mit.edu> wrote:
> On Mon, Aug 05, 2024 at 11:32:35AM +0200, James Gowans wrote:
> > Guestmemfs implements preservation acrosss kexec by carving out a
> > large contiguous block of host system RAM early in boot which is
> > then used as the data for the guestmemfs files.
>
> Also, the VMM update process is not a common case thing, so we don't
> need to optimize for performance. If we need to temporarily use
> swap/zswap to allocate memory at VMM update time, and if the pages
> aren't contiguous when they are copied out before doing the VMM
> update
I'm not sure I understand, where would this temporary allocation happen?
> that might be very well worth the vast of of memory needed to
> pay for reserving memory on the host for the VMM update that only
> might happen once every few days/weeks/months (depending on whether
> you are doing update just for high severity security fixes, or for
> random VMM updates).
>
> Even if you are updating the VMM every few days, it still doesn't seem
> that permanently reserving contiguous memory on the host can be
> justified from a TCO perspective.
As far as I understand, this is intended for use in systems that do
not do anything except hosting VMs, where anyway you'd devote 90%+ of
host memory to hugetlbfs gigapages.
Paolo
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 00/10] Introduce guestmemfs: persistent in-memory filesystem
2024-08-05 14:41 ` Paolo Bonzini
@ 2024-08-05 19:47 ` Gowans, James
0 siblings, 0 replies; 35+ messages in thread
From: Gowans, James @ 2024-08-05 19:47 UTC (permalink / raw)
To: pbonzini@redhat.com, tytso@mit.edu
Cc: kvm@vger.kernel.org, rppt@kernel.org, brauner@kernel.org,
Graf (AWS), Alexander, anthony.yznaga@oracle.com,
steven.sistare@oracle.com, akpm@linux-foundation.org,
linux-kernel@vger.kernel.org, seanjc@google.com, Woodhouse, David,
linux-mm@kvack.org, nh-open-source@amazon.com,
Saenz Julienne, Nicolas, Durrant, Paul, viro@zeniv.linux.org.uk,
jack@suse.cz, linux-fsdevel@vger.kernel.org, jgg@ziepe.ca,
usama.arif@bytedance.com
On Mon, 2024-08-05 at 16:41 +0200, Paolo Bonzini wrote:
> On Mon, Aug 5, 2024 at 4:35 PM Theodore Ts'o <tytso@mit.edu> wrote:
> > On Mon, Aug 05, 2024 at 11:32:35AM +0200, James Gowans wrote:
> > > Guestmemfs implements preservation acrosss kexec by carving out a
> > > large contiguous block of host system RAM early in boot which is
> > > then used as the data for the guestmemfs files.
> >
> > Also, the VMM update process is not a common case thing, so we don't
> > need to optimize for performance. If we need to temporarily use
> > swap/zswap to allocate memory at VMM update time, and if the pages
> > aren't contiguous when they are copied out before doing the VMM
> > update
>
> I'm not sure I understand, where would this temporary allocation happen?
The intended use case for live update is to update the entirely of the
hypervisor: kexecing into a new kernel, launching new VMM processes. So
anything in kernel state (page tables, VMAs, (z)swap entries, struct
pages, etc) is all lost after kexec and needs to be re-created. That's
the job of guestmemfs: provide the persistence across kexec and ability
to re-create the mapping by re-opening the files.
It would be far too impactful to need to write out the whole VM memory
to disk. Also with CoCo VMs that's not really possible. When virtual
machines are running, every millisecond of down time counts. It would be
wasteful to need to keep terabytes of SSDs lying around just to briefly
write all the guest RAM there and then read it out a moment later. Much
better to leave all the guest memory where it is: in memory.
>
> > that might be very well worth the vast of of memory needed to
> > pay for reserving memory on the host for the VMM update that only
> > might happen once every few days/weeks/months (depending on whether
> > you are doing update just for high severity security fixes, or for
> > random VMM updates).
> >
> > Even if you are updating the VMM every few days, it still doesn't seem
> > that permanently reserving contiguous memory on the host can be
> > justified from a TCO perspective.
>
> As far as I understand, this is intended for use in systems that do
> not do anything except hosting VMs, where anyway you'd devote 90%+ of
> host memory to hugetlbfs gigapages.
Exactly, the use case here is for machines whose only job is to be a KVM
hypervisor. The majority of system RAM is donated to guestmemfs;
anything else (host kernel memory and VMM anonymous memory) is
essentially overhead and should be minimised.
JG
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 00/10] Introduce guestmemfs: persistent in-memory filesystem
2024-08-05 14:32 ` [PATCH 00/10] Introduce guestmemfs: persistent in-memory filesystem Theodore Ts'o
2024-08-05 14:41 ` Paolo Bonzini
@ 2024-08-05 19:53 ` Gowans, James
1 sibling, 0 replies; 35+ messages in thread
From: Gowans, James @ 2024-08-05 19:53 UTC (permalink / raw)
To: tytso@mit.edu
Cc: kvm@vger.kernel.org, rppt@kernel.org, brauner@kernel.org,
Graf (AWS), Alexander, anthony.yznaga@oracle.com,
steven.sistare@oracle.com, akpm@linux-foundation.org,
linux-kernel@vger.kernel.org, seanjc@google.com, Woodhouse, David,
pbonzini@redhat.com, linux-mm@kvack.org,
nh-open-source@amazon.com, Saenz Julienne, Nicolas, Durrant, Paul,
viro@zeniv.linux.org.uk, jack@suse.cz,
linux-fsdevel@vger.kernel.org, jgg@ziepe.ca,
usama.arif@bytedance.com
On Mon, 2024-08-05 at 10:32 -0400, Theodore Ts'o wrote:
> On Mon, Aug 05, 2024 at 11:32:35AM +0200, James Gowans wrote:
> > Guestmemfs implements preservation acrosss kexec by carving out a
> > large contiguous block of host system RAM early in boot which is
> > then used as the data for the guestmemfs files.
>
> Why does the memory have to be (a) contiguous, and (b) carved out of
> *host* system memory early in boot? This seems to be very inflexible;
> it means that you have to know how much memory will be needed for
> guestmemfs in early boot.
The main reason for both of these is to guarantee that the huge (2 MiB
PMD) and gigantic (1 GiB PUD) allocations can happen. While this patch
series only does huge page allocations for simplicity, the intention is
to extend it to gigantic PUD level allocations soon (I'd like to get the
simple functionality merged before adding more complexity).
Other than doing a memblock allocation at early boot there really is no
way that I know of to do GiB-size allocations dynamically.
In terms of the need for a contiguous chunk, that's a bit of a
simplification for now. As mentioned in the cover letter there currently
isn't any NUMA support in this patch series. We'd want to add the
ability to do NUMA handling in following patch series. In that case it
would be multiple contiguous allocations, one for each NUMA node that
the user wants to run VMs on.
JG
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 00/10] Introduce guestmemfs: persistent in-memory filesystem
2024-08-05 9:32 [PATCH 00/10] Introduce guestmemfs: persistent in-memory filesystem James Gowans
` (10 preceding siblings ...)
2024-08-05 14:32 ` [PATCH 00/10] Introduce guestmemfs: persistent in-memory filesystem Theodore Ts'o
@ 2024-08-05 20:01 ` Jan Kara
2024-08-05 23:29 ` Jason Gunthorpe
2024-08-06 8:12 ` Gowans, James
2024-08-07 23:45 ` David Matlack
2024-10-17 4:53 ` Vishal Annapurve
13 siblings, 2 replies; 35+ messages in thread
From: Jan Kara @ 2024-08-05 20:01 UTC (permalink / raw)
To: James Gowans
Cc: linux-kernel, Sean Christopherson, Paolo Bonzini, Alexander Viro,
Steve Sistare, Christian Brauner, Jan Kara, Anthony Yznaga,
Mike Rapoport, Andrew Morton, linux-mm, Jason Gunthorpe,
linux-fsdevel, Usama Arif, kvm, Alexander Graf, David Woodhouse,
Paul Durrant, Nicolas Saenz Julienne, Muchun Song
On Mon 05-08-24 11:32:35, James Gowans wrote:
> In this patch series a new in-memory filesystem designed specifically
> for live update is implemented. Live update is a mechanism to support
> updating a hypervisor in a way that has limited impact to running
> virtual machines. This is done by pausing/serialising running VMs,
> kexec-ing into a new kernel, starting new VMM processes and then
> deserialising/resuming the VMs so that they continue running from where
> they were. To support this, guest memory needs to be preserved.
>
> Guestmemfs implements preservation acrosss kexec by carving out a large
> contiguous block of host system RAM early in boot which is then used as
> the data for the guestmemfs files. As well as preserving that large
> block of data memory across kexec, the filesystem metadata is preserved
> via the Kexec Hand Over (KHO) framework (still under review):
> https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com/
>
> Filesystem metadata is structured to make preservation across kexec
> easy: inodes are one large contiguous array, and each inode has a
> "mappings" block which defines which block from the filesystem data
> memory corresponds to which offset in the file.
>
> There are additional constraints/requirements which guestmemfs aims to
> meet:
>
> 1. Secret hiding: all filesystem data is removed from the kernel direct
> map so immune from speculative access. read()/write() are not supported;
> the only way to get at the data is via mmap.
>
> 2. Struct page overhead elimination: the memory is not managed by the
> buddy allocator and hence has no struct pages.
>
> 3. PMD and PUD level allocations for TLB performance: guestmemfs
> allocates PMD-sized pages to back files which improves TLB perf (caveat
> below!). PUD size allocations are a next step.
>
> 4. Device assignment: being able to use guestmemfs memory for
> VFIO/iommufd mappings, and allow those mappings to survive and continue
> to be used across kexec.
To me the basic functionality resembles a lot hugetlbfs. Now I know very
little details about hugetlbfs so I've added relevant folks to CC. Have you
considered to extend hugetlbfs with the functionality you need (such as
preservation across kexec) instead of implementing completely new filesystem?
Honza
> Next steps
> =========
>
> The idea is that this patch series implements a minimal filesystem to
> provide the foundations for in-memory persistent across kexec files.
> One this foundation is in place it will be extended:
>
> 1. Improve the filesystem to be more comprehensive - currently it's just
> functional enough to demonstrate the main objective of reserved memory
> and persistence via KHO.
>
> 2. Build support for iommufd IOAS and HWPT persistence, and integrate
> that with guestmemfs. The idea is that if VMs have DMA devices assigned
> to them, DMA should continue running across kexec. A future patch series
> will add support for this in iommufd and connect iommufd to guestmemfs
> so that guestmemfs files can remain mapped into the IOMMU during kexec.
>
> 3. Support a guest_memfd interface to files so that they can be used for
> confidential computing without needing to mmap into userspace.
>
> 3. Gigantic PUD level mappings for even better TLB perf.
>
> Caveats
> =======
>
> There are a issues with the current implementation which should be
> solved either in this patch series or soon in follow-on work:
>
> 1. Although PMD-size allocations are done, PTE-level page tables are
> still created. This is because guestmemfs uses remap_pfn_range() to set
> up userspace pgtables. Currently remap_pfn_range() only creates
> PTE-level mappings. I suggest enhancing remap_pfn_range() to support
> creating higher level mappings where possible, by adding pmd_special
> and pud_special flags.
>
> 2. NUMA support is currently non-existent. To make this more generally
> useful it's necessary to have NUMA-awareness. One thought on how to do
> this is to be able to specify multiple allocations with wNUMA affinity
> on the kernel cmdline and have multiple mount points, one per NUMA node.
> Currently, for simplicity, only a single contiguous filesystem data
> allocation and a single mount point is supported.
>
> 3. MCEs are currently not handled - we need to add functionality for
> this to be able to track block ownership and deliver an MCE correctly.
>
> 4. Looking for reviews from filesystem experts to see if necessary
> callbacks, refcounting, locking, etc, is done correctly.
>
> Open questions
> ==============
>
> It is not too clear if or how guestmemfs should use DAX as a source of
> memory. Seeing as guestmemfs has an in-memory design, it seems that it
> is not necessary to use DAX as a source of memory, but I am keen for
> guidance/input on whether DAX should be used here.
>
> The filesystem data memory is removed from the direct map for secret
> hiding, but it is still necessary to mmap it to be accessible to KVM.
> For improving secret hiding even more a guest_memfd-style interface
> could be used to remove the need to mmap. That introduces a new problem
> of the memory being completely inaccessible to KVM for this like MMIO
> instruction emulation. How can this be handled?
>
> Related Work
> ============
>
> There are similarities to a few attempts at solving aspects of this
> problem previously.
>
> The original was probably PKRAM from Oracle; a tempfs filesystem with
> persistence:
> https://lore.kernel.org/kexec/1682554137-13938-1-git-send-email-anthony.yznaga@oracle.com/
> guestmemfs will additionally provide secret hiding, PMD/PUD allocations
> and a path to DMA persistence and NUMA support.
>
> Dmemfs from Tencent aimed to remove the need for struct page overhead:
> https://lore.kernel.org/kvm/cover.1602093760.git.yuleixzhang@tencent.com/
> Guestmemfs provides this benefit too, along with persistence across
> kexec and secret hiding.
>
> Pkernfs attempted to solve guest memory persistence and IOMMU
> persistence all in one:
> https://lore.kernel.org/all/20240205120203.60312-1-jgowans@amazon.com/
> Guestmemfs is a re-work of that to only persist guest RAM in the
> filesystem, and to use KHO for filesystem metadata. IOMMU persistence
> will be implemented independently with persistent iommufd domains via
> KHO.
>
> Testing
> =======
>
> The testing for this can be seen in the Documentation file in this patch
> series. Essentially it is using a guestmemfs file for a QEMU VM's RAM,
> doing a kexec, restoring the QEMU VM and confirming that the VM picked
> up from where it left off.
>
> James Gowans (10):
> guestmemfs: Introduce filesystem skeleton
> guestmemfs: add inode store, files and dirs
> guestmemfs: add persistent data block allocator
> guestmemfs: support file truncation
> guestmemfs: add file mmap callback
> kexec/kho: Add addr flag to not initialise memory
> guestmemfs: Persist filesystem metadata via KHO
> guestmemfs: Block modifications when serialised
> guestmemfs: Add documentation and usage instructions
> MAINTAINERS: Add maintainers for guestmemfs
>
> Documentation/filesystems/guestmemfs.rst | 87 +++++++
> MAINTAINERS | 8 +
> arch/x86/mm/init_64.c | 2 +
> fs/Kconfig | 1 +
> fs/Makefile | 1 +
> fs/guestmemfs/Kconfig | 11 +
> fs/guestmemfs/Makefile | 8 +
> fs/guestmemfs/allocator.c | 40 +++
> fs/guestmemfs/dir.c | 43 ++++
> fs/guestmemfs/file.c | 106 ++++++++
> fs/guestmemfs/guestmemfs.c | 160 ++++++++++++
> fs/guestmemfs/guestmemfs.h | 60 +++++
> fs/guestmemfs/inode.c | 189 ++++++++++++++
> fs/guestmemfs/serialise.c | 302 +++++++++++++++++++++++
> include/linux/guestmemfs.h | 16 ++
> include/uapi/linux/kexec.h | 6 +
> kernel/kexec_kho_in.c | 12 +-
> kernel/kexec_kho_out.c | 4 +
> 18 files changed, 1055 insertions(+), 1 deletion(-)
> create mode 100644 Documentation/filesystems/guestmemfs.rst
> create mode 100644 fs/guestmemfs/Kconfig
> create mode 100644 fs/guestmemfs/Makefile
> create mode 100644 fs/guestmemfs/allocator.c
> create mode 100644 fs/guestmemfs/dir.c
> create mode 100644 fs/guestmemfs/file.c
> create mode 100644 fs/guestmemfs/guestmemfs.c
> create mode 100644 fs/guestmemfs/guestmemfs.h
> create mode 100644 fs/guestmemfs/inode.c
> create mode 100644 fs/guestmemfs/serialise.c
> create mode 100644 include/linux/guestmemfs.h
>
> --
> 2.34.1
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: [PATCH 00/10] Introduce guestmemfs: persistent in-memory filesystem
2024-08-05 20:01 ` Jan Kara
@ 2024-08-05 23:29 ` Jason Gunthorpe
2024-08-06 8:26 ` Gowans, James
2024-08-06 8:12 ` Gowans, James
1 sibling, 1 reply; 35+ messages in thread
From: Jason Gunthorpe @ 2024-08-05 23:29 UTC (permalink / raw)
To: Jan Kara
Cc: James Gowans, linux-kernel, Sean Christopherson, Paolo Bonzini,
Alexander Viro, Steve Sistare, Christian Brauner, Anthony Yznaga,
Mike Rapoport, Andrew Morton, linux-mm, linux-fsdevel, Usama Arif,
kvm, Alexander Graf, David Woodhouse, Paul Durrant,
Nicolas Saenz Julienne, Muchun Song
On Mon, Aug 05, 2024 at 10:01:51PM +0200, Jan Kara wrote:
> > 4. Device assignment: being able to use guestmemfs memory for
> > VFIO/iommufd mappings, and allow those mappings to survive and continue
> > to be used across kexec.
That's a fun one. Proposals for that will be very interesting!
> To me the basic functionality resembles a lot hugetlbfs. Now I know very
> little details about hugetlbfs so I've added relevant folks to CC. Have you
> considered to extend hugetlbfs with the functionality you need (such as
> preservation across kexec) instead of implementing completely new filesystem?
In mm circles we've broadly been talking about splitting the "memory
provider" part out of hugetlbfs into its own layer. This would include
the carving out of kernel memory at boot and organizing it by page
size to allow huge ptes.
It would make alot of sense to have only one carve out mechanism, and
several consumers - hugetlbfs, the new private guestmemfd, this thing,
for example.
Jason
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 00/10] Introduce guestmemfs: persistent in-memory filesystem
2024-08-05 23:29 ` Jason Gunthorpe
@ 2024-08-06 8:26 ` Gowans, James
0 siblings, 0 replies; 35+ messages in thread
From: Gowans, James @ 2024-08-06 8:26 UTC (permalink / raw)
To: jack@suse.cz, jgg@ziepe.ca
Cc: kvm@vger.kernel.org, rppt@kernel.org, brauner@kernel.org,
Graf (AWS), Alexander, anthony.yznaga@oracle.com,
steven.sistare@oracle.com, akpm@linux-foundation.org,
linux-kernel@vger.kernel.org, Durrant, Paul, seanjc@google.com,
pbonzini@redhat.com, linux-mm@kvack.org, Woodhouse, David,
Saenz Julienne, Nicolas, muchun.song@linux.dev,
viro@zeniv.linux.org.uk, nh-open-source@amazon.com,
linux-fsdevel@vger.kernel.org
On Mon, 2024-08-05 at 20:29 -0300, Jason Gunthorpe wrote:
>
> On Mon, Aug 05, 2024 at 10:01:51PM +0200, Jan Kara wrote:
>
> > > 4. Device assignment: being able to use guestmemfs memory for
> > > VFIO/iommufd mappings, and allow those mappings to survive and continue
> > > to be used across kexec.
>
> That's a fun one. Proposals for that will be very interesting!
Yup! We have an LPC session for this; looking forward to discussing more
there: https://lpc.events/event/18/contributions/1686/
I'll be working on a iommufd RFC soon; should get it out before then.
>
> > To me the basic functionality resembles a lot hugetlbfs. Now I know very
> > little details about hugetlbfs so I've added relevant folks to CC. Have you
> > considered to extend hugetlbfs with the functionality you need (such as
> > preservation across kexec) instead of implementing completely new filesystem?
>
> In mm circles we've broadly been talking about splitting the "memory
> provider" part out of hugetlbfs into its own layer. This would include
> the carving out of kernel memory at boot and organizing it by page
> size to allow huge ptes.
>
> It would make alot of sense to have only one carve out mechanism, and
> several consumers - hugetlbfs, the new private guestmemfd, this thing,
> for example.
The actual allocation in guestmemfs isn't too complex, basically just a
hook in mem_init() (that's a bit yucky as it's arch-specific) and then a
call to memblock allocator.
That being said, the functionality for this patch series is currently
intentionally limited: missing NUMA support, and only doing PMD (2 MiB)
block allocations for files - we want PUD (1 GiB) where possible falling
back to splitting to 2 MiB for smaller files. That will complicate
things, so perhaps a memory provider will be useful when this gets more
functionally complete. Keen to hear more!
JG
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 00/10] Introduce guestmemfs: persistent in-memory filesystem
2024-08-05 20:01 ` Jan Kara
2024-08-05 23:29 ` Jason Gunthorpe
@ 2024-08-06 8:12 ` Gowans, James
2024-08-06 13:43 ` David Hildenbrand
1 sibling, 1 reply; 35+ messages in thread
From: Gowans, James @ 2024-08-06 8:12 UTC (permalink / raw)
To: jack@suse.cz, muchun.song@linux.dev
Cc: kvm@vger.kernel.org, rppt@kernel.org, brauner@kernel.org,
Graf (AWS), Alexander, anthony.yznaga@oracle.com,
steven.sistare@oracle.com, akpm@linux-foundation.org,
linux-kernel@vger.kernel.org, Durrant, Paul, seanjc@google.com,
pbonzini@redhat.com, linux-mm@kvack.org, Woodhouse, David,
Saenz Julienne, Nicolas, viro@zeniv.linux.org.uk,
nh-open-source@amazon.com, linux-fsdevel@vger.kernel.org,
jgg@ziepe.ca
On Mon, 2024-08-05 at 22:01 +0200, Jan Kara wrote:
>
> On Mon 05-08-24 11:32:35, James Gowans wrote:
> > In this patch series a new in-memory filesystem designed specifically
> > for live update is implemented. Live update is a mechanism to support
> > updating a hypervisor in a way that has limited impact to running
> > virtual machines. This is done by pausing/serialising running VMs,
> > kexec-ing into a new kernel, starting new VMM processes and then
> > deserialising/resuming the VMs so that they continue running from where
> > they were. To support this, guest memory needs to be preserved.
> >
> > Guestmemfs implements preservation acrosss kexec by carving out a large
> > contiguous block of host system RAM early in boot which is then used as
> > the data for the guestmemfs files. As well as preserving that large
> > block of data memory across kexec, the filesystem metadata is preserved
> > via the Kexec Hand Over (KHO) framework (still under review):
> > https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com/
> >
> > Filesystem metadata is structured to make preservation across kexec
> > easy: inodes are one large contiguous array, and each inode has a
> > "mappings" block which defines which block from the filesystem data
> > memory corresponds to which offset in the file.
> >
> > There are additional constraints/requirements which guestmemfs aims to
> > meet:
> >
> > 1. Secret hiding: all filesystem data is removed from the kernel direct
> > map so immune from speculative access. read()/write() are not supported;
> > the only way to get at the data is via mmap.
> >
> > 2. Struct page overhead elimination: the memory is not managed by the
> > buddy allocator and hence has no struct pages.
> >
> > 3. PMD and PUD level allocations for TLB performance: guestmemfs
> > allocates PMD-sized pages to back files which improves TLB perf (caveat
> > below!). PUD size allocations are a next step.
> >
> > 4. Device assignment: being able to use guestmemfs memory for
> > VFIO/iommufd mappings, and allow those mappings to survive and continue
> > to be used across kexec.
>
> To me the basic functionality resembles a lot hugetlbfs. Now I know very
> little details about hugetlbfs so I've added relevant folks to CC. Have you
> considered to extend hugetlbfs with the functionality you need (such as
> preservation across kexec) instead of implementing completely new filesystem?
Oof, I forgot to mention hugetlbfs in the cover letter - thanks for
raising this! Indeed, there are similarities: in-memory fs, with
huge/gigantic allocations.
We did consider extending hugetlbfs to support persistence, but there
are differences in requirements which we're not sure would be practical
or desirable to add to hugetlbfs.
1. Secret hiding: with guestmemfs all of the memory is out of the kernel
direct map as an additional defence mechanism. This means no
read()/write() syscalls to guestmemfs files, and no IO to it. The only
way to access it is to mmap the file.
2. No struct page overhead: the intended use case is for systems whose
sole job is to be a hypervisor, typically for large (multi-GiB) VMs, so
the majority of system RAM would be donated to this fs. We definitely
don't want 4 KiB struct pages here as it would be a significant
overhead. That's why guestmemfs carves the memory out in early boot and
sets memblock flags to avoid struct page allocation. I don't know if
hugetlbfs does anything fancy to avoid allocating PTE-level struct pages
for its memory?
3. guest_memfd interface: For confidential computing use-cases we need
to provide a guest_memfd style interface so that these FDs can be used
as a guest_memfd file in KVM memslots. Would there be interest in
extending hugetlbfs to also support a guest_memfd style interface?
4. Metadata designed for persistence: guestmemfs will need to keep
simple internal metadata data structures (limited allocations, limited
fragmentation) so that pages can easily and efficiently be marked as
persistent via KHO. Something like slab allocations would probably be a
no-go as then we'd need to persist and reconstruct the slab allocator. I
don't know how hugetlbfs structures its fs metadata but I'm guessing it
uses the slab and does lots of small allocations so trying to retrofit
persistence via KHO to it may be challenging.
5. Integration with persistent IOMMU mappings: to keep DMA running
across kexec, iommufd needs to know that the backing memory for an IOAS
is persistent too. The idea is to do some DMA pinning of persistent
files, which would require iommufd/guestmemfs integration - would we
want to add this to hugetlbfs?
6. Virtualisation-specific APIs: starting to get a bit esoteric here,
but use-cases like being able to carve out specific chunks of memory
from a running VM and turn it into memory for another side car VM, or
doing post-copy LM via DMA by mapping memory into the IOMMU but taking
page faults on the CPU. This may require virtualisation-specific ioctls
on the files which wouldn't be generally applicable to hugetlbfs.
7. NUMA control: a requirement is to always have correct NUMA affinity.
While currently not implemented the idea is to extend the guestmemfs
allocation to support specifying allocation sizes from each NUMA node at
early boot, and then having multiple mount points, one per NUMA node (or
something like that...). Unclear if this is something hugetlbfs would
want.
There are probably more potential issues, but those are the ones that
come to mind... That being said, if hugetlbfs maintainers are interested
in going in this direction then we can definitely look at enhancing
hugetlbfs.
I think there are two types of problems: "Would hugetlbfs want this
functionality?" - that's the majority. An a few are "This would be hard
with hugetlbfs!" - persistence probably falls into this category.
Looking forward to input from maintainers. :-)
JG
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 00/10] Introduce guestmemfs: persistent in-memory filesystem
2024-08-06 8:12 ` Gowans, James
@ 2024-08-06 13:43 ` David Hildenbrand
0 siblings, 0 replies; 35+ messages in thread
From: David Hildenbrand @ 2024-08-06 13:43 UTC (permalink / raw)
To: Gowans, James, jack@suse.cz, muchun.song@linux.dev
Cc: kvm@vger.kernel.org, rppt@kernel.org, brauner@kernel.org,
Graf (AWS), Alexander, anthony.yznaga@oracle.com,
steven.sistare@oracle.com, akpm@linux-foundation.org,
linux-kernel@vger.kernel.org, Durrant, Paul, seanjc@google.com,
pbonzini@redhat.com, linux-mm@kvack.org, Woodhouse, David,
Saenz Julienne, Nicolas, viro@zeniv.linux.org.uk,
nh-open-source@amazon.com, linux-fsdevel@vger.kernel.org,
jgg@ziepe.ca
> 1. Secret hiding: with guestmemfs all of the memory is out of the kernel
> direct map as an additional defence mechanism. This means no
> read()/write() syscalls to guestmemfs files, and no IO to it. The only
> way to access it is to mmap the file.
There are people interested into similar things for guest_memfd.
>
> 2. No struct page overhead: the intended use case is for systems whose
> sole job is to be a hypervisor, typically for large (multi-GiB) VMs, so
> the majority of system RAM would be donated to this fs. We definitely
> don't want 4 KiB struct pages here as it would be a significant
> overhead. That's why guestmemfs carves the memory out in early boot and
> sets memblock flags to avoid struct page allocation. I don't know if
> hugetlbfs does anything fancy to avoid allocating PTE-level struct pages
> for its memory?
Sure, it's called HVO and can optimize out a significant portion of the
vmemmap.
>
> 3. guest_memfd interface: For confidential computing use-cases we need
> to provide a guest_memfd style interface so that these FDs can be used
> as a guest_memfd file in KVM memslots. Would there be interest in
> extending hugetlbfs to also support a guest_memfd style interface?
>
"Extending hugetlbfs" sounds wrong; hugetlbfs is a blast from the past
and not something people are particularly keen to extend for such use
cases. :)
Instead, as Jason said, we're looking into letting guest_memfd own and
manage large chunks of contiguous memory.
> 4. Metadata designed for persistence: guestmemfs will need to keep
> simple internal metadata data structures (limited allocations, limited
> fragmentation) so that pages can easily and efficiently be marked as
> persistent via KHO. Something like slab allocations would probably be a
> no-go as then we'd need to persist and reconstruct the slab allocator. I
> don't know how hugetlbfs structures its fs metadata but I'm guessing it
> uses the slab and does lots of small allocations so trying to retrofit
> persistence via KHO to it may be challenging.
>
> 5. Integration with persistent IOMMU mappings: to keep DMA running
> across kexec, iommufd needs to know that the backing memory for an IOAS
> is persistent too. The idea is to do some DMA pinning of persistent
> files, which would require iommufd/guestmemfs integration - would we
> want to add this to hugetlbfs?
>
> 6. Virtualisation-specific APIs: starting to get a bit esoteric here,
> but use-cases like being able to carve out specific chunks of memory
> from a running VM and turn it into memory for another side car VM, or
> doing post-copy LM via DMA by mapping memory into the IOMMU but taking
> page faults on the CPU. This may require virtualisation-specific ioctls
> on the files which wouldn't be generally applicable to hugetlbfs.
>
> 7. NUMA control: a requirement is to always have correct NUMA affinity.
> While currently not implemented the idea is to extend the guestmemfs
> allocation to support specifying allocation sizes from each NUMA node at
> early boot, and then having multiple mount points, one per NUMA node (or
> something like that...). Unclear if this is something hugetlbfs would
> want.
>
> There are probably more potential issues, but those are the ones that
> come to mind... That being said, if hugetlbfs maintainers are interested
> in going in this direction then we can definitely look at enhancing
> hugetlbfs.
>
> I think there are two types of problems: "Would hugetlbfs want this
> functionality?" - that's the majority. An a few are "This would be hard
> with hugetlbfs!" - persistence probably falls into this category.
I'm much rather asking myself if you should instead teach/extend the
guest_memfd concept by some of what you propose here.
At least "guest_memfd" sounds a lot like the "anonymous fd" based
variant of guestmemfs ;)
Like we have hugetlbfs and memfd with hugetlb pages.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 00/10] Introduce guestmemfs: persistent in-memory filesystem
2024-08-05 9:32 [PATCH 00/10] Introduce guestmemfs: persistent in-memory filesystem James Gowans
` (11 preceding siblings ...)
2024-08-05 20:01 ` Jan Kara
@ 2024-08-07 23:45 ` David Matlack
2024-10-17 4:53 ` Vishal Annapurve
13 siblings, 0 replies; 35+ messages in thread
From: David Matlack @ 2024-08-07 23:45 UTC (permalink / raw)
To: James Gowans
Cc: linux-kernel, Sean Christopherson, Paolo Bonzini, Alexander Viro,
Steve Sistare, Christian Brauner, Jan Kara, Anthony Yznaga,
Mike Rapoport, Andrew Morton, linux-mm, Jason Gunthorpe,
linux-fsdevel, Usama Arif, kvm, Alexander Graf, David Woodhouse,
Paul Durrant, Nicolas Saenz Julienne, James Houghton
Hi James,
On Mon, Aug 5, 2024 at 2:33 AM James Gowans <jgowans@amazon.com> wrote:
>
> In this patch series a new in-memory filesystem designed specifically
> for live update is implemented. Live update is a mechanism to support
> updating a hypervisor in a way that has limited impact to running
> virtual machines. This is done by pausing/serialising running VMs,
> kexec-ing into a new kernel, starting new VMM processes and then
> deserialising/resuming the VMs so that they continue running from where
> they were. To support this, guest memory needs to be preserved.
How do you envision VM state (or other userspace state) being
preserved? I guess it could just be regular files on this filesystem
but I wonder if that would become inefficient if the files are
(eventually) backed with PUD-sized allocations.
>
> Guestmemfs implements preservation acrosss kexec by carving out a large
> contiguous block of host system RAM early in boot which is then used as
> the data for the guestmemfs files. As well as preserving that large
> block of data memory across kexec, the filesystem metadata is preserved
> via the Kexec Hand Over (KHO) framework (still under review):
> https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com/
>
> Filesystem metadata is structured to make preservation across kexec
> easy: inodes are one large contiguous array, and each inode has a
> "mappings" block which defines which block from the filesystem data
> memory corresponds to which offset in the file.
>
> There are additional constraints/requirements which guestmemfs aims to
> meet:
>
> 1. Secret hiding: all filesystem data is removed from the kernel direct
> map so immune from speculative access. read()/write() are not supported;
> the only way to get at the data is via mmap.
>
> 2. Struct page overhead elimination: the memory is not managed by the
> buddy allocator and hence has no struct pages.
I'm curious if there any downsides of eliminating struct pages? e.g.
Certain operations/features in the kernel relevant for running VMs
that do not work?
>
> 3. PMD and PUD level allocations for TLB performance: guestmemfs
> allocates PMD-sized pages to back files which improves TLB perf (caveat
> below!). PUD size allocations are a next step.
>
> 4. Device assignment: being able to use guestmemfs memory for
> VFIO/iommufd mappings, and allow those mappings to survive and continue
> to be used across kexec.
>
>
> Next steps
> =========
>
> The idea is that this patch series implements a minimal filesystem to
> provide the foundations for in-memory persistent across kexec files.
> One this foundation is in place it will be extended:
>
> 1. Improve the filesystem to be more comprehensive - currently it's just
> functional enough to demonstrate the main objective of reserved memory
> and persistence via KHO.
>
> 2. Build support for iommufd IOAS and HWPT persistence, and integrate
> that with guestmemfs. The idea is that if VMs have DMA devices assigned
> to them, DMA should continue running across kexec. A future patch series
> will add support for this in iommufd and connect iommufd to guestmemfs
> so that guestmemfs files can remain mapped into the IOMMU during kexec.
>
> 3. Support a guest_memfd interface to files so that they can be used for
> confidential computing without needing to mmap into userspace.
>
> 3. Gigantic PUD level mappings for even better TLB perf.
>
> Caveats
> =======
>
> There are a issues with the current implementation which should be
> solved either in this patch series or soon in follow-on work:
>
> 1. Although PMD-size allocations are done, PTE-level page tables are
> still created. This is because guestmemfs uses remap_pfn_range() to set
> up userspace pgtables. Currently remap_pfn_range() only creates
> PTE-level mappings. I suggest enhancing remap_pfn_range() to support
> creating higher level mappings where possible, by adding pmd_special
> and pud_special flags.
This might actually be beneficial.
Creating PTEs for userspace mappings would make it for UserfaultFD to
intercept at PAGE_SIZE granularity. A big pain point for Google with
using HugeTLB is the inability to use UsefaultFD to intercept at
PAGE_SIZE for the post-copy phase of VM Live Migration.
As long as the memory is physically contiguous it should be possible
for KVM to still map it into the guest with PMD or PUD mappings.
KVM/arm64 already even has support for that for VM_PFNMAP VMAs.
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: [PATCH 00/10] Introduce guestmemfs: persistent in-memory filesystem
2024-08-05 9:32 [PATCH 00/10] Introduce guestmemfs: persistent in-memory filesystem James Gowans
` (12 preceding siblings ...)
2024-08-07 23:45 ` David Matlack
@ 2024-10-17 4:53 ` Vishal Annapurve
2024-11-01 12:53 ` Gowans, James
13 siblings, 1 reply; 35+ messages in thread
From: Vishal Annapurve @ 2024-10-17 4:53 UTC (permalink / raw)
To: James Gowans
Cc: linux-kernel, Sean Christopherson, Paolo Bonzini, Alexander Viro,
Steve Sistare, Christian Brauner, Jan Kara, Anthony Yznaga,
Mike Rapoport, Andrew Morton, linux-mm, Jason Gunthorpe,
linux-fsdevel, Usama Arif, kvm, Alexander Graf, David Woodhouse,
Paul Durrant, Nicolas Saenz Julienne
On Mon, Aug 5, 2024 at 3:03 PM James Gowans <jgowans@amazon.com> wrote:
>
> In this patch series a new in-memory filesystem designed specifically
> for live update is implemented. Live update is a mechanism to support
> updating a hypervisor in a way that has limited impact to running
> virtual machines. This is done by pausing/serialising running VMs,
> kexec-ing into a new kernel, starting new VMM processes and then
> deserialising/resuming the VMs so that they continue running from where
> they were. To support this, guest memory needs to be preserved.
>
> Guestmemfs implements preservation acrosss kexec by carving out a large
> contiguous block of host system RAM early in boot which is then used as
> the data for the guestmemfs files. As well as preserving that large
> block of data memory across kexec, the filesystem metadata is preserved
> via the Kexec Hand Over (KHO) framework (still under review):
> https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com/
>
> Filesystem metadata is structured to make preservation across kexec
> easy: inodes are one large contiguous array, and each inode has a
> "mappings" block which defines which block from the filesystem data
> memory corresponds to which offset in the file.
>
> There are additional constraints/requirements which guestmemfs aims to
> meet:
>
> 1. Secret hiding: all filesystem data is removed from the kernel direct
> map so immune from speculative access. read()/write() are not supported;
> the only way to get at the data is via mmap.
>
> 2. Struct page overhead elimination: the memory is not managed by the
> buddy allocator and hence has no struct pages.
>
> 3. PMD and PUD level allocations for TLB performance: guestmemfs
> allocates PMD-sized pages to back files which improves TLB perf (caveat
> below!). PUD size allocations are a next step.
>
> 4. Device assignment: being able to use guestmemfs memory for
> VFIO/iommufd mappings, and allow those mappings to survive and continue
> to be used across kexec.
>
>
> Next steps
> =========
>
> The idea is that this patch series implements a minimal filesystem to
> provide the foundations for in-memory persistent across kexec files.
> One this foundation is in place it will be extended:
>
> 1. Improve the filesystem to be more comprehensive - currently it's just
> functional enough to demonstrate the main objective of reserved memory
> and persistence via KHO.
>
> 2. Build support for iommufd IOAS and HWPT persistence, and integrate
> that with guestmemfs. The idea is that if VMs have DMA devices assigned
> to them, DMA should continue running across kexec. A future patch series
> will add support for this in iommufd and connect iommufd to guestmemfs
> so that guestmemfs files can remain mapped into the IOMMU during kexec.
>
> 3. Support a guest_memfd interface to files so that they can be used for
> confidential computing without needing to mmap into userspace.
I am guessing this goal was before we discussed the need of supporting
mmap on guest_memfd for confidential computing usecases to support
hugepages [1]. This series [1] as of today tries to leverage hugetlb
allocator functionality to allocate huge pages which seems to be along
the lines of what you are aiming for. There are also discussions to
support NUMA mempolicy [2] for guest memfd. In order to use
guest_memfd to back non-confidential VMs with hugepages, core-mm will
need to support PMD/PUD level mappings in future.
David H's suggestion from the other thread to extend guest_memfd to
support guest memory persistence over kexec instead of introducing
guestmemfs as a parallel subsystem seems appealing to me.
[1] https://lore.kernel.org/kvm/cover.1726009989.git.ackerleytng@google.com/T/
[2] https://lore.kernel.org/kvm/47476c27-897c-4487-bcd2-7ef6ec089dd1@amd.com/T/
>
> 3. Gigantic PUD level mappings for even better TLB perf.
>
> Caveats
> =======
>
> There are a issues with the current implementation which should be
> solved either in this patch series or soon in follow-on work:
>
> 1. Although PMD-size allocations are done, PTE-level page tables are
> still created. This is because guestmemfs uses remap_pfn_range() to set
> up userspace pgtables. Currently remap_pfn_range() only creates
> PTE-level mappings. I suggest enhancing remap_pfn_range() to support
> creating higher level mappings where possible, by adding pmd_special
> and pud_special flags.
>
> 2. NUMA support is currently non-existent. To make this more generally
> useful it's necessary to have NUMA-awareness. One thought on how to do
> this is to be able to specify multiple allocations with wNUMA affinity
> on the kernel cmdline and have multiple mount points, one per NUMA node.
> Currently, for simplicity, only a single contiguous filesystem data
> allocation and a single mount point is supported.
>
> 3. MCEs are currently not handled - we need to add functionality for
> this to be able to track block ownership and deliver an MCE correctly.
>
> 4. Looking for reviews from filesystem experts to see if necessary
> callbacks, refcounting, locking, etc, is done correctly.
>
> Open questions
> ==============
>
> It is not too clear if or how guestmemfs should use DAX as a source of
> memory. Seeing as guestmemfs has an in-memory design, it seems that it
> is not necessary to use DAX as a source of memory, but I am keen for
> guidance/input on whether DAX should be used here.
>
> The filesystem data memory is removed from the direct map for secret
> hiding, but it is still necessary to mmap it to be accessible to KVM.
> For improving secret hiding even more a guest_memfd-style interface
> could be used to remove the need to mmap. That introduces a new problem
> of the memory being completely inaccessible to KVM for this like MMIO
> instruction emulation. How can this be handled?
>
> Related Work
> ============
>
> There are similarities to a few attempts at solving aspects of this
> problem previously.
>
> The original was probably PKRAM from Oracle; a tempfs filesystem with
> persistence:
> https://lore.kernel.org/kexec/1682554137-13938-1-git-send-email-anthony.yznaga@oracle.com/
> guestmemfs will additionally provide secret hiding, PMD/PUD allocations
> and a path to DMA persistence and NUMA support.
>
> Dmemfs from Tencent aimed to remove the need for struct page overhead:
> https://lore.kernel.org/kvm/cover.1602093760.git.yuleixzhang@tencent.com/
> Guestmemfs provides this benefit too, along with persistence across
> kexec and secret hiding.
>
> Pkernfs attempted to solve guest memory persistence and IOMMU
> persistence all in one:
> https://lore.kernel.org/all/20240205120203.60312-1-jgowans@amazon.com/
> Guestmemfs is a re-work of that to only persist guest RAM in the
> filesystem, and to use KHO for filesystem metadata. IOMMU persistence
> will be implemented independently with persistent iommufd domains via
> KHO.
>
> Testing
> =======
>
> The testing for this can be seen in the Documentation file in this patch
> series. Essentially it is using a guestmemfs file for a QEMU VM's RAM,
> doing a kexec, restoring the QEMU VM and confirming that the VM picked
> up from where it left off.
>
> James Gowans (10):
> guestmemfs: Introduce filesystem skeleton
> guestmemfs: add inode store, files and dirs
> guestmemfs: add persistent data block allocator
> guestmemfs: support file truncation
> guestmemfs: add file mmap callback
> kexec/kho: Add addr flag to not initialise memory
> guestmemfs: Persist filesystem metadata via KHO
> guestmemfs: Block modifications when serialised
> guestmemfs: Add documentation and usage instructions
> MAINTAINERS: Add maintainers for guestmemfs
>
> Documentation/filesystems/guestmemfs.rst | 87 +++++++
> MAINTAINERS | 8 +
> arch/x86/mm/init_64.c | 2 +
> fs/Kconfig | 1 +
> fs/Makefile | 1 +
> fs/guestmemfs/Kconfig | 11 +
> fs/guestmemfs/Makefile | 8 +
> fs/guestmemfs/allocator.c | 40 +++
> fs/guestmemfs/dir.c | 43 ++++
> fs/guestmemfs/file.c | 106 ++++++++
> fs/guestmemfs/guestmemfs.c | 160 ++++++++++++
> fs/guestmemfs/guestmemfs.h | 60 +++++
> fs/guestmemfs/inode.c | 189 ++++++++++++++
> fs/guestmemfs/serialise.c | 302 +++++++++++++++++++++++
> include/linux/guestmemfs.h | 16 ++
> include/uapi/linux/kexec.h | 6 +
> kernel/kexec_kho_in.c | 12 +-
> kernel/kexec_kho_out.c | 4 +
> 18 files changed, 1055 insertions(+), 1 deletion(-)
> create mode 100644 Documentation/filesystems/guestmemfs.rst
> create mode 100644 fs/guestmemfs/Kconfig
> create mode 100644 fs/guestmemfs/Makefile
> create mode 100644 fs/guestmemfs/allocator.c
> create mode 100644 fs/guestmemfs/dir.c
> create mode 100644 fs/guestmemfs/file.c
> create mode 100644 fs/guestmemfs/guestmemfs.c
> create mode 100644 fs/guestmemfs/guestmemfs.h
> create mode 100644 fs/guestmemfs/inode.c
> create mode 100644 fs/guestmemfs/serialise.c
> create mode 100644 include/linux/guestmemfs.h
>
> --
> 2.34.1
>
>
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: [PATCH 00/10] Introduce guestmemfs: persistent in-memory filesystem
2024-10-17 4:53 ` Vishal Annapurve
@ 2024-11-01 12:53 ` Gowans, James
0 siblings, 0 replies; 35+ messages in thread
From: Gowans, James @ 2024-11-01 12:53 UTC (permalink / raw)
To: vannapurve@google.com
Cc: kvm@vger.kernel.org, rppt@kernel.org, brauner@kernel.org,
Graf (AWS), Alexander, anthony.yznaga@oracle.com,
steven.sistare@oracle.com, akpm@linux-foundation.org,
linux-kernel@vger.kernel.org, seanjc@google.com, Woodhouse, David,
pbonzini@redhat.com, linux-mm@kvack.org, Saenz Julienne, Nicolas,
Durrant, Paul, viro@zeniv.linux.org.uk, jack@suse.cz,
linux-fsdevel@vger.kernel.org, jgg@ziepe.ca,
usama.arif@bytedance.com
On Thu, 2024-10-17 at 10:23 +0530, Vishal Annapurve wrote:
> On Mon, Aug 5, 2024 at 3:03 PM James Gowans <jgowans@amazon.com> wrote:
> >
> > In this patch series a new in-memory filesystem designed specifically
> > for live update is implemented. Live update is a mechanism to support
> > updating a hypervisor in a way that has limited impact to running
> > virtual machines. This is done by pausing/serialising running VMs,
> > kexec-ing into a new kernel, starting new VMM processes and then
> > deserialising/resuming the VMs so that they continue running from where
> > they were. To support this, guest memory needs to be preserved.
> >
> > Guestmemfs implements preservation acrosss kexec by carving out a large
> > contiguous block of host system RAM early in boot which is then used as
> > the data for the guestmemfs files. As well as preserving that large
> > block of data memory across kexec, the filesystem metadata is preserved
> > via the Kexec Hand Over (KHO) framework (still under review):
> > https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com/
> >
> > Filesystem metadata is structured to make preservation across kexec
> > easy: inodes are one large contiguous array, and each inode has a
> > "mappings" block which defines which block from the filesystem data
> > memory corresponds to which offset in the file.
> >
> > There are additional constraints/requirements which guestmemfs aims to
> > meet:
> >
> > 1. Secret hiding: all filesystem data is removed from the kernel direct
> > map so immune from speculative access. read()/write() are not supported;
> > the only way to get at the data is via mmap.
> >
> > 2. Struct page overhead elimination: the memory is not managed by the
> > buddy allocator and hence has no struct pages.
> >
> > 3. PMD and PUD level allocations for TLB performance: guestmemfs
> > allocates PMD-sized pages to back files which improves TLB perf (caveat
> > below!). PUD size allocations are a next step.
> >
> > 4. Device assignment: being able to use guestmemfs memory for
> > VFIO/iommufd mappings, and allow those mappings to survive and continue
> > to be used across kexec.
> >
> >
> > Next steps
> > =========
> >
> > The idea is that this patch series implements a minimal filesystem to
> > provide the foundations for in-memory persistent across kexec files.
> > One this foundation is in place it will be extended:
> >
> > 1. Improve the filesystem to be more comprehensive - currently it's just
> > functional enough to demonstrate the main objective of reserved memory
> > and persistence via KHO.
> >
> > 2. Build support for iommufd IOAS and HWPT persistence, and integrate
> > that with guestmemfs. The idea is that if VMs have DMA devices assigned
> > to them, DMA should continue running across kexec. A future patch series
> > will add support for this in iommufd and connect iommufd to guestmemfs
> > so that guestmemfs files can remain mapped into the IOMMU during kexec.
> >
> > 3. Support a guest_memfd interface to files so that they can be used for
> > confidential computing without needing to mmap into userspace.
>
> I am guessing this goal was before we discussed the need of supporting
> mmap on guest_memfd for confidential computing usecases to support
> hugepages [1]. This series [1] as of today tries to leverage hugetlb
> allocator functionality to allocate huge pages which seems to be along
> the lines of what you are aiming for. There are also discussions to
> support NUMA mempolicy [2] for guest memfd. In order to use
> guest_memfd to back non-confidential VMs with hugepages, core-mm will
> need to support PMD/PUD level mappings in future.
>
> David H's suggestion from the other thread to extend guest_memfd to
> support guest memory persistence over kexec instead of introducing
> guestmemfs as a parallel subsystem seems appealing to me.
I think there is a lot of overlap with the huge page goals for
guest_memfd. Especially the 1 GiB allocations; that also needs a custom
allocator to be able to allocate chunks from something other than core
MM buddy allocator. My rough plan is to rebase on top of the 1 GiB
guest_memfd support code, and add guestmemfs as another allocator, very
similar to hugetlbfs 1 GiB allocations.
I still need to engage on the hugetlb(fs?) allocator patch series, but I
think in concept it's all going in the right direction for this
persistence use case too.
JG
>
> [1] https://lore.kernel.org/kvm/cover.1726009989.git.ackerleytng@google.com/T/
> [2] https://lore.kernel.org/kvm/47476c27-897c-4487-bcd2-7ef6ec089dd1@amd.com/T/
^ permalink raw reply [flat|nested] 35+ messages in thread