[PATCH v7 0/7] KVM: Restricted mapping of guest_memfd at the host and arm64 support

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v7 0/7] KVM: Restricted mapping of guest_memfd at the host and arm64 support
@ 2025-03-28 15:31 Fuad Tabba
  2025-03-28 15:31 ` [PATCH v7 1/7] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes Fuad Tabba
                   ` (7 more replies)
  0 siblings, 8 replies; 30+ messages in thread
From: Fuad Tabba @ 2025-03-28 15:31 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, david, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
	rientjes, jhubbard, fvdl, hughd, jthoughton, peterx, pankaj.gupta,
	tabba

This series adds restricted mmap() support to guest_memfd, as well as
support for guest_memfd on arm64. Please see v3 for the context [1].

Main change since v6 [2]:
Protected the shared_offsets array with a rwlock instead of hopping on
the invalidate_lock. The main reason for this is that the final put
callback (kvm_gmem_handle_folio_put()) could be called from an atomic
context, and the invalidate_lock is an rw_semaphore (Vishal).

The other reason is, in hindsight, it didn't make sense to use the
invalidate_lock since they're not quite protecting the same thing.

I did consider using the folio lock to implicitly protect the array, and
even have another series that does that. The result was more
complicated, and not obviously race free. One of the difficulties with
relying on the folio lock is that there are cases (e.g., on
initilization and teardown) when we want to toggle the state of an
offset that doesn't have a folio in the cache. We could special case
these, but the result was more complicated (and to me not obviously
better) than adding a separate lock for the shared_offsets array.

Based on the `KVM: Mapping guest_memfd backed memory at the host for
software protected VMs` series [3], which in turn is based on Linux
6.14-rc7.

The state diagram that uses the new states in this patch series,
and how they would interact with sharing/unsharing in pKVM [4].

Cheers,
/fuad

[1] https://lore.kernel.org/all/20241010085930.1546800-1-tabba@google.com/
[2] https://lore.kernel.org/all/20250318162046.4016367-1-tabba@google.com/
[3] https://lore.kernel.org/all/20250318161823.4005529-1-tabba@google.com/
[4] https://lpc.events/event/18/contributions/1758/attachments/1457/3699/Guestmemfd%20folio%20state%20page_type.pdf

Ackerley Tng (2):
  KVM: guest_memfd: Make guest mem use guest mem inodes instead of
    anonymous inodes
  KVM: guest_memfd: Track folio sharing within a struct kvm_gmem_private

Fuad Tabba (5):
  KVM: guest_memfd: Introduce kvm_gmem_get_pfn_locked(), which retains
    the folio lock
  KVM: guest_memfd: Folio sharing states and functions that manage their
    transition
  KVM: guest_memfd: Restore folio state after final folio_put()
  KVM: guest_memfd: Handle invalidation of shared memory
  KVM: guest_memfd: Add a guest_memfd() flag to initialize it as shared

 Documentation/virt/kvm/api.rst                |   4 +
 include/linux/kvm_host.h                      |  56 +-
 include/uapi/linux/kvm.h                      |   1 +
 include/uapi/linux/magic.h                    |   1 +
 .../testing/selftests/kvm/guest_memfd_test.c  |   7 +-
 virt/kvm/guest_memfd.c                        | 613 +++++++++++++++++-
 virt/kvm/kvm_main.c                           |  62 ++
 7 files changed, 706 insertions(+), 38 deletions(-)

base-commit: 62aff24816ac463bf3f754a15b2e7cdff59976ea
-- 
2.49.0.472.ge94155a9ec-goog

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v7 1/7] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes
  2025-03-28 15:31 [PATCH v7 0/7] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
@ 2025-03-28 15:31 ` Fuad Tabba
  2025-04-08 11:53   ` Shivank Garg
  2025-03-28 15:31 ` [PATCH v7 2/7] KVM: guest_memfd: Introduce kvm_gmem_get_pfn_locked(), which retains the folio lock Fuad Tabba
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Fuad Tabba @ 2025-03-28 15:31 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, david, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
	rientjes, jhubbard, fvdl, hughd, jthoughton, peterx, pankaj.gupta,
	tabba

From: Ackerley Tng <ackerleytng@google.com>

Using guest mem inodes allows us to store metadata for the backing
memory on the inode. Metadata will be added in a later patch to support
HugeTLB pages.

Metadata about backing memory should not be stored on the file, since
the file represents a guest_memfd's binding with a struct kvm, and
metadata about backing memory is not unique to a specific binding and
struct kvm.

Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 include/uapi/linux/magic.h |   1 +
 virt/kvm/guest_memfd.c     | 130 +++++++++++++++++++++++++++++++------
 2 files changed, 111 insertions(+), 20 deletions(-)

diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index bb575f3ab45e..169dba2a6920 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -103,5 +103,6 @@
 #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
 #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
 #define PID_FS_MAGIC		0x50494446	/* "PIDF" */
+#define GUEST_MEMORY_MAGIC	0x474d454d	/* "GMEM" */
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index fbf89e643add..844e70c82558 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -1,12 +1,16 @@
 // SPDX-License-Identifier: GPL-2.0
+#include <linux/fs.h>
 #include <linux/backing-dev.h>
 #include <linux/falloc.h>
 #include <linux/kvm_host.h>
+#include <linux/pseudo_fs.h>
 #include <linux/pagemap.h>
 #include <linux/anon_inodes.h>
 
 #include "kvm_mm.h"
 
+static struct vfsmount *kvm_gmem_mnt;
+
 struct kvm_gmem {
 	struct kvm *kvm;
 	struct xarray bindings;
@@ -320,6 +324,38 @@ static pgoff_t kvm_gmem_get_index(struct kvm_memory_slot *slot, gfn_t gfn)
 	return gfn - slot->base_gfn + slot->gmem.pgoff;
 }
 
+static const struct super_operations kvm_gmem_super_operations = {
+	.statfs		= simple_statfs,
+};
+
+static int kvm_gmem_init_fs_context(struct fs_context *fc)
+{
+	struct pseudo_fs_context *ctx;
+
+	if (!init_pseudo(fc, GUEST_MEMORY_MAGIC))
+		return -ENOMEM;
+
+	ctx = fc->fs_private;
+	ctx->ops = &kvm_gmem_super_operations;
+
+	return 0;
+}
+
+static struct file_system_type kvm_gmem_fs = {
+	.name		 = "kvm_guest_memory",
+	.init_fs_context = kvm_gmem_init_fs_context,
+	.kill_sb	 = kill_anon_super,
+};
+
+static void kvm_gmem_init_mount(void)
+{
+	kvm_gmem_mnt = kern_mount(&kvm_gmem_fs);
+	BUG_ON(IS_ERR(kvm_gmem_mnt));
+
+	/* For giggles. Userspace can never map this anyways. */
+	kvm_gmem_mnt->mnt_flags |= MNT_NOEXEC;
+}
+
 #ifdef CONFIG_KVM_GMEM_SHARED_MEM
 static bool kvm_gmem_offset_is_shared(struct file *file, pgoff_t index)
 {
@@ -430,6 +466,8 @@ static struct file_operations kvm_gmem_fops = {
 void kvm_gmem_init(struct module *module)
 {
 	kvm_gmem_fops.owner = module;
+
+	kvm_gmem_init_mount();
 }
 
 static int kvm_gmem_migrate_folio(struct address_space *mapping,
@@ -511,11 +549,79 @@ static const struct inode_operations kvm_gmem_iops = {
 	.setattr	= kvm_gmem_setattr,
 };
 
+static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
+						      loff_t size, u64 flags)
+{
+	const struct qstr qname = QSTR_INIT(name, strlen(name));
+	struct inode *inode;
+	int err;
+
+	inode = alloc_anon_inode(kvm_gmem_mnt->mnt_sb);
+	if (IS_ERR(inode))
+		return inode;
+
+	err = security_inode_init_security_anon(inode, &qname, NULL);
+	if (err) {
+		iput(inode);
+		return ERR_PTR(err);
+	}
+
+	inode->i_private = (void *)(unsigned long)flags;
+	inode->i_op = &kvm_gmem_iops;
+	inode->i_mapping->a_ops = &kvm_gmem_aops;
+	inode->i_mode |= S_IFREG;
+	inode->i_size = size;
+	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+	mapping_set_inaccessible(inode->i_mapping);
+	/* Unmovable mappings are supposed to be marked unevictable as well. */
+	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
+
+	return inode;
+}
+
+static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,
+						  u64 flags)
+{
+	static const char *name = "[kvm-gmem]";
+	struct inode *inode;
+	struct file *file;
+	int err;
+
+	err = -ENOENT;
+	if (!try_module_get(kvm_gmem_fops.owner))
+		goto err;
+
+	inode = kvm_gmem_inode_make_secure_inode(name, size, flags);
+	if (IS_ERR(inode)) {
+		err = PTR_ERR(inode);
+		goto err_put_module;
+	}
+
+	file = alloc_file_pseudo(inode, kvm_gmem_mnt, name, O_RDWR,
+				 &kvm_gmem_fops);
+	if (IS_ERR(file)) {
+		err = PTR_ERR(file);
+		goto err_put_inode;
+	}
+
+	file->f_flags |= O_LARGEFILE;
+	file->private_data = priv;
+
+out:
+	return file;
+
+err_put_inode:
+	iput(inode);
+err_put_module:
+	module_put(kvm_gmem_fops.owner);
+err:
+	file = ERR_PTR(err);
+	goto out;
+}
+
 static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
 {
-	const char *anon_name = "[kvm-gmem]";
 	struct kvm_gmem *gmem;
-	struct inode *inode;
 	struct file *file;
 	int fd, err;
 
@@ -529,32 +635,16 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
 		goto err_fd;
 	}
 
-	file = anon_inode_create_getfile(anon_name, &kvm_gmem_fops, gmem,
-					 O_RDWR, NULL);
+	file = kvm_gmem_inode_create_getfile(gmem, size, flags);
 	if (IS_ERR(file)) {
 		err = PTR_ERR(file);
 		goto err_gmem;
 	}
 
-	file->f_flags |= O_LARGEFILE;
-
-	inode = file->f_inode;
-	WARN_ON(file->f_mapping != inode->i_mapping);
-
-	inode->i_private = (void *)(unsigned long)flags;
-	inode->i_op = &kvm_gmem_iops;
-	inode->i_mapping->a_ops = &kvm_gmem_aops;
-	inode->i_mode |= S_IFREG;
-	inode->i_size = size;
-	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
-	mapping_set_inaccessible(inode->i_mapping);
-	/* Unmovable mappings are supposed to be marked unevictable as well. */
-	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
-
 	kvm_get_kvm(kvm);
 	gmem->kvm = kvm;
 	xa_init(&gmem->bindings);
-	list_add(&gmem->entry, &inode->i_mapping->i_private_list);
+	list_add(&gmem->entry, &file_inode(file)->i_mapping->i_private_list);
 
 	fd_install(fd, file);
 	return fd;
-- 
2.49.0.472.ge94155a9ec-goog



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v7 2/7] KVM: guest_memfd: Introduce kvm_gmem_get_pfn_locked(), which retains the folio lock
  2025-03-28 15:31 [PATCH v7 0/7] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
  2025-03-28 15:31 ` [PATCH v7 1/7] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes Fuad Tabba
@ 2025-03-28 15:31 ` Fuad Tabba
  2025-03-28 15:31 ` [PATCH v7 3/7] KVM: guest_memfd: Track folio sharing within a struct kvm_gmem_private Fuad Tabba
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 30+ messages in thread
From: Fuad Tabba @ 2025-03-28 15:31 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, david, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
	rientjes, jhubbard, fvdl, hughd, jthoughton, peterx, pankaj.gupta,
	tabba

Create a new variant of kvm_gmem_get_pfn(), which retains the folio lock
if it returns successfully. This is needed in subsequent patches to
protect against races when checking whether a folio can be shared with
the host.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 include/linux/kvm_host.h | 11 +++++++++++
 virt/kvm/guest_memfd.c   | 27 ++++++++++++++++++++-------
 2 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ec3bedc18eab..bc73d7426363 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2535,6 +2535,9 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
 int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 		     gfn_t gfn, kvm_pfn_t *pfn, struct page **page,
 		     int *max_order);
+int kvm_gmem_get_pfn_locked(struct kvm *kvm, struct kvm_memory_slot *slot,
+			    gfn_t gfn, kvm_pfn_t *pfn, struct page **page,
+			    int *max_order);
 #else
 static inline int kvm_gmem_get_pfn(struct kvm *kvm,
 				   struct kvm_memory_slot *slot, gfn_t gfn,
@@ -2544,6 +2547,14 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
 	KVM_BUG_ON(1, kvm);
 	return -EIO;
 }
+static inline int kvm_gmem_get_pfn_locked(struct kvm *kvm,
+					  struct kvm_memory_slot *slot,
+					  gfn_t gfn, kvm_pfn_t *pfn,
+					  struct page **page, int *max_order)
+{
+	KVM_BUG_ON(1, kvm);
+	return -EIO;
+}
 #endif /* CONFIG_KVM_PRIVATE_MEM */
 
 #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_PREPARE
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 844e70c82558..ac6b8853699d 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -802,9 +802,9 @@ static struct folio *__kvm_gmem_get_pfn(struct file *file,
 	return folio;
 }
 
-int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
-		     gfn_t gfn, kvm_pfn_t *pfn, struct page **page,
-		     int *max_order)
+int kvm_gmem_get_pfn_locked(struct kvm *kvm, struct kvm_memory_slot *slot,
+			    gfn_t gfn, kvm_pfn_t *pfn, struct page **page,
+			    int *max_order)
 {
 	pgoff_t index = kvm_gmem_get_index(slot, gfn);
 	struct file *file = kvm_gmem_get_file(slot);
@@ -824,17 +824,30 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 	if (!is_prepared)
 		r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
 
-	folio_unlock(folio);
-
-	if (!r)
+	if (!r) {
 		*page = folio_file_page(folio, index);
-	else
+	} else {
+		folio_unlock(folio);
 		folio_put(folio);
+	}
 
 out:
 	fput(file);
 	return r;
 }
+EXPORT_SYMBOL_GPL(kvm_gmem_get_pfn_locked);
+
+int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
+		     gfn_t gfn, kvm_pfn_t *pfn, struct page **page,
+		     int *max_order)
+{
+	int r = kvm_gmem_get_pfn_locked(kvm, slot, gfn, pfn, page, max_order);
+
+	if (!r)
+		unlock_page(*page);
+
+	return r;
+}
 EXPORT_SYMBOL_GPL(kvm_gmem_get_pfn);
 
 #ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
-- 
2.49.0.472.ge94155a9ec-goog



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v7 3/7] KVM: guest_memfd: Track folio sharing within a struct kvm_gmem_private
  2025-03-28 15:31 [PATCH v7 0/7] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
  2025-03-28 15:31 ` [PATCH v7 1/7] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes Fuad Tabba
  2025-03-28 15:31 ` [PATCH v7 2/7] KVM: guest_memfd: Introduce kvm_gmem_get_pfn_locked(), which retains the folio lock Fuad Tabba
@ 2025-03-28 15:31 ` Fuad Tabba
  2025-04-02 22:25   ` Ackerley Tng
  2025-03-28 15:31 ` [PATCH v7 4/7] KVM: guest_memfd: Folio sharing states and functions that manage their transition Fuad Tabba
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Fuad Tabba @ 2025-03-28 15:31 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, david, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
	rientjes, jhubbard, fvdl, hughd, jthoughton, peterx, pankaj.gupta,
	tabba

From: Ackerley Tng <ackerleytng@google.com>

Track guest_memfd folio sharing state within the inode, since it is a
property of the guest_memfd's memory contents.

The guest_memfd PRIVATE memory attribute is not used for two reasons. It
reflects the userspace expectation for the memory state, and therefore
can be toggled by userspace. Also, although each guest_memfd file has a
1:1 binding with a KVM instance, the plan is to allow multiple files per
inode, e.g. to allow intra-host migration to a new KVM instance, without
destroying guest_memfd.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Vishal Annapurve <vannapurve@google.com>
Signed-off-by: Vishal Annapurve <vannapurve@google.com>
Co-developed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 virt/kvm/guest_memfd.c | 58 ++++++++++++++++++++++++++++++++++++++----
 1 file changed, 53 insertions(+), 5 deletions(-)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index ac6b8853699d..cde16ed3b230 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -17,6 +17,18 @@ struct kvm_gmem {
 	struct list_head entry;
 };
 
+struct kvm_gmem_inode_private {
+#ifdef CONFIG_KVM_GMEM_SHARED_MEM
+	struct xarray shared_offsets;
+	rwlock_t offsets_lock;
+#endif
+};
+
+static struct kvm_gmem_inode_private *kvm_gmem_private(struct inode *inode)
+{
+	return inode->i_mapping->i_private_data;
+}
+
 #ifdef CONFIG_KVM_GMEM_SHARED_MEM
 void kvm_gmem_handle_folio_put(struct folio *folio)
 {
@@ -324,8 +336,28 @@ static pgoff_t kvm_gmem_get_index(struct kvm_memory_slot *slot, gfn_t gfn)
 	return gfn - slot->base_gfn + slot->gmem.pgoff;
 }
 
+static void kvm_gmem_evict_inode(struct inode *inode)
+{
+	struct kvm_gmem_inode_private *private = kvm_gmem_private(inode);
+
+#ifdef CONFIG_KVM_GMEM_SHARED_MEM
+	/*
+	 * .evict_inode can be called before private data is set up if there are
+	 * issues during inode creation.
+	 */
+	if (private)
+		xa_destroy(&private->shared_offsets);
+#endif
+
+	truncate_inode_pages_final(inode->i_mapping);
+
+	kfree(private);
+	clear_inode(inode);
+}
+
 static const struct super_operations kvm_gmem_super_operations = {
-	.statfs		= simple_statfs,
+	.statfs         = simple_statfs,
+	.evict_inode	= kvm_gmem_evict_inode,
 };
 
 static int kvm_gmem_init_fs_context(struct fs_context *fc)
@@ -553,6 +585,7 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
 						      loff_t size, u64 flags)
 {
 	const struct qstr qname = QSTR_INIT(name, strlen(name));
+	struct kvm_gmem_inode_private *private;
 	struct inode *inode;
 	int err;
 
@@ -561,10 +594,20 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
 		return inode;
 
 	err = security_inode_init_security_anon(inode, &qname, NULL);
-	if (err) {
-		iput(inode);
-		return ERR_PTR(err);
-	}
+	if (err)
+		goto out;
+
+	err = -ENOMEM;
+	private = kzalloc(sizeof(*private), GFP_KERNEL);
+	if (!private)
+		goto out;
+
+#ifdef CONFIG_KVM_GMEM_SHARED_MEM
+	xa_init(&private->shared_offsets);
+	rwlock_init(&private->offsets_lock);
+#endif
+
+	inode->i_mapping->i_private_data = private;
 
 	inode->i_private = (void *)(unsigned long)flags;
 	inode->i_op = &kvm_gmem_iops;
@@ -577,6 +620,11 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
 	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
 
 	return inode;
+
+out:
+	iput(inode);
+
+	return ERR_PTR(err);
 }
 
 static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,
-- 
2.49.0.472.ge94155a9ec-goog



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v7 4/7] KVM: guest_memfd: Folio sharing states and functions that manage their transition
  2025-03-28 15:31 [PATCH v7 0/7] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
                   ` (2 preceding siblings ...)
  2025-03-28 15:31 ` [PATCH v7 3/7] KVM: guest_memfd: Track folio sharing within a struct kvm_gmem_private Fuad Tabba
@ 2025-03-28 15:31 ` Fuad Tabba
  2025-04-02 23:48   ` Ackerley Tng
  2025-04-03  0:19   ` Sean Christopherson
  2025-03-28 15:31 ` [PATCH v7 5/7] KVM: guest_memfd: Restore folio state after final folio_put() Fuad Tabba
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 30+ messages in thread
From: Fuad Tabba @ 2025-03-28 15:31 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, david, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
	rientjes, jhubbard, fvdl, hughd, jthoughton, peterx, pankaj.gupta,
	tabba

To allow in-place sharing of guest_memfd folios with the host,
guest_memfd needs to track their sharing state, because mapping of
shared folios will only be allowed where it safe to access these folios.
It is safe to map and access these folios when explicitly shared with
the host, or potentially if not yet exposed to the guest (e.g., at
initialization).

This patch introduces sharing states for guest_memfd folios as well as
the functions that manage transitioning between those states.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 include/linux/kvm_host.h |  39 +++++++-
 virt/kvm/guest_memfd.c   | 208 ++++++++++++++++++++++++++++++++++++---
 virt/kvm/kvm_main.c      |  62 ++++++++++++
 3 files changed, 295 insertions(+), 14 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index bc73d7426363..bf82faf16c53 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2600,7 +2600,44 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
 #endif
 
 #ifdef CONFIG_KVM_GMEM_SHARED_MEM
+int kvm_gmem_set_shared(struct kvm *kvm, gfn_t start, gfn_t end);
+int kvm_gmem_clear_shared(struct kvm *kvm, gfn_t start, gfn_t end);
+int kvm_gmem_slot_set_shared(struct kvm_memory_slot *slot, gfn_t start,
+			     gfn_t end);
+int kvm_gmem_slot_clear_shared(struct kvm_memory_slot *slot, gfn_t start,
+			       gfn_t end);
+bool kvm_gmem_slot_is_guest_shared(struct kvm_memory_slot *slot, gfn_t gfn);
 void kvm_gmem_handle_folio_put(struct folio *folio);
-#endif
+#else
+static inline int kvm_gmem_set_shared(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+	WARN_ON_ONCE(1);
+	return -EINVAL;
+}
+static inline int kvm_gmem_clear_shared(struct kvm *kvm, gfn_t start,
+					gfn_t end)
+{
+	WARN_ON_ONCE(1);
+	return -EINVAL;
+}
+static inline int kvm_gmem_slot_set_shared(struct kvm_memory_slot *slot,
+					   gfn_t start, gfn_t end)
+{
+	WARN_ON_ONCE(1);
+	return -EINVAL;
+}
+static inline int kvm_gmem_slot_clear_shared(struct kvm_memory_slot *slot,
+					     gfn_t start, gfn_t end)
+{
+	WARN_ON_ONCE(1);
+	return -EINVAL;
+}
+static inline bool kvm_gmem_slot_is_guest_shared(struct kvm_memory_slot *slot,
+						 gfn_t gfn)
+{
+	WARN_ON_ONCE(1);
+	return false;
+}
+#endif /* CONFIG_KVM_GMEM_SHARED_MEM */
 
 #endif
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index cde16ed3b230..3b4d724084a8 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -29,14 +29,6 @@ static struct kvm_gmem_inode_private *kvm_gmem_private(struct inode *inode)
 	return inode->i_mapping->i_private_data;
 }
 
-#ifdef CONFIG_KVM_GMEM_SHARED_MEM
-void kvm_gmem_handle_folio_put(struct folio *folio)
-{
-	WARN_ONCE(1, "A placeholder that shouldn't trigger. Work in progress.");
-}
-EXPORT_SYMBOL_GPL(kvm_gmem_handle_folio_put);
-#endif /* CONFIG_KVM_GMEM_SHARED_MEM */
-
 /**
  * folio_file_pfn - like folio_file_page, but return a pfn.
  * @folio: The folio which contains this index.
@@ -389,22 +381,211 @@ static void kvm_gmem_init_mount(void)
 }
 
 #ifdef CONFIG_KVM_GMEM_SHARED_MEM
-static bool kvm_gmem_offset_is_shared(struct file *file, pgoff_t index)
+/*
+ * An enum of the valid folio sharing states:
+ * Bit 0: set if not shared with the guest (guest cannot fault it in)
+ * Bit 1: set if not shared with the host (host cannot fault it in)
+ */
+enum folio_shareability {
+	KVM_GMEM_ALL_SHARED	= 0b00,	/* Shared with the host and the guest. */
+	KVM_GMEM_GUEST_SHARED	= 0b10, /* Shared only with the guest. */
+	KVM_GMEM_NONE_SHARED	= 0b11, /* Not shared, transient state. */
+};
+
+static int kvm_gmem_offset_set_shared(struct inode *inode, pgoff_t index)
 {
-	struct kvm_gmem *gmem = file->private_data;
+	struct xarray *shared_offsets = &kvm_gmem_private(inode)->shared_offsets;
+	rwlock_t *offsets_lock = &kvm_gmem_private(inode)->offsets_lock;
+	void *xval = xa_mk_value(KVM_GMEM_ALL_SHARED);
+
+	lockdep_assert_held_write(offsets_lock);
+
+	return xa_err(xa_store(shared_offsets, index, xval, GFP_KERNEL));
+}
+
+/*
+ * Marks the range [start, end) as shared with both the host and the guest.
+ * Called when guest shares memory with the host.
+ */
+static int kvm_gmem_offset_range_set_shared(struct inode *inode,
+					    pgoff_t start, pgoff_t end)
+{
+	rwlock_t *offsets_lock = &kvm_gmem_private(inode)->offsets_lock;
+	pgoff_t i;
+	int r = 0;
+
+	write_lock(offsets_lock);
+	for (i = start; i < end; i++) {
+		r = kvm_gmem_offset_set_shared(inode, i);
+		if (WARN_ON_ONCE(r))
+			break;
+	}
+	write_unlock(offsets_lock);
+
+	return r;
+}
+
+static int kvm_gmem_offset_clear_shared(struct inode *inode, pgoff_t index)
+{
+	struct xarray *shared_offsets = &kvm_gmem_private(inode)->shared_offsets;
+	rwlock_t *offsets_lock = &kvm_gmem_private(inode)->offsets_lock;
+	void *xval_guest = xa_mk_value(KVM_GMEM_GUEST_SHARED);
+	void *xval_none = xa_mk_value(KVM_GMEM_NONE_SHARED);
+	struct folio *folio;
+	int refcount;
+	int r;
+
+	lockdep_assert_held_write(offsets_lock);
+
+	folio = filemap_lock_folio(inode->i_mapping, index);
+	if (!IS_ERR(folio)) {
+		/* +1 references are expected because of filemap_lock_folio(). */
+		refcount = folio_nr_pages(folio) + 1;
+	} else {
+		r = PTR_ERR(folio);
+		if (WARN_ON_ONCE(r != -ENOENT))
+			return r;
+
+		folio = NULL;
+	}
+
+	if (!folio || folio_ref_freeze(folio, refcount)) {
+		/*
+		 * No outstanding references: transition to guest shared.
+		 */
+		r = xa_err(xa_store(shared_offsets, index, xval_guest, GFP_KERNEL));
+
+		if (folio)
+			folio_ref_unfreeze(folio, refcount);
+	} else {
+		/*
+		 * Outstanding references: the folio cannot be faulted in by
+		 * anyone until they're dropped.
+		 */
+		r = xa_err(xa_store(shared_offsets, index, xval_none, GFP_KERNEL));
+	}
+
+	if (folio) {
+		folio_unlock(folio);
+		folio_put(folio);
+	}
+
+	return r;
+}
 
+/*
+ * Marks the range [start, end) as not shared with the host. If the host doesn't
+ * have any references to a particular folio, then that folio is marked as
+ * shared with the guest.
+ *
+ * However, if the host still has references to the folio, then the folio is
+ * marked and not shared with anyone. Marking it as not shared allows draining
+ * all references from the host, and ensures that the hypervisor does not
+ * transition the folio to private, since the host still might access it.
+ *
+ * Called when guest unshares memory with the host.
+ */
+static int kvm_gmem_offset_range_clear_shared(struct inode *inode,
+					      pgoff_t start, pgoff_t end)
+{
+	rwlock_t *offsets_lock = &kvm_gmem_private(inode)->offsets_lock;
+	pgoff_t i;
+	int r = 0;
+
+	write_lock(offsets_lock);
+	for (i = start; i < end; i++) {
+		r = kvm_gmem_offset_clear_shared(inode, i);
+		if (WARN_ON_ONCE(r))
+			break;
+	}
+	write_unlock(offsets_lock);
+
+	return r;
+}
+
+void kvm_gmem_handle_folio_put(struct folio *folio)
+{
+	WARN_ONCE(1, "A placeholder that shouldn't trigger. Work in progress.");
+}
+EXPORT_SYMBOL_GPL(kvm_gmem_handle_folio_put);
+
+/*
+ * Returns true if the folio is shared with the host and the guest.
+ *
+ * Must be called with the offsets_lock lock held.
+ */
+static bool kvm_gmem_offset_is_shared(struct inode *inode, pgoff_t index)
+{
+	struct xarray *shared_offsets = &kvm_gmem_private(inode)->shared_offsets;
+	rwlock_t *offsets_lock = &kvm_gmem_private(inode)->offsets_lock;
+	unsigned long r;
+
+	lockdep_assert_held(offsets_lock);
 
-	/* For now, VMs that support shared memory share all their memory. */
-	return kvm_arch_gmem_supports_shared_mem(gmem->kvm);
+	r = xa_to_value(xa_load(shared_offsets, index));
+
+	return r == KVM_GMEM_ALL_SHARED;
+}
+
+/*
+ * Returns true if the folio is shared with the guest (not transitioning).
+ *
+ * Must be called with the offsets_lock lock held.
+ */
+static bool kvm_gmem_offset_is_guest_shared(struct inode *inode, pgoff_t index)
+{
+	struct xarray *shared_offsets = &kvm_gmem_private(inode)->shared_offsets;
+	rwlock_t *offsets_lock = &kvm_gmem_private(inode)->offsets_lock;
+	unsigned long r;
+
+	lockdep_assert_held(offsets_lock);
+
+	r = xa_to_value(xa_load(shared_offsets, index));
+
+	return (r == KVM_GMEM_ALL_SHARED || r == KVM_GMEM_GUEST_SHARED);
+}
+
+int kvm_gmem_slot_set_shared(struct kvm_memory_slot *slot, gfn_t start, gfn_t end)
+{
+	struct inode *inode = file_inode(READ_ONCE(slot->gmem.file));
+	pgoff_t start_off = slot->gmem.pgoff + start - slot->base_gfn;
+	pgoff_t end_off = start_off + end - start;
+
+	return kvm_gmem_offset_range_set_shared(inode, start_off, end_off);
+}
+
+int kvm_gmem_slot_clear_shared(struct kvm_memory_slot *slot, gfn_t start, gfn_t end)
+{
+	struct inode *inode = file_inode(READ_ONCE(slot->gmem.file));
+	pgoff_t start_off = slot->gmem.pgoff + start - slot->base_gfn;
+	pgoff_t end_off = start_off + end - start;
+
+	return kvm_gmem_offset_range_clear_shared(inode, start_off, end_off);
+}
+
+bool kvm_gmem_slot_is_guest_shared(struct kvm_memory_slot *slot, gfn_t gfn)
+{
+	struct inode *inode = file_inode(READ_ONCE(slot->gmem.file));
+	rwlock_t *offsets_lock = &kvm_gmem_private(inode)->offsets_lock;
+	unsigned long pgoff = slot->gmem.pgoff + gfn - slot->base_gfn;
+	bool r;
+
+	read_lock(offsets_lock);
+	r = kvm_gmem_offset_is_guest_shared(inode, pgoff);
+	read_unlock(offsets_lock);
+
+	return r;
 }
 
 static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
 {
 	struct inode *inode = file_inode(vmf->vma->vm_file);
+	rwlock_t *offsets_lock = &kvm_gmem_private(inode)->offsets_lock;
 	struct folio *folio;
 	vm_fault_t ret = VM_FAULT_LOCKED;
 
 	filemap_invalidate_lock_shared(inode->i_mapping);
+	read_lock(offsets_lock);
 
 	folio = kvm_gmem_get_folio(inode, vmf->pgoff);
 	if (IS_ERR(folio)) {
@@ -423,7 +604,7 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
 		goto out_folio;
 	}
 
-	if (!kvm_gmem_offset_is_shared(vmf->vma->vm_file, vmf->pgoff)) {
+	if (!kvm_gmem_offset_is_shared(inode, vmf->pgoff)) {
 		ret = VM_FAULT_SIGBUS;
 		goto out_folio;
 	}
@@ -457,6 +638,7 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
 	}
 
 out_filemap:
+	read_unlock(offsets_lock);
 	filemap_invalidate_unlock_shared(inode->i_mapping);
 
 	return ret;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 3e40acb9f5c0..90762252381c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3091,6 +3091,68 @@ static int next_segment(unsigned long len, int offset)
 		return len;
 }
 
+#ifdef CONFIG_KVM_GMEM_SHARED_MEM
+int kvm_gmem_set_shared(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+	struct kvm_memslot_iter iter;
+	int r = 0;
+
+	mutex_lock(&kvm->slots_lock);
+
+	kvm_for_each_memslot_in_gfn_range(&iter, kvm_memslots(kvm), start, end) {
+		struct kvm_memory_slot *memslot = iter.slot;
+		gfn_t gfn_start, gfn_end;
+
+		if (!kvm_slot_can_be_private(memslot))
+			continue;
+
+		gfn_start = max(start, memslot->base_gfn);
+		gfn_end = min(end, memslot->base_gfn + memslot->npages);
+		if (WARN_ON_ONCE(start >= end))
+			continue;
+
+		r = kvm_gmem_slot_set_shared(memslot, gfn_start, gfn_end);
+		if (WARN_ON_ONCE(r))
+			break;
+	}
+
+	mutex_unlock(&kvm->slots_lock);
+
+	return r;
+}
+EXPORT_SYMBOL_GPL(kvm_gmem_set_shared);
+
+int kvm_gmem_clear_shared(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+	struct kvm_memslot_iter iter;
+	int r = 0;
+
+	mutex_lock(&kvm->slots_lock);
+
+	kvm_for_each_memslot_in_gfn_range(&iter, kvm_memslots(kvm), start, end) {
+		struct kvm_memory_slot *memslot = iter.slot;
+		gfn_t gfn_start, gfn_end;
+
+		if (!kvm_slot_can_be_private(memslot))
+			continue;
+
+		gfn_start = max(start, memslot->base_gfn);
+		gfn_end = min(end, memslot->base_gfn + memslot->npages);
+		if (WARN_ON_ONCE(start >= end))
+			continue;
+
+		r = kvm_gmem_slot_clear_shared(memslot, gfn_start, gfn_end);
+		if (WARN_ON_ONCE(r))
+			break;
+	}
+
+	mutex_unlock(&kvm->slots_lock);
+
+	return r;
+}
+EXPORT_SYMBOL_GPL(kvm_gmem_clear_shared);
+#endif /* CONFIG_KVM_GMEM_SHARED_MEM */
+
 /* Copy @len bytes from guest memory at '(@gfn * PAGE_SIZE) + @offset' to @data */
 static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
 				 void *data, int offset, int len)
-- 
2.49.0.472.ge94155a9ec-goog



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v7 5/7] KVM: guest_memfd: Restore folio state after final folio_put()
  2025-03-28 15:31 [PATCH v7 0/7] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
                   ` (3 preceding siblings ...)
  2025-03-28 15:31 ` [PATCH v7 4/7] KVM: guest_memfd: Folio sharing states and functions that manage their transition Fuad Tabba
@ 2025-03-28 15:31 ` Fuad Tabba
  2025-04-05 16:23   ` Matthew Wilcox
  2025-03-28 15:31 ` [PATCH v7 6/7] KVM: guest_memfd: Handle invalidation of shared memory Fuad Tabba
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Fuad Tabba @ 2025-03-28 15:31 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, david, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
	rientjes, jhubbard, fvdl, hughd, jthoughton, peterx, pankaj.gupta,
	tabba

Before transitioning a guest_memfd folio to unshared, thereby
disallowing access by the host and allowing the hypervisor to transition
its view of the guest page as private, we need to ensure that the host
doesn't have any references to the folio.

This patch uses the guest_memfd folio type to register a callback that
informs the guest_memfd subsystem when the last reference is dropped,
therefore knowing that the host doesn't have any remaining references.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
The function kvm_slot_gmem_register_callback() isn't used in this
series. It will be used later in code that performs unsharing of
memory. I have tested it with pKVM, based on downstream code [*].
It's included in this RFC since it demonstrates the plan to
handle unsharing of private folios.

[*] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.13-v7-pkvm
---
 include/linux/kvm_host.h |   6 ++
 virt/kvm/guest_memfd.c   | 143 ++++++++++++++++++++++++++++++++++++++-
 2 files changed, 148 insertions(+), 1 deletion(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index bf82faf16c53..d9d9d72d8beb 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2607,6 +2607,7 @@ int kvm_gmem_slot_set_shared(struct kvm_memory_slot *slot, gfn_t start,
 int kvm_gmem_slot_clear_shared(struct kvm_memory_slot *slot, gfn_t start,
 			       gfn_t end);
 bool kvm_gmem_slot_is_guest_shared(struct kvm_memory_slot *slot, gfn_t gfn);
+int kvm_gmem_slot_register_callback(struct kvm_memory_slot *slot, gfn_t gfn);
 void kvm_gmem_handle_folio_put(struct folio *folio);
 #else
 static inline int kvm_gmem_set_shared(struct kvm *kvm, gfn_t start, gfn_t end)
@@ -2638,6 +2639,11 @@ static inline bool kvm_gmem_slot_is_guest_shared(struct kvm_memory_slot *slot,
 	WARN_ON_ONCE(1);
 	return false;
 }
+static inline int kvm_gmem_slot_register_callback(struct kvm_memory_slot *slot, gfn_t gfn)
+{
+	WARN_ON_ONCE(1);
+	return -EINVAL;
+}
 #endif /* CONFIG_KVM_GMEM_SHARED_MEM */
 
 #endif
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 3b4d724084a8..ce19bd6c2031 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -392,6 +392,27 @@ enum folio_shareability {
 	KVM_GMEM_NONE_SHARED	= 0b11, /* Not shared, transient state. */
 };
 
+/*
+ * Unregisters the __folio_put() callback from the folio.
+ *
+ * Restores a folio's refcount after all pending references have been released,
+ * and removes the folio type, thereby removing the callback. Now the folio can
+ * be freed normaly once all actual references have been dropped.
+ *
+ * Must be called with the folio locked and the offsets_lock write lock held.
+ */
+static void kvm_gmem_restore_pending_folio(struct folio *folio, struct inode *inode)
+{
+	rwlock_t *offsets_lock = &kvm_gmem_private(inode)->offsets_lock;
+
+	lockdep_assert_held_write(offsets_lock);
+	WARN_ON_ONCE(!folio_test_locked(folio));
+	WARN_ON_ONCE(!folio_test_guestmem(folio));
+
+	__folio_clear_guestmem(folio);
+	folio_ref_add(folio, folio_nr_pages(folio));
+}
+
 static int kvm_gmem_offset_set_shared(struct inode *inode, pgoff_t index)
 {
 	struct xarray *shared_offsets = &kvm_gmem_private(inode)->shared_offsets;
@@ -400,6 +421,24 @@ static int kvm_gmem_offset_set_shared(struct inode *inode, pgoff_t index)
 
 	lockdep_assert_held_write(offsets_lock);
 
+	/*
+	 * If the folio is NONE_SHARED, it indicates that it is transitioning to
+	 * private (GUEST_SHARED). Transition it to shared (ALL_SHARED)
+	 * immediately, and remove the callback.
+	 */
+	if (xa_to_value(xa_load(shared_offsets, index)) == KVM_GMEM_NONE_SHARED) {
+		struct folio *folio = filemap_lock_folio(inode->i_mapping, index);
+
+		if (WARN_ON_ONCE(IS_ERR(folio)))
+			return PTR_ERR(folio);
+
+		if (folio_test_guestmem(folio))
+			kvm_gmem_restore_pending_folio(folio, inode);
+
+		folio_unlock(folio);
+		folio_put(folio);
+	}
+
 	return xa_err(xa_store(shared_offsets, index, xval, GFP_KERNEL));
 }
 
@@ -503,9 +542,111 @@ static int kvm_gmem_offset_range_clear_shared(struct inode *inode,
 	return r;
 }
 
+/*
+ * Registers a callback to __folio_put(), so that gmem knows that the host does
+ * not have any references to the folio. The callback itself is registered by
+ * setting the folio type to guestmem.
+ *
+ * Returns 0 if a callback was registered or already has been registered, or
+ * -EAGAIN if the host has references, indicating a callback wasn't registered.
+ *
+ * Must be called with the folio locked and the offsets_lock write lock held.
+ */
+static int kvm_gmem_register_callback(struct folio *folio, struct inode *inode, pgoff_t index)
+{
+	struct xarray *shared_offsets = &kvm_gmem_private(inode)->shared_offsets;
+	rwlock_t *offsets_lock = &kvm_gmem_private(inode)->offsets_lock;
+	void *xval_guest = xa_mk_value(KVM_GMEM_GUEST_SHARED);
+	int refcount;
+	int r = 0;
+
+	lockdep_assert_held_write(offsets_lock);
+	WARN_ON_ONCE(!folio_test_locked(folio));
+
+	if (folio_test_guestmem(folio))
+		return 0;
+
+	if (folio_mapped(folio))
+		return -EAGAIN;
+
+	refcount = folio_ref_count(folio);
+	if (!folio_ref_freeze(folio, refcount))
+		return -EAGAIN;
+
+	/*
+	 * Register callback by setting the folio type and subtracting gmem's
+	 * references for it to trigger once outstanding references are dropped.
+	 */
+	if (refcount > 1) {
+		__folio_set_guestmem(folio);
+		refcount -= folio_nr_pages(folio);
+	} else {
+		/* No outstanding references, transition it to guest shared. */
+		r = WARN_ON_ONCE(xa_err(xa_store(shared_offsets, index, xval_guest, GFP_KERNEL)));
+	}
+
+	folio_ref_unfreeze(folio, refcount);
+	return r;
+}
+
+int kvm_gmem_slot_register_callback(struct kvm_memory_slot *slot, gfn_t gfn)
+{
+	unsigned long pgoff = slot->gmem.pgoff + gfn - slot->base_gfn;
+	struct inode *inode = file_inode(READ_ONCE(slot->gmem.file));
+	rwlock_t *offsets_lock = &kvm_gmem_private(inode)->offsets_lock;
+	struct folio *folio;
+	int r;
+
+	write_lock(offsets_lock);
+
+	folio = filemap_lock_folio(inode->i_mapping, pgoff);
+	if (WARN_ON_ONCE(IS_ERR(folio))) {
+		write_unlock(offsets_lock);
+		return PTR_ERR(folio);
+	}
+
+	r = kvm_gmem_register_callback(folio, inode, pgoff);
+
+	folio_unlock(folio);
+	folio_put(folio);
+	write_unlock(offsets_lock);
+
+	return r;
+}
+EXPORT_SYMBOL_GPL(kvm_gmem_slot_register_callback);
+
+/*
+ * Callback function for __folio_put(), i.e., called once all references by the
+ * host to the folio have been dropped. This allows gmem to transition the state
+ * of the folio to shared with the guest, and allows the hypervisor to continue
+ * transitioning its state to private, since the host cannot attempt to access
+ * it anymore.
+ */
 void kvm_gmem_handle_folio_put(struct folio *folio)
 {
-	WARN_ONCE(1, "A placeholder that shouldn't trigger. Work in progress.");
+	struct address_space *mapping;
+	struct xarray *shared_offsets;
+	rwlock_t *offsets_lock;
+	struct inode *inode;
+	pgoff_t index;
+	void *xval;
+
+	mapping = folio->mapping;
+	if (WARN_ON_ONCE(!mapping))
+		return;
+
+	inode = mapping->host;
+	index = folio->index;
+	shared_offsets = &kvm_gmem_private(inode)->shared_offsets;
+	offsets_lock = &kvm_gmem_private(inode)->offsets_lock;
+	xval = xa_mk_value(KVM_GMEM_GUEST_SHARED);
+
+	write_lock(offsets_lock);
+	folio_lock(folio);
+	kvm_gmem_restore_pending_folio(folio, inode);
+	folio_unlock(folio);
+	WARN_ON_ONCE(xa_err(xa_store(shared_offsets, index, xval, GFP_KERNEL)));
+	write_unlock(offsets_lock);
 }
 EXPORT_SYMBOL_GPL(kvm_gmem_handle_folio_put);
 
-- 
2.49.0.472.ge94155a9ec-goog



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v7 6/7] KVM: guest_memfd: Handle invalidation of shared memory
  2025-03-28 15:31 [PATCH v7 0/7] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
                   ` (4 preceding siblings ...)
  2025-03-28 15:31 ` [PATCH v7 5/7] KVM: guest_memfd: Restore folio state after final folio_put() Fuad Tabba
@ 2025-03-28 15:31 ` Fuad Tabba
  2025-03-28 15:31 ` [PATCH v7 7/7] KVM: guest_memfd: Add a guest_memfd() flag to initialize it as shared Fuad Tabba
  2025-04-04 18:05 ` [PATCH v7 0/7] KVM: Restricted mapping of guest_memfd at the host and arm64 support Liam R. Howlett
  7 siblings, 0 replies; 30+ messages in thread
From: Fuad Tabba @ 2025-03-28 15:31 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, david, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
	rientjes, jhubbard, fvdl, hughd, jthoughton, peterx, pankaj.gupta,
	tabba

When guest_memfd backed memory is invalidated, e.g., on punching holes,
releasing, ensure that the sharing states are updated and that any
folios in a transient state are restored to an appropriate state.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 virt/kvm/guest_memfd.c | 57 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 57 insertions(+)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index ce19bd6c2031..eec9d5e09f09 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -118,6 +118,16 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
 	return filemap_grab_folio(inode->i_mapping, index);
 }
 
+#ifdef CONFIG_KVM_GMEM_SHARED_MEM
+static void kvm_gmem_offset_range_invalidate_shared(struct inode *inode,
+						    pgoff_t start, pgoff_t end);
+#else
+static inline void kvm_gmem_offset_range_invalidate_shared(struct inode *inode,
+							   pgoff_t start, pgoff_t end)
+{
+}
+#endif
+
 static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
 				      pgoff_t end)
 {
@@ -127,6 +137,7 @@ static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
 	unsigned long index;
 
 	xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
+		struct file *file = READ_ONCE(slot->gmem.file);
 		pgoff_t pgoff = slot->gmem.pgoff;
 
 		struct kvm_gfn_range gfn_range = {
@@ -146,6 +157,16 @@ static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
 		}
 
 		flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range);
+
+		/*
+		 * If this gets called after kvm_gmem_unbind() it means that all
+		 * in-flight operations are gone, and the file has been closed.
+		 */
+		if (file) {
+			kvm_gmem_offset_range_invalidate_shared(file_inode(file),
+								gfn_range.start,
+								gfn_range.end);
+		}
 	}
 
 	if (flush)
@@ -512,6 +533,42 @@ static int kvm_gmem_offset_clear_shared(struct inode *inode, pgoff_t index)
 	return r;
 }
 
+/*
+ * Callback when invalidating memory that is potentially shared.
+ *
+ * Must be called with the offsets_lock write lock held.
+ */
+static void kvm_gmem_offset_range_invalidate_shared(struct inode *inode,
+						    pgoff_t start, pgoff_t end)
+{
+	struct xarray *shared_offsets = &kvm_gmem_private(inode)->shared_offsets;
+	rwlock_t *offsets_lock = &kvm_gmem_private(inode)->offsets_lock;
+	pgoff_t i;
+
+	lockdep_assert_held_write(offsets_lock);
+
+	for (i = start; i < end; i++) {
+		/*
+		 * If the folio is NONE_SHARED, it indicates that it's
+		 * transitioning to private (GUEST_SHARED). Transition it to
+		 * shared (ALL_SHARED) and remove the callback.
+		 */
+		if (xa_to_value(xa_load(shared_offsets, i)) == KVM_GMEM_NONE_SHARED) {
+			struct folio *folio = filemap_lock_folio(inode->i_mapping, i);
+
+			if (!WARN_ON_ONCE(IS_ERR(folio))) {
+				if (folio_test_guestmem(folio))
+					kvm_gmem_restore_pending_folio(folio, inode);
+
+				folio_unlock(folio);
+				folio_put(folio);
+			}
+		}
+
+		xa_erase(shared_offsets, i);
+	}
+}
+
 /*
  * Marks the range [start, end) as not shared with the host. If the host doesn't
  * have any references to a particular folio, then that folio is marked as
-- 
2.49.0.472.ge94155a9ec-goog



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v7 7/7] KVM: guest_memfd: Add a guest_memfd() flag to initialize it as shared
  2025-03-28 15:31 [PATCH v7 0/7] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
                   ` (5 preceding siblings ...)
  2025-03-28 15:31 ` [PATCH v7 6/7] KVM: guest_memfd: Handle invalidation of shared memory Fuad Tabba
@ 2025-03-28 15:31 ` Fuad Tabba
  2025-04-02 22:47   ` Ackerley Tng
  2025-04-02 23:03   ` Ackerley Tng
  2025-04-04 18:05 ` [PATCH v7 0/7] KVM: Restricted mapping of guest_memfd at the host and arm64 support Liam R. Howlett
  7 siblings, 2 replies; 30+ messages in thread
From: Fuad Tabba @ 2025-03-28 15:31 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, david, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
	rientjes, jhubbard, fvdl, hughd, jthoughton, peterx, pankaj.gupta,
	tabba

Not all use cases require guest_memfd() to be shared with the host when
first created. Add a new flag, GUEST_MEMFD_FLAG_INIT_SHARED, which when
set on KVM_CREATE_GUEST_MEMFD initializes the memory as shared with the
host, and therefore mappable by it. Otherwise, memory is private until
explicitly shared by the guest with the host.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 Documentation/virt/kvm/api.rst                 |  4 ++++
 include/uapi/linux/kvm.h                       |  1 +
 tools/testing/selftests/kvm/guest_memfd_test.c |  7 +++++--
 virt/kvm/guest_memfd.c                         | 12 ++++++++++++
 4 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 2b52eb77e29c..a5496d7d323b 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6386,6 +6386,10 @@ most one mapping per page, i.e. binding multiple memory regions to a single
 guest_memfd range is not allowed (any number of memory regions can be bound to
 a single guest_memfd file, but the bound ranges must not overlap).
 
+If the capability KVM_CAP_GMEM_SHARED_MEM is supported, then the flags field
+supports GUEST_MEMFD_FLAG_INIT_SHARED, which initializes the memory as shared
+with the host, and thereby, mappable by it.
+
 See KVM_SET_USER_MEMORY_REGION2 for additional details.
 
 4.143 KVM_PRE_FAULT_MEMORY
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 117937a895da..22d7e33bf09c 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1566,6 +1566,7 @@ struct kvm_memory_attributes {
 #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
 
 #define KVM_CREATE_GUEST_MEMFD	_IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
+#define GUEST_MEMFD_FLAG_INIT_SHARED		(1UL << 0)
 
 struct kvm_create_guest_memfd {
 	__u64 size;
diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index 38c501e49e0e..4a7fcd6aa372 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -159,7 +159,7 @@ static void test_invalid_punch_hole(int fd, size_t page_size, size_t total_size)
 static void test_create_guest_memfd_invalid(struct kvm_vm *vm)
 {
 	size_t page_size = getpagesize();
-	uint64_t flag;
+	uint64_t flag = BIT(0);
 	size_t size;
 	int fd;
 
@@ -170,7 +170,10 @@ static void test_create_guest_memfd_invalid(struct kvm_vm *vm)
 			    size);
 	}
 
-	for (flag = BIT(0); flag; flag <<= 1) {
+	if (kvm_has_cap(KVM_CAP_GMEM_SHARED_MEM))
+		flag = GUEST_MEMFD_FLAG_INIT_SHARED << 1;
+
+	for (; flag; flag <<= 1) {
 		fd = __vm_create_guest_memfd(vm, page_size, flag);
 		TEST_ASSERT(fd == -1 && errno == EINVAL,
 			    "guest_memfd() with flag '0x%lx' should fail with EINVAL",
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index eec9d5e09f09..32e149478b04 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -1069,6 +1069,15 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
 		goto err_gmem;
 	}
 
+	if (IS_ENABLED(CONFIG_KVM_GMEM_SHARED_MEM) &&
+	    (flags & GUEST_MEMFD_FLAG_INIT_SHARED)) {
+		err = kvm_gmem_offset_range_set_shared(file_inode(file), 0, size >> PAGE_SHIFT);
+		if (err) {
+			fput(file);
+			goto err_gmem;
+		}
+	}
+
 	kvm_get_kvm(kvm);
 	gmem->kvm = kvm;
 	xa_init(&gmem->bindings);
@@ -1090,6 +1099,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
 	u64 flags = args->flags;
 	u64 valid_flags = 0;
 
+	if (IS_ENABLED(CONFIG_KVM_GMEM_SHARED_MEM))
+		valid_flags |= GUEST_MEMFD_FLAG_INIT_SHARED;
+
 	if (flags & ~valid_flags)
 		return -EINVAL;
 
-- 
2.49.0.472.ge94155a9ec-goog



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 3/7] KVM: guest_memfd: Track folio sharing within a struct kvm_gmem_private
  2025-03-28 15:31 ` [PATCH v7 3/7] KVM: guest_memfd: Track folio sharing within a struct kvm_gmem_private Fuad Tabba
@ 2025-04-02 22:25   ` Ackerley Tng
  2025-04-02 23:56     ` Sean Christopherson
  0 siblings, 1 reply; 30+ messages in thread
From: Ackerley Tng @ 2025-04-02 22:25 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	isaku.yamahata, mic, vbabka, vannapurve, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton, peterx,
	pankaj.gupta, tabba

Fuad Tabba <tabba@google.com> writes:

> From: Ackerley Tng <ackerleytng@google.com>
>
> Track guest_memfd folio sharing state within the inode, since it is a
> property of the guest_memfd's memory contents.
>
> The guest_memfd PRIVATE memory attribute is not used for two reasons. It
> reflects the userspace expectation for the memory state, and therefore
> can be toggled by userspace. Also, although each guest_memfd file has a
> 1:1 binding with a KVM instance, the plan is to allow multiple files per
> inode, e.g. to allow intra-host migration to a new KVM instance, without
> destroying guest_memfd.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Co-developed-by: Vishal Annapurve <vannapurve@google.com>
> Signed-off-by: Vishal Annapurve <vannapurve@google.com>
> Co-developed-by: Fuad Tabba <tabba@google.com>
> Signed-off-by: Fuad Tabba <tabba@google.com>
> ---
>  virt/kvm/guest_memfd.c | 58 ++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 53 insertions(+), 5 deletions(-)
>
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index ac6b8853699d..cde16ed3b230 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -17,6 +17,18 @@ struct kvm_gmem {
>  	struct list_head entry;
>  };
>  
> +struct kvm_gmem_inode_private {
> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> +	struct xarray shared_offsets;
> +	rwlock_t offsets_lock;

This lock doesn't work, either that or this lock can't be held while
faulting, because holding this lock means we can't sleep, and we need to
sleep to allocate.

One of these config options must have helped throw a BUG

CONFIG_DEBUG_ATOMIC_SLEEP=y
CONFIG_DEBUG_IRQFLAGS=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_DEBUG_LOCKDEP=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_RT_MUTEXES=y
CONFIG_DEBUG_RWSEMS=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
CONFIG_LOCKDEP=y
CONFIG_LOCK_STAT=y
CONFIG_PREEMPT_COUNT=y
CONFIG_PREEMPTIRQ_TRACEPOINTS=y
CONFIG_PROVE_LOCKING=y
CONFIG_PROVE_RAW_LOCK_NESTING=y
CONFIG_PROVE_RCU=y
CONFIG_RCU_CPU_STALL_CPUTIME=y
CONFIG_TRACE_IRQFLAGS_NMI=y
CONFIG_TRACE_IRQFLAGS=y
CONFIG_UNINLINE_SPIN_UNLOCK=y

with ./guest_memfd_test

[  161.255012] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:321
[  161.257350] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 254, name: guest_memfd_tes
[  161.259662] preempt_count: 1, expected: 0
[  161.260884] RCU nest depth: 0, expected: 0
[  161.262119] 3 locks held by guest_memfd_tes/254:
[  161.263470]  #0: ffff8883064c3c80 (&mm->mmap_lock){++++}-{4:4}, at: lock_mm_and_find_vma+0x29/0x140
[  161.265932]  #1: ffff88830dedbc10 (mapping.invalidate_lock#4){++++}-{4:4}, at: kvm_gmem_fault+0x3d/0x1f0
[  161.268507]  #2: ffff88830d510d30 (&private->offsets_lock){.+.+}-{3:3}, at: kvm_gmem_fault+0x45/0x1f0
[  161.270992] CPU: 2 UID: 0 PID: 254 Comm: guest_memfd_tes Tainted: G        W          6.14.0-rc7-00016-g174a15c15f96 #1
[  161.270995] Tainted: [W]=WARN
[  161.270996] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[  161.270997] Call Trace:
[  161.270998]  <TASK>
[  161.271000]  dump_stack_lvl+0xa7/0x100
[  161.271005]  __might_resched+0x261/0x280
[  161.271009]  prepare_alloc_pages+0xe5/0x1e0
[  161.271013]  __alloc_frozen_pages_noprof+0xbc/0x2a0
[  161.271019]  alloc_pages_mpol+0x111/0x1d0
[  161.271025]  alloc_pages_noprof+0x7e/0x120
[  161.271028]  folio_alloc_noprof+0x14/0x30
[  161.271030]  __filemap_get_folio+0x189/0x380
[  161.271036]  kvm_gmem_fault+0x5e/0x1f0
[  161.271041]  __do_fault+0x42/0xc0
[  161.271045]  handle_mm_fault+0xf37/0x1c90
[  161.271047]  ? handle_mm_fault+0x3c/0x1c90
[  161.271053]  ? mt_find+0x208/0x2a0
[  161.271088]  do_user_addr_fault+0x3c0/0x740
[  161.271095]  exc_page_fault+0x69/0x110
[  161.271099]  asm_exc_page_fault+0x26/0x30
[  161.271102] RIP: 0033:0x419fb0
[  161.271104] Code: 48 8d 3c 17 48 89 c1 48 85 d2 74 2e 48 89 fa 48 29 c2 83 e2 01 74 13 48 8d 48 01 40 88 71 ff 48 39 cf 74 17 66 0f 1f 44 00 00 <44> 88 01 48 83 c1 02 44 88 41 ff 48 39 cf[  161.271105] RSP: 002b:00007fffd695b568 EFLAGS: 00010246
[  161.271107] RAX: 00007f4f395a0000 RBX: 00007fffd695b590 RCX: 00007f4f395a0000
[  161.271108] RDX: 0000000000000000 RSI: 00000000ffffffaa RDI: 00007f4f395a4000
[  161.271109] RBP: 0000000000000005 R08: 00000000ffffffaa R09: 0000000000000000
[  161.271109] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000001000
[  161.271110] R13: 000000002e03ab40 R14: 00007f4f395a0000 R15: 0000000000004000
[  161.271119]  </TASK>

> +#endif
> +};
> +
> +static struct kvm_gmem_inode_private *kvm_gmem_private(struct inode *inode)
> +{
> +	return inode->i_mapping->i_private_data;
> +}
> +
>  #ifdef CONFIG_KVM_GMEM_SHARED_MEM
>  void kvm_gmem_handle_folio_put(struct folio *folio)
>  {
> @@ -324,8 +336,28 @@ static pgoff_t kvm_gmem_get_index(struct kvm_memory_slot *slot, gfn_t gfn)
>  	return gfn - slot->base_gfn + slot->gmem.pgoff;
>  }
>  
> +static void kvm_gmem_evict_inode(struct inode *inode)
> +{
> +	struct kvm_gmem_inode_private *private = kvm_gmem_private(inode);
> +
> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> +	/*
> +	 * .evict_inode can be called before private data is set up if there are
> +	 * issues during inode creation.
> +	 */
> +	if (private)
> +		xa_destroy(&private->shared_offsets);
> +#endif
> +
> +	truncate_inode_pages_final(inode->i_mapping);
> +
> +	kfree(private);
> +	clear_inode(inode);
> +}
> +
>  static const struct super_operations kvm_gmem_super_operations = {
> -	.statfs		= simple_statfs,
> +	.statfs         = simple_statfs,
> +	.evict_inode	= kvm_gmem_evict_inode,
>  };
>  
>  static int kvm_gmem_init_fs_context(struct fs_context *fc)
> @@ -553,6 +585,7 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
>  						      loff_t size, u64 flags)
>  {
>  	const struct qstr qname = QSTR_INIT(name, strlen(name));
> +	struct kvm_gmem_inode_private *private;
>  	struct inode *inode;
>  	int err;
>  
> @@ -561,10 +594,20 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
>  		return inode;
>  
>  	err = security_inode_init_security_anon(inode, &qname, NULL);
> -	if (err) {
> -		iput(inode);
> -		return ERR_PTR(err);
> -	}
> +	if (err)
> +		goto out;
> +
> +	err = -ENOMEM;
> +	private = kzalloc(sizeof(*private), GFP_KERNEL);
> +	if (!private)
> +		goto out;
> +
> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> +	xa_init(&private->shared_offsets);
> +	rwlock_init(&private->offsets_lock);
> +#endif
> +
> +	inode->i_mapping->i_private_data = private;
>  
>  	inode->i_private = (void *)(unsigned long)flags;
>  	inode->i_op = &kvm_gmem_iops;
> @@ -577,6 +620,11 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
>  	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
>  
>  	return inode;
> +
> +out:
> +	iput(inode);
> +
> +	return ERR_PTR(err);
>  }
>  
>  static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 7/7] KVM: guest_memfd: Add a guest_memfd() flag to initialize it as shared
  2025-03-28 15:31 ` [PATCH v7 7/7] KVM: guest_memfd: Add a guest_memfd() flag to initialize it as shared Fuad Tabba
@ 2025-04-02 22:47   ` Ackerley Tng
  2025-04-04  7:39     ` Fuad Tabba
  2025-04-02 23:03   ` Ackerley Tng
  1 sibling, 1 reply; 30+ messages in thread
From: Ackerley Tng @ 2025-04-02 22:47 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	isaku.yamahata, mic, vbabka, vannapurve, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton, peterx,
	pankaj.gupta, tabba

Fuad Tabba <tabba@google.com> writes:

> Not all use cases require guest_memfd() to be shared with the host when
> first created. Add a new flag, GUEST_MEMFD_FLAG_INIT_SHARED, which when
> set on KVM_CREATE_GUEST_MEMFD initializes the memory as shared with the
> host, and therefore mappable by it. Otherwise, memory is private until
> explicitly shared by the guest with the host.
>
> Signed-off-by: Fuad Tabba <tabba@google.com>
> ---
>  Documentation/virt/kvm/api.rst                 |  4 ++++
>  include/uapi/linux/kvm.h                       |  1 +
>  tools/testing/selftests/kvm/guest_memfd_test.c |  7 +++++--
>  virt/kvm/guest_memfd.c                         | 12 ++++++++++++
>  4 files changed, 22 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 2b52eb77e29c..a5496d7d323b 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6386,6 +6386,10 @@ most one mapping per page, i.e. binding multiple memory regions to a single
>  guest_memfd range is not allowed (any number of memory regions can be bound to
>  a single guest_memfd file, but the bound ranges must not overlap).
>  
> +If the capability KVM_CAP_GMEM_SHARED_MEM is supported, then the flags field
> +supports GUEST_MEMFD_FLAG_INIT_SHARED, which initializes the memory as shared
> +with the host, and thereby, mappable by it.
> +
>  See KVM_SET_USER_MEMORY_REGION2 for additional details.
>  
>  4.143 KVM_PRE_FAULT_MEMORY
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 117937a895da..22d7e33bf09c 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1566,6 +1566,7 @@ struct kvm_memory_attributes {
>  #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
>  
>  #define KVM_CREATE_GUEST_MEMFD	_IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
> +#define GUEST_MEMFD_FLAG_INIT_SHARED		(1UL << 0)
>  
>  struct kvm_create_guest_memfd {
>  	__u64 size;
> diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
> index 38c501e49e0e..4a7fcd6aa372 100644
> --- a/tools/testing/selftests/kvm/guest_memfd_test.c
> +++ b/tools/testing/selftests/kvm/guest_memfd_test.c
> @@ -159,7 +159,7 @@ static void test_invalid_punch_hole(int fd, size_t page_size, size_t total_size)
>  static void test_create_guest_memfd_invalid(struct kvm_vm *vm)
>  {
>  	size_t page_size = getpagesize();
> -	uint64_t flag;
> +	uint64_t flag = BIT(0);
>  	size_t size;
>  	int fd;
>  
> @@ -170,7 +170,10 @@ static void test_create_guest_memfd_invalid(struct kvm_vm *vm)
>  			    size);
>  	}
>  
> -	for (flag = BIT(0); flag; flag <<= 1) {
> +	if (kvm_has_cap(KVM_CAP_GMEM_SHARED_MEM))
> +		flag = GUEST_MEMFD_FLAG_INIT_SHARED << 1;
> +
> +	for (; flag; flag <<= 1) {

This would end up shifting the GUEST_MEMFD_FLAG_INIT_SHARED flag for
each loop.

>  		fd = __vm_create_guest_memfd(vm, page_size, flag);
>  		TEST_ASSERT(fd == -1 && errno == EINVAL,
>  			    "guest_memfd() with flag '0x%lx' should fail with EINVAL",
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index eec9d5e09f09..32e149478b04 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -1069,6 +1069,15 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
>  		goto err_gmem;
>  	}
>  
> +	if (IS_ENABLED(CONFIG_KVM_GMEM_SHARED_MEM) &&
> +	    (flags & GUEST_MEMFD_FLAG_INIT_SHARED)) {
> +		err = kvm_gmem_offset_range_set_shared(file_inode(file), 0, size >> PAGE_SHIFT);
> +		if (err) {
> +			fput(file);
> +			goto err_gmem;
> +		}
> +	}
> +
>  	kvm_get_kvm(kvm);
>  	gmem->kvm = kvm;
>  	xa_init(&gmem->bindings);
> @@ -1090,6 +1099,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
>  	u64 flags = args->flags;
>  	u64 valid_flags = 0;
>  
> +	if (IS_ENABLED(CONFIG_KVM_GMEM_SHARED_MEM))
> +		valid_flags |= GUEST_MEMFD_FLAG_INIT_SHARED;
> +
>  	if (flags & ~valid_flags)
>  		return -EINVAL;


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 7/7] KVM: guest_memfd: Add a guest_memfd() flag to initialize it as shared
  2025-03-28 15:31 ` [PATCH v7 7/7] KVM: guest_memfd: Add a guest_memfd() flag to initialize it as shared Fuad Tabba
  2025-04-02 22:47   ` Ackerley Tng
@ 2025-04-02 23:03   ` Ackerley Tng
  2025-04-03 11:32     ` Fuad Tabba
  1 sibling, 1 reply; 30+ messages in thread
From: Ackerley Tng @ 2025-04-02 23:03 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	isaku.yamahata, mic, vbabka, vannapurve, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton, peterx,
	pankaj.gupta, tabba

Fuad Tabba <tabba@google.com> writes:

> Not all use cases require guest_memfd() to be shared with the host when
> first created. Add a new flag, GUEST_MEMFD_FLAG_INIT_SHARED, which when
> set on KVM_CREATE_GUEST_MEMFD initializes the memory as shared with the
> host, and therefore mappable by it. Otherwise, memory is private until
> explicitly shared by the guest with the host.
>
> Signed-off-by: Fuad Tabba <tabba@google.com>
> ---
>  Documentation/virt/kvm/api.rst                 |  4 ++++
>  include/uapi/linux/kvm.h                       |  1 +
>  tools/testing/selftests/kvm/guest_memfd_test.c |  7 +++++--
>  virt/kvm/guest_memfd.c                         | 12 ++++++++++++
>  4 files changed, 22 insertions(+), 2 deletions(-)
>
> <snip>
> 
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index eec9d5e09f09..32e149478b04 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -1069,6 +1069,15 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
>  		goto err_gmem;
>  	}
>  
> +	if (IS_ENABLED(CONFIG_KVM_GMEM_SHARED_MEM) &&
> +	    (flags & GUEST_MEMFD_FLAG_INIT_SHARED)) {
> +		err = kvm_gmem_offset_range_set_shared(file_inode(file), 0, size >> PAGE_SHIFT);

I think if GUEST_MEMFD_FLAG_INIT_SHARED is not set, we should call
kvm_gmem_offset_range_clear_shared(); so that there is always some
shareability defined for all offsets in a file. Otherwise, when reading
shareability, we'd have to check against GUEST_MEMFD_FLAG_INIT_SHARED
to find out what to initialize it to.

> +		if (err) {
> +			fput(file);
> +			goto err_gmem;
> +		}
> +	}
> +
>  	kvm_get_kvm(kvm);
>  	gmem->kvm = kvm;
>  	xa_init(&gmem->bindings);
> @@ -1090,6 +1099,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
>  	u64 flags = args->flags;
>  	u64 valid_flags = 0;
>  
> +	if (IS_ENABLED(CONFIG_KVM_GMEM_SHARED_MEM))
> +		valid_flags |= GUEST_MEMFD_FLAG_INIT_SHARED;
> +
>  	if (flags & ~valid_flags)
>  		return -EINVAL;


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 4/7] KVM: guest_memfd: Folio sharing states and functions that manage their transition
  2025-03-28 15:31 ` [PATCH v7 4/7] KVM: guest_memfd: Folio sharing states and functions that manage their transition Fuad Tabba
@ 2025-04-02 23:48   ` Ackerley Tng
  2025-04-03  8:58     ` Fuad Tabba
  2025-04-03  0:19   ` Sean Christopherson
  1 sibling, 1 reply; 30+ messages in thread
From: Ackerley Tng @ 2025-04-02 23:48 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	isaku.yamahata, mic, vbabka, vannapurve, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton, peterx,
	pankaj.gupta, tabba

Fuad Tabba <tabba@google.com> writes:

> To allow in-place sharing of guest_memfd folios with the host,
> guest_memfd needs to track their sharing state, because mapping of
> shared folios will only be allowed where it safe to access these folios.
> It is safe to map and access these folios when explicitly shared with
> the host, or potentially if not yet exposed to the guest (e.g., at
> initialization).
>
> This patch introduces sharing states for guest_memfd folios as well as
> the functions that manage transitioning between those states.
>
> Signed-off-by: Fuad Tabba <tabba@google.com>
> ---
>  include/linux/kvm_host.h |  39 +++++++-
>  virt/kvm/guest_memfd.c   | 208 ++++++++++++++++++++++++++++++++++++---
>  virt/kvm/kvm_main.c      |  62 ++++++++++++
>  3 files changed, 295 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index bc73d7426363..bf82faf16c53 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2600,7 +2600,44 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
>  #endif
>  
>  #ifdef CONFIG_KVM_GMEM_SHARED_MEM
> +int kvm_gmem_set_shared(struct kvm *kvm, gfn_t start, gfn_t end);
> +int kvm_gmem_clear_shared(struct kvm *kvm, gfn_t start, gfn_t end);
> +int kvm_gmem_slot_set_shared(struct kvm_memory_slot *slot, gfn_t start,
> +			     gfn_t end);
> +int kvm_gmem_slot_clear_shared(struct kvm_memory_slot *slot, gfn_t start,
> +			       gfn_t end);
> +bool kvm_gmem_slot_is_guest_shared(struct kvm_memory_slot *slot, gfn_t gfn);
>  void kvm_gmem_handle_folio_put(struct folio *folio);
> -#endif
> +#else
> +static inline int kvm_gmem_set_shared(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> +	WARN_ON_ONCE(1);
> +	return -EINVAL;
> +}
> +static inline int kvm_gmem_clear_shared(struct kvm *kvm, gfn_t start,
> +					gfn_t end)
> +{
> +	WARN_ON_ONCE(1);
> +	return -EINVAL;
> +}
> +static inline int kvm_gmem_slot_set_shared(struct kvm_memory_slot *slot,
> +					   gfn_t start, gfn_t end)
> +{
> +	WARN_ON_ONCE(1);
> +	return -EINVAL;
> +}
> +static inline int kvm_gmem_slot_clear_shared(struct kvm_memory_slot *slot,
> +					     gfn_t start, gfn_t end)
> +{
> +	WARN_ON_ONCE(1);
> +	return -EINVAL;
> +}
> +static inline bool kvm_gmem_slot_is_guest_shared(struct kvm_memory_slot *slot,
> +						 gfn_t gfn)
> +{
> +	WARN_ON_ONCE(1);
> +	return false;
> +}
> +#endif /* CONFIG_KVM_GMEM_SHARED_MEM */
>  
>  #endif
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index cde16ed3b230..3b4d724084a8 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -29,14 +29,6 @@ static struct kvm_gmem_inode_private *kvm_gmem_private(struct inode *inode)
>  	return inode->i_mapping->i_private_data;
>  }
>  
> -#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> -void kvm_gmem_handle_folio_put(struct folio *folio)
> -{
> -	WARN_ONCE(1, "A placeholder that shouldn't trigger. Work in progress.");
> -}
> -EXPORT_SYMBOL_GPL(kvm_gmem_handle_folio_put);
> -#endif /* CONFIG_KVM_GMEM_SHARED_MEM */
> -
>  /**
>   * folio_file_pfn - like folio_file_page, but return a pfn.
>   * @folio: The folio which contains this index.
> @@ -389,22 +381,211 @@ static void kvm_gmem_init_mount(void)
>  }
>  
>  #ifdef CONFIG_KVM_GMEM_SHARED_MEM
> -static bool kvm_gmem_offset_is_shared(struct file *file, pgoff_t index)
> +/*
> + * An enum of the valid folio sharing states:
> + * Bit 0: set if not shared with the guest (guest cannot fault it in)
> + * Bit 1: set if not shared with the host (host cannot fault it in)
> + */
> +enum folio_shareability {
> +	KVM_GMEM_ALL_SHARED	= 0b00,	/* Shared with the host and the guest. */
> +	KVM_GMEM_GUEST_SHARED	= 0b10, /* Shared only with the guest. */
> +	KVM_GMEM_NONE_SHARED	= 0b11, /* Not shared, transient state. */
> +};
> +
> +static int kvm_gmem_offset_set_shared(struct inode *inode, pgoff_t index)
>  {
> -	struct kvm_gmem *gmem = file->private_data;
> +	struct xarray *shared_offsets = &kvm_gmem_private(inode)->shared_offsets;
> +	rwlock_t *offsets_lock = &kvm_gmem_private(inode)->offsets_lock;
> +	void *xval = xa_mk_value(KVM_GMEM_ALL_SHARED);
> +
> +	lockdep_assert_held_write(offsets_lock);
> +
> +	return xa_err(xa_store(shared_offsets, index, xval, GFP_KERNEL));
> +}
> +
> +/*
> + * Marks the range [start, end) as shared with both the host and the guest.
> + * Called when guest shares memory with the host.
> + */
> +static int kvm_gmem_offset_range_set_shared(struct inode *inode,
> +					    pgoff_t start, pgoff_t end)
> +{
> +	rwlock_t *offsets_lock = &kvm_gmem_private(inode)->offsets_lock;
> +	pgoff_t i;
> +	int r = 0;
> +
> +	write_lock(offsets_lock);
> +	for (i = start; i < end; i++) {
> +		r = kvm_gmem_offset_set_shared(inode, i);
> +		if (WARN_ON_ONCE(r))
> +			break;
> +	}
> +	write_unlock(offsets_lock);
> +
> +	return r;
> +}
> +
> +static int kvm_gmem_offset_clear_shared(struct inode *inode, pgoff_t index)
> +{
> +	struct xarray *shared_offsets = &kvm_gmem_private(inode)->shared_offsets;
> +	rwlock_t *offsets_lock = &kvm_gmem_private(inode)->offsets_lock;
> +	void *xval_guest = xa_mk_value(KVM_GMEM_GUEST_SHARED);
> +	void *xval_none = xa_mk_value(KVM_GMEM_NONE_SHARED);
> +	struct folio *folio;
> +	int refcount;
> +	int r;
> +
> +	lockdep_assert_held_write(offsets_lock);
> +
> +	folio = filemap_lock_folio(inode->i_mapping, index);
> +	if (!IS_ERR(folio)) {
> +		/* +1 references are expected because of filemap_lock_folio(). */
> +		refcount = folio_nr_pages(folio) + 1;
> +	} else {
> +		r = PTR_ERR(folio);
> +		if (WARN_ON_ONCE(r != -ENOENT))
> +			return r;
> +
> +		folio = NULL;
> +	}
> +
> +	if (!folio || folio_ref_freeze(folio, refcount)) {
> +		/*
> +		 * No outstanding references: transition to guest shared.
> +		 */
> +		r = xa_err(xa_store(shared_offsets, index, xval_guest, GFP_KERNEL));
> +
> +		if (folio)
> +			folio_ref_unfreeze(folio, refcount);
> +	} else {
> +		/*
> +		 * Outstanding references: the folio cannot be faulted in by
> +		 * anyone until they're dropped.
> +		 */
> +		r = xa_err(xa_store(shared_offsets, index, xval_none, GFP_KERNEL));

Once we do this on elevated refcounts, truncate needs to be updated to
handle the case where some folio is still in a KVM_GMEM_NONE_SHARED
state.

When a folio is found in a KVM_GMEM_NONE_SHARED state, the shareability
should be fast-forwarded to KVM_GMEM_GUEST_SHARED, and the filemap's
refcounts restored.  The folio can then be truncated from the filemap as
usual (which will drop the filemap's refcounts)

> +	}
> +
> +	if (folio) {
> +		folio_unlock(folio);
> +		folio_put(folio);
> +	}
> +
> +	return r;
> +}
>  
> +/*
> + * Marks the range [start, end) as not shared with the host. If the host doesn't
> + * have any references to a particular folio, then that folio is marked as
> + * shared with the guest.
> + *
> + * However, if the host still has references to the folio, then the folio is
> + * marked and not shared with anyone. Marking it as not shared allows draining
> + * all references from the host, and ensures that the hypervisor does not
> + * transition the folio to private, since the host still might access it.
> + *
> + * Called when guest unshares memory with the host.
> + */
> +static int kvm_gmem_offset_range_clear_shared(struct inode *inode,
> +					      pgoff_t start, pgoff_t end)
> +{
> +	rwlock_t *offsets_lock = &kvm_gmem_private(inode)->offsets_lock;
> +	pgoff_t i;
> +	int r = 0;
> +
> +	write_lock(offsets_lock);
> +	for (i = start; i < end; i++) {
> +		r = kvm_gmem_offset_clear_shared(inode, i);
> +		if (WARN_ON_ONCE(r))
> +			break;
> +	}
> +	write_unlock(offsets_lock);
> +
> +	return r;
> +}
> +
> +void kvm_gmem_handle_folio_put(struct folio *folio)
> +{
> +	WARN_ONCE(1, "A placeholder that shouldn't trigger. Work in progress.");
> +}
> +EXPORT_SYMBOL_GPL(kvm_gmem_handle_folio_put);
> +
> +/*
> + * Returns true if the folio is shared with the host and the guest.
> + *
> + * Must be called with the offsets_lock lock held.
> + */
> +static bool kvm_gmem_offset_is_shared(struct inode *inode, pgoff_t index)
> +{
> +	struct xarray *shared_offsets = &kvm_gmem_private(inode)->shared_offsets;
> +	rwlock_t *offsets_lock = &kvm_gmem_private(inode)->offsets_lock;
> +	unsigned long r;
> +
> +	lockdep_assert_held(offsets_lock);
>  
> -	/* For now, VMs that support shared memory share all their memory. */
> -	return kvm_arch_gmem_supports_shared_mem(gmem->kvm);
> +	r = xa_to_value(xa_load(shared_offsets, index));
> +
> +	return r == KVM_GMEM_ALL_SHARED;
> +}
> +
> +/*
> + * Returns true if the folio is shared with the guest (not transitioning).
> + *
> + * Must be called with the offsets_lock lock held.
> + */
> +static bool kvm_gmem_offset_is_guest_shared(struct inode *inode, pgoff_t index)
> +{
> +	struct xarray *shared_offsets = &kvm_gmem_private(inode)->shared_offsets;
> +	rwlock_t *offsets_lock = &kvm_gmem_private(inode)->offsets_lock;
> +	unsigned long r;
> +
> +	lockdep_assert_held(offsets_lock);
> +
> +	r = xa_to_value(xa_load(shared_offsets, index));
> +
> +	return (r == KVM_GMEM_ALL_SHARED || r == KVM_GMEM_GUEST_SHARED);
> +}
> +
> +int kvm_gmem_slot_set_shared(struct kvm_memory_slot *slot, gfn_t start, gfn_t end)
> +{
> +	struct inode *inode = file_inode(READ_ONCE(slot->gmem.file));
> +	pgoff_t start_off = slot->gmem.pgoff + start - slot->base_gfn;
> +	pgoff_t end_off = start_off + end - start;
> +
> +	return kvm_gmem_offset_range_set_shared(inode, start_off, end_off);
> +}
> +
> +int kvm_gmem_slot_clear_shared(struct kvm_memory_slot *slot, gfn_t start, gfn_t end)
> +{
> +	struct inode *inode = file_inode(READ_ONCE(slot->gmem.file));
> +	pgoff_t start_off = slot->gmem.pgoff + start - slot->base_gfn;
> +	pgoff_t end_off = start_off + end - start;
> +
> +	return kvm_gmem_offset_range_clear_shared(inode, start_off, end_off);
> +}
> +
> +bool kvm_gmem_slot_is_guest_shared(struct kvm_memory_slot *slot, gfn_t gfn)
> +{
> +	struct inode *inode = file_inode(READ_ONCE(slot->gmem.file));
> +	rwlock_t *offsets_lock = &kvm_gmem_private(inode)->offsets_lock;
> +	unsigned long pgoff = slot->gmem.pgoff + gfn - slot->base_gfn;
> +	bool r;
> +
> +	read_lock(offsets_lock);
> +	r = kvm_gmem_offset_is_guest_shared(inode, pgoff);
> +	read_unlock(offsets_lock);
> +
> +	return r;
>  }
>  
>  static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
>  {
>  	struct inode *inode = file_inode(vmf->vma->vm_file);
> +	rwlock_t *offsets_lock = &kvm_gmem_private(inode)->offsets_lock;
>  	struct folio *folio;
>  	vm_fault_t ret = VM_FAULT_LOCKED;
>  
>  	filemap_invalidate_lock_shared(inode->i_mapping);
> +	read_lock(offsets_lock);
>  
>  	folio = kvm_gmem_get_folio(inode, vmf->pgoff);
>  	if (IS_ERR(folio)) {
> @@ -423,7 +604,7 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
>  		goto out_folio;
>  	}
>  
> -	if (!kvm_gmem_offset_is_shared(vmf->vma->vm_file, vmf->pgoff)) {
> +	if (!kvm_gmem_offset_is_shared(inode, vmf->pgoff)) {
>  		ret = VM_FAULT_SIGBUS;
>  		goto out_folio;
>  	}
> @@ -457,6 +638,7 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
>  	}
>  
>  out_filemap:
> +	read_unlock(offsets_lock);
>  	filemap_invalidate_unlock_shared(inode->i_mapping);
>  
>  	return ret;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 3e40acb9f5c0..90762252381c 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3091,6 +3091,68 @@ static int next_segment(unsigned long len, int offset)
>  		return len;
>  }
>  
> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> +int kvm_gmem_set_shared(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> +	struct kvm_memslot_iter iter;
> +	int r = 0;
> +
> +	mutex_lock(&kvm->slots_lock);
> +
> +	kvm_for_each_memslot_in_gfn_range(&iter, kvm_memslots(kvm), start, end) {
> +		struct kvm_memory_slot *memslot = iter.slot;
> +		gfn_t gfn_start, gfn_end;
> +
> +		if (!kvm_slot_can_be_private(memslot))
> +			continue;
> +
> +		gfn_start = max(start, memslot->base_gfn);
> +		gfn_end = min(end, memslot->base_gfn + memslot->npages);
> +		if (WARN_ON_ONCE(start >= end))
> +			continue;
> +
> +		r = kvm_gmem_slot_set_shared(memslot, gfn_start, gfn_end);
> +		if (WARN_ON_ONCE(r))
> +			break;
> +	}
> +
> +	mutex_unlock(&kvm->slots_lock);
> +
> +	return r;
> +}
> +EXPORT_SYMBOL_GPL(kvm_gmem_set_shared);
> +
> +int kvm_gmem_clear_shared(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> +	struct kvm_memslot_iter iter;
> +	int r = 0;
> +
> +	mutex_lock(&kvm->slots_lock);
> +
> +	kvm_for_each_memslot_in_gfn_range(&iter, kvm_memslots(kvm), start, end) {
> +		struct kvm_memory_slot *memslot = iter.slot;
> +		gfn_t gfn_start, gfn_end;
> +
> +		if (!kvm_slot_can_be_private(memslot))
> +			continue;
> +
> +		gfn_start = max(start, memslot->base_gfn);
> +		gfn_end = min(end, memslot->base_gfn + memslot->npages);
> +		if (WARN_ON_ONCE(start >= end))
> +			continue;
> +
> +		r = kvm_gmem_slot_clear_shared(memslot, gfn_start, gfn_end);
> +		if (WARN_ON_ONCE(r))
> +			break;
> +	}
> +
> +	mutex_unlock(&kvm->slots_lock);
> +
> +	return r;
> +}
> +EXPORT_SYMBOL_GPL(kvm_gmem_clear_shared);
> +#endif /* CONFIG_KVM_GMEM_SHARED_MEM */
> +
>  /* Copy @len bytes from guest memory at '(@gfn * PAGE_SIZE) + @offset' to @data */
>  static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
>  				 void *data, int offset, int len)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 3/7] KVM: guest_memfd: Track folio sharing within a struct kvm_gmem_private
  2025-04-02 22:25   ` Ackerley Tng
@ 2025-04-02 23:56     ` Sean Christopherson
  2025-04-03  8:58       ` Fuad Tabba
  0 siblings, 1 reply; 30+ messages in thread
From: Sean Christopherson @ 2025-04-02 23:56 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Fuad Tabba, kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai,
	mpe, anup, paul.walmsley, palmer, aou, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	isaku.yamahata, mic, vbabka, vannapurve, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton, peterx,
	pankaj.gupta

On Wed, Apr 02, 2025, Ackerley Tng wrote:
> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > index ac6b8853699d..cde16ed3b230 100644
> > --- a/virt/kvm/guest_memfd.c
> > +++ b/virt/kvm/guest_memfd.c
> > @@ -17,6 +17,18 @@ struct kvm_gmem {
> >  	struct list_head entry;
> >  };
> >  
> > +struct kvm_gmem_inode_private {
> > +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> > +	struct xarray shared_offsets;
> > +	rwlock_t offsets_lock;
> 
> This lock doesn't work, either that or this lock can't be held while
> faulting, because holding this lock means we can't sleep, and we need to
> sleep to allocate.

rwlock_t is a variant of a spinlock, which can't be held when sleeping.

What exactly does offsets_lock protect, and what are the rules for holding it?
At a glance, it's flawed.  Something needs to prevent KVM from installing a mapping
for a private gfn that is being converted to shared.  KVM doesn't hold references
to PFNs while they're mapped into the guest, and kvm_gmem_get_pfn() doesn't check
shared_offsets let alone take offsets_lock.

guest_memfd currently handles races between kvm_gmem_fault() and PUNCH_HOLE via
kvm_gmem_invalidate_{begin,end}().  I don't see any equivalent functionality in
the shared/private conversion code.

In general, this series is unreviewable as many of the APIs have no callers,
which makes it impossible to review the safety/correctness of locking.  Ditto
for all the stubs that are guarded by CONFIG_KVM_GMEM_SHARED_MEM.  

Lack of uAPI also makes this series unreviewable.  The tracking is done on the
guest_memfd inode, but it looks like the uAPI to toggle private/shared is going
to be attached to the VM.  Why?  That's just adds extra locks and complexity to
ensure the memslots are stable.

Lastly, this series is at v7, but to be very blunt, it looks RFC quality to my
eyes.  On top of the fact that it's practically impossible to review, it doesn't
even compile on x86 when CONFIG_KVM=m.

  mm/swap.c:110:(.text+0x18ba): undefined reference to `kvm_gmem_handle_folio_put'
  ERROR: modpost: "security_inode_init_security_anon" [arch/x86/kvm/kvm.ko] undefined!

The latter can be solved with an EXPORT, but calling module code from core mm/
is a complete non-starter.

Maybe much of this has discussed been discussed elsewhere and there's a grand
plan for how all of this will shake out.  I haven't been closely following
guest_memfd development due to lack of cycles, and unfortunately I don't expect
that to change in the near future.  I am more than happy to let y'all do the heavy
lifting, but when you submit something for inclusion (i.e. not RFC), then it needs
to actually be ready for inclusion.

I would much, much prefer one large series that shows the full picture than a
mish mash of partial series that I can't actually review, even if the big series
is 100+ patches (hopefully not).

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 4/7] KVM: guest_memfd: Folio sharing states and functions that manage their transition
  2025-03-28 15:31 ` [PATCH v7 4/7] KVM: guest_memfd: Folio sharing states and functions that manage their transition Fuad Tabba
  2025-04-02 23:48   ` Ackerley Tng
@ 2025-04-03  0:19   ` Sean Christopherson
  2025-04-03  9:11     ` Fuad Tabba
  1 sibling, 1 reply; 30+ messages in thread
From: Sean Christopherson @ 2025-04-03  0:19 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton, peterx,
	pankaj.gupta

On Fri, Mar 28, 2025, Fuad Tabba wrote:
> @@ -389,22 +381,211 @@ static void kvm_gmem_init_mount(void)
>  }
>  
>  #ifdef CONFIG_KVM_GMEM_SHARED_MEM
> -static bool kvm_gmem_offset_is_shared(struct file *file, pgoff_t index)
> +/*
> + * An enum of the valid folio sharing states:
> + * Bit 0: set if not shared with the guest (guest cannot fault it in)
> + * Bit 1: set if not shared with the host (host cannot fault it in)
> + */
> +enum folio_shareability {
> +	KVM_GMEM_ALL_SHARED	= 0b00,	/* Shared with the host and the guest. */
> +	KVM_GMEM_GUEST_SHARED	= 0b10, /* Shared only with the guest. */
> +	KVM_GMEM_NONE_SHARED	= 0b11, /* Not shared, transient state. */

Absolutely not.  The proper way to define bitmasks is to use BIT(xxx).  Based on
past discussions, I suspect you went this route so that the most common value
is '0' to avoid extra, but that should be an implementation detail buried deep
in the low level xarray handling, not a

The name is also bizarre and confusing.  To map memory into the guest as private,
it needs to be in KVM_GMEM_GUEST_SHARED.  That's completely unworkable.
Of course, it's not at all obvious that you're actually trying to create a bitmask.
The above looks like an inverted bitmask, but then it's used as if the values don't
matter.

	return (r == KVM_GMEM_ALL_SHARED || r == KVM_GMEM_GUEST_SHARED);

Given that I can't think of a sane use case for allowing guest_memfd to be mapped
into the host but not the guest (modulo temporary demand paging scenarios), I
think all we need is:

	KVM_GMEM_SHARED		  = BIT(0),
	KVM_GMEM_INVALID	  = BIT(1),

As for optimizing xarray storage, assuming it's actually a bitmask, simply let
KVM specify which bits to invert when storing/loading to/from the xarray so that
KVM can optimize storage for the most common value (which is presumably
KVM_GEM_SHARED on arm64?).

If KVM_GMEM_SHARED is the desired "default", invert bit 0, otherwise dont.  If
for some reason we get to a state where the default value is multiple bits, the
inversion trick still works.  E.g. if KVM_GMEM_SHARED where a composite value,
then invert bits 0 and 1.  The polarity shenanigans should be easy to hide in two
low level macros/helpers.

> +/*
> + * Returns true if the folio is shared with the host and the guest.

This is a superfluous comment.  Simple predicates should be self-explanatory
based on function name alone.

> + *
> + * Must be called with the offsets_lock lock held.

Drop these types of comments and document through code, i.e. via lockdep
assertions (which you already have).

> + */
> +static bool kvm_gmem_offset_is_shared(struct inode *inode, pgoff_t index)
> +{
> +	struct xarray *shared_offsets = &kvm_gmem_private(inode)->shared_offsets;
> +	rwlock_t *offsets_lock = &kvm_gmem_private(inode)->offsets_lock;
> +	unsigned long r;
> +
> +	lockdep_assert_held(offsets_lock);
>  
> -	/* For now, VMs that support shared memory share all their memory. */
> -	return kvm_arch_gmem_supports_shared_mem(gmem->kvm);
> +	r = xa_to_value(xa_load(shared_offsets, index));
> +
> +	return r == KVM_GMEM_ALL_SHARED;
> +}
> +
> +/*
> + * Returns true if the folio is shared with the guest (not transitioning).
> + *
> + * Must be called with the offsets_lock lock held.

See above.

>  static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)

This should be something like kvm_gmem_fault_shared() make it abundantly clear
what's being done.  Because it too me a few looks to realize this is faulting
memory into host userspace, not into the guest.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 3/7] KVM: guest_memfd: Track folio sharing within a struct kvm_gmem_private
  2025-04-02 23:56     ` Sean Christopherson
@ 2025-04-03  8:58       ` Fuad Tabba
  2025-04-03 14:51         ` Sean Christopherson
  0 siblings, 1 reply; 30+ messages in thread
From: Fuad Tabba @ 2025-04-03  8:58 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ackerley Tng, kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai,
	mpe, anup, paul.walmsley, palmer, aou, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	isaku.yamahata, mic, vbabka, vannapurve, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton, peterx,
	pankaj.gupta

Hi Ackerley and Sean,


On Thu, 3 Apr 2025 at 00:56, Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Apr 02, 2025, Ackerley Tng wrote:
> > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > > index ac6b8853699d..cde16ed3b230 100644
> > > --- a/virt/kvm/guest_memfd.c
> > > +++ b/virt/kvm/guest_memfd.c
> > > @@ -17,6 +17,18 @@ struct kvm_gmem {
> > >     struct list_head entry;
> > >  };
> > >
> > > +struct kvm_gmem_inode_private {
> > > +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> > > +   struct xarray shared_offsets;
> > > +   rwlock_t offsets_lock;
> >
> > This lock doesn't work, either that or this lock can't be held while
> > faulting, because holding this lock means we can't sleep, and we need to
> > sleep to allocate.
>
> rwlock_t is a variant of a spinlock, which can't be held when sleeping.
>
> What exactly does offsets_lock protect, and what are the rules for holding it?
> At a glance, it's flawed.  Something needs to prevent KVM from installing a mapping
> for a private gfn that is being converted to shared.  KVM doesn't hold references
> to PFNs while they're mapped into the guest, and kvm_gmem_get_pfn() doesn't check
> shared_offsets let alone take offsets_lock.

You're right about the rwlock_t. The goal of the offsets_lock is to
protect the shared offsets -- i.e., it's just meant to protect the
SHARED/PRIVATE status of a folio, not more, hence why it's not checked
in kvm_gmem_get_pfn(). It used to be protected by the
filemap_invalidate_lock, but the problem is that it would be called
from an interrupt context.

However, this is wrong, as you've pointed out. The purpose of locking
is to ensure  that no two conversions of the same folio happen at the
same time. An alternative I had written up is to rely on having
exclusive access to the folio to ensure that, since this is tied to
the folio. That could be either by acquiring the folio lock, or
ensuring that the folio doesn't have any outstanding references,
indicating that we have exclusive access to it. This would avoid the
whole locking issue.

> ... Something needs to prevent KVM from installing a mapping
> for a private gfn that is being converted to shared.  ...

> guest_memfd currently handles races between kvm_gmem_fault() and PUNCH_HOLE via
> kvm_gmem_invalidate_{begin,end}().  I don't see any equivalent functionality in
> the shared/private conversion code.

For in-place sharing, KVM can install a mapping for a SHARED gfn. What
it cannot do is install a mapping for a transient (i.e., NONE) gfn. We
don't rely on kvm_gmem_get_pfn() for that, but on the individual KVM
mmu fault handlers, but that said...

> In general, this series is unreviewable as many of the APIs have no callers,
> which makes it impossible to review the safety/correctness of locking.  Ditto
> for all the stubs that are guarded by CONFIG_KVM_GMEM_SHARED_MEM.
>
> Lack of uAPI also makes this series unreviewable.  The tracking is done on the
> guest_memfd inode, but it looks like the uAPI to toggle private/shared is going
> to be attached to the VM.  Why?  That's just adds extra locks and complexity to
> ensure the memslots are stable.

You're right. Back to your point from above, it might make more sense
to have that logic in kvm_gmem_get_pfn() instead. I'll work something
out.

> Lastly, this series is at v7, but to be very blunt, it looks RFC quality to my
> eyes.  On top of the fact that it's practically impossible to review, it doesn't
> even compile on x86 when CONFIG_KVM=m.
>
>   mm/swap.c:110:(.text+0x18ba): undefined reference to `kvm_gmem_handle_folio_put'
>   ERROR: modpost: "security_inode_init_security_anon" [arch/x86/kvm/kvm.ko] undefined!
>
> The latter can be solved with an EXPORT, but calling module code from core mm/
> is a complete non-starter.
>
> Maybe much of this has discussed been discussed elsewhere and there's a grand
> plan for how all of this will shake out.  I haven't been closely following
> guest_memfd development due to lack of cycles, and unfortunately I don't expect
> that to change in the near future.  I am more than happy to let y'all do the heavy
> lifting, but when you submit something for inclusion (i.e. not RFC), then it needs
> to actually be ready for inclusion.
>
> I would much, much prefer one large series that shows the full picture than a
> mish mash of partial series that I can't actually review, even if the big series
> is 100+ patches (hopefully not).

Dropping the RFC from the second series was not intentional, the first
series is the one where I intended to drop the RFC. I apologize for
that.  Especially since I obviously don't know how to handle modules
and wanted some input on how to do that :)

Splitting this series into two was my attempt at making it easier to
review, but I see it ended up making things more complicated. The next
one will be just one series.

Cheers,
/fuad


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 4/7] KVM: guest_memfd: Folio sharing states and functions that manage their transition
  2025-04-02 23:48   ` Ackerley Tng
@ 2025-04-03  8:58     ` Fuad Tabba
  0 siblings, 0 replies; 30+ messages in thread
From: Fuad Tabba @ 2025-04-03  8:58 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	isaku.yamahata, mic, vbabka, vannapurve, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton, peterx,
	pankaj.gupta

Hi Ackerley,

On Thu, 3 Apr 2025 at 00:48, Ackerley Tng <ackerleytng@google.com> wrote:
>
> Fuad Tabba <tabba@google.com> writes:
>
> > To allow in-place sharing of guest_memfd folios with the host,
> > guest_memfd needs to track their sharing state, because mapping of
> > shared folios will only be allowed where it safe to access these folios.
> > It is safe to map and access these folios when explicitly shared with
> > the host, or potentially if not yet exposed to the guest (e.g., at
> > initialization).
> >
> > This patch introduces sharing states for guest_memfd folios as well as
> > the functions that manage transitioning between those states.
> >
> > Signed-off-by: Fuad Tabba <tabba@google.com>
> > ---
> >  include/linux/kvm_host.h |  39 +++++++-
> >  virt/kvm/guest_memfd.c   | 208 ++++++++++++++++++++++++++++++++++++---
> >  virt/kvm/kvm_main.c      |  62 ++++++++++++
> >  3 files changed, 295 insertions(+), 14 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index bc73d7426363..bf82faf16c53 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -2600,7 +2600,44 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> >  #endif
> >
> >  #ifdef CONFIG_KVM_GMEM_SHARED_MEM
> > +int kvm_gmem_set_shared(struct kvm *kvm, gfn_t start, gfn_t end);
> > +int kvm_gmem_clear_shared(struct kvm *kvm, gfn_t start, gfn_t end);
> > +int kvm_gmem_slot_set_shared(struct kvm_memory_slot *slot, gfn_t start,
> > +                          gfn_t end);
> > +int kvm_gmem_slot_clear_shared(struct kvm_memory_slot *slot, gfn_t start,
> > +                            gfn_t end);
> > +bool kvm_gmem_slot_is_guest_shared(struct kvm_memory_slot *slot, gfn_t gfn);
> >  void kvm_gmem_handle_folio_put(struct folio *folio);
> > -#endif
> > +#else
> > +static inline int kvm_gmem_set_shared(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > +     WARN_ON_ONCE(1);
> > +     return -EINVAL;
> > +}
> > +static inline int kvm_gmem_clear_shared(struct kvm *kvm, gfn_t start,
> > +                                     gfn_t end)
> > +{
> > +     WARN_ON_ONCE(1);
> > +     return -EINVAL;
> > +}
> > +static inline int kvm_gmem_slot_set_shared(struct kvm_memory_slot *slot,
> > +                                        gfn_t start, gfn_t end)
> > +{
> > +     WARN_ON_ONCE(1);
> > +     return -EINVAL;
> > +}
> > +static inline int kvm_gmem_slot_clear_shared(struct kvm_memory_slot *slot,
> > +                                          gfn_t start, gfn_t end)
> > +{
> > +     WARN_ON_ONCE(1);
> > +     return -EINVAL;
> > +}
> > +static inline bool kvm_gmem_slot_is_guest_shared(struct kvm_memory_slot *slot,
> > +                                              gfn_t gfn)
> > +{
> > +     WARN_ON_ONCE(1);
> > +     return false;
> > +}
> > +#endif /* CONFIG_KVM_GMEM_SHARED_MEM */
> >
> >  #endif
> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > index cde16ed3b230..3b4d724084a8 100644
> > --- a/virt/kvm/guest_memfd.c
> > +++ b/virt/kvm/guest_memfd.c
> > @@ -29,14 +29,6 @@ static struct kvm_gmem_inode_private *kvm_gmem_private(struct inode *inode)
> >       return inode->i_mapping->i_private_data;
> >  }
> >
> > -#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> > -void kvm_gmem_handle_folio_put(struct folio *folio)
> > -{
> > -     WARN_ONCE(1, "A placeholder that shouldn't trigger. Work in progress.");
> > -}
> > -EXPORT_SYMBOL_GPL(kvm_gmem_handle_folio_put);
> > -#endif /* CONFIG_KVM_GMEM_SHARED_MEM */
> > -
> >  /**
> >   * folio_file_pfn - like folio_file_page, but return a pfn.
> >   * @folio: The folio which contains this index.
> > @@ -389,22 +381,211 @@ static void kvm_gmem_init_mount(void)
> >  }
> >
> >  #ifdef CONFIG_KVM_GMEM_SHARED_MEM
> > -static bool kvm_gmem_offset_is_shared(struct file *file, pgoff_t index)
> > +/*
> > + * An enum of the valid folio sharing states:
> > + * Bit 0: set if not shared with the guest (guest cannot fault it in)
> > + * Bit 1: set if not shared with the host (host cannot fault it in)
> > + */
> > +enum folio_shareability {
> > +     KVM_GMEM_ALL_SHARED     = 0b00, /* Shared with the host and the guest. */
> > +     KVM_GMEM_GUEST_SHARED   = 0b10, /* Shared only with the guest. */
> > +     KVM_GMEM_NONE_SHARED    = 0b11, /* Not shared, transient state. */
> > +};
> > +
> > +static int kvm_gmem_offset_set_shared(struct inode *inode, pgoff_t index)
> >  {
> > -     struct kvm_gmem *gmem = file->private_data;
> > +     struct xarray *shared_offsets = &kvm_gmem_private(inode)->shared_offsets;
> > +     rwlock_t *offsets_lock = &kvm_gmem_private(inode)->offsets_lock;
> > +     void *xval = xa_mk_value(KVM_GMEM_ALL_SHARED);
> > +
> > +     lockdep_assert_held_write(offsets_lock);
> > +
> > +     return xa_err(xa_store(shared_offsets, index, xval, GFP_KERNEL));
> > +}
> > +
> > +/*
> > + * Marks the range [start, end) as shared with both the host and the guest.
> > + * Called when guest shares memory with the host.
> > + */
> > +static int kvm_gmem_offset_range_set_shared(struct inode *inode,
> > +                                         pgoff_t start, pgoff_t end)
> > +{
> > +     rwlock_t *offsets_lock = &kvm_gmem_private(inode)->offsets_lock;
> > +     pgoff_t i;
> > +     int r = 0;
> > +
> > +     write_lock(offsets_lock);
> > +     for (i = start; i < end; i++) {
> > +             r = kvm_gmem_offset_set_shared(inode, i);
> > +             if (WARN_ON_ONCE(r))
> > +                     break;
> > +     }
> > +     write_unlock(offsets_lock);
> > +
> > +     return r;
> > +}
> > +
> > +static int kvm_gmem_offset_clear_shared(struct inode *inode, pgoff_t index)
> > +{
> > +     struct xarray *shared_offsets = &kvm_gmem_private(inode)->shared_offsets;
> > +     rwlock_t *offsets_lock = &kvm_gmem_private(inode)->offsets_lock;
> > +     void *xval_guest = xa_mk_value(KVM_GMEM_GUEST_SHARED);
> > +     void *xval_none = xa_mk_value(KVM_GMEM_NONE_SHARED);
> > +     struct folio *folio;
> > +     int refcount;
> > +     int r;
> > +
> > +     lockdep_assert_held_write(offsets_lock);
> > +
> > +     folio = filemap_lock_folio(inode->i_mapping, index);
> > +     if (!IS_ERR(folio)) {
> > +             /* +1 references are expected because of filemap_lock_folio(). */
> > +             refcount = folio_nr_pages(folio) + 1;
> > +     } else {
> > +             r = PTR_ERR(folio);
> > +             if (WARN_ON_ONCE(r != -ENOENT))
> > +                     return r;
> > +
> > +             folio = NULL;
> > +     }
> > +
> > +     if (!folio || folio_ref_freeze(folio, refcount)) {
> > +             /*
> > +              * No outstanding references: transition to guest shared.
> > +              */
> > +             r = xa_err(xa_store(shared_offsets, index, xval_guest, GFP_KERNEL));
> > +
> > +             if (folio)
> > +                     folio_ref_unfreeze(folio, refcount);
> > +     } else {
> > +             /*
> > +              * Outstanding references: the folio cannot be faulted in by
> > +              * anyone until they're dropped.
> > +              */
> > +             r = xa_err(xa_store(shared_offsets, index, xval_none, GFP_KERNEL));
>
> Once we do this on elevated refcounts, truncate needs to be updated to
> handle the case where some folio is still in a KVM_GMEM_NONE_SHARED
> state.
>
> When a folio is found in a KVM_GMEM_NONE_SHARED state, the shareability
> should be fast-forwarded to KVM_GMEM_GUEST_SHARED, and the filemap's
> refcounts restored.  The folio can then be truncated from the filemap as
> usual (which will drop the filemap's refcounts)

Ack.

Thanks,
/fuad


> > +     }
> > +
> > +     if (folio) {
> > +             folio_unlock(folio);
> > +             folio_put(folio);
> > +     }
> > +
> > +     return r;
> > +}
> >
> > +/*
> > + * Marks the range [start, end) as not shared with the host. If the host doesn't
> > + * have any references to a particular folio, then that folio is marked as
> > + * shared with the guest.
> > + *
> > + * However, if the host still has references to the folio, then the folio is
> > + * marked and not shared with anyone. Marking it as not shared allows draining
> > + * all references from the host, and ensures that the hypervisor does not
> > + * transition the folio to private, since the host still might access it.
> > + *
> > + * Called when guest unshares memory with the host.
> > + */
> > +static int kvm_gmem_offset_range_clear_shared(struct inode *inode,
> > +                                           pgoff_t start, pgoff_t end)
> > +{
> > +     rwlock_t *offsets_lock = &kvm_gmem_private(inode)->offsets_lock;
> > +     pgoff_t i;
> > +     int r = 0;
> > +
> > +     write_lock(offsets_lock);
> > +     for (i = start; i < end; i++) {
> > +             r = kvm_gmem_offset_clear_shared(inode, i);
> > +             if (WARN_ON_ONCE(r))
> > +                     break;
> > +     }
> > +     write_unlock(offsets_lock);
> > +
> > +     return r;
> > +}
> > +
> > +void kvm_gmem_handle_folio_put(struct folio *folio)
> > +{
> > +     WARN_ONCE(1, "A placeholder that shouldn't trigger. Work in progress.");
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_gmem_handle_folio_put);
> > +
> > +/*
> > + * Returns true if the folio is shared with the host and the guest.
> > + *
> > + * Must be called with the offsets_lock lock held.
> > + */
> > +static bool kvm_gmem_offset_is_shared(struct inode *inode, pgoff_t index)
> > +{
> > +     struct xarray *shared_offsets = &kvm_gmem_private(inode)->shared_offsets;
> > +     rwlock_t *offsets_lock = &kvm_gmem_private(inode)->offsets_lock;
> > +     unsigned long r;
> > +
> > +     lockdep_assert_held(offsets_lock);
> >
> > -     /* For now, VMs that support shared memory share all their memory. */
> > -     return kvm_arch_gmem_supports_shared_mem(gmem->kvm);
> > +     r = xa_to_value(xa_load(shared_offsets, index));
> > +
> > +     return r == KVM_GMEM_ALL_SHARED;
> > +}
> > +
> > +/*
> > + * Returns true if the folio is shared with the guest (not transitioning).
> > + *
> > + * Must be called with the offsets_lock lock held.
> > + */
> > +static bool kvm_gmem_offset_is_guest_shared(struct inode *inode, pgoff_t index)
> > +{
> > +     struct xarray *shared_offsets = &kvm_gmem_private(inode)->shared_offsets;
> > +     rwlock_t *offsets_lock = &kvm_gmem_private(inode)->offsets_lock;
> > +     unsigned long r;
> > +
> > +     lockdep_assert_held(offsets_lock);
> > +
> > +     r = xa_to_value(xa_load(shared_offsets, index));
> > +
> > +     return (r == KVM_GMEM_ALL_SHARED || r == KVM_GMEM_GUEST_SHARED);
> > +}
> > +
> > +int kvm_gmem_slot_set_shared(struct kvm_memory_slot *slot, gfn_t start, gfn_t end)
> > +{
> > +     struct inode *inode = file_inode(READ_ONCE(slot->gmem.file));
> > +     pgoff_t start_off = slot->gmem.pgoff + start - slot->base_gfn;
> > +     pgoff_t end_off = start_off + end - start;
> > +
> > +     return kvm_gmem_offset_range_set_shared(inode, start_off, end_off);
> > +}
> > +
> > +int kvm_gmem_slot_clear_shared(struct kvm_memory_slot *slot, gfn_t start, gfn_t end)
> > +{
> > +     struct inode *inode = file_inode(READ_ONCE(slot->gmem.file));
> > +     pgoff_t start_off = slot->gmem.pgoff + start - slot->base_gfn;
> > +     pgoff_t end_off = start_off + end - start;
> > +
> > +     return kvm_gmem_offset_range_clear_shared(inode, start_off, end_off);
> > +}
> > +
> > +bool kvm_gmem_slot_is_guest_shared(struct kvm_memory_slot *slot, gfn_t gfn)
> > +{
> > +     struct inode *inode = file_inode(READ_ONCE(slot->gmem.file));
> > +     rwlock_t *offsets_lock = &kvm_gmem_private(inode)->offsets_lock;
> > +     unsigned long pgoff = slot->gmem.pgoff + gfn - slot->base_gfn;
> > +     bool r;
> > +
> > +     read_lock(offsets_lock);
> > +     r = kvm_gmem_offset_is_guest_shared(inode, pgoff);
> > +     read_unlock(offsets_lock);
> > +
> > +     return r;
> >  }
> >
> >  static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
> >  {
> >       struct inode *inode = file_inode(vmf->vma->vm_file);
> > +     rwlock_t *offsets_lock = &kvm_gmem_private(inode)->offsets_lock;
> >       struct folio *folio;
> >       vm_fault_t ret = VM_FAULT_LOCKED;
> >
> >       filemap_invalidate_lock_shared(inode->i_mapping);
> > +     read_lock(offsets_lock);
> >
> >       folio = kvm_gmem_get_folio(inode, vmf->pgoff);
> >       if (IS_ERR(folio)) {
> > @@ -423,7 +604,7 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
> >               goto out_folio;
> >       }
> >
> > -     if (!kvm_gmem_offset_is_shared(vmf->vma->vm_file, vmf->pgoff)) {
> > +     if (!kvm_gmem_offset_is_shared(inode, vmf->pgoff)) {
> >               ret = VM_FAULT_SIGBUS;
> >               goto out_folio;
> >       }
> > @@ -457,6 +638,7 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
> >       }
> >
> >  out_filemap:
> > +     read_unlock(offsets_lock);
> >       filemap_invalidate_unlock_shared(inode->i_mapping);
> >
> >       return ret;
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 3e40acb9f5c0..90762252381c 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -3091,6 +3091,68 @@ static int next_segment(unsigned long len, int offset)
> >               return len;
> >  }
> >
> > +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> > +int kvm_gmem_set_shared(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > +     struct kvm_memslot_iter iter;
> > +     int r = 0;
> > +
> > +     mutex_lock(&kvm->slots_lock);
> > +
> > +     kvm_for_each_memslot_in_gfn_range(&iter, kvm_memslots(kvm), start, end) {
> > +             struct kvm_memory_slot *memslot = iter.slot;
> > +             gfn_t gfn_start, gfn_end;
> > +
> > +             if (!kvm_slot_can_be_private(memslot))
> > +                     continue;
> > +
> > +             gfn_start = max(start, memslot->base_gfn);
> > +             gfn_end = min(end, memslot->base_gfn + memslot->npages);
> > +             if (WARN_ON_ONCE(start >= end))
> > +                     continue;
> > +
> > +             r = kvm_gmem_slot_set_shared(memslot, gfn_start, gfn_end);
> > +             if (WARN_ON_ONCE(r))
> > +                     break;
> > +     }
> > +
> > +     mutex_unlock(&kvm->slots_lock);
> > +
> > +     return r;
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_gmem_set_shared);
> > +
> > +int kvm_gmem_clear_shared(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > +     struct kvm_memslot_iter iter;
> > +     int r = 0;
> > +
> > +     mutex_lock(&kvm->slots_lock);
> > +
> > +     kvm_for_each_memslot_in_gfn_range(&iter, kvm_memslots(kvm), start, end) {
> > +             struct kvm_memory_slot *memslot = iter.slot;
> > +             gfn_t gfn_start, gfn_end;
> > +
> > +             if (!kvm_slot_can_be_private(memslot))
> > +                     continue;
> > +
> > +             gfn_start = max(start, memslot->base_gfn);
> > +             gfn_end = min(end, memslot->base_gfn + memslot->npages);
> > +             if (WARN_ON_ONCE(start >= end))
> > +                     continue;
> > +
> > +             r = kvm_gmem_slot_clear_shared(memslot, gfn_start, gfn_end);
> > +             if (WARN_ON_ONCE(r))
> > +                     break;
> > +     }
> > +
> > +     mutex_unlock(&kvm->slots_lock);
> > +
> > +     return r;
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_gmem_clear_shared);
> > +#endif /* CONFIG_KVM_GMEM_SHARED_MEM */
> > +
> >  /* Copy @len bytes from guest memory at '(@gfn * PAGE_SIZE) + @offset' to @data */
> >  static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
> >                                void *data, int offset, int len)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 4/7] KVM: guest_memfd: Folio sharing states and functions that manage their transition
  2025-04-03  0:19   ` Sean Christopherson
@ 2025-04-03  9:11     ` Fuad Tabba
  2025-04-03 14:52       ` Sean Christopherson
  0 siblings, 1 reply; 30+ messages in thread
From: Fuad Tabba @ 2025-04-03  9:11 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton, peterx,
	pankaj.gupta

Hi Sean,

On Thu, 3 Apr 2025 at 01:19, Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Mar 28, 2025, Fuad Tabba wrote:
> > @@ -389,22 +381,211 @@ static void kvm_gmem_init_mount(void)
> >  }
> >
> >  #ifdef CONFIG_KVM_GMEM_SHARED_MEM
> > -static bool kvm_gmem_offset_is_shared(struct file *file, pgoff_t index)
> > +/*
> > + * An enum of the valid folio sharing states:
> > + * Bit 0: set if not shared with the guest (guest cannot fault it in)
> > + * Bit 1: set if not shared with the host (host cannot fault it in)
> > + */
> > +enum folio_shareability {
> > +     KVM_GMEM_ALL_SHARED     = 0b00, /* Shared with the host and the guest. */
> > +     KVM_GMEM_GUEST_SHARED   = 0b10, /* Shared only with the guest. */
> > +     KVM_GMEM_NONE_SHARED    = 0b11, /* Not shared, transient state. */
>
> Absolutely not.  The proper way to define bitmasks is to use BIT(xxx).  Based on
> past discussions, I suspect you went this route so that the most common value
> is '0' to avoid extra, but that should be an implementation detail buried deep
> in the low level xarray handling, not a
>
> The name is also bizarre and confusing.  To map memory into the guest as private,
> it needs to be in KVM_GMEM_GUEST_SHARED.  That's completely unworkable.
> Of course, it's not at all obvious that you're actually trying to create a bitmask.
> The above looks like an inverted bitmask, but then it's used as if the values don't
> matter.
>
>         return (r == KVM_GMEM_ALL_SHARED || r == KVM_GMEM_GUEST_SHARED);

Ack.

> Given that I can't think of a sane use case for allowing guest_memfd to be mapped
> into the host but not the guest (modulo temporary demand paging scenarios), I
> think all we need is:
>
>         KVM_GMEM_SHARED           = BIT(0),
>         KVM_GMEM_INVALID          = BIT(1),

We need the third state for the transient case, i.e., when a page is
transitioning from being shared with the host to going back to
private, in order to ensure that neither the guest nor the host can
install a mapping/fault it in. But I see your point.

> As for optimizing xarray storage, assuming it's actually a bitmask, simply let
> KVM specify which bits to invert when storing/loading to/from the xarray so that
> KVM can optimize storage for the most common value (which is presumably
> KVM_GEM_SHARED on arm64?).
>
> If KVM_GMEM_SHARED is the desired "default", invert bit 0, otherwise dont.  If
> for some reason we get to a state where the default value is multiple bits, the
> inversion trick still works.  E.g. if KVM_GMEM_SHARED where a composite value,
> then invert bits 0 and 1.  The polarity shenanigans should be easy to hide in two
> low level macros/helpers.

Ack.


> > +/*
> > + * Returns true if the folio is shared with the host and the guest.
>
> This is a superfluous comment.  Simple predicates should be self-explanatory
> based on function name alone.
>
> > + *
> > + * Must be called with the offsets_lock lock held.
>
> Drop these types of comments and document through code, i.e. via lockdep
> assertions (which you already have).

Ack.

> > + */
> > +static bool kvm_gmem_offset_is_shared(struct inode *inode, pgoff_t index)
> > +{
> > +     struct xarray *shared_offsets = &kvm_gmem_private(inode)->shared_offsets;
> > +     rwlock_t *offsets_lock = &kvm_gmem_private(inode)->offsets_lock;
> > +     unsigned long r;
> > +
> > +     lockdep_assert_held(offsets_lock);
> >
> > -     /* For now, VMs that support shared memory share all their memory. */
> > -     return kvm_arch_gmem_supports_shared_mem(gmem->kvm);
> > +     r = xa_to_value(xa_load(shared_offsets, index));
> > +
> > +     return r == KVM_GMEM_ALL_SHARED;
> > +}
> > +
> > +/*
> > + * Returns true if the folio is shared with the guest (not transitioning).
> > + *
> > + * Must be called with the offsets_lock lock held.
>
> See above.

Ack.

> >  static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
>
> This should be something like kvm_gmem_fault_shared() make it abundantly clear
> what's being done.  Because it too me a few looks to realize this is faulting
> memory into host userspace, not into the guest.

Ack.

Thanks!


/fuad


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 7/7] KVM: guest_memfd: Add a guest_memfd() flag to initialize it as shared
  2025-04-02 23:03   ` Ackerley Tng
@ 2025-04-03 11:32     ` Fuad Tabba
  0 siblings, 0 replies; 30+ messages in thread
From: Fuad Tabba @ 2025-04-03 11:32 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	isaku.yamahata, mic, vbabka, vannapurve, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton, peterx,
	pankaj.gupta

Hi Ackerley,

On Thu, 3 Apr 2025 at 00:03, Ackerley Tng <ackerleytng@google.com> wrote:
>
> Fuad Tabba <tabba@google.com> writes:
>
> > Not all use cases require guest_memfd() to be shared with the host when
> > first created. Add a new flag, GUEST_MEMFD_FLAG_INIT_SHARED, which when
> > set on KVM_CREATE_GUEST_MEMFD initializes the memory as shared with the
> > host, and therefore mappable by it. Otherwise, memory is private until
> > explicitly shared by the guest with the host.
> >
> > Signed-off-by: Fuad Tabba <tabba@google.com>
> > ---
> >  Documentation/virt/kvm/api.rst                 |  4 ++++
> >  include/uapi/linux/kvm.h                       |  1 +
> >  tools/testing/selftests/kvm/guest_memfd_test.c |  7 +++++--
> >  virt/kvm/guest_memfd.c                         | 12 ++++++++++++
> >  4 files changed, 22 insertions(+), 2 deletions(-)
> >
> > <snip>
> >
> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > index eec9d5e09f09..32e149478b04 100644
> > --- a/virt/kvm/guest_memfd.c
> > +++ b/virt/kvm/guest_memfd.c
> > @@ -1069,6 +1069,15 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
> >               goto err_gmem;
> >       }
> >
> > +     if (IS_ENABLED(CONFIG_KVM_GMEM_SHARED_MEM) &&
> > +         (flags & GUEST_MEMFD_FLAG_INIT_SHARED)) {
> > +             err = kvm_gmem_offset_range_set_shared(file_inode(file), 0, size >> PAGE_SHIFT);
>
> I think if GUEST_MEMFD_FLAG_INIT_SHARED is not set, we should call
> kvm_gmem_offset_range_clear_shared(); so that there is always some
> shareability defined for all offsets in a file. Otherwise, when reading
> shareability, we'd have to check against GUEST_MEMFD_FLAG_INIT_SHARED
> to find out what to initialize it to.

Ack.

Thanks!

/fuad

> > +             if (err) {
> > +                     fput(file);
> > +                     goto err_gmem;
> > +             }
> > +     }
> > +
> >       kvm_get_kvm(kvm);
> >       gmem->kvm = kvm;
> >       xa_init(&gmem->bindings);
> > @@ -1090,6 +1099,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
> >       u64 flags = args->flags;
> >       u64 valid_flags = 0;
> >
> > +     if (IS_ENABLED(CONFIG_KVM_GMEM_SHARED_MEM))
> > +             valid_flags |= GUEST_MEMFD_FLAG_INIT_SHARED;
> > +
> >       if (flags & ~valid_flags)
> >               return -EINVAL;


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 3/7] KVM: guest_memfd: Track folio sharing within a struct kvm_gmem_private
  2025-04-03  8:58       ` Fuad Tabba
@ 2025-04-03 14:51         ` Sean Christopherson
  2025-04-04  6:43           ` Fuad Tabba
  0 siblings, 1 reply; 30+ messages in thread
From: Sean Christopherson @ 2025-04-03 14:51 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: Ackerley Tng, kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai,
	mpe, anup, paul.walmsley, palmer, aou, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	isaku.yamahata, mic, vbabka, vannapurve, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton, peterx,
	pankaj.gupta

On Thu, Apr 03, 2025, Fuad Tabba wrote:
> On Thu, 3 Apr 2025 at 00:56, Sean Christopherson <seanjc@google.com> wrote:
> > On Wed, Apr 02, 2025, Ackerley Tng wrote:
> > > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > > > index ac6b8853699d..cde16ed3b230 100644
> > > > --- a/virt/kvm/guest_memfd.c
> > > > +++ b/virt/kvm/guest_memfd.c
> > > > @@ -17,6 +17,18 @@ struct kvm_gmem {
> > > >     struct list_head entry;
> > > >  };
> > > >
> > > > +struct kvm_gmem_inode_private {
> > > > +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> > > > +   struct xarray shared_offsets;
> > > > +   rwlock_t offsets_lock;
> > >
> > > This lock doesn't work, either that or this lock can't be held while
> > > faulting, because holding this lock means we can't sleep, and we need to
> > > sleep to allocate.
> >
> > rwlock_t is a variant of a spinlock, which can't be held when sleeping.
> >
> > What exactly does offsets_lock protect, and what are the rules for holding it?
> > At a glance, it's flawed.  Something needs to prevent KVM from installing a mapping
> > for a private gfn that is being converted to shared.  KVM doesn't hold references
> > to PFNs while they're mapped into the guest, and kvm_gmem_get_pfn() doesn't check
> > shared_offsets let alone take offsets_lock.
> 
> You're right about the rwlock_t. The goal of the offsets_lock is to
> protect the shared offsets -- i.e., it's just meant to protect the
> SHARED/PRIVATE status of a folio, not more, hence why it's not checked
> in kvm_gmem_get_pfn(). It used to be protected by the
> filemap_invalidate_lock, but the problem is that it would be called
> from an interrupt context.
> 
> However, this is wrong, as you've pointed out. The purpose of locking
> is to ensure  that no two conversions of the same folio happen at the
> same time. An alternative I had written up is to rely on having
> exclusive access to the folio to ensure that, since this is tied to
> the folio. That could be either by acquiring the folio lock, or
> ensuring that the folio doesn't have any outstanding references,
> indicating that we have exclusive access to it. This would avoid the
> whole locking issue.
> 
> > ... Something needs to prevent KVM from installing a mapping
> > for a private gfn that is being converted to shared.  ...
> 
> > guest_memfd currently handles races between kvm_gmem_fault() and PUNCH_HOLE via
> > kvm_gmem_invalidate_{begin,end}().  I don't see any equivalent functionality in
> > the shared/private conversion code.
> 
> For in-place sharing, KVM can install a mapping for a SHARED gfn. What
> it cannot do is install a mapping for a transient (i.e., NONE) gfn. We
> don't rely on kvm_gmem_get_pfn() for that, but on the individual KVM
> mmu fault handlers, but that said...

Consumption of shared/private physical pages _must_ be enforced by guest_memfd.
The private vs. shared state in the MMU handlers is that VM's view of the world
and desired state.  The guest_memfd inode is the single source of true for the
state of the _physical_ page.

E.g. on TDX, if KVM installs a private SPTE for a PFN that is in actuality shared,
there will be machine checks and the host will likely crash.

> > I would much, much prefer one large series that shows the full picture than a
> > mish mash of partial series that I can't actually review, even if the big series
> > is 100+ patches (hopefully not).
> 
> Dropping the RFC from the second series was not intentional, the first
> series is the one where I intended to drop the RFC. I apologize for
> that.  Especially since I obviously don't know how to handle modules
> and wanted some input on how to do that :)

In this case, the rules for modules are pretty simple.  Code in mm/ can't call
into KVM.  Either avoid callbacks entirely, or implement via a layer of
indirection, e.g. function pointer or ops table, so that KVM can provide its
implementation at runtime.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 4/7] KVM: guest_memfd: Folio sharing states and functions that manage their transition
  2025-04-03  9:11     ` Fuad Tabba
@ 2025-04-03 14:52       ` Sean Christopherson
  2025-04-04  6:44         ` Fuad Tabba
  0 siblings, 1 reply; 30+ messages in thread
From: Sean Christopherson @ 2025-04-03 14:52 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton, peterx,
	pankaj.gupta

On Thu, Apr 03, 2025, Fuad Tabba wrote:
> Hi Sean,
> 
> On Thu, 3 Apr 2025 at 01:19, Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Fri, Mar 28, 2025, Fuad Tabba wrote:
> > > @@ -389,22 +381,211 @@ static void kvm_gmem_init_mount(void)
> > >  }
> > >
> > >  #ifdef CONFIG_KVM_GMEM_SHARED_MEM
> > > -static bool kvm_gmem_offset_is_shared(struct file *file, pgoff_t index)
> > > +/*
> > > + * An enum of the valid folio sharing states:
> > > + * Bit 0: set if not shared with the guest (guest cannot fault it in)
> > > + * Bit 1: set if not shared with the host (host cannot fault it in)
> > > + */
> > > +enum folio_shareability {
> > > +     KVM_GMEM_ALL_SHARED     = 0b00, /* Shared with the host and the guest. */
> > > +     KVM_GMEM_GUEST_SHARED   = 0b10, /* Shared only with the guest. */
> > > +     KVM_GMEM_NONE_SHARED    = 0b11, /* Not shared, transient state. */
> >
> > Absolutely not.  The proper way to define bitmasks is to use BIT(xxx).  Based on
> > past discussions, I suspect you went this route so that the most common value
> > is '0' to avoid extra, but that should be an implementation detail buried deep
> > in the low level xarray handling, not a
> >
> > The name is also bizarre and confusing.  To map memory into the guest as private,
> > it needs to be in KVM_GMEM_GUEST_SHARED.  That's completely unworkable.
> > Of course, it's not at all obvious that you're actually trying to create a bitmask.
> > The above looks like an inverted bitmask, but then it's used as if the values don't
> > matter.
> >
> >         return (r == KVM_GMEM_ALL_SHARED || r == KVM_GMEM_GUEST_SHARED);
> 
> Ack.
> 
> > Given that I can't think of a sane use case for allowing guest_memfd to be mapped
> > into the host but not the guest (modulo temporary demand paging scenarios), I
> > think all we need is:
> >
> >         KVM_GMEM_SHARED           = BIT(0),
> >         KVM_GMEM_INVALID          = BIT(1),
> 
> We need the third state for the transient case, i.e., when a page is
> transitioning from being shared with the host to going back to
> private, in order to ensure that neither the guest nor the host can
> install a mapping/fault it in. But I see your point.

That's KVM_GMEM_INVALID.  Translating to what you had:

  KVM_GMEM_ALL_SHARED   = KVM_GMEM_SHARED
  KVM_GMEM_GUEST_SHARED = 0
  KVM_GMEM_NONE_SHARED  = KVM_GMEM_INVALID


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 3/7] KVM: guest_memfd: Track folio sharing within a struct kvm_gmem_private
  2025-04-03 14:51         ` Sean Christopherson
@ 2025-04-04  6:43           ` Fuad Tabba
  0 siblings, 0 replies; 30+ messages in thread
From: Fuad Tabba @ 2025-04-04  6:43 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ackerley Tng, kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai,
	mpe, anup, paul.walmsley, palmer, aou, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	isaku.yamahata, mic, vbabka, vannapurve, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton, peterx,
	pankaj.gupta

Hi Sean,

On Thu, 3 Apr 2025 at 15:51, Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Apr 03, 2025, Fuad Tabba wrote:
> > On Thu, 3 Apr 2025 at 00:56, Sean Christopherson <seanjc@google.com> wrote:
> > > On Wed, Apr 02, 2025, Ackerley Tng wrote:
> > > > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > > > > index ac6b8853699d..cde16ed3b230 100644
> > > > > --- a/virt/kvm/guest_memfd.c
> > > > > +++ b/virt/kvm/guest_memfd.c
> > > > > @@ -17,6 +17,18 @@ struct kvm_gmem {
> > > > >     struct list_head entry;
> > > > >  };
> > > > >
> > > > > +struct kvm_gmem_inode_private {
> > > > > +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> > > > > +   struct xarray shared_offsets;
> > > > > +   rwlock_t offsets_lock;
> > > >
> > > > This lock doesn't work, either that or this lock can't be held while
> > > > faulting, because holding this lock means we can't sleep, and we need to
> > > > sleep to allocate.
> > >
> > > rwlock_t is a variant of a spinlock, which can't be held when sleeping.
> > >
> > > What exactly does offsets_lock protect, and what are the rules for holding it?
> > > At a glance, it's flawed.  Something needs to prevent KVM from installing a mapping
> > > for a private gfn that is being converted to shared.  KVM doesn't hold references
> > > to PFNs while they're mapped into the guest, and kvm_gmem_get_pfn() doesn't check
> > > shared_offsets let alone take offsets_lock.
> >
> > You're right about the rwlock_t. The goal of the offsets_lock is to
> > protect the shared offsets -- i.e., it's just meant to protect the
> > SHARED/PRIVATE status of a folio, not more, hence why it's not checked
> > in kvm_gmem_get_pfn(). It used to be protected by the
> > filemap_invalidate_lock, but the problem is that it would be called
> > from an interrupt context.
> >
> > However, this is wrong, as you've pointed out. The purpose of locking
> > is to ensure  that no two conversions of the same folio happen at the
> > same time. An alternative I had written up is to rely on having
> > exclusive access to the folio to ensure that, since this is tied to
> > the folio. That could be either by acquiring the folio lock, or
> > ensuring that the folio doesn't have any outstanding references,
> > indicating that we have exclusive access to it. This would avoid the
> > whole locking issue.
> >
> > > ... Something needs to prevent KVM from installing a mapping
> > > for a private gfn that is being converted to shared.  ...
> >
> > > guest_memfd currently handles races between kvm_gmem_fault() and PUNCH_HOLE via
> > > kvm_gmem_invalidate_{begin,end}().  I don't see any equivalent functionality in
> > > the shared/private conversion code.
> >
> > For in-place sharing, KVM can install a mapping for a SHARED gfn. What
> > it cannot do is install a mapping for a transient (i.e., NONE) gfn. We
> > don't rely on kvm_gmem_get_pfn() for that, but on the individual KVM
> > mmu fault handlers, but that said...
>
> Consumption of shared/private physical pages _must_ be enforced by guest_memfd.
> The private vs. shared state in the MMU handlers is that VM's view of the world
> and desired state.  The guest_memfd inode is the single source of true for the
> state of the _physical_ page.
>
> E.g. on TDX, if KVM installs a private SPTE for a PFN that is in actuality shared,
> there will be machine checks and the host will likely crash.

I agree. As a plus, I've made that change and it actually simplifies the logic .

> > > I would much, much prefer one large series that shows the full picture than a
> > > mish mash of partial series that I can't actually review, even if the big series
> > > is 100+ patches (hopefully not).
> >
> > Dropping the RFC from the second series was not intentional, the first
> > series is the one where I intended to drop the RFC. I apologize for
> > that.  Especially since I obviously don't know how to handle modules
> > and wanted some input on how to do that :)
>
> In this case, the rules for modules are pretty simple.  Code in mm/ can't call
> into KVM.  Either avoid callbacks entirely, or implement via a layer of
> indirection, e.g. function pointer or ops table, so that KVM can provide its
> implementation at runtime.

Ack.

Thanks again!
/fuad


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 4/7] KVM: guest_memfd: Folio sharing states and functions that manage their transition
  2025-04-03 14:52       ` Sean Christopherson
@ 2025-04-04  6:44         ` Fuad Tabba
  0 siblings, 0 replies; 30+ messages in thread
From: Fuad Tabba @ 2025-04-04  6:44 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton, peterx,
	pankaj.gupta

On Thu, 3 Apr 2025 at 15:53, Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Apr 03, 2025, Fuad Tabba wrote:
> > Hi Sean,
> >
> > On Thu, 3 Apr 2025 at 01:19, Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Fri, Mar 28, 2025, Fuad Tabba wrote:
> > > > @@ -389,22 +381,211 @@ static void kvm_gmem_init_mount(void)
> > > >  }
> > > >
> > > >  #ifdef CONFIG_KVM_GMEM_SHARED_MEM
> > > > -static bool kvm_gmem_offset_is_shared(struct file *file, pgoff_t index)
> > > > +/*
> > > > + * An enum of the valid folio sharing states:
> > > > + * Bit 0: set if not shared with the guest (guest cannot fault it in)
> > > > + * Bit 1: set if not shared with the host (host cannot fault it in)
> > > > + */
> > > > +enum folio_shareability {
> > > > +     KVM_GMEM_ALL_SHARED     = 0b00, /* Shared with the host and the guest. */
> > > > +     KVM_GMEM_GUEST_SHARED   = 0b10, /* Shared only with the guest. */
> > > > +     KVM_GMEM_NONE_SHARED    = 0b11, /* Not shared, transient state. */
> > >
> > > Absolutely not.  The proper way to define bitmasks is to use BIT(xxx).  Based on
> > > past discussions, I suspect you went this route so that the most common value
> > > is '0' to avoid extra, but that should be an implementation detail buried deep
> > > in the low level xarray handling, not a
> > >
> > > The name is also bizarre and confusing.  To map memory into the guest as private,
> > > it needs to be in KVM_GMEM_GUEST_SHARED.  That's completely unworkable.
> > > Of course, it's not at all obvious that you're actually trying to create a bitmask.
> > > The above looks like an inverted bitmask, but then it's used as if the values don't
> > > matter.
> > >
> > >         return (r == KVM_GMEM_ALL_SHARED || r == KVM_GMEM_GUEST_SHARED);
> >
> > Ack.
> >
> > > Given that I can't think of a sane use case for allowing guest_memfd to be mapped
> > > into the host but not the guest (modulo temporary demand paging scenarios), I
> > > think all we need is:
> > >
> > >         KVM_GMEM_SHARED           = BIT(0),
> > >         KVM_GMEM_INVALID          = BIT(1),
> >
> > We need the third state for the transient case, i.e., when a page is
> > transitioning from being shared with the host to going back to
> > private, in order to ensure that neither the guest nor the host can
> > install a mapping/fault it in. But I see your point.
>
> That's KVM_GMEM_INVALID.  Translating to what you had:
>
>   KVM_GMEM_ALL_SHARED   = KVM_GMEM_SHARED
>   KVM_GMEM_GUEST_SHARED = 0
>   KVM_GMEM_NONE_SHARED  = KVM_GMEM_INVALID

Ack.

Thanks,
/fuad


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 7/7] KVM: guest_memfd: Add a guest_memfd() flag to initialize it as shared
  2025-04-02 22:47   ` Ackerley Tng
@ 2025-04-04  7:39     ` Fuad Tabba
  0 siblings, 0 replies; 30+ messages in thread
From: Fuad Tabba @ 2025-04-04  7:39 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	isaku.yamahata, mic, vbabka, vannapurve, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton, peterx,
	pankaj.gupta

Hi Ackerley,

On Wed, 2 Apr 2025 at 23:47, Ackerley Tng <ackerleytng@google.com> wrote:
>
> Fuad Tabba <tabba@google.com> writes:
>
> > Not all use cases require guest_memfd() to be shared with the host when
> > first created. Add a new flag, GUEST_MEMFD_FLAG_INIT_SHARED, which when
> > set on KVM_CREATE_GUEST_MEMFD initializes the memory as shared with the
> > host, and therefore mappable by it. Otherwise, memory is private until
> > explicitly shared by the guest with the host.
> >
> > Signed-off-by: Fuad Tabba <tabba@google.com>
> > ---
> >  Documentation/virt/kvm/api.rst                 |  4 ++++
> >  include/uapi/linux/kvm.h                       |  1 +
> >  tools/testing/selftests/kvm/guest_memfd_test.c |  7 +++++--
> >  virt/kvm/guest_memfd.c                         | 12 ++++++++++++
> >  4 files changed, 22 insertions(+), 2 deletions(-)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 2b52eb77e29c..a5496d7d323b 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -6386,6 +6386,10 @@ most one mapping per page, i.e. binding multiple memory regions to a single
> >  guest_memfd range is not allowed (any number of memory regions can be bound to
> >  a single guest_memfd file, but the bound ranges must not overlap).
> >
> > +If the capability KVM_CAP_GMEM_SHARED_MEM is supported, then the flags field
> > +supports GUEST_MEMFD_FLAG_INIT_SHARED, which initializes the memory as shared
> > +with the host, and thereby, mappable by it.
> > +
> >  See KVM_SET_USER_MEMORY_REGION2 for additional details.
> >
> >  4.143 KVM_PRE_FAULT_MEMORY
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 117937a895da..22d7e33bf09c 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -1566,6 +1566,7 @@ struct kvm_memory_attributes {
> >  #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
> >
> >  #define KVM_CREATE_GUEST_MEMFD       _IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
> > +#define GUEST_MEMFD_FLAG_INIT_SHARED         (1UL << 0)
> >
> >  struct kvm_create_guest_memfd {
> >       __u64 size;
> > diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
> > index 38c501e49e0e..4a7fcd6aa372 100644
> > --- a/tools/testing/selftests/kvm/guest_memfd_test.c
> > +++ b/tools/testing/selftests/kvm/guest_memfd_test.c
> > @@ -159,7 +159,7 @@ static void test_invalid_punch_hole(int fd, size_t page_size, size_t total_size)
> >  static void test_create_guest_memfd_invalid(struct kvm_vm *vm)
> >  {
> >       size_t page_size = getpagesize();
> > -     uint64_t flag;
> > +     uint64_t flag = BIT(0);
> >       size_t size;
> >       int fd;
> >
> > @@ -170,7 +170,10 @@ static void test_create_guest_memfd_invalid(struct kvm_vm *vm)
> >                           size);
> >       }
> >
> > -     for (flag = BIT(0); flag; flag <<= 1) {
> > +     if (kvm_has_cap(KVM_CAP_GMEM_SHARED_MEM))
> > +             flag = GUEST_MEMFD_FLAG_INIT_SHARED << 1;
> > +
> > +     for (; flag; flag <<= 1) {
>
> This would end up shifting the GUEST_MEMFD_FLAG_INIT_SHARED flag for
> each loop.

Yes. That's my intention. This tests whether the flags are invalid.
Without GUEST_MEMFD_FLAG_INIT_SHARED, no flag is valid, so it starts
with bit zero and goes through all the flags.

If KVM_CAP_GMEM_SHARED_MEM is available, then we want to start from
bit 1 (i.e., skip bit 0, which is GUEST_MEMFD_FLAG_INIT_SHARED).

Cheers,
/fuad

> >               fd = __vm_create_guest_memfd(vm, page_size, flag);
> >               TEST_ASSERT(fd == -1 && errno == EINVAL,
> >                           "guest_memfd() with flag '0x%lx' should fail with EINVAL",
> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > index eec9d5e09f09..32e149478b04 100644
> > --- a/virt/kvm/guest_memfd.c
> > +++ b/virt/kvm/guest_memfd.c
> > @@ -1069,6 +1069,15 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
> >               goto err_gmem;
> >       }
> >
> > +     if (IS_ENABLED(CONFIG_KVM_GMEM_SHARED_MEM) &&
> > +         (flags & GUEST_MEMFD_FLAG_INIT_SHARED)) {
> > +             err = kvm_gmem_offset_range_set_shared(file_inode(file), 0, size >> PAGE_SHIFT);
> > +             if (err) {
> > +                     fput(file);
> > +                     goto err_gmem;
> > +             }
> > +     }
> > +
> >       kvm_get_kvm(kvm);
> >       gmem->kvm = kvm;
> >       xa_init(&gmem->bindings);
> > @@ -1090,6 +1099,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
> >       u64 flags = args->flags;
> >       u64 valid_flags = 0;
> >
> > +     if (IS_ENABLED(CONFIG_KVM_GMEM_SHARED_MEM))
> > +             valid_flags |= GUEST_MEMFD_FLAG_INIT_SHARED;
> > +
> >       if (flags & ~valid_flags)
> >               return -EINVAL;


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 0/7] KVM: Restricted mapping of guest_memfd at the host and arm64 support
  2025-03-28 15:31 [PATCH v7 0/7] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
                   ` (6 preceding siblings ...)
  2025-03-28 15:31 ` [PATCH v7 7/7] KVM: guest_memfd: Add a guest_memfd() flag to initialize it as shared Fuad Tabba
@ 2025-04-04 18:05 ` Liam R. Howlett
  2025-04-04 20:10   ` David Hildenbrand
  7 siblings, 1 reply; 30+ messages in thread
From: Liam R. Howlett @ 2025-04-04 18:05 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton, peterx,
	pankaj.gupta

* Fuad Tabba <tabba@google.com> [250328 11:32]:
> This series adds restricted mmap() support to guest_memfd, as well as
> support for guest_memfd on arm64. Please see v3 for the context [1].

As I'm sure you are aware, we have issues getting people to review
patches.  The lower the barrier to entry, the better for everyone.

Sorry if I go too much into detail about the process, I am not sure
where you are starting from.  The linux-mm maintainer (Andrew, aka
akpm), usually takes this cover letter and puts it on patch 1 so that
the git history captures all the details necessary for the entire patch
set to make sense.  What you have done here is made a lot of work for
the maintainer.  I'm not sure what information below is or is not
captured in the v3 context.

But then again, maybe are are not going through linux-mm for upstream?

> 
> Main change since v6 [2]:
> Protected the shared_offsets array with a rwlock instead of hopping on
> the invalidate_lock. The main reason for this is that the final put
> callback (kvm_gmem_handle_folio_put()) could be called from an atomic
> context, and the invalidate_lock is an rw_semaphore (Vishal).
> 
> The other reason is, in hindsight, 

Can you please try to be more brief and to the point?

>it didn't make sense to use the
> invalidate_lock since they're not quite protecting the same thing.
> 
> I did consider using the folio lock to implicitly protect the array, and
> even have another series that does that. The result was more
> complicated, and not obviously race free. One of the difficulties with
> relying on the folio lock is that there are cases (e.g., on
> initilization and teardown) when we want to toggle the state of an
> offset that doesn't have a folio in the cache. We could special case
> these, but the result was more complicated (and to me not obviously
> better) than adding a separate lock for the shared_offsets array.

This must be captured elsewhere and doesn't need to be here?

> 
> Based on the `KVM: Mapping guest_memfd backed memory at the host for
> software protected VMs` series [3], which in turn is based on Linux
> 6.14-rc7.

Wait, what?  You have another in-flight series that I need for this
series?

So this means I need 6.14-rc7 + patches from march 18th to review your
series?

Oh my, and I just responded to another patch set built off this patch
set?  So we have 3 in-flight patches that I need for the last patch set?
What is going on with guest_memfd that everything needs to be in-flight
at once?

I was actually thinking that maybe it would be better to split *this*
set into 2 logical parts: 1. mmap() support and 2. guest_memfd arm64
support.  But now I have so many lore emails opened in my browser trying
to figure this out that I don't want any more.

There's also "mm: Consolidate freeing of typed folios on final
folio_put()" and I don't know where it fits.

Is this because the upstream path differs?  It's very difficult to
follow what's going on.

> 
> The state diagram that uses the new states in this patch series,
> and how they would interact with sharing/unsharing in pKVM [4].

This cover letter is very difficult to follow.  Where do the main
changes since v6 end?  I was going to suggest bullets, but v3 has
bullets and is more clear already.  I'd much prefer if it remained line
v3, any reason you changed?

I am not sure what upstream repository you are working with, but
if you are sending this for the mm stream, please base your work on
mm-new and AT LEAST wait until the patches are in mm-new, but ideally
wait for mm-unstable and mm-stable for wider test coverage.  This might
vary based on your upstream path though.

I did respond to one of the other patch set that's based off this one
that the lockdep issue in patch 3 makes testing a concern.  Considering
there are patches on patches, I don't think you are going to get a whole
lot of people reviewing or testing it until it falls over once it hits
linux-next.

The number of patches in-flight, the ordering, and the base is so
confusing.  Are you sure none of these should be RFC?  The flood of
changes pretty much guarantees things will be missed, more work will be
created, and people (like me) will get frustrated.

It looks like a small team across companies are collaborating on this,
and that's awesome.  I think you need to change how you are doing things
and let the rest of us in on the code earlier.  Otherwise you will be
forced to rebase on changed patch series every time something is
accepted upstream.  That's all to say, you don't have a solid base of
development for the newer patches and I don't know what interactions
will happen when all the in-flight things meet.

I'm sorry, but you have made the barrier of entry to review this too
high.

Thanks,
Liam

> 
> Cheers,
> /fuad
> 
> [1] https://lore.kernel.org/all/20241010085930.1546800-1-tabba@google.com/
> [2] https://lore.kernel.org/all/20250318162046.4016367-1-tabba@google.com/
> [3] https://lore.kernel.org/all/20250318161823.4005529-1-tabba@google.com/
> [4] https://lpc.events/event/18/contributions/1758/attachments/1457/3699/Guestmemfd%20folio%20state%20page_type.pdf

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 0/7] KVM: Restricted mapping of guest_memfd at the host and arm64 support
  2025-04-04 18:05 ` [PATCH v7 0/7] KVM: Restricted mapping of guest_memfd at the host and arm64 support Liam R. Howlett
@ 2025-04-04 20:10   ` David Hildenbrand
  2025-04-04 21:48     ` Sean Christopherson
  0 siblings, 1 reply; 30+ messages in thread
From: David Hildenbrand @ 2025-04-04 20:10 UTC (permalink / raw)
  To: Liam R. Howlett, Fuad Tabba, kvm, linux-arm-msm, linux-mm,
	pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, michael.roth, wei.w.wang,
	liam.merwick, isaku.yamahata, kirill.shutemov, suzuki.poulose,
	steven.price, quic_eberman, quic_mnalajal, quic_tsoni,
	quic_svaddagi, quic_cvanscha, quic_pderrin, quic_pheragu,
	catalin.marinas, james.morse, yuzenghui, oliver.upton, maz, will,
	qperret, keirf, roypat, shuah, hch, jgg, rientjes, jhubbard, fvdl,
	hughd, jthoughton, peterx, pankaj.gupta

On 04.04.25 20:05, Liam R. Howlett wrote:
> * Fuad Tabba <tabba@google.com> [250328 11:32]:
>> This series adds restricted mmap() support to guest_memfd, as well as
>> support for guest_memfd on arm64. Please see v3 for the context [1].
> 
> As I'm sure you are aware, we have issues getting people to review
> patches.  The lower the barrier to entry, the better for everyone.
> 
> Sorry if I go too much into detail about the process, I am not sure
> where you are starting from.  The linux-mm maintainer (Andrew, aka
> akpm), usually takes this cover letter and puts it on patch 1 so that
> the git history captures all the details necessary for the entire patch
> set to make sense.  What you have done here is made a lot of work for
> the maintainer.  I'm not sure what information below is or is not
> captured in the v3 context.
> 
> But then again, maybe are are not going through linux-mm for upstream?

[replying to some bits]

As all patches and this subject is "KVM:" I would assume this goes 
through the kvm tree once ready.

[...]

> 
> So this means I need 6.14-rc7 + patches from march 18th to review your
> series?
> 
> Oh my, and I just responded to another patch set built off this patch
> set?  So we have 3 in-flight patches that I need for the last patch set?
> What is going on with guest_memfd that everything needs to be in-flight
> at once?

Yeah, we're still all trying to connect the pieces to make it all fly 
together. Looks like we're getting there.

> 
> I was actually thinking that maybe it would be better to split *this*
> set into 2 logical parts: 1. mmap() support and 2. guest_memfd arm64
> support.  But now I have so many lore emails opened in my browser trying
> to figure this out that I don't want any more.
> 
> There's also "mm: Consolidate freeing of typed folios on final
> folio_put()" and I don't know where it fits.

That will likely be moved into a different patch set after some 
discussions we had yesterday.

> 
> Is this because the upstream path differs?  It's very difficult to
> follow what's going on.
> 
>>
>> The state diagram that uses the new states in this patch series,
>> and how they would interact with sharing/unsharing in pKVM [4].
> 
> This cover letter is very difficult to follow.  Where do the main
> changes since v6 end?  I was going to suggest bullets, but v3 has
> bullets and is more clear already.  I'd much prefer if it remained line
> v3, any reason you changed?
> 
> I am not sure what upstream repository you are working with, but
> if you are sending this for the mm stream, please base your work on
> mm-new and AT LEAST wait until the patches are in mm-new, but ideally
> wait for mm-unstable and mm-stable for wider test coverage.  This might
> vary based on your upstream path though.
> 
> I did respond to one of the other patch set that's based off this one
> that the lockdep issue in patch 3 makes testing a concern.  Considering
> there are patches on patches, I don't think you are going to get a whole
> lot of people reviewing or testing it until it falls over once it hits
> linux-next.

I think for now it's mostly KVM + guest_memfd people reviewing this. 
linux-mm is only CCed for awareness (clearly MM related stuff).

> 
> The number of patches in-flight, the ordering, and the base is so
> confusing.  Are you sure none of these should be RFC?  The flood of
> changes pretty much guarantees things will be missed, more work will be
> created, and people (like me) will get frustrated.

Yeah, I think everything basing the work on this series should be RFC or 
clearly tagged as based on this work differently.

> It looks like a small team across companies are collaborating on this,
> and that's awesome.  I think you need to change how you are doing things
> and let the rest of us in on the code earlier. 

I think the approach taken to share the different pieces early makes 
sense, it just has to be clearer what the dependencies are and what is 
actually the first thing that should go in so people can focus review on 
that.

> Otherwise you will be
> forced to rebase on changed patch series every time something is
> accepted upstream. 

I don't think that's a problem, the work built on top of this are mostly 
shared for early review/feedback -- whereby I agree that tagging them 
differently makes a lot of sense.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 0/7] KVM: Restricted mapping of guest_memfd at the host and arm64 support
  2025-04-04 20:10   ` David Hildenbrand
@ 2025-04-04 21:48     ` Sean Christopherson
  2025-04-05  2:45       ` Liam R. Howlett
  0 siblings, 1 reply; 30+ messages in thread
From: Sean Christopherson @ 2025-04-04 21:48 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Liam R. Howlett, Fuad Tabba, kvm, linux-arm-msm, linux-mm,
	pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou, viro,
	brauner, willy, akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko,
	amoorthy, dmatlack, isaku.yamahata, mic, vbabka, vannapurve,
	ackerleytng, mail, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, qperret, keirf,
	roypat, shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd,
	jthoughton, peterx, pankaj.gupta

On Fri, Apr 04, 2025, David Hildenbrand wrote:
> On 04.04.25 20:05, Liam R. Howlett wrote:
> > But then again, maybe are are not going through linux-mm for upstream?
> 
> [replying to some bits]
> 
> As all patches and this subject is "KVM:" I would assume this goes through
> the kvm tree once ready.

Yeah, that's a safe assumption.

> > It looks like a small team across companies are collaborating on this,
> > and that's awesome.  I think you need to change how you are doing things
> > and let the rest of us in on the code earlier.
> 
> I think the approach taken to share the different pieces early makes sense,
> it just has to be clearer what the dependencies are and what is actually the
> first thing that should go in so people can focus review on that.

100% agreed that sharing early makes sense, but I also 100% agree with Liam that
having multiple series flying around with multiple dependencies makes them all
unreviewable.  I simply don't have the bandwidth to track down who's doing what
and where.

I don't see those two sides as conflicting.  Someone "just" needs to take point
on collecting, squashing, and posting the various inter-related works as a single
cohesive series.

As you said, things are getting there, but I recommend prioritizing that above
the actual code, otherwise reviewers are going to continue ignoring the individual
series.

FWIW, if necessary, I would much prefer a single massive series over a bunch of
smaller series all pointing at each other, at least for the initial review.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 0/7] KVM: Restricted mapping of guest_memfd at the host and arm64 support
  2025-04-04 21:48     ` Sean Christopherson
@ 2025-04-05  2:45       ` Liam R. Howlett
  2025-04-07  6:28         ` Fuad Tabba
  0 siblings, 1 reply; 30+ messages in thread
From: Liam R. Howlett @ 2025-04-05  2:45 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: David Hildenbrand, Fuad Tabba, kvm, linux-arm-msm, linux-mm,
	pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou, viro,
	brauner, willy, akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko,
	amoorthy, dmatlack, isaku.yamahata, mic, vbabka, vannapurve,
	ackerleytng, mail, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, qperret, keirf,
	roypat, shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd,
	jthoughton, peterx, pankaj.gupta

* Sean Christopherson <seanjc@google.com> [250404 17:48]:
> On Fri, Apr 04, 2025, David Hildenbrand wrote:
> > On 04.04.25 20:05, Liam R. Howlett wrote:
> > > But then again, maybe are are not going through linux-mm for upstream?
> > 
> > [replying to some bits]
> > 
> > As all patches and this subject is "KVM:" I would assume this goes through
> > the kvm tree once ready.
> 
> Yeah, that's a safe assumption.

Okay, thanks.

It still seems strange to have such vast differences in the cover letter
between versions and require others to hunt down the information vs an
updated cover letter with revision history and links.

I did get lost on where the changes since v6 stopped vs new comments
started in the update.

But maybe it's that I'm set in my grumpy old ways.

I really like the cover letter information in the first patch (this
patch 1 of 7..) to have the information in the git log.

> 
> > > It looks like a small team across companies are collaborating on this,
> > > and that's awesome.  I think you need to change how you are doing things
> > > and let the rest of us in on the code earlier.
> > 
> > I think the approach taken to share the different pieces early makes sense,
> > it just has to be clearer what the dependencies are and what is actually the
> > first thing that should go in so people can focus review on that.
> 
> 100% agreed that sharing early makes sense, but I also 100% agree with Liam that
> having multiple series flying around with multiple dependencies makes them all
> unreviewable.  I simply don't have the bandwidth to track down who's doing what
> and where.

Yes, sharing early is crucial, but we lack the quilt file to stitch them
together in the proper sequence so we can do a proper review.

My main issue is barrier of entry, I have no idea how I can help this
effort as it is today.

> 
> I don't see those two sides as conflicting.  Someone "just" needs to take point
> on collecting, squashing, and posting the various inter-related works as a single
> cohesive series.

It *looks* like all these patches need to go in now (no RFC tags, for
instance), but when you start reading through the cover letters it has
many levels and the effort quickly grows; which branch do I need, what
order, and which of these landed?  This is like SMR, but with code.

> 
> As you said, things are getting there, but I recommend prioritizing that above
> the actual code, otherwise reviewers are going to continue ignoring the individual
> series.
> 
> FWIW, if necessary, I would much prefer a single massive series over a bunch of
> smaller series all pointing at each other, at least for the initial review.
> 

Yes, at least then the dependency order and versions would not be such
an effort to get correct.  If it's really big maybe a git repo would be
a good idea along with the large patch set?  You'd probably be using
that repo to combine/squash and generate the patches anyways.  I still
don't know what patch I need to start with and which ones have landed.

If each part is worth doing on its own, then send one at a time and wait
for them to land.  This might result in wasted time on a plan that needs
to be changed for upstreaming, so I think the big patch bomb/git branch
is the way to go.

Both of these methods will provide better end-to-end testing and code
review than the individual parts being sent out in short succession with
references to each other.

Thanks,
Liam

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 5/7] KVM: guest_memfd: Restore folio state after final folio_put()
  2025-03-28 15:31 ` [PATCH v7 5/7] KVM: guest_memfd: Restore folio state after final folio_put() Fuad Tabba
@ 2025-04-05 16:23   ` Matthew Wilcox
  0 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2025-04-05 16:23 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton, peterx,
	pankaj.gupta

On Fri, Mar 28, 2025 at 03:31:31PM +0000, Fuad Tabba wrote:
> +	write_lock(offsets_lock);
> +	folio_lock(folio);

write_lock() is a spinlock.
folio_lock() is a sleeping lock.

Do you not enable any kernel debugging during development?  It would
save you the embarassment of sending patches which are obvious garbage.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 0/7] KVM: Restricted mapping of guest_memfd at the host and arm64 support
  2025-04-05  2:45       ` Liam R. Howlett
@ 2025-04-07  6:28         ` Fuad Tabba
  0 siblings, 0 replies; 30+ messages in thread
From: Fuad Tabba @ 2025-04-07  6:28 UTC (permalink / raw)
  To: Liam R. Howlett, Sean Christopherson, David Hildenbrand,
	Fuad Tabba, kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai,
	mpe, anup, paul.walmsley, palmer, aou, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton, peterx,
	pankaj.gupta

Hi,

On Sat, 5 Apr 2025 at 03:46, Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Sean Christopherson <seanjc@google.com> [250404 17:48]:
> > On Fri, Apr 04, 2025, David Hildenbrand wrote:
> > > On 04.04.25 20:05, Liam R. Howlett wrote:
> > > > But then again, maybe are are not going through linux-mm for upstream?
> > >
> > > [replying to some bits]
> > >
> > > As all patches and this subject is "KVM:" I would assume this goes through
> > > the kvm tree once ready.
> >
> > Yeah, that's a safe assumption.
>
> Okay, thanks.
>
> It still seems strange to have such vast differences in the cover letter
> between versions and require others to hunt down the information vs an
> updated cover letter with revision history and links.
>
> I did get lost on where the changes since v6 stopped vs new comments
> started in the update.
>
> But maybe it's that I'm set in my grumpy old ways.
>
> I really like the cover letter information in the first patch (this
> patch 1 of 7..) to have the information in the git log.
>
> >
> > > > It looks like a small team across companies are collaborating on this,
> > > > and that's awesome.  I think you need to change how you are doing things
> > > > and let the rest of us in on the code earlier.
> > >
> > > I think the approach taken to share the different pieces early makes sense,
> > > it just has to be clearer what the dependencies are and what is actually the
> > > first thing that should go in so people can focus review on that.
> >
> > 100% agreed that sharing early makes sense, but I also 100% agree with Liam that
> > having multiple series flying around with multiple dependencies makes them all
> > unreviewable.  I simply don't have the bandwidth to track down who's doing what
> > and where.
>
> Yes, sharing early is crucial, but we lack the quilt file to stitch them
> together in the proper sequence so we can do a proper review.
>
> My main issue is barrier of entry, I have no idea how I can help this
> effort as it is today.
>
> >
> > I don't see those two sides as conflicting.  Someone "just" needs to take point
> > on collecting, squashing, and posting the various inter-related works as a single
> > cohesive series.
>
> It *looks* like all these patches need to go in now (no RFC tags, for
> instance), but when you start reading through the cover letters it has
> many levels and the effort quickly grows; which branch do I need, what
> order, and which of these landed?  This is like SMR, but with code.
>
> >
> > As you said, things are getting there, but I recommend prioritizing that above
> > the actual code, otherwise reviewers are going to continue ignoring the individual
> > series.
> >
> > FWIW, if necessary, I would much prefer a single massive series over a bunch of
> > smaller series all pointing at each other, at least for the initial review.
> >
>
> Yes, at least then the dependency order and versions would not be such
> an effort to get correct.  If it's really big maybe a git repo would be
> a good idea along with the large patch set?  You'd probably be using
> that repo to combine/squash and generate the patches anyways.  I still
> don't know what patch I need to start with and which ones have landed.
>
> If each part is worth doing on its own, then send one at a time and wait
> for them to land.  This might result in wasted time on a plan that needs
> to be changed for upstreaming, so I think the big patch bomb/git branch
> is the way to go.
>
> Both of these methods will provide better end-to-end testing and code
> review than the individual parts being sent out in short succession with
> references to each other.

The idea was that it would be easier to review the two series
separately, with the second one being tagged as RFC (which I
accidentally dropped). That ended up making this more confusing and
difficult to review. The next series will just be one series.

Thanks for your feedback and sorry for the noise.

Cheers,
/fuad

> Thanks,
> Liam


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 1/7] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes
  2025-03-28 15:31 ` [PATCH v7 1/7] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes Fuad Tabba
@ 2025-04-08 11:53   ` Shivank Garg
  0 siblings, 0 replies; 30+ messages in thread
From: Shivank Garg @ 2025-04-08 11:53 UTC (permalink / raw)
  To: Fuad Tabba, ackerleytng, kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, isaku.yamahata, mic,
	vbabka, vannapurve, mail, david, michael.roth, wei.w.wang,
	liam.merwick, isaku.yamahata, kirill.shutemov, suzuki.poulose,
	steven.price, quic_eberman, quic_mnalajal, quic_tsoni,
	quic_svaddagi, quic_cvanscha, quic_pderrin, quic_pheragu,
	catalin.marinas, james.morse, yuzenghui, oliver.upton, maz, will,
	qperret, keirf, roypat, shuah, hch, jgg, rientjes, jhubbard, fvdl,
	hughd, jthoughton, peterx, pankaj.gupta

Hi Ackerley, Fuad,

On 3/28/2025 9:01 PM, Fuad Tabba wrote:
> From: Ackerley Tng <ackerleytng@google.com>
> 
> Using guest mem inodes allows us to store metadata for the backing
> memory on the inode. Metadata will be added in a later patch to support
> HugeTLB pages.
> 
> Metadata about backing memory should not be stored on the file, since
> the file represents a guest_memfd's binding with a struct kvm, and
> metadata about backing memory is not unique to a specific binding and
> struct kvm.
> 
> Signed-off-by: Fuad Tabba <tabba@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---

<snip>

> +
> +static void kvm_gmem_init_mount(void)
> +{
> +	kvm_gmem_mnt = kern_mount(&kvm_gmem_fs);
> +	BUG_ON(IS_ERR(kvm_gmem_mnt));
> +
> +	/* For giggles. Userspace can never map this anyways. */
> +	kvm_gmem_mnt->mnt_flags |= MNT_NOEXEC;
> +}
> +
...

>  void kvm_gmem_init(struct module *module)
>  {
>  	kvm_gmem_fops.owner = module;
> +
> +	kvm_gmem_init_mount();
>  }

Looks like we’re missing a kern_unmount() call to properly clean up when
the KVM module gets unloaded. How about adding this?

 void kvm_gmem_exit(void)
 {
-
+	kern_unmount(kvm_gmem_mnt);
 }

This hooks up the teardown for the guest_memfd code nicely.
Right now, kvm_gmem_exit() isn’t even there, so this needs to be added.

https://lore.kernel.org/linux-mm/20250408112402.181574-6-shivankg@amd.com
 
>  
>  static int kvm_gmem_migrate_folio(struct address_space *mapping,
> @@ -511,11 +549,79 @@ static const struct inode_operations kvm_gmem_iops = {
>  	.setattr	= kvm_gmem_setattr,
>  };
>  
> +static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
> +						      loff_t size, u64 flags)
> +{
> +	const struct qstr qname = QSTR_INIT(name, strlen(name));
> +	struct inode *inode;
> +	int err;
> +
> +	inode = alloc_anon_inode(kvm_gmem_mnt->mnt_sb);
> +	if (IS_ERR(inode))
> +		return inode;
> +
> +	err = security_inode_init_security_anon(inode, &qname, NULL);

Also, I hit a build error with security_inode_init_security_anon() when
using this in the module.
Looks like it needs to be exported. Adding this fixed it for me:

+EXPORT_SYMBOL(security_inode_init_security_anon);

Can you guys check if this looks good?
If so, there’s a revised patch here that includes these changes:
https://lore.kernel.org/linux-mm/20250408112402.181574-1-shivankg@amd.com

Let me know what you think!

Thanks,
Shivank



^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2025-04-08 11:53 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-28 15:31 [PATCH v7 0/7] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
2025-03-28 15:31 ` [PATCH v7 1/7] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes Fuad Tabba
2025-04-08 11:53   ` Shivank Garg
2025-03-28 15:31 ` [PATCH v7 2/7] KVM: guest_memfd: Introduce kvm_gmem_get_pfn_locked(), which retains the folio lock Fuad Tabba
2025-03-28 15:31 ` [PATCH v7 3/7] KVM: guest_memfd: Track folio sharing within a struct kvm_gmem_private Fuad Tabba
2025-04-02 22:25   ` Ackerley Tng
2025-04-02 23:56     ` Sean Christopherson
2025-04-03  8:58       ` Fuad Tabba
2025-04-03 14:51         ` Sean Christopherson
2025-04-04  6:43           ` Fuad Tabba
2025-03-28 15:31 ` [PATCH v7 4/7] KVM: guest_memfd: Folio sharing states and functions that manage their transition Fuad Tabba
2025-04-02 23:48   ` Ackerley Tng
2025-04-03  8:58     ` Fuad Tabba
2025-04-03  0:19   ` Sean Christopherson
2025-04-03  9:11     ` Fuad Tabba
2025-04-03 14:52       ` Sean Christopherson
2025-04-04  6:44         ` Fuad Tabba
2025-03-28 15:31 ` [PATCH v7 5/7] KVM: guest_memfd: Restore folio state after final folio_put() Fuad Tabba
2025-04-05 16:23   ` Matthew Wilcox
2025-03-28 15:31 ` [PATCH v7 6/7] KVM: guest_memfd: Handle invalidation of shared memory Fuad Tabba
2025-03-28 15:31 ` [PATCH v7 7/7] KVM: guest_memfd: Add a guest_memfd() flag to initialize it as shared Fuad Tabba
2025-04-02 22:47   ` Ackerley Tng
2025-04-04  7:39     ` Fuad Tabba
2025-04-02 23:03   ` Ackerley Tng
2025-04-03 11:32     ` Fuad Tabba
2025-04-04 18:05 ` [PATCH v7 0/7] KVM: Restricted mapping of guest_memfd at the host and arm64 support Liam R. Howlett
2025-04-04 20:10   ` David Hildenbrand
2025-04-04 21:48     ` Sean Christopherson
2025-04-05  2:45       ` Liam R. Howlett
2025-04-07  6:28         ` Fuad Tabba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).