linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/11] KVM: Restricted mapping of guest_memfd at the host and arm64 support
@ 2024-10-10  8:59 Fuad Tabba
  2024-10-10  8:59 ` [PATCH v3 01/11] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes Fuad Tabba
                   ` (10 more replies)
  0 siblings, 11 replies; 28+ messages in thread
From: Fuad Tabba @ 2024-10-10  8:59 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton, tabba

This series adds restricted mmap() support to guest_memfd, as
well as support for guest_memfd on arm64. It is based on Linux
6.12-rc2.

Changes since V2 [1]:
- Use refcount to determine whether a page/folio is mapped by the
host rather than folio_mapcount()+folio_maybe_dma_pinned()
(DavidH)
- Track of mappability of guest memory at the host in the
guest_memfd inode (Ackerly)
- Refactoring and tidying up (Sean, Ackerly)

By design, guest_memfd cannot be mapped, read, or written by the
host. In pKVM, memory shared between a protected guest and the
host is shared in-place, unlike other confidential computing
solutions that guest_memfd was originally envisaged for (e.g,
TDX). When initializing a guest, as well as when accessing memory
shared by a protected guest with the host, it would be useful to
support mapping guest memory at the host to avoid copying its
contents.

One of the benefits of guest_memfd is that it prevents a
misbehaving host from crashing the system when attempting to
access private guest memory (deliberately or accidentally), since
this memory isn't mapped to begin with. Without guest_memfd, the
hypervisor would still prevent such accesses, but in certain
cases the host kernel wouldn't be able to recover, causing the
system to crash.

Support for mmap() in this patch series maintains the invariant
that only memory shared with the host, either explicitly by the
guest or implicitly before the guest has started running (in
order to populate its memory) is allowed to have a valid mapping
at the host. At no point should _private_ guest memory have any
mappings at the host.

This patch series is divided into two parts:

The first part is to the KVM core code. It adds opt-in support
for mapping guest memory only as long as it is shared, or
optionally when it is first created. For that, the host needs to
know the mappability status of guest memory. Therefore, the
series adds a structure to track whether memory is mappable. This
new structure is associated with each guest_memfd inode object.

The second part of the series adds guest_memfd support for arm64.

The patch series enforces the invariant that only memory shared
with the host can be mapped by the host userspace in
vm_operations_struct:fault(), instead of file_operations:mmap().
On a fault, we check whether the page is allowed to be mapped. If
not, we deliver a SIGBUS to the current task, as discussed in the
Linux MM Alignment Session and LPC 2024 on this topic [2,3 ].

Currently, there's no support for huge pages, which is something
we hope to support in the near future [4].

Cheers,
/fuad

[1] https://lore.kernel.org/all/20240801090117.3841080-1-tabba@google.com/

[2] https://lore.kernel.org/all/20240712232937.2861788-1-ackerleytng@google.com/

[3] https://lpc.events/event/18/sessions/183/#20240919

[4] https://lore.kernel.org/all/cover.1726009989.git.ackerleytng@google.com/

Ackerley Tng (2):
  KVM: guest_memfd: Make guest mem use guest mem inodes instead of
    anonymous inodes
  KVM: guest_memfd: Track mappability within a struct kvm_gmem_private

Fuad Tabba (9):
  KVM: guest_memfd: Introduce kvm_gmem_get_pfn_locked(), which retains
    the folio lock
  KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared
  KVM: guest_memfd: Add guest_memfd support to
    kvm_(read|/write)_guest_page()
  KVM: guest_memfd: Add KVM capability to check if guest_memfd is host
    mappable
  KVM: guest_memfd: Add a guest_memfd() flag to initialize it as
    mappable
  KVM: guest_memfd: selftests: guest_memfd mmap() test when mapping is
    allowed
  KVM: arm64: Skip VMA checks for slots without userspace address
  KVM: arm64: Handle guest_memfd()-backed guest page faults
  KVM: arm64: Enable guest_memfd private memory when pKVM is enabled

 Documentation/virt/kvm/api.rst                |   4 +
 arch/arm64/include/asm/kvm_host.h             |   3 +
 arch/arm64/kvm/Kconfig                        |   1 +
 arch/arm64/kvm/mmu.c                          | 120 +++++-
 include/linux/kvm_host.h                      |  63 +++
 include/uapi/linux/kvm.h                      |   2 +
 include/uapi/linux/magic.h                    |   1 +
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../testing/selftests/kvm/guest_memfd_test.c  |  57 ++-
 virt/kvm/Kconfig                              |   4 +
 virt/kvm/guest_memfd.c                        | 397 ++++++++++++++++--
 virt/kvm/kvm_main.c                           | 279 +++++++++++-
 12 files changed, 877 insertions(+), 55 deletions(-)


base-commit: 8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b
-- 
2.47.0.rc0.187.ge670bccf7e-goog



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 01/11] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes
  2024-10-10  8:59 [PATCH v3 00/11] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
@ 2024-10-10  8:59 ` Fuad Tabba
  2024-10-12  6:12   ` kernel test robot
  2024-10-10  8:59 ` [PATCH v3 02/11] KVM: guest_memfd: Track mappability within a struct kvm_gmem_private Fuad Tabba
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 28+ messages in thread
From: Fuad Tabba @ 2024-10-10  8:59 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton, tabba

From: Ackerley Tng <ackerleytng@google.com>

Using guest mem inodes allows us to store metadata for the backing
memory on the inode. Metadata will be added in a later patch to
support HugeTLB pages.

Metadata about backing memory should not be stored on the file, since
the file represents a guest_memfd's binding with a struct kvm, and
metadata about backing memory is not unique to a specific binding and
struct kvm.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 include/uapi/linux/magic.h |   1 +
 virt/kvm/guest_memfd.c     | 119 ++++++++++++++++++++++++++++++-------
 2 files changed, 100 insertions(+), 20 deletions(-)

diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index bb575f3ab45e..169dba2a6920 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -103,5 +103,6 @@
 #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
 #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
 #define PID_FS_MAGIC		0x50494446	/* "PIDF" */
+#define GUEST_MEMORY_MAGIC	0x474d454d	/* "GMEM" */
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 8f079a61a56d..5d7fd1f708a6 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -1,12 +1,17 @@
 // SPDX-License-Identifier: GPL-2.0
+#include <linux/fs.h>
+#include <linux/mount.h>
 #include <linux/backing-dev.h>
 #include <linux/falloc.h>
 #include <linux/kvm_host.h>
+#include <linux/pseudo_fs.h>
 #include <linux/pagemap.h>
 #include <linux/anon_inodes.h>
 
 #include "kvm_mm.h"
 
+static struct vfsmount *kvm_gmem_mnt;
+
 struct kvm_gmem {
 	struct kvm *kvm;
 	struct xarray bindings;
@@ -302,6 +307,38 @@ static inline struct file *kvm_gmem_get_file(struct kvm_memory_slot *slot)
 	return get_file_active(&slot->gmem.file);
 }
 
+static const struct super_operations kvm_gmem_super_operations = {
+	.statfs		= simple_statfs,
+};
+
+static int kvm_gmem_init_fs_context(struct fs_context *fc)
+{
+	struct pseudo_fs_context *ctx;
+
+	if (!init_pseudo(fc, GUEST_MEMORY_MAGIC))
+		return -ENOMEM;
+
+	ctx = fc->fs_private;
+	ctx->ops = &kvm_gmem_super_operations;
+
+	return 0;
+}
+
+static struct file_system_type kvm_gmem_fs = {
+	.name		 = "kvm_guest_memory",
+	.init_fs_context = kvm_gmem_init_fs_context,
+	.kill_sb	 = kill_anon_super,
+};
+
+static void kvm_gmem_init_mount(void)
+{
+	kvm_gmem_mnt = kern_mount(&kvm_gmem_fs);
+	BUG_ON(IS_ERR(kvm_gmem_mnt));
+
+	/* For giggles. Userspace can never map this anyways. */
+	kvm_gmem_mnt->mnt_flags |= MNT_NOEXEC;
+}
+
 static struct file_operations kvm_gmem_fops = {
 	.open		= generic_file_open,
 	.release	= kvm_gmem_release,
@@ -311,6 +348,8 @@ static struct file_operations kvm_gmem_fops = {
 void kvm_gmem_init(struct module *module)
 {
 	kvm_gmem_fops.owner = module;
+
+	kvm_gmem_init_mount();
 }
 
 static int kvm_gmem_migrate_folio(struct address_space *mapping,
@@ -392,11 +431,67 @@ static const struct inode_operations kvm_gmem_iops = {
 	.setattr	= kvm_gmem_setattr,
 };
 
+static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
+						      loff_t size, u64 flags)
+{
+	const struct qstr qname = QSTR_INIT(name, strlen(name));
+	struct inode *inode;
+	int err;
+
+	inode = alloc_anon_inode(kvm_gmem_mnt->mnt_sb);
+	if (IS_ERR(inode))
+		return inode;
+
+	err = security_inode_init_security_anon(inode, &qname, NULL);
+	if (err) {
+		iput(inode);
+		return ERR_PTR(err);
+	}
+
+	inode->i_private = (void *)(unsigned long)flags;
+	inode->i_op = &kvm_gmem_iops;
+	inode->i_mapping->a_ops = &kvm_gmem_aops;
+	inode->i_mode |= S_IFREG;
+	inode->i_size = size;
+	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+	mapping_set_inaccessible(inode->i_mapping);
+	/* Unmovable mappings are supposed to be marked unevictable as well. */
+	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
+
+	return inode;
+}
+
+static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,
+						  u64 flags)
+{
+	static const char *name = "[kvm-gmem]";
+	struct inode *inode;
+	struct file *file;
+
+	if (kvm_gmem_fops.owner && !try_module_get(kvm_gmem_fops.owner))
+		return ERR_PTR(-ENOENT);
+
+	inode = kvm_gmem_inode_make_secure_inode(name, size, flags);
+	if (IS_ERR(inode))
+		return ERR_CAST(inode);
+
+	file = alloc_file_pseudo(inode, kvm_gmem_mnt, name, O_RDWR,
+				 &kvm_gmem_fops);
+	if (IS_ERR(file)) {
+		iput(inode);
+		return file;
+	}
+
+	file->f_mapping = inode->i_mapping;
+	file->f_flags |= O_LARGEFILE;
+	file->private_data = priv;
+
+	return file;
+}
+
 static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
 {
-	const char *anon_name = "[kvm-gmem]";
 	struct kvm_gmem *gmem;
-	struct inode *inode;
 	struct file *file;
 	int fd, err;
 
@@ -410,32 +505,16 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
 		goto err_fd;
 	}
 
-	file = anon_inode_create_getfile(anon_name, &kvm_gmem_fops, gmem,
-					 O_RDWR, NULL);
+	file = kvm_gmem_inode_create_getfile(gmem, size, flags);
 	if (IS_ERR(file)) {
 		err = PTR_ERR(file);
 		goto err_gmem;
 	}
 
-	file->f_flags |= O_LARGEFILE;
-
-	inode = file->f_inode;
-	WARN_ON(file->f_mapping != inode->i_mapping);
-
-	inode->i_private = (void *)(unsigned long)flags;
-	inode->i_op = &kvm_gmem_iops;
-	inode->i_mapping->a_ops = &kvm_gmem_aops;
-	inode->i_mode |= S_IFREG;
-	inode->i_size = size;
-	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
-	mapping_set_inaccessible(inode->i_mapping);
-	/* Unmovable mappings are supposed to be marked unevictable as well. */
-	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
-
 	kvm_get_kvm(kvm);
 	gmem->kvm = kvm;
 	xa_init(&gmem->bindings);
-	list_add(&gmem->entry, &inode->i_mapping->i_private_list);
+	list_add(&gmem->entry, &file_inode(file)->i_mapping->i_private_list);
 
 	fd_install(fd, file);
 	return fd;
-- 
2.47.0.rc0.187.ge670bccf7e-goog



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 02/11] KVM: guest_memfd: Track mappability within a struct kvm_gmem_private
  2024-10-10  8:59 [PATCH v3 00/11] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
  2024-10-10  8:59 ` [PATCH v3 01/11] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes Fuad Tabba
@ 2024-10-10  8:59 ` Fuad Tabba
  2024-10-10  8:59 ` [PATCH v3 03/11] KVM: guest_memfd: Introduce kvm_gmem_get_pfn_locked(), which retains the folio lock Fuad Tabba
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Fuad Tabba @ 2024-10-10  8:59 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton, tabba

From: Ackerley Tng <ackerleytng@google.com>

Track whether guest_memfd memory can be mapped within the inode,
since it is property of the guest_memfd's memory contents.

The guest_memfd PRIVATE memory attribute is not used for two
reasons. First because it reflects the userspace expectation for
that memory location, and therefore can be toggled by userspace.
The second is, although each guest_memfd file has a 1:1 binding
with a KVM instance, the plan is to allow multiple files per
inode, e.g. to allow intra-host migration to a new KVM instance,
without destroying guest_memfd.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Vishal Annapurve <vannapurve@google.com>
Signed-off-by: Vishal Annapurve <vannapurve@google.com>
Co-developed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 virt/kvm/guest_memfd.c | 56 ++++++++++++++++++++++++++++++++++++++----
 1 file changed, 51 insertions(+), 5 deletions(-)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 5d7fd1f708a6..4d3ba346c415 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -18,6 +18,17 @@ struct kvm_gmem {
 	struct list_head entry;
 };
 
+struct kvm_gmem_inode_private {
+#ifdef CONFIG_KVM_GMEM_MAPPABLE
+	struct xarray mappable_offsets;
+#endif
+};
+
+static struct kvm_gmem_inode_private *kvm_gmem_private(struct inode *inode)
+{
+	return inode->i_mapping->i_private_data;
+}
+
 /**
  * folio_file_pfn - like folio_file_page, but return a pfn.
  * @folio: The folio which contains this index.
@@ -307,8 +318,28 @@ static inline struct file *kvm_gmem_get_file(struct kvm_memory_slot *slot)
 	return get_file_active(&slot->gmem.file);
 }
 
+static void kvm_gmem_evict_inode(struct inode *inode)
+{
+	struct kvm_gmem_inode_private *private = kvm_gmem_private(inode);
+
+#ifdef CONFIG_KVM_GMEM_MAPPABLE
+	/*
+	 * .free_inode can be called before private data is set up if there are
+	 * issues during inode creation.
+	 */
+	if (private)
+		xa_destroy(&private->mappable_offsets);
+#endif
+
+	truncate_inode_pages_final(inode->i_mapping);
+
+	kfree(private);
+	clear_inode(inode);
+}
+
 static const struct super_operations kvm_gmem_super_operations = {
-	.statfs		= simple_statfs,
+	.statfs         = simple_statfs,
+	.evict_inode	= kvm_gmem_evict_inode,
 };
 
 static int kvm_gmem_init_fs_context(struct fs_context *fc)
@@ -435,6 +466,7 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
 						      loff_t size, u64 flags)
 {
 	const struct qstr qname = QSTR_INIT(name, strlen(name));
+	struct kvm_gmem_inode_private *private;
 	struct inode *inode;
 	int err;
 
@@ -443,10 +475,19 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
 		return inode;
 
 	err = security_inode_init_security_anon(inode, &qname, NULL);
-	if (err) {
-		iput(inode);
-		return ERR_PTR(err);
-	}
+	if (err)
+		goto out;
+
+	err = -ENOMEM;
+	private = kzalloc(sizeof(*private), GFP_KERNEL);
+	if (!private)
+		goto out;
+
+#ifdef CONFIG_KVM_GMEM_MAPPABLE
+	xa_init(&private->mappable_offsets);
+#endif
+
+	inode->i_mapping->i_private_data = private;
 
 	inode->i_private = (void *)(unsigned long)flags;
 	inode->i_op = &kvm_gmem_iops;
@@ -459,6 +500,11 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
 	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
 
 	return inode;
+
+out:
+	iput(inode);
+
+	return ERR_PTR(err);
 }
 
 static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,
-- 
2.47.0.rc0.187.ge670bccf7e-goog



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 03/11] KVM: guest_memfd: Introduce kvm_gmem_get_pfn_locked(), which retains the folio lock
  2024-10-10  8:59 [PATCH v3 00/11] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
  2024-10-10  8:59 ` [PATCH v3 01/11] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes Fuad Tabba
  2024-10-10  8:59 ` [PATCH v3 02/11] KVM: guest_memfd: Track mappability within a struct kvm_gmem_private Fuad Tabba
@ 2024-10-10  8:59 ` Fuad Tabba
  2024-10-10  8:59 ` [PATCH v3 04/11] KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared Fuad Tabba
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Fuad Tabba @ 2024-10-10  8:59 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton, tabba

Create a new variant of kvm_gmem_get_pfn(), which retains the
folio lock if it returns successfully. This is needed in
subsequent patches in order to protect against races when
checking whether a folio can be mapped by the host.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 include/linux/kvm_host.h | 11 ++++++++++
 virt/kvm/guest_memfd.c   | 45 +++++++++++++++++++++++++++++++---------
 2 files changed, 46 insertions(+), 10 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index db567d26f7b9..acf85995b582 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2464,6 +2464,9 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
 #ifdef CONFIG_KVM_PRIVATE_MEM
 int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 		     gfn_t gfn, kvm_pfn_t *pfn, int *max_order);
+int kvm_gmem_get_pfn_locked(struct kvm *kvm, struct kvm_memory_slot *slot,
+			    gfn_t gfn, kvm_pfn_t *pfn, int *max_order);
+
 #else
 static inline int kvm_gmem_get_pfn(struct kvm *kvm,
 				   struct kvm_memory_slot *slot, gfn_t gfn,
@@ -2472,6 +2475,14 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
 	KVM_BUG_ON(1, kvm);
 	return -EIO;
 }
+static inline int kvm_gmem_get_pfn_locked(struct kvm *kvm,
+					  struct kvm_memory_slot *slot,
+					  gfn_t gfn,
+					  kvm_pfn_t *pfn, int *max_order)
+{
+	KVM_BUG_ON(1, kvm);
+	return -EIO;
+}
 #endif /* CONFIG_KVM_PRIVATE_MEM */
 
 #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_PREPARE
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 4d3ba346c415..f414646c475b 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -714,34 +714,59 @@ __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
 	return folio;
 }
 
-int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
-		     gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
+static int
+kvm_gmem_get_pfn_folio_locked(struct kvm *kvm, struct kvm_memory_slot *slot,
+			      gfn_t gfn, kvm_pfn_t *pfn, int *max_order,
+			      struct folio **folio)
 {
 	struct file *file = kvm_gmem_get_file(slot);
-	struct folio *folio;
 	bool is_prepared = false;
 	int r = 0;
 
 	if (!file)
 		return -EFAULT;
 
-	folio = __kvm_gmem_get_pfn(file, slot, gfn, pfn, &is_prepared, max_order);
-	if (IS_ERR(folio)) {
-		r = PTR_ERR(folio);
+	*folio = __kvm_gmem_get_pfn(file, slot, gfn, pfn, &is_prepared, max_order);
+	if (IS_ERR(*folio)) {
+		r = PTR_ERR(*folio);
 		goto out;
 	}
 
 	if (!is_prepared)
-		r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
+		r = kvm_gmem_prepare_folio(kvm, slot, gfn, *folio);
 
-	folio_unlock(folio);
-	if (r < 0)
-		folio_put(folio);
+	if (r) {
+		folio_unlock(*folio);
+		folio_put(*folio);
+	}
 
 out:
 	fput(file);
 	return r;
 }
+
+int kvm_gmem_get_pfn_locked(struct kvm *kvm, struct kvm_memory_slot *slot,
+			    gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
+{
+	struct folio *folio;
+
+	return kvm_gmem_get_pfn_folio_locked(kvm, slot, gfn, pfn, max_order, &folio);
+
+}
+EXPORT_SYMBOL_GPL(kvm_gmem_get_pfn_locked);
+
+int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
+		     gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
+{
+	struct folio *folio;
+	int r;
+
+	r = kvm_gmem_get_pfn_folio_locked(kvm, slot, gfn, pfn, max_order, &folio);
+	if (!r)
+		folio_unlock(folio);
+
+	return r;
+}
 EXPORT_SYMBOL_GPL(kvm_gmem_get_pfn);
 
 #ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
-- 
2.47.0.rc0.187.ge670bccf7e-goog



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 04/11] KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared
  2024-10-10  8:59 [PATCH v3 00/11] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
                   ` (2 preceding siblings ...)
  2024-10-10  8:59 ` [PATCH v3 03/11] KVM: guest_memfd: Introduce kvm_gmem_get_pfn_locked(), which retains the folio lock Fuad Tabba
@ 2024-10-10  8:59 ` Fuad Tabba
  2024-10-10 10:14   ` Kirill A. Shutemov
  2024-10-14 16:52   ` Elliot Berman
  2024-10-10  8:59 ` [PATCH v3 05/11] KVM: guest_memfd: Add guest_memfd support to kvm_(read|/write)_guest_page() Fuad Tabba
                   ` (6 subsequent siblings)
  10 siblings, 2 replies; 28+ messages in thread
From: Fuad Tabba @ 2024-10-10  8:59 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton, tabba

Add support for mmap() and fault() for guest_memfd in the host.
The ability to fault in a guest page is contingent on that page
being shared with the host.

The guest_memfd PRIVATE memory attribute is not used for two
reasons. First because it reflects the userspace expectation for
that memory location, and therefore can be toggled by userspace.
The second is, although each guest_memfd file has a 1:1 binding
with a KVM instance, the plan is to allow multiple files per
inode, e.g. to allow intra-host migration to a new KVM instance,
without destroying guest_memfd.

The mapping is restricted to only memory explicitly shared with
the host. KVM checks that the host doesn't have any mappings for
private memory via the folio's refcount. To avoid races between
paths that check mappability and paths that check whether the
host has any mappings (via the refcount), the folio lock is held
in while either check is being performed.

This new feature is gated with a new configuration option,
CONFIG_KVM_GMEM_MAPPABLE.

Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Elliot Berman <quic_eberman@quicinc.com>
Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
Signed-off-by: Fuad Tabba <tabba@google.com>

---

Note that the functions kvm_gmem_is_mapped(),
kvm_gmem_set_mappable(), and int kvm_gmem_clear_mappable() are
not used in this patch series. They are intended to be used in
future patches [*], which check and toggle mapability when the
guest shares/unshares pages with the host.

[*] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.12-v3-pkvm

---
 include/linux/kvm_host.h |  52 +++++++++++
 virt/kvm/Kconfig         |   4 +
 virt/kvm/guest_memfd.c   | 185 +++++++++++++++++++++++++++++++++++++++
 virt/kvm/kvm_main.c      | 138 +++++++++++++++++++++++++++++
 4 files changed, 379 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index acf85995b582..bda7fda9945e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2527,4 +2527,56 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
 				    struct kvm_pre_fault_memory *range);
 #endif
 
+#ifdef CONFIG_KVM_GMEM_MAPPABLE
+bool kvm_gmem_is_mappable(struct kvm *kvm, gfn_t gfn, gfn_t end);
+bool kvm_gmem_is_mapped(struct kvm *kvm, gfn_t start, gfn_t end);
+int kvm_gmem_set_mappable(struct kvm *kvm, gfn_t start, gfn_t end);
+int kvm_gmem_clear_mappable(struct kvm *kvm, gfn_t start, gfn_t end);
+int kvm_slot_gmem_set_mappable(struct kvm_memory_slot *slot, gfn_t start,
+			       gfn_t end);
+int kvm_slot_gmem_clear_mappable(struct kvm_memory_slot *slot, gfn_t start,
+				 gfn_t end);
+bool kvm_slot_gmem_is_mappable(struct kvm_memory_slot *slot, gfn_t gfn);
+#else
+static inline bool kvm_gmem_is_mappable(struct kvm *kvm, gfn_t gfn, gfn_t end)
+{
+	WARN_ON_ONCE(1);
+	return false;
+}
+static inline bool kvm_gmem_is_mapped(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+	WARN_ON_ONCE(1);
+	return false;
+}
+static inline int kvm_gmem_set_mappable(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+	WARN_ON_ONCE(1);
+	return -EINVAL;
+}
+static inline int kvm_gmem_clear_mappable(struct kvm *kvm, gfn_t start,
+					  gfn_t end)
+{
+	WARN_ON_ONCE(1);
+	return -EINVAL;
+}
+static inline int kvm_slot_gmem_set_mappable(struct kvm_memory_slot *slot,
+					     gfn_t start, gfn_t end)
+{
+	WARN_ON_ONCE(1);
+	return -EINVAL;
+}
+static inline int kvm_slot_gmem_clear_mappable(struct kvm_memory_slot *slot,
+					       gfn_t start, gfn_t end)
+{
+	WARN_ON_ONCE(1);
+	return -EINVAL;
+}
+static inline bool kvm_slot_gmem_is_mappable(struct kvm_memory_slot *slot,
+					     gfn_t gfn)
+{
+	WARN_ON_ONCE(1);
+	return false;
+}
+#endif /* CONFIG_KVM_GMEM_MAPPABLE */
+
 #endif
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index fd6a3010afa8..2cfcb0848e37 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -120,3 +120,7 @@ config HAVE_KVM_ARCH_GMEM_PREPARE
 config HAVE_KVM_ARCH_GMEM_INVALIDATE
        bool
        depends on KVM_PRIVATE_MEM
+
+config KVM_GMEM_MAPPABLE
+       select KVM_PRIVATE_MEM
+       bool
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index f414646c475b..df3a6f05a16e 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -370,7 +370,184 @@ static void kvm_gmem_init_mount(void)
 	kvm_gmem_mnt->mnt_flags |= MNT_NOEXEC;
 }
 
+#ifdef CONFIG_KVM_GMEM_MAPPABLE
+static struct folio *
+__kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
+		   gfn_t gfn, kvm_pfn_t *pfn, bool *is_prepared,
+		   int *max_order);
+
+static int gmem_set_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
+{
+	struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
+	void *xval = xa_mk_value(true);
+	pgoff_t i;
+	bool r;
+
+	filemap_invalidate_lock(inode->i_mapping);
+	for (i = start; i < end; i++) {
+		r = xa_err(xa_store(mappable_offsets, i, xval, GFP_KERNEL));
+		if (r)
+			break;
+	}
+	filemap_invalidate_unlock(inode->i_mapping);
+
+	return r;
+}
+
+static int gmem_clear_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
+{
+	struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
+	pgoff_t i;
+	int r = 0;
+
+	filemap_invalidate_lock(inode->i_mapping);
+	for (i = start; i < end; i++) {
+		struct folio *folio;
+
+		/*
+		 * Holds the folio lock until after checking its refcount,
+		 * to avoid races with paths that fault in the folio.
+		 */
+		folio = kvm_gmem_get_folio(inode, i);
+		if (WARN_ON_ONCE(IS_ERR(folio)))
+			continue;
+
+		/*
+		 * Check that the host doesn't have any mappings on clearing
+		 * the mappable flag, because clearing the flag implies that the
+		 * memory will be unshared from the host. Therefore, to maintain
+		 * the invariant that the host cannot access private memory, we
+		 * need to check that it doesn't have any mappings to that
+		 * memory before making it private.
+		 *
+		 * Two references are expected because of kvm_gmem_get_folio().
+		 */
+		if (folio_ref_count(folio) > 2)
+			r = -EPERM;
+		else
+			xa_erase(mappable_offsets, i);
+
+		folio_put(folio);
+		folio_unlock(folio);
+
+		if (r)
+			break;
+	}
+	filemap_invalidate_unlock(inode->i_mapping);
+
+	return r;
+}
+
+static bool gmem_is_mappable(struct inode *inode, pgoff_t pgoff)
+{
+	struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
+	bool r;
+
+	filemap_invalidate_lock_shared(inode->i_mapping);
+	r = xa_find(mappable_offsets, &pgoff, pgoff, XA_PRESENT);
+	filemap_invalidate_unlock_shared(inode->i_mapping);
+
+	return r;
+}
+
+int kvm_slot_gmem_set_mappable(struct kvm_memory_slot *slot, gfn_t start, gfn_t end)
+{
+	struct inode *inode = file_inode(slot->gmem.file);
+	pgoff_t start_off = slot->gmem.pgoff + start - slot->base_gfn;
+	pgoff_t end_off = start_off + end - start;
+
+	return gmem_set_mappable(inode, start_off, end_off);
+}
+
+int kvm_slot_gmem_clear_mappable(struct kvm_memory_slot *slot, gfn_t start, gfn_t end)
+{
+	struct inode *inode = file_inode(slot->gmem.file);
+	pgoff_t start_off = slot->gmem.pgoff + start - slot->base_gfn;
+	pgoff_t end_off = start_off + end - start;
+
+	return gmem_clear_mappable(inode, start_off, end_off);
+}
+
+bool kvm_slot_gmem_is_mappable(struct kvm_memory_slot *slot, gfn_t gfn)
+{
+	struct inode *inode = file_inode(slot->gmem.file);
+	unsigned long pgoff = slot->gmem.pgoff + gfn - slot->base_gfn;
+
+	return gmem_is_mappable(inode, pgoff);
+}
+
+static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
+{
+	struct inode *inode = file_inode(vmf->vma->vm_file);
+	struct folio *folio;
+	vm_fault_t ret = VM_FAULT_LOCKED;
+
+	/*
+	 * Holds the folio lock until after checking whether it can be faulted
+	 * in, to avoid races with paths that change a folio's mappability.
+	 */
+	folio = kvm_gmem_get_folio(inode, vmf->pgoff);
+	if (!folio)
+		return VM_FAULT_SIGBUS;
+
+	if (folio_test_hwpoison(folio)) {
+		ret = VM_FAULT_HWPOISON;
+		goto out;
+	}
+
+	if (!gmem_is_mappable(inode, vmf->pgoff)) {
+		ret = VM_FAULT_SIGBUS;
+		goto out;
+	}
+
+	if (!folio_test_uptodate(folio)) {
+		unsigned long nr_pages = folio_nr_pages(folio);
+		unsigned long i;
+
+		for (i = 0; i < nr_pages; i++)
+			clear_highpage(folio_page(folio, i));
+
+		folio_mark_uptodate(folio);
+	}
+
+	vmf->page = folio_file_page(folio, vmf->pgoff);
+out:
+	if (ret != VM_FAULT_LOCKED) {
+		folio_put(folio);
+		folio_unlock(folio);
+	}
+
+	return ret;
+}
+
+static const struct vm_operations_struct kvm_gmem_vm_ops = {
+	.fault = kvm_gmem_fault,
+};
+
+static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) !=
+	    (VM_SHARED | VM_MAYSHARE)) {
+		return -EINVAL;
+	}
+
+	file_accessed(file);
+	vm_flags_set(vma, VM_DONTDUMP);
+	vma->vm_ops = &kvm_gmem_vm_ops;
+
+	return 0;
+}
+#else
+static int gmem_set_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
+{
+	WARN_ON_ONCE(1);
+	return -EINVAL;
+}
+#define kvm_gmem_mmap NULL
+#endif /* CONFIG_KVM_GMEM_MAPPABLE */
+
 static struct file_operations kvm_gmem_fops = {
+	.mmap		= kvm_gmem_mmap,
 	.open		= generic_file_open,
 	.release	= kvm_gmem_release,
 	.fallocate	= kvm_gmem_fallocate,
@@ -557,6 +734,14 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
 		goto err_gmem;
 	}
 
+	if (IS_ENABLED(CONFIG_KVM_GMEM_MAPPABLE)) {
+		err = gmem_set_mappable(file_inode(file), 0, size >> PAGE_SHIFT);
+		if (err) {
+			fput(file);
+			goto err_gmem;
+		}
+	}
+
 	kvm_get_kvm(kvm);
 	gmem->kvm = kvm;
 	xa_init(&gmem->bindings);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 05cbb2548d99..aed9cf2f1685 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3263,6 +3263,144 @@ static int next_segment(unsigned long len, int offset)
 		return len;
 }
 
+#ifdef CONFIG_KVM_GMEM_MAPPABLE
+static bool __kvm_gmem_is_mappable(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+	struct kvm_memslot_iter iter;
+
+	lockdep_assert_held(&kvm->slots_lock);
+
+	kvm_for_each_memslot_in_gfn_range(&iter, kvm_memslots(kvm), start, end) {
+		struct kvm_memory_slot *memslot = iter.slot;
+		gfn_t gfn_start, gfn_end, i;
+
+		gfn_start = max(start, memslot->base_gfn);
+		gfn_end = min(end, memslot->base_gfn + memslot->npages);
+		if (WARN_ON_ONCE(gfn_start >= gfn_end))
+			continue;
+
+		for (i = gfn_start; i < gfn_end; i++) {
+			if (!kvm_slot_gmem_is_mappable(memslot, i))
+				return false;
+		}
+	}
+
+	return true;
+}
+
+bool kvm_gmem_is_mappable(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+	bool r;
+
+	mutex_lock(&kvm->slots_lock);
+	r = __kvm_gmem_is_mappable(kvm, start, end);
+	mutex_unlock(&kvm->slots_lock);
+
+	return r;
+}
+
+static bool kvm_gmem_is_pfn_mapped(struct kvm *kvm, struct kvm_memory_slot *memslot, gfn_t gfn_idx)
+{
+	struct page *page;
+	bool is_mapped;
+	kvm_pfn_t pfn;
+
+	/*
+	 * Holds the folio lock until after checking its refcount,
+	 * to avoid races with paths that fault in the folio.
+	 */
+	if (WARN_ON_ONCE(kvm_gmem_get_pfn_locked(kvm, memslot, gfn_idx, &pfn, NULL)))
+		return false;
+
+	page = pfn_to_page(pfn);
+
+	/* Two references are expected because of kvm_gmem_get_pfn_locked(). */
+	is_mapped = page_ref_count(page) > 2;
+
+	put_page(page);
+	unlock_page(page);
+
+	return is_mapped;
+}
+
+static bool __kvm_gmem_is_mapped(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+	struct kvm_memslot_iter iter;
+
+	lockdep_assert_held(&kvm->slots_lock);
+
+	kvm_for_each_memslot_in_gfn_range(&iter, kvm_memslots(kvm), start, end) {
+		struct kvm_memory_slot *memslot = iter.slot;
+		gfn_t gfn_start, gfn_end, i;
+
+		gfn_start = max(start, memslot->base_gfn);
+		gfn_end = min(end, memslot->base_gfn + memslot->npages);
+		if (WARN_ON_ONCE(gfn_start >= gfn_end))
+			continue;
+
+		for (i = gfn_start; i < gfn_end; i++) {
+			if (kvm_gmem_is_pfn_mapped(kvm, memslot, i))
+				return true;
+		}
+	}
+
+	return false;
+}
+
+bool kvm_gmem_is_mapped(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+	bool r;
+
+	mutex_lock(&kvm->slots_lock);
+	r = __kvm_gmem_is_mapped(kvm, start, end);
+	mutex_unlock(&kvm->slots_lock);
+
+	return r;
+}
+
+static int kvm_gmem_toggle_mappable(struct kvm *kvm, gfn_t start, gfn_t end,
+				    bool is_mappable)
+{
+	struct kvm_memslot_iter iter;
+	int r = 0;
+
+	mutex_lock(&kvm->slots_lock);
+
+	kvm_for_each_memslot_in_gfn_range(&iter, kvm_memslots(kvm), start, end) {
+		struct kvm_memory_slot *memslot = iter.slot;
+		gfn_t gfn_start, gfn_end;
+
+		gfn_start = max(start, memslot->base_gfn);
+		gfn_end = min(end, memslot->base_gfn + memslot->npages);
+		if (WARN_ON_ONCE(start >= end))
+			continue;
+
+		if (is_mappable)
+			r = kvm_slot_gmem_set_mappable(memslot, gfn_start, gfn_end);
+		else
+			r = kvm_slot_gmem_clear_mappable(memslot, gfn_start, gfn_end);
+
+		if (WARN_ON_ONCE(r))
+			break;
+	}
+
+	mutex_unlock(&kvm->slots_lock);
+
+	return r;
+}
+
+int kvm_gmem_set_mappable(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+	return kvm_gmem_toggle_mappable(kvm, start, end, true);
+}
+
+int kvm_gmem_clear_mappable(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+	return kvm_gmem_toggle_mappable(kvm, start, end, false);
+}
+
+#endif /* CONFIG_KVM_GMEM_MAPPABLE */
+
 /* Copy @len bytes from guest memory at '(@gfn * PAGE_SIZE) + @offset' to @data */
 static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
 				 void *data, int offset, int len)
-- 
2.47.0.rc0.187.ge670bccf7e-goog



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 05/11] KVM: guest_memfd: Add guest_memfd support to kvm_(read|/write)_guest_page()
  2024-10-10  8:59 [PATCH v3 00/11] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
                   ` (3 preceding siblings ...)
  2024-10-10  8:59 ` [PATCH v3 04/11] KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared Fuad Tabba
@ 2024-10-10  8:59 ` Fuad Tabba
  2024-10-17 21:53   ` Ackerley Tng
  2024-10-10  8:59 ` [PATCH v3 06/11] KVM: guest_memfd: Add KVM capability to check if guest_memfd is host mappable Fuad Tabba
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 28+ messages in thread
From: Fuad Tabba @ 2024-10-10  8:59 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton, tabba

Make kvm_(read|/write)_guest_page() capable of accessing guest
memory for slots that don't have a userspace address, but only if
the memory is mappable, which also indicates that it is
accessible by the host.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 virt/kvm/kvm_main.c | 137 ++++++++++++++++++++++++++++++++++++++------
 1 file changed, 118 insertions(+), 19 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index aed9cf2f1685..77e6412034b9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3399,23 +3399,114 @@ int kvm_gmem_clear_mappable(struct kvm *kvm, gfn_t start, gfn_t end)
 	return kvm_gmem_toggle_mappable(kvm, start, end, false);
 }
 
+static int __kvm_read_guest_memfd_page(struct kvm *kvm,
+				       struct kvm_memory_slot *slot,
+				       gfn_t gfn, void *data, int offset,
+				       int len)
+{
+	struct page *page;
+	u64 pfn;
+	int r;
+
+	/*
+	 * Holds the folio lock until after checking whether it can be faulted
+	 * in, to avoid races with paths that change a folio's mappability.
+	 */
+	r = kvm_gmem_get_pfn_locked(kvm, slot, gfn, &pfn, NULL);
+	if (r)
+		return r;
+
+	page = pfn_to_page(pfn);
+
+	if (!kvm_gmem_is_mappable(kvm, gfn, gfn + 1)) {
+		r = -EPERM;
+		goto unlock;
+	}
+	memcpy(data, page_address(page) + offset, len);
+unlock:
+	if (r)
+		put_page(page);
+	else
+		kvm_release_pfn_clean(pfn);
+	unlock_page(page);
+
+	return r;
+}
+
+static int __kvm_write_guest_memfd_page(struct kvm *kvm,
+					struct kvm_memory_slot *slot,
+					gfn_t gfn, const void *data,
+					int offset, int len)
+{
+	struct page *page;
+	u64 pfn;
+	int r;
+
+	/*
+	 * Holds the folio lock until after checking whether it can be faulted
+	 * in, to avoid races with paths that change a folio's mappability.
+	 */
+	r = kvm_gmem_get_pfn_locked(kvm, slot, gfn, &pfn, NULL);
+	if (r)
+		return r;
+
+	page = pfn_to_page(pfn);
+
+	if (!kvm_gmem_is_mappable(kvm, gfn, gfn + 1)) {
+		r = -EPERM;
+		goto unlock;
+	}
+	memcpy(page_address(page) + offset, data, len);
+unlock:
+	if (r)
+		put_page(page);
+	else
+		kvm_release_pfn_dirty(pfn);
+	unlock_page(page);
+
+	return r;
+}
+#else
+static int __kvm_read_guest_memfd_page(struct kvm *kvm,
+				       struct kvm_memory_slot *slot,
+				       gfn_t gfn, void *data, int offset,
+				       int len)
+{
+	WARN_ON_ONCE(1);
+	return -EIO;
+}
+
+static int __kvm_write_guest_memfd_page(struct kvm *kvm,
+					struct kvm_memory_slot *slot,
+					gfn_t gfn, const void *data,
+					int offset, int len)
+{
+	WARN_ON_ONCE(1);
+	return -EIO;
+}
 #endif /* CONFIG_KVM_GMEM_MAPPABLE */
 
 /* Copy @len bytes from guest memory at '(@gfn * PAGE_SIZE) + @offset' to @data */
-static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
-				 void *data, int offset, int len)
+
+static int __kvm_read_guest_page(struct kvm *kvm, struct kvm_memory_slot *slot,
+				 gfn_t gfn, void *data, int offset, int len)
 {
-	int r;
 	unsigned long addr;
 
 	if (WARN_ON_ONCE(offset + len > PAGE_SIZE))
 		return -EFAULT;
 
+	if (IS_ENABLED(CONFIG_KVM_GMEM_MAPPABLE) &&
+	    kvm_slot_can_be_private(slot) &&
+	    !slot->userspace_addr) {
+		return __kvm_read_guest_memfd_page(kvm, slot, gfn, data,
+						   offset, len);
+	}
+
 	addr = gfn_to_hva_memslot_prot(slot, gfn, NULL);
 	if (kvm_is_error_hva(addr))
 		return -EFAULT;
-	r = __copy_from_user(data, (void __user *)addr + offset, len);
-	if (r)
+	if (__copy_from_user(data, (void __user *)addr + offset, len))
 		return -EFAULT;
 	return 0;
 }
@@ -3425,7 +3516,7 @@ int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
 {
 	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
 
-	return __kvm_read_guest_page(slot, gfn, data, offset, len);
+	return __kvm_read_guest_page(kvm, slot, gfn, data, offset, len);
 }
 EXPORT_SYMBOL_GPL(kvm_read_guest_page);
 
@@ -3434,7 +3525,7 @@ int kvm_vcpu_read_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn, void *data,
 {
 	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
 
-	return __kvm_read_guest_page(slot, gfn, data, offset, len);
+	return __kvm_read_guest_page(vcpu->kvm, slot, gfn, data, offset, len);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_page);
 
@@ -3511,22 +3602,30 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_atomic);
 
 /* Copy @len bytes from @data into guest memory at '(@gfn * PAGE_SIZE) + @offset' */
 static int __kvm_write_guest_page(struct kvm *kvm,
-				  struct kvm_memory_slot *memslot, gfn_t gfn,
-			          const void *data, int offset, int len)
+				  struct kvm_memory_slot *slot, gfn_t gfn,
+				  const void *data, int offset, int len)
 {
-	int r;
-	unsigned long addr;
-
 	if (WARN_ON_ONCE(offset + len > PAGE_SIZE))
 		return -EFAULT;
 
-	addr = gfn_to_hva_memslot(memslot, gfn);
-	if (kvm_is_error_hva(addr))
-		return -EFAULT;
-	r = __copy_to_user((void __user *)addr + offset, data, len);
-	if (r)
-		return -EFAULT;
-	mark_page_dirty_in_slot(kvm, memslot, gfn);
+	if (IS_ENABLED(CONFIG_KVM_GMEM_MAPPABLE) &&
+	    kvm_slot_can_be_private(slot) &&
+	    !slot->userspace_addr) {
+		int r = __kvm_write_guest_memfd_page(kvm, slot, gfn, data,
+						     offset, len);
+
+		if (r)
+			return r;
+	} else {
+		unsigned long addr = gfn_to_hva_memslot(slot, gfn);
+
+		if (kvm_is_error_hva(addr))
+			return -EFAULT;
+		if (__copy_to_user((void __user *)addr + offset, data, len))
+			return -EFAULT;
+	}
+
+	mark_page_dirty_in_slot(kvm, slot, gfn);
 	return 0;
 }
 
-- 
2.47.0.rc0.187.ge670bccf7e-goog



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 06/11] KVM: guest_memfd: Add KVM capability to check if guest_memfd is host mappable
  2024-10-10  8:59 [PATCH v3 00/11] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
                   ` (4 preceding siblings ...)
  2024-10-10  8:59 ` [PATCH v3 05/11] KVM: guest_memfd: Add guest_memfd support to kvm_(read|/write)_guest_page() Fuad Tabba
@ 2024-10-10  8:59 ` Fuad Tabba
  2024-10-15 10:30   ` Suzuki K Poulose
  2024-10-10  8:59 ` [PATCH v3 07/11] KVM: guest_memfd: Add a guest_memfd() flag to initialize it as mappable Fuad Tabba
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 28+ messages in thread
From: Fuad Tabba @ 2024-10-10  8:59 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton, tabba

Add the KVM capability KVM_CAP_GUEST_MEMFD_MAPPABLE, which is
true if mapping guest memory is supported by the host.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 include/uapi/linux/kvm.h | 1 +
 virt/kvm/kvm_main.c      | 4 ++++
 2 files changed, 5 insertions(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 637efc055145..2c6057bab71c 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -933,6 +933,7 @@ struct kvm_enable_cap {
 #define KVM_CAP_PRE_FAULT_MEMORY 236
 #define KVM_CAP_X86_APIC_BUS_CYCLES_NS 237
 #define KVM_CAP_X86_GUEST_MODE 238
+#define KVM_CAP_GUEST_MEMFD_MAPPABLE 239
 
 struct kvm_irq_routing_irqchip {
 	__u32 irqchip;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 77e6412034b9..c2ff09197795 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -5176,6 +5176,10 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 #ifdef CONFIG_KVM_PRIVATE_MEM
 	case KVM_CAP_GUEST_MEMFD:
 		return !kvm || kvm_arch_has_private_mem(kvm);
+#endif
+#ifdef CONFIG_KVM_GMEM_MAPPABLE
+	case KVM_CAP_GUEST_MEMFD_MAPPABLE:
+		return !kvm || kvm_arch_has_private_mem(kvm);
 #endif
 	default:
 		break;
-- 
2.47.0.rc0.187.ge670bccf7e-goog



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 07/11] KVM: guest_memfd: Add a guest_memfd() flag to initialize it as mappable
  2024-10-10  8:59 [PATCH v3 00/11] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
                   ` (5 preceding siblings ...)
  2024-10-10  8:59 ` [PATCH v3 06/11] KVM: guest_memfd: Add KVM capability to check if guest_memfd is host mappable Fuad Tabba
@ 2024-10-10  8:59 ` Fuad Tabba
  2024-10-10  8:59 ` [PATCH v3 08/11] KVM: guest_memfd: selftests: guest_memfd mmap() test when mapping is allowed Fuad Tabba
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Fuad Tabba @ 2024-10-10  8:59 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton, tabba

Not all use cases require guest_memfd() to be mappable by the
host when first created. Add a new flag,
GUEST_MEMFD_FLAG_INIT_MAPPABLE, which when set on
KVM_CREATE_GUEST_MEMFD initializes the memory as mappable by the
host. Otherwise, memory is private until shared by the guest with
the host.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 Documentation/virt/kvm/api.rst | 4 ++++
 include/uapi/linux/kvm.h       | 1 +
 virt/kvm/guest_memfd.c         | 6 +++++-
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index e32471977d0a..c503f9443335 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6380,6 +6380,10 @@ most one mapping per page, i.e. binding multiple memory regions to a single
 guest_memfd range is not allowed (any number of memory regions can be bound to
 a single guest_memfd file, but the bound ranges must not overlap).
 
+If the capability KVM_CAP_GUEST_MEMFD_MAPPABLE is supported, then the flags
+field supports GUEST_MEMFD_FLAG_INIT_MAPPABLE, which initializes the memory
+as mappable by the host.
+
 See KVM_SET_USER_MEMORY_REGION2 for additional details.
 
 4.143 KVM_PRE_FAULT_MEMORY
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 2c6057bab71c..751f167d0f33 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1558,6 +1558,7 @@ struct kvm_memory_attributes {
 #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
 
 #define KVM_CREATE_GUEST_MEMFD	_IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
+#define GUEST_MEMFD_FLAG_INIT_MAPPABLE		BIT(0)
 
 struct kvm_create_guest_memfd {
 	__u64 size;
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index df3a6f05a16e..9080fa29cd8c 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -734,7 +734,8 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
 		goto err_gmem;
 	}
 
-	if (IS_ENABLED(CONFIG_KVM_GMEM_MAPPABLE)) {
+	if (IS_ENABLED(CONFIG_KVM_GMEM_MAPPABLE) &&
+	    (flags & GUEST_MEMFD_FLAG_INIT_MAPPABLE)) {
 		err = gmem_set_mappable(file_inode(file), 0, size >> PAGE_SHIFT);
 		if (err) {
 			fput(file);
@@ -763,6 +764,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
 	u64 flags = args->flags;
 	u64 valid_flags = 0;
 
+	if (IS_ENABLED(CONFIG_KVM_GMEM_MAPPABLE))
+		valid_flags |= GUEST_MEMFD_FLAG_INIT_MAPPABLE;
+
 	if (flags & ~valid_flags)
 		return -EINVAL;
 
-- 
2.47.0.rc0.187.ge670bccf7e-goog



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 08/11] KVM: guest_memfd: selftests: guest_memfd mmap() test when mapping is allowed
  2024-10-10  8:59 [PATCH v3 00/11] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
                   ` (6 preceding siblings ...)
  2024-10-10  8:59 ` [PATCH v3 07/11] KVM: guest_memfd: Add a guest_memfd() flag to initialize it as mappable Fuad Tabba
@ 2024-10-10  8:59 ` Fuad Tabba
  2024-10-10  8:59 ` [PATCH v3 09/11] KVM: arm64: Skip VMA checks for slots without userspace address Fuad Tabba
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Fuad Tabba @ 2024-10-10  8:59 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton, tabba

Expand the guest_memfd selftests to include testing mapping guest
memory if the capability is supported, and that still checks that
memory is not mappable if the capability isn't supported.

Also, build the guest_memfd selftest for aarch64.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 tools/testing/selftests/kvm/Makefile          |  1 +
 .../testing/selftests/kvm/guest_memfd_test.c  | 57 +++++++++++++++++--
 2 files changed, 53 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index 960cf6a77198..c4937b6c2a97 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -172,6 +172,7 @@ TEST_GEN_PROGS_aarch64 += coalesced_io_test
 TEST_GEN_PROGS_aarch64 += demand_paging_test
 TEST_GEN_PROGS_aarch64 += dirty_log_test
 TEST_GEN_PROGS_aarch64 += dirty_log_perf_test
+TEST_GEN_PROGS_aarch64 += guest_memfd_test
 TEST_GEN_PROGS_aarch64 += guest_print_test
 TEST_GEN_PROGS_aarch64 += get-reg-list
 TEST_GEN_PROGS_aarch64 += kvm_create_max_vcpus
diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index ba0c8e996035..ae64027d5bd8 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -34,12 +34,55 @@ static void test_file_read_write(int fd)
 		    "pwrite on a guest_mem fd should fail");
 }
 
-static void test_mmap(int fd, size_t page_size)
+static void test_mmap_allowed(int fd, size_t total_size)
 {
+	size_t page_size = getpagesize();
+	char *mem;
+	int ret;
+	int i;
+
+	mem = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	TEST_ASSERT(mem != MAP_FAILED, "mmaping() guest memory should pass.");
+
+	memset(mem, 0xaa, total_size);
+	for (i = 0; i < total_size; i++)
+		TEST_ASSERT_EQ(mem[i], 0xaa);
+
+	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, 0,
+			page_size);
+	TEST_ASSERT(!ret, "fallocate the first page should succeed");
+
+	for (i = 0; i < page_size; i++)
+		TEST_ASSERT_EQ(mem[i], 0x00);
+	for (; i < total_size; i++)
+		TEST_ASSERT_EQ(mem[i], 0xaa);
+
+	memset(mem, 0xaa, total_size);
+	for (i = 0; i < total_size; i++)
+		TEST_ASSERT_EQ(mem[i], 0xaa);
+
+	ret = munmap(mem, total_size);
+	TEST_ASSERT(!ret, "munmap should succeed");
+}
+
+static void test_mmap_denied(int fd, size_t total_size)
+{
+	size_t page_size = getpagesize();
 	char *mem;
 
 	mem = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
 	TEST_ASSERT_EQ(mem, MAP_FAILED);
+
+	mem = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	TEST_ASSERT_EQ(mem, MAP_FAILED);
+}
+
+static void test_mmap(int fd, size_t total_size)
+{
+	if (kvm_has_cap(KVM_CAP_GUEST_MEMFD_MAPPABLE))
+		test_mmap_allowed(fd, total_size);
+	else
+		test_mmap_denied(fd, total_size);
 }
 
 static void test_file_size(int fd, size_t page_size, size_t total_size)
@@ -172,13 +215,17 @@ static void test_create_guest_memfd_multiple(struct kvm_vm *vm)
 
 int main(int argc, char *argv[])
 {
-	size_t page_size;
+	uint64_t flags = 0;
+	struct kvm_vm *vm;
 	size_t total_size;
+	size_t page_size;
 	int fd;
-	struct kvm_vm *vm;
 
 	TEST_REQUIRE(kvm_has_cap(KVM_CAP_GUEST_MEMFD));
 
+	if (kvm_has_cap(KVM_CAP_GUEST_MEMFD_MAPPABLE))
+		flags |= GUEST_MEMFD_FLAG_INIT_MAPPABLE;
+
 	page_size = getpagesize();
 	total_size = page_size * 4;
 
@@ -187,10 +234,10 @@ int main(int argc, char *argv[])
 	test_create_guest_memfd_invalid(vm);
 	test_create_guest_memfd_multiple(vm);
 
-	fd = vm_create_guest_memfd(vm, total_size, 0);
+	fd = vm_create_guest_memfd(vm, total_size, flags);
 
 	test_file_read_write(fd);
-	test_mmap(fd, page_size);
+	test_mmap(fd, total_size);
 	test_file_size(fd, page_size, total_size);
 	test_fallocate(fd, page_size, total_size);
 	test_invalid_punch_hole(fd, page_size, total_size);
-- 
2.47.0.rc0.187.ge670bccf7e-goog



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 09/11] KVM: arm64: Skip VMA checks for slots without userspace address
  2024-10-10  8:59 [PATCH v3 00/11] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
                   ` (7 preceding siblings ...)
  2024-10-10  8:59 ` [PATCH v3 08/11] KVM: guest_memfd: selftests: guest_memfd mmap() test when mapping is allowed Fuad Tabba
@ 2024-10-10  8:59 ` Fuad Tabba
  2024-10-10  8:59 ` [PATCH v3 10/11] KVM: arm64: Handle guest_memfd()-backed guest page faults Fuad Tabba
  2024-10-10  8:59 ` [PATCH v3 11/11] KVM: arm64: Enable guest_memfd private memory when pKVM is enabled Fuad Tabba
  10 siblings, 0 replies; 28+ messages in thread
From: Fuad Tabba @ 2024-10-10  8:59 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton, tabba

Memory slots backed by guest memory might be created with no
intention of being mapped by the host. These are recognized by
not having a userspace address in the memory slot.

VMA checks are neither possible nor necessary for this kind of
slot, so skip them.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/mmu.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index a509b63bd4dd..71ceea661701 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -987,6 +987,10 @@ static void stage2_unmap_memslot(struct kvm *kvm,
 	phys_addr_t size = PAGE_SIZE * memslot->npages;
 	hva_t reg_end = hva + size;
 
+	/* Host will not map this private memory without a userspace address. */
+	if (kvm_slot_can_be_private(memslot) && !hva)
+		return;
+
 	/*
 	 * A memory region could potentially cover multiple VMAs, and any holes
 	 * between them, so iterate over all of them to find out if we should
@@ -2126,6 +2130,10 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 	hva = new->userspace_addr;
 	reg_end = hva + (new->npages << PAGE_SHIFT);
 
+	/* Host will not map this private memory without a userspace address. */
+	if ((kvm_slot_can_be_private(new)) && !hva)
+		return 0;
+
 	mmap_read_lock(current->mm);
 	/*
 	 * A memory region could potentially cover multiple VMAs, and any holes
-- 
2.47.0.rc0.187.ge670bccf7e-goog



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 10/11] KVM: arm64: Handle guest_memfd()-backed guest page faults
  2024-10-10  8:59 [PATCH v3 00/11] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
                   ` (8 preceding siblings ...)
  2024-10-10  8:59 ` [PATCH v3 09/11] KVM: arm64: Skip VMA checks for slots without userspace address Fuad Tabba
@ 2024-10-10  8:59 ` Fuad Tabba
  2024-10-10  8:59 ` [PATCH v3 11/11] KVM: arm64: Enable guest_memfd private memory when pKVM is enabled Fuad Tabba
  10 siblings, 0 replies; 28+ messages in thread
From: Fuad Tabba @ 2024-10-10  8:59 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton, tabba

Add arm64 support for resolving guest page faults on
guest_memfd() backed memslots. This support is not contingent on
pKVM, or other confidential computing support, and works in both
VHE and nVHE modes.

Without confidential computing, this support is useful for
testing and debugging. In the future, it might also be useful
should a user want to use guest_memfd() for all code, whether
it's for a protected guest or not.

For now, the fault granule is restricted to PAGE_SIZE.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/mmu.c | 112 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 110 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 71ceea661701..250c59f0ca5b 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1422,6 +1422,108 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
 	return vma->vm_flags & VM_MTE_ALLOWED;
 }
 
+static int guest_memfd_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
+			     struct kvm_memory_slot *memslot, bool fault_is_perm)
+{
+	struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
+	bool exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
+	bool logging_active = memslot_is_logging(memslot);
+	struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
+	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
+	bool write_fault = kvm_is_write_fault(vcpu);
+	struct mm_struct *mm = current->mm;
+	gfn_t gfn = gpa_to_gfn(fault_ipa);
+	struct kvm *kvm = vcpu->kvm;
+	struct page *page;
+	kvm_pfn_t pfn;
+	int ret;
+
+	/* For now, guest_memfd() only supports PAGE_SIZE granules. */
+	if (WARN_ON_ONCE(fault_is_perm &&
+			 kvm_vcpu_trap_get_perm_fault_granule(vcpu) != PAGE_SIZE)) {
+		return -EFAULT;
+	}
+
+	VM_BUG_ON(write_fault && exec_fault);
+
+	if (fault_is_perm && !write_fault && !exec_fault) {
+		kvm_err("Unexpected L2 read permission error\n");
+		return -EFAULT;
+	}
+
+	/*
+	 * Permission faults just need to update the existing leaf entry,
+	 * and so normally don't require allocations from the memcache. The
+	 * only exception to this is when dirty logging is enabled at runtime
+	 * and a write fault needs to collapse a block entry into a table.
+	 */
+	if (!fault_is_perm || (logging_active && write_fault)) {
+		ret = kvm_mmu_topup_memory_cache(memcache,
+						 kvm_mmu_cache_min_pages(vcpu->arch.hw_mmu));
+		if (ret)
+			return ret;
+	}
+
+	/*
+	 * Holds the folio lock until mapped in the guest and its refcount is
+	 * stable, to avoid races with paths that check if the folio is mapped
+	 * by the host.
+	 */
+	ret = kvm_gmem_get_pfn_locked(kvm, memslot, gfn, &pfn, NULL);
+	if (ret)
+		return ret;
+
+	page = pfn_to_page(pfn);
+
+	/*
+	 * Once it's faulted in, a guest_memfd() page will stay in memory.
+	 * Therefore, count it as locked.
+	 */
+	if (!fault_is_perm) {
+		ret = account_locked_vm(mm, 1, true);
+		if (ret)
+			goto unlock_page;
+	}
+
+	read_lock(&kvm->mmu_lock);
+	if (write_fault)
+		prot |= KVM_PGTABLE_PROT_W;
+
+	if (exec_fault)
+		prot |= KVM_PGTABLE_PROT_X;
+
+	if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC))
+		prot |= KVM_PGTABLE_PROT_X;
+
+	/*
+	 * Under the premise of getting a FSC_PERM fault, we just need to relax
+	 * permissions.
+	 */
+	if (fault_is_perm)
+		ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot);
+	else
+		ret = kvm_pgtable_stage2_map(pgt, fault_ipa, PAGE_SIZE,
+					__pfn_to_phys(pfn), prot,
+					memcache,
+					KVM_PGTABLE_WALK_HANDLE_FAULT |
+					KVM_PGTABLE_WALK_SHARED);
+
+	/* Mark the page dirty only if the fault is handled successfully */
+	if (write_fault && !ret) {
+		kvm_set_pfn_dirty(pfn);
+		mark_page_dirty_in_slot(kvm, memslot, gfn);
+	}
+	read_unlock(&kvm->mmu_lock);
+
+	if (ret && !fault_is_perm)
+		account_locked_vm(mm, 1, false);
+unlock_page:
+	put_page(page);
+	unlock_page(page);
+
+	return ret != -EAGAIN ? ret : 0;
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  struct kvm_s2_trans *nested,
 			  struct kvm_memory_slot *memslot, unsigned long hva,
@@ -1893,8 +1995,14 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
 		goto out_unlock;
 	}
 
-	ret = user_mem_abort(vcpu, fault_ipa, nested, memslot, hva,
-			     esr_fsc_is_permission_fault(esr));
+	if (kvm_slot_can_be_private(memslot)) {
+		ret = guest_memfd_abort(vcpu, fault_ipa, memslot,
+					esr_fsc_is_permission_fault(esr));
+	} else {
+		ret = user_mem_abort(vcpu, fault_ipa, nested, memslot, hva,
+				     esr_fsc_is_permission_fault(esr));
+	}
+
 	if (ret == 0)
 		ret = 1;
 out:
-- 
2.47.0.rc0.187.ge670bccf7e-goog



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 11/11] KVM: arm64: Enable guest_memfd private memory when pKVM is enabled
  2024-10-10  8:59 [PATCH v3 00/11] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
                   ` (9 preceding siblings ...)
  2024-10-10  8:59 ` [PATCH v3 10/11] KVM: arm64: Handle guest_memfd()-backed guest page faults Fuad Tabba
@ 2024-10-10  8:59 ` Fuad Tabba
  10 siblings, 0 replies; 28+ messages in thread
From: Fuad Tabba @ 2024-10-10  8:59 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton, tabba

Implement kvm_arch_has_private_mem() in arm64 when pKVM is
enabled, and make it dependent on the configuration option.

Also, now that the infrastructure is in place for arm64 to
support guest private memory, enable it in the arm64 kernel
configuration.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/include/asm/kvm_host.h | 3 +++
 arch/arm64/kvm/Kconfig            | 1 +
 2 files changed, 4 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 94cff508874b..eec32e537097 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -1496,4 +1496,7 @@ void kvm_set_vm_id_reg(struct kvm *kvm, u32 reg, u64 val);
 	(system_supports_fpmr() &&			\
 	 kvm_has_feat((k), ID_AA64PFR2_EL1, FPMR, IMP))
 
+#define kvm_arch_has_private_mem(kvm)					\
+	(IS_ENABLED(CONFIG_KVM_PRIVATE_MEM) && is_protected_kvm_enabled())
+
 #endif /* __ARM64_KVM_HOST_H__ */
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index ead632ad01b4..fe3451f244b5 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -38,6 +38,7 @@ menuconfig KVM
 	select HAVE_KVM_VCPU_RUN_PID_CHANGE
 	select SCHED_INFO
 	select GUEST_PERF_EVENTS if PERF_EVENTS
+	select KVM_GMEM_MAPPABLE
 	help
 	  Support hosting virtualized guest machines.
 
-- 
2.47.0.rc0.187.ge670bccf7e-goog



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 04/11] KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared
  2024-10-10  8:59 ` [PATCH v3 04/11] KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared Fuad Tabba
@ 2024-10-10 10:14   ` Kirill A. Shutemov
  2024-10-10 10:23     ` Fuad Tabba
  2024-10-14 16:52   ` Elliot Berman
  1 sibling, 1 reply; 28+ messages in thread
From: Kirill A. Shutemov @ 2024-10-10 10:14 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, qperret, keirf,
	roypat, shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd,
	jthoughton

On Thu, Oct 10, 2024 at 09:59:23AM +0100, Fuad Tabba wrote:
> +out:
> +	if (ret != VM_FAULT_LOCKED) {
> +		folio_put(folio);
> +		folio_unlock(folio);

Hm. Here and in few other places you return reference before unlocking.

I think it is safe because nobody can (or can they?) remove the page from
pagecache while the page is locked so we have at least one refcount on the
folie, but it *looks* like a use-after-free bug.

Please follow the usual pattern: _unlock() then _put().

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 04/11] KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared
  2024-10-10 10:14   ` Kirill A. Shutemov
@ 2024-10-10 10:23     ` Fuad Tabba
  2024-10-10 12:03       ` Jason Gunthorpe
  2024-10-10 12:20       ` Kirill A. Shutemov
  0 siblings, 2 replies; 28+ messages in thread
From: Fuad Tabba @ 2024-10-10 10:23 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, qperret, keirf,
	roypat, shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd,
	jthoughton

Hi Kirill,

On Thu, 10 Oct 2024 at 11:14, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> On Thu, Oct 10, 2024 at 09:59:23AM +0100, Fuad Tabba wrote:
> > +out:
> > +     if (ret != VM_FAULT_LOCKED) {
> > +             folio_put(folio);
> > +             folio_unlock(folio);
>
> Hm. Here and in few other places you return reference before unlocking.
>
> I think it is safe because nobody can (or can they?) remove the page from
> pagecache while the page is locked so we have at least one refcount on the
> folie, but it *looks* like a use-after-free bug.
>
> Please follow the usual pattern: _unlock() then _put().

That is deliberate, since these patches rely on the refcount to check
whether the host has any mappings, and the folio lock in order not to
race. It's not that it's not safe to decrement the refcount after
unlocking, but by doing that i cannot rely on the folio lock to ensure
that there aren't any races between the code added to check whether a
folio is mappable, and the code that checks whether the refcount is
safe. It's a tiny window, but it's there.

What do you think?

Thanks,
/fuad

> --
>   Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 04/11] KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared
  2024-10-10 10:23     ` Fuad Tabba
@ 2024-10-10 12:03       ` Jason Gunthorpe
  2024-10-10 14:27         ` Fuad Tabba
  2024-10-10 12:20       ` Kirill A. Shutemov
  1 sibling, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2024-10-10 12:03 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: Kirill A. Shutemov, kvm, linux-arm-msm, linux-mm, pbonzini,
	chenhuacai, mpe, anup, paul.walmsley, palmer, aou, seanjc, viro,
	brauner, willy, akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko,
	amoorthy, dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka,
	vannapurve, ackerleytng, mail, david, michael.roth, wei.w.wang,
	liam.merwick, isaku.yamahata, kirill.shutemov, suzuki.poulose,
	steven.price, quic_eberman, quic_mnalajal, quic_tsoni,
	quic_svaddagi, quic_cvanscha, quic_pderrin, quic_pheragu,
	catalin.marinas, james.morse, yuzenghui, oliver.upton, maz, will,
	qperret, keirf, roypat, shuah, hch, rientjes, jhubbard, fvdl,
	hughd, jthoughton

On Thu, Oct 10, 2024 at 11:23:55AM +0100, Fuad Tabba wrote:
> Hi Kirill,
> 
> On Thu, 10 Oct 2024 at 11:14, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> >
> > On Thu, Oct 10, 2024 at 09:59:23AM +0100, Fuad Tabba wrote:
> > > +out:
> > > +     if (ret != VM_FAULT_LOCKED) {
> > > +             folio_put(folio);
> > > +             folio_unlock(folio);
> >
> > Hm. Here and in few other places you return reference before unlocking.
> >
> > I think it is safe because nobody can (or can they?) remove the page from
> > pagecache while the page is locked so we have at least one refcount on the
> > folie, but it *looks* like a use-after-free bug.
> >
> > Please follow the usual pattern: _unlock() then _put().
> 
> That is deliberate, since these patches rely on the refcount to check
> whether the host has any mappings, and the folio lock in order not to
> race. It's not that it's not safe to decrement the refcount after
> unlocking, but by doing that i cannot rely on the folio lock to ensure
> that there aren't any races between the code added to check whether a
> folio is mappable, and the code that checks whether the refcount is
> safe. It's a tiny window, but it's there.

That seems very suspicious as the folio lock does not protect the
refcount, and we have things like speculative refcount increments in
GUP.

When we talked at LPC the notion was you could just check if the
refcount was 1 without sleeping or waiting, and somehow deal with !1
cases. Which also means you shouldn't need a lock around the refcount.

Jason


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 04/11] KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared
  2024-10-10 10:23     ` Fuad Tabba
  2024-10-10 12:03       ` Jason Gunthorpe
@ 2024-10-10 12:20       ` Kirill A. Shutemov
  2024-10-10 14:28         ` Fuad Tabba
  1 sibling, 1 reply; 28+ messages in thread
From: Kirill A. Shutemov @ 2024-10-10 12:20 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, qperret, keirf,
	roypat, shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd,
	jthoughton

On Thu, Oct 10, 2024 at 11:23:55AM +0100, Fuad Tabba wrote:
> Hi Kirill,
> 
> On Thu, 10 Oct 2024 at 11:14, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> >
> > On Thu, Oct 10, 2024 at 09:59:23AM +0100, Fuad Tabba wrote:
> > > +out:
> > > +     if (ret != VM_FAULT_LOCKED) {
> > > +             folio_put(folio);
> > > +             folio_unlock(folio);
> >
> > Hm. Here and in few other places you return reference before unlocking.
> >
> > I think it is safe because nobody can (or can they?) remove the page from
> > pagecache while the page is locked so we have at least one refcount on the
> > folie, but it *looks* like a use-after-free bug.
> >
> > Please follow the usual pattern: _unlock() then _put().
> 
> That is deliberate, since these patches rely on the refcount to check
> whether the host has any mappings, and the folio lock in order not to
> race. It's not that it's not safe to decrement the refcount after
> unlocking, but by doing that i cannot rely on the folio lock to ensure
> that there aren't any races between the code added to check whether a
> folio is mappable, and the code that checks whether the refcount is
> safe. It's a tiny window, but it's there.
> 
> What do you think?

I don't think your scheme is race-free either. gmem_clear_mappable() is
going to fail with -EPERM if there's any transient pin on the page. For
instance from any physical memory scanner.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 04/11] KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared
  2024-10-10 12:03       ` Jason Gunthorpe
@ 2024-10-10 14:27         ` Fuad Tabba
  0 siblings, 0 replies; 28+ messages in thread
From: Fuad Tabba @ 2024-10-10 14:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Kirill A. Shutemov, kvm, linux-arm-msm, linux-mm, pbonzini,
	chenhuacai, mpe, anup, paul.walmsley, palmer, aou, seanjc, viro,
	brauner, willy, akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko,
	amoorthy, dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka,
	vannapurve, ackerleytng, mail, david, michael.roth, wei.w.wang,
	liam.merwick, isaku.yamahata, kirill.shutemov, suzuki.poulose,
	steven.price, quic_eberman, quic_mnalajal, quic_tsoni,
	quic_svaddagi, quic_cvanscha, quic_pderrin, quic_pheragu,
	catalin.marinas, james.morse, yuzenghui, oliver.upton, maz, will,
	qperret, keirf, roypat, shuah, hch, rientjes, jhubbard, fvdl,
	hughd, jthoughton

Hi Jason,

On Thu, 10 Oct 2024 at 13:04, Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Oct 10, 2024 at 11:23:55AM +0100, Fuad Tabba wrote:
> > Hi Kirill,
> >
> > On Thu, 10 Oct 2024 at 11:14, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > >
> > > On Thu, Oct 10, 2024 at 09:59:23AM +0100, Fuad Tabba wrote:
> > > > +out:
> > > > +     if (ret != VM_FAULT_LOCKED) {
> > > > +             folio_put(folio);
> > > > +             folio_unlock(folio);
> > >
> > > Hm. Here and in few other places you return reference before unlocking.
> > >
> > > I think it is safe because nobody can (or can they?) remove the page from
> > > pagecache while the page is locked so we have at least one refcount on the
> > > folie, but it *looks* like a use-after-free bug.
> > >
> > > Please follow the usual pattern: _unlock() then _put().
> >
> > That is deliberate, since these patches rely on the refcount to check
> > whether the host has any mappings, and the folio lock in order not to
> > race. It's not that it's not safe to decrement the refcount after
> > unlocking, but by doing that i cannot rely on the folio lock to ensure
> > that there aren't any races between the code added to check whether a
> > folio is mappable, and the code that checks whether the refcount is
> > safe. It's a tiny window, but it's there.
>
> That seems very suspicious as the folio lock does not protect the
> refcount, and we have things like speculative refcount increments in
> GUP.
>
> When we talked at LPC the notion was you could just check if the
> refcount was 1 without sleeping or waiting, and somehow deal with !1
> cases. Which also means you shouldn't need a lock around the refcount.

The idea of the lock isn't to protect the refcount, which I know isn't
protected by the lock. It is to protect against races with the path
that (added in this patch series), would check whether the host is
allowed to map a certain page/folio. But as Kirill pointed out, there
seems to be other issues there, which I'll cover more in my reply to
him.

Thank you,
/fuad


> Jason


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 04/11] KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared
  2024-10-10 12:20       ` Kirill A. Shutemov
@ 2024-10-10 14:28         ` Fuad Tabba
  2024-10-10 14:36           ` Kirill A. Shutemov
  2024-10-10 14:37           ` Jason Gunthorpe
  0 siblings, 2 replies; 28+ messages in thread
From: Fuad Tabba @ 2024-10-10 14:28 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, qperret, keirf,
	roypat, shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd,
	jthoughton

On Thu, 10 Oct 2024 at 13:21, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> On Thu, Oct 10, 2024 at 11:23:55AM +0100, Fuad Tabba wrote:
> > Hi Kirill,
> >
> > On Thu, 10 Oct 2024 at 11:14, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > >
> > > On Thu, Oct 10, 2024 at 09:59:23AM +0100, Fuad Tabba wrote:
> > > > +out:
> > > > +     if (ret != VM_FAULT_LOCKED) {
> > > > +             folio_put(folio);
> > > > +             folio_unlock(folio);
> > >
> > > Hm. Here and in few other places you return reference before unlocking.
> > >
> > > I think it is safe because nobody can (or can they?) remove the page from
> > > pagecache while the page is locked so we have at least one refcount on the
> > > folie, but it *looks* like a use-after-free bug.
> > >
> > > Please follow the usual pattern: _unlock() then _put().
> >
> > That is deliberate, since these patches rely on the refcount to check
> > whether the host has any mappings, and the folio lock in order not to
> > race. It's not that it's not safe to decrement the refcount after
> > unlocking, but by doing that i cannot rely on the folio lock to ensure
> > that there aren't any races between the code added to check whether a
> > folio is mappable, and the code that checks whether the refcount is
> > safe. It's a tiny window, but it's there.
> >
> > What do you think?
>
> I don't think your scheme is race-free either. gmem_clear_mappable() is
> going to fail with -EPERM if there's any transient pin on the page. For
> instance from any physical memory scanner.

I remember discussing this before. One question that I have is, is it
possible to get a transient pin while the folio lock is held, or would
that have happened before taking the lock?

Thanks,
/fuad

> --
>   Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 04/11] KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared
  2024-10-10 14:28         ` Fuad Tabba
@ 2024-10-10 14:36           ` Kirill A. Shutemov
  2024-10-10 14:37           ` Jason Gunthorpe
  1 sibling, 0 replies; 28+ messages in thread
From: Kirill A. Shutemov @ 2024-10-10 14:36 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, qperret, keirf,
	roypat, shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd,
	jthoughton

On Thu, Oct 10, 2024 at 03:28:38PM +0100, Fuad Tabba wrote:
> On Thu, 10 Oct 2024 at 13:21, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> >
> > On Thu, Oct 10, 2024 at 11:23:55AM +0100, Fuad Tabba wrote:
> > > Hi Kirill,
> > >
> > > On Thu, 10 Oct 2024 at 11:14, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > > >
> > > > On Thu, Oct 10, 2024 at 09:59:23AM +0100, Fuad Tabba wrote:
> > > > > +out:
> > > > > +     if (ret != VM_FAULT_LOCKED) {
> > > > > +             folio_put(folio);
> > > > > +             folio_unlock(folio);
> > > >
> > > > Hm. Here and in few other places you return reference before unlocking.
> > > >
> > > > I think it is safe because nobody can (or can they?) remove the page from
> > > > pagecache while the page is locked so we have at least one refcount on the
> > > > folie, but it *looks* like a use-after-free bug.
> > > >
> > > > Please follow the usual pattern: _unlock() then _put().
> > >
> > > That is deliberate, since these patches rely on the refcount to check
> > > whether the host has any mappings, and the folio lock in order not to
> > > race. It's not that it's not safe to decrement the refcount after
> > > unlocking, but by doing that i cannot rely on the folio lock to ensure
> > > that there aren't any races between the code added to check whether a
> > > folio is mappable, and the code that checks whether the refcount is
> > > safe. It's a tiny window, but it's there.
> > >
> > > What do you think?
> >
> > I don't think your scheme is race-free either. gmem_clear_mappable() is
> > going to fail with -EPERM if there's any transient pin on the page. For
> > instance from any physical memory scanner.
> 
> I remember discussing this before. One question that I have is, is it
> possible to get a transient pin while the folio lock is held, or would
> that have happened before taking the lock?

Yes.

The normal pattern is to get the pin on the page before attempting to lock
it.

In case of physical scanners it happens like this:

1. pfn_to_page()/pfn_folio()
2. get_page_unless_zero()/folio_get_nontail_page()
3. lock_page()/folio_lock() if needed

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 04/11] KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared
  2024-10-10 14:28         ` Fuad Tabba
  2024-10-10 14:36           ` Kirill A. Shutemov
@ 2024-10-10 14:37           ` Jason Gunthorpe
  1 sibling, 0 replies; 28+ messages in thread
From: Jason Gunthorpe @ 2024-10-10 14:37 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: Kirill A. Shutemov, kvm, linux-arm-msm, linux-mm, pbonzini,
	chenhuacai, mpe, anup, paul.walmsley, palmer, aou, seanjc, viro,
	brauner, willy, akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko,
	amoorthy, dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka,
	vannapurve, ackerleytng, mail, david, michael.roth, wei.w.wang,
	liam.merwick, isaku.yamahata, kirill.shutemov, suzuki.poulose,
	steven.price, quic_eberman, quic_mnalajal, quic_tsoni,
	quic_svaddagi, quic_cvanscha, quic_pderrin, quic_pheragu,
	catalin.marinas, james.morse, yuzenghui, oliver.upton, maz, will,
	qperret, keirf, roypat, shuah, hch, rientjes, jhubbard, fvdl,
	hughd, jthoughton

On Thu, Oct 10, 2024 at 03:28:38PM +0100, Fuad Tabba wrote:

> I remember discussing this before. One question that I have is, is it
> possible to get a transient pin while the folio lock is held, 

No, you can't lock a folio until you refcount it. That is a basic
requirement for any transient pin.

Jason


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 01/11] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes
  2024-10-10  8:59 ` [PATCH v3 01/11] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes Fuad Tabba
@ 2024-10-12  6:12   ` kernel test robot
  0 siblings, 0 replies; 28+ messages in thread
From: kernel test robot @ 2024-10-12  6:12 UTC (permalink / raw)
  To: Fuad Tabba, kvm, linux-arm-msm, linux-mm
  Cc: llvm, oe-kbuild-all, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth

Hi Fuad,

kernel test robot noticed the following build errors:

[auto build test ERROR on 8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b]

url:    https://github.com/intel-lab-lkp/linux/commits/Fuad-Tabba/KVM-guest_memfd-Make-guest-mem-use-guest-mem-inodes-instead-of-anonymous-inodes/20241010-170821
base:   8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b
patch link:    https://lore.kernel.org/r/20241010085930.1546800-2-tabba%40google.com
patch subject: [PATCH v3 01/11] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes
config: x86_64-rhel-8.3-rust (https://download.01.org/0day-ci/archive/20241012/202410121337.0ETimfvJ-lkp@intel.com/config)
compiler: clang version 18.1.8 (https://github.com/llvm/llvm-project 3b5b5c1ec4a3095ab096dd780e84d7ab81f3d7ff)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241012/202410121337.0ETimfvJ-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202410121337.0ETimfvJ-lkp@intel.com/

All errors (new ones prefixed by >>, old ones prefixed by <<):

>> ERROR: modpost: "security_inode_init_security_anon" [arch/x86/kvm/kvm.ko] undefined!

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 04/11] KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared
  2024-10-10  8:59 ` [PATCH v3 04/11] KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared Fuad Tabba
  2024-10-10 10:14   ` Kirill A. Shutemov
@ 2024-10-14 16:52   ` Elliot Berman
  2024-10-15 10:27     ` Fuad Tabba
  1 sibling, 1 reply; 28+ messages in thread
From: Elliot Berman @ 2024-10-14 16:52 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

On Thu, Oct 10, 2024 at 09:59:23AM +0100, Fuad Tabba wrote:
> Add support for mmap() and fault() for guest_memfd in the host.
> The ability to fault in a guest page is contingent on that page
> being shared with the host.
> 
> The guest_memfd PRIVATE memory attribute is not used for two
> reasons. First because it reflects the userspace expectation for
> that memory location, and therefore can be toggled by userspace.
> The second is, although each guest_memfd file has a 1:1 binding
> with a KVM instance, the plan is to allow multiple files per
> inode, e.g. to allow intra-host migration to a new KVM instance,
> without destroying guest_memfd.
> 
> The mapping is restricted to only memory explicitly shared with
> the host. KVM checks that the host doesn't have any mappings for
> private memory via the folio's refcount. To avoid races between
> paths that check mappability and paths that check whether the
> host has any mappings (via the refcount), the folio lock is held
> in while either check is being performed.
> 
> This new feature is gated with a new configuration option,
> CONFIG_KVM_GMEM_MAPPABLE.
> 
> Co-developed-by: Ackerley Tng <ackerleytng@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Co-developed-by: Elliot Berman <quic_eberman@quicinc.com>
> Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
> Signed-off-by: Fuad Tabba <tabba@google.com>
> 
> ---
> 
> Note that the functions kvm_gmem_is_mapped(),
> kvm_gmem_set_mappable(), and int kvm_gmem_clear_mappable() are
> not used in this patch series. They are intended to be used in
> future patches [*], which check and toggle mapability when the
> guest shares/unshares pages with the host.
> 
> [*] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.12-v3-pkvm
> 
> ---
>  include/linux/kvm_host.h |  52 +++++++++++
>  virt/kvm/Kconfig         |   4 +
>  virt/kvm/guest_memfd.c   | 185 +++++++++++++++++++++++++++++++++++++++
>  virt/kvm/kvm_main.c      | 138 +++++++++++++++++++++++++++++
>  4 files changed, 379 insertions(+)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index acf85995b582..bda7fda9945e 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2527,4 +2527,56 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
>  				    struct kvm_pre_fault_memory *range);
>  #endif
>  
> +#ifdef CONFIG_KVM_GMEM_MAPPABLE
> +bool kvm_gmem_is_mappable(struct kvm *kvm, gfn_t gfn, gfn_t end);
> +bool kvm_gmem_is_mapped(struct kvm *kvm, gfn_t start, gfn_t end);
> +int kvm_gmem_set_mappable(struct kvm *kvm, gfn_t start, gfn_t end);
> +int kvm_gmem_clear_mappable(struct kvm *kvm, gfn_t start, gfn_t end);
> +int kvm_slot_gmem_set_mappable(struct kvm_memory_slot *slot, gfn_t start,
> +			       gfn_t end);
> +int kvm_slot_gmem_clear_mappable(struct kvm_memory_slot *slot, gfn_t start,
> +				 gfn_t end);
> +bool kvm_slot_gmem_is_mappable(struct kvm_memory_slot *slot, gfn_t gfn);
> +#else
> +static inline bool kvm_gmem_is_mappable(struct kvm *kvm, gfn_t gfn, gfn_t end)
> +{
> +	WARN_ON_ONCE(1);
> +	return false;
> +}
> +static inline bool kvm_gmem_is_mapped(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> +	WARN_ON_ONCE(1);
> +	return false;
> +}
> +static inline int kvm_gmem_set_mappable(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> +	WARN_ON_ONCE(1);
> +	return -EINVAL;
> +}
> +static inline int kvm_gmem_clear_mappable(struct kvm *kvm, gfn_t start,
> +					  gfn_t end)
> +{
> +	WARN_ON_ONCE(1);
> +	return -EINVAL;
> +}
> +static inline int kvm_slot_gmem_set_mappable(struct kvm_memory_slot *slot,
> +					     gfn_t start, gfn_t end)
> +{
> +	WARN_ON_ONCE(1);
> +	return -EINVAL;
> +}
> +static inline int kvm_slot_gmem_clear_mappable(struct kvm_memory_slot *slot,
> +					       gfn_t start, gfn_t end)
> +{
> +	WARN_ON_ONCE(1);
> +	return -EINVAL;
> +}
> +static inline bool kvm_slot_gmem_is_mappable(struct kvm_memory_slot *slot,
> +					     gfn_t gfn)
> +{
> +	WARN_ON_ONCE(1);
> +	return false;
> +}
> +#endif /* CONFIG_KVM_GMEM_MAPPABLE */
> +
>  #endif
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index fd6a3010afa8..2cfcb0848e37 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -120,3 +120,7 @@ config HAVE_KVM_ARCH_GMEM_PREPARE
>  config HAVE_KVM_ARCH_GMEM_INVALIDATE
>         bool
>         depends on KVM_PRIVATE_MEM
> +
> +config KVM_GMEM_MAPPABLE
> +       select KVM_PRIVATE_MEM
> +       bool
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index f414646c475b..df3a6f05a16e 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -370,7 +370,184 @@ static void kvm_gmem_init_mount(void)
>  	kvm_gmem_mnt->mnt_flags |= MNT_NOEXEC;
>  }
>  
> +#ifdef CONFIG_KVM_GMEM_MAPPABLE
> +static struct folio *
> +__kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
> +		   gfn_t gfn, kvm_pfn_t *pfn, bool *is_prepared,
> +		   int *max_order);
> +
> +static int gmem_set_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
> +{
> +	struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
> +	void *xval = xa_mk_value(true);
> +	pgoff_t i;
> +	bool r;
> +
> +	filemap_invalidate_lock(inode->i_mapping);
> +	for (i = start; i < end; i++) {
> +		r = xa_err(xa_store(mappable_offsets, i, xval, GFP_KERNEL));

I think it might not be strictly necessary,

> +		if (r)
> +			break;
> +	}
> +	filemap_invalidate_unlock(inode->i_mapping);
> +
> +	return r;
> +}
> +
> +static int gmem_clear_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
> +{
> +	struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
> +	pgoff_t i;
> +	int r = 0;
> +
> +	filemap_invalidate_lock(inode->i_mapping);
> +	for (i = start; i < end; i++) {
> +		struct folio *folio;
> +
> +		/*
> +		 * Holds the folio lock until after checking its refcount,
> +		 * to avoid races with paths that fault in the folio.
> +		 */
> +		folio = kvm_gmem_get_folio(inode, i);

We don't need to allocate the folio here. I think we can use

		folio = filemap_lock_folio(inode, i);
		if (!folio || WARN_ON_ONCE(IS_ERR(folio)))
			continue;

> +		if (WARN_ON_ONCE(IS_ERR(folio)))
> +			continue;
> +
> +		/*
> +		 * Check that the host doesn't have any mappings on clearing
> +		 * the mappable flag, because clearing the flag implies that the
> +		 * memory will be unshared from the host. Therefore, to maintain
> +		 * the invariant that the host cannot access private memory, we
> +		 * need to check that it doesn't have any mappings to that
> +		 * memory before making it private.
> +		 *
> +		 * Two references are expected because of kvm_gmem_get_folio().
> +		 */
> +		if (folio_ref_count(folio) > 2)

If we'd like to be prepared for large folios, it should be
folio_nr_pages(folio) + 1. 

> +			r = -EPERM;
> +		else
> +			xa_erase(mappable_offsets, i);
> +
> +		folio_put(folio);
> +		folio_unlock(folio);
> +
> +		if (r)
> +			break;
> +	}
> +	filemap_invalidate_unlock(inode->i_mapping);
> +
> +	return r;
> +}
> +
> +static bool gmem_is_mappable(struct inode *inode, pgoff_t pgoff)
> +{
> +	struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
> +	bool r;
> +
> +	filemap_invalidate_lock_shared(inode->i_mapping);
> +	r = xa_find(mappable_offsets, &pgoff, pgoff, XA_PRESENT);
> +	filemap_invalidate_unlock_shared(inode->i_mapping);
> +
> +	return r;
> +}
> +
> +int kvm_slot_gmem_set_mappable(struct kvm_memory_slot *slot, gfn_t start, gfn_t end)
> +{
> +	struct inode *inode = file_inode(slot->gmem.file);
> +	pgoff_t start_off = slot->gmem.pgoff + start - slot->base_gfn;
> +	pgoff_t end_off = start_off + end - start;
> +
> +	return gmem_set_mappable(inode, start_off, end_off);
> +}
> +
> +int kvm_slot_gmem_clear_mappable(struct kvm_memory_slot *slot, gfn_t start, gfn_t end)
> +{
> +	struct inode *inode = file_inode(slot->gmem.file);
> +	pgoff_t start_off = slot->gmem.pgoff + start - slot->base_gfn;
> +	pgoff_t end_off = start_off + end - start;
> +
> +	return gmem_clear_mappable(inode, start_off, end_off);
> +}
> +
> +bool kvm_slot_gmem_is_mappable(struct kvm_memory_slot *slot, gfn_t gfn)
> +{
> +	struct inode *inode = file_inode(slot->gmem.file);
> +	unsigned long pgoff = slot->gmem.pgoff + gfn - slot->base_gfn;
> +
> +	return gmem_is_mappable(inode, pgoff);
> +}
> +
> +static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
> +{
> +	struct inode *inode = file_inode(vmf->vma->vm_file);
> +	struct folio *folio;
> +	vm_fault_t ret = VM_FAULT_LOCKED;
> +
> +	/*
> +	 * Holds the folio lock until after checking whether it can be faulted
> +	 * in, to avoid races with paths that change a folio's mappability.
> +	 */
> +	folio = kvm_gmem_get_folio(inode, vmf->pgoff);
> +	if (!folio)
> +		return VM_FAULT_SIGBUS;
> +
> +	if (folio_test_hwpoison(folio)) {
> +		ret = VM_FAULT_HWPOISON;
> +		goto out;
> +	}
> +
> +	if (!gmem_is_mappable(inode, vmf->pgoff)) {
> +		ret = VM_FAULT_SIGBUS;
> +		goto out;
> +	}
> +
> +	if (!folio_test_uptodate(folio)) {
> +		unsigned long nr_pages = folio_nr_pages(folio);
> +		unsigned long i;
> +
> +		for (i = 0; i < nr_pages; i++)
> +			clear_highpage(folio_page(folio, i));
> +
> +		folio_mark_uptodate(folio);
> +	}
> +
> +	vmf->page = folio_file_page(folio, vmf->pgoff);
> +out:
> +	if (ret != VM_FAULT_LOCKED) {
> +		folio_put(folio);
> +		folio_unlock(folio);
> +	}
> +
> +	return ret;
> +}
> +
> +static const struct vm_operations_struct kvm_gmem_vm_ops = {
> +	.fault = kvm_gmem_fault,
> +};
> +
> +static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) !=
> +	    (VM_SHARED | VM_MAYSHARE)) {
> +		return -EINVAL;
> +	}
> +
> +	file_accessed(file);
> +	vm_flags_set(vma, VM_DONTDUMP);
> +	vma->vm_ops = &kvm_gmem_vm_ops;
> +
> +	return 0;
> +}
> +#else
> +static int gmem_set_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
> +{
> +	WARN_ON_ONCE(1);
> +	return -EINVAL;
> +}
> +#define kvm_gmem_mmap NULL
> +#endif /* CONFIG_KVM_GMEM_MAPPABLE */
> +
>  static struct file_operations kvm_gmem_fops = {
> +	.mmap		= kvm_gmem_mmap,
>  	.open		= generic_file_open,
>  	.release	= kvm_gmem_release,
>  	.fallocate	= kvm_gmem_fallocate,
> @@ -557,6 +734,14 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
>  		goto err_gmem;
>  	}
>  
> +	if (IS_ENABLED(CONFIG_KVM_GMEM_MAPPABLE)) {
> +		err = gmem_set_mappable(file_inode(file), 0, size >> PAGE_SHIFT);
> +		if (err) {
> +			fput(file);
> +			goto err_gmem;
> +		}
> +	}
> +
>  	kvm_get_kvm(kvm);
>  	gmem->kvm = kvm;
>  	xa_init(&gmem->bindings);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 05cbb2548d99..aed9cf2f1685 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3263,6 +3263,144 @@ static int next_segment(unsigned long len, int offset)
>  		return len;
>  }
>  
> +#ifdef CONFIG_KVM_GMEM_MAPPABLE
> +static bool __kvm_gmem_is_mappable(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> +	struct kvm_memslot_iter iter;
> +
> +	lockdep_assert_held(&kvm->slots_lock);
> +
> +	kvm_for_each_memslot_in_gfn_range(&iter, kvm_memslots(kvm), start, end) {
> +		struct kvm_memory_slot *memslot = iter.slot;
> +		gfn_t gfn_start, gfn_end, i;
> +
> +		gfn_start = max(start, memslot->base_gfn);
> +		gfn_end = min(end, memslot->base_gfn + memslot->npages);
> +		if (WARN_ON_ONCE(gfn_start >= gfn_end))
> +			continue;
> +
> +		for (i = gfn_start; i < gfn_end; i++) {
> +			if (!kvm_slot_gmem_is_mappable(memslot, i))
> +				return false;
> +		}
> +	}
> +
> +	return true;
> +}
> +
> +bool kvm_gmem_is_mappable(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> +	bool r;
> +
> +	mutex_lock(&kvm->slots_lock);
> +	r = __kvm_gmem_is_mappable(kvm, start, end);
> +	mutex_unlock(&kvm->slots_lock);
> +
> +	return r;
> +}
> +
> +static bool kvm_gmem_is_pfn_mapped(struct kvm *kvm, struct kvm_memory_slot *memslot, gfn_t gfn_idx)
> +{
> +	struct page *page;
> +	bool is_mapped;
> +	kvm_pfn_t pfn;
> +
> +	/*
> +	 * Holds the folio lock until after checking its refcount,
> +	 * to avoid races with paths that fault in the folio.
> +	 */
> +	if (WARN_ON_ONCE(kvm_gmem_get_pfn_locked(kvm, memslot, gfn_idx, &pfn, NULL)))
> +		return false;
> +
> +	page = pfn_to_page(pfn);
> +
> +	/* Two references are expected because of kvm_gmem_get_pfn_locked(). */
> +	is_mapped = page_ref_count(page) > 2;
> +
> +	put_page(page);
> +	unlock_page(page);
> +
> +	return is_mapped;
> +}
> +
> +static bool __kvm_gmem_is_mapped(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> +	struct kvm_memslot_iter iter;
> +
> +	lockdep_assert_held(&kvm->slots_lock);
> +
> +	kvm_for_each_memslot_in_gfn_range(&iter, kvm_memslots(kvm), start, end) {
> +		struct kvm_memory_slot *memslot = iter.slot;
> +		gfn_t gfn_start, gfn_end, i;
> +
> +		gfn_start = max(start, memslot->base_gfn);
> +		gfn_end = min(end, memslot->base_gfn + memslot->npages);
> +		if (WARN_ON_ONCE(gfn_start >= gfn_end))
> +			continue;
> +
> +		for (i = gfn_start; i < gfn_end; i++) {
> +			if (kvm_gmem_is_pfn_mapped(kvm, memslot, i))
> +				return true;
> +		}
> +	}
> +
> +	return false;
> +}
> +
> +bool kvm_gmem_is_mapped(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> +	bool r;
> +
> +	mutex_lock(&kvm->slots_lock);
> +	r = __kvm_gmem_is_mapped(kvm, start, end);
> +	mutex_unlock(&kvm->slots_lock);
> +
> +	return r;
> +}
> +
> +static int kvm_gmem_toggle_mappable(struct kvm *kvm, gfn_t start, gfn_t end,
> +				    bool is_mappable)
> +{
> +	struct kvm_memslot_iter iter;
> +	int r = 0;
> +
> +	mutex_lock(&kvm->slots_lock);
> +
> +	kvm_for_each_memslot_in_gfn_range(&iter, kvm_memslots(kvm), start, end) {
> +		struct kvm_memory_slot *memslot = iter.slot;
> +		gfn_t gfn_start, gfn_end;
> +
> +		gfn_start = max(start, memslot->base_gfn);
> +		gfn_end = min(end, memslot->base_gfn + memslot->npages);
> +		if (WARN_ON_ONCE(start >= end))
> +			continue;
> +
> +		if (is_mappable)
> +			r = kvm_slot_gmem_set_mappable(memslot, gfn_start, gfn_end);
> +		else
> +			r = kvm_slot_gmem_clear_mappable(memslot, gfn_start, gfn_end);
> +
> +		if (WARN_ON_ONCE(r))
> +			break;
> +	}
> +
> +	mutex_unlock(&kvm->slots_lock);
> +
> +	return r;
> +}
> +
> +int kvm_gmem_set_mappable(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> +	return kvm_gmem_toggle_mappable(kvm, start, end, true);
> +}
> +
> +int kvm_gmem_clear_mappable(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> +	return kvm_gmem_toggle_mappable(kvm, start, end, false);
> +}
> +
> +#endif /* CONFIG_KVM_GMEM_MAPPABLE */
> +
>  /* Copy @len bytes from guest memory at '(@gfn * PAGE_SIZE) + @offset' to @data */
>  static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
>  				 void *data, int offset, int len)
> -- 
> 2.47.0.rc0.187.ge670bccf7e-goog
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 04/11] KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared
  2024-10-14 16:52   ` Elliot Berman
@ 2024-10-15 10:27     ` Fuad Tabba
  2024-10-16 16:53       ` Elliot Berman
  0 siblings, 1 reply; 28+ messages in thread
From: Fuad Tabba @ 2024-10-15 10:27 UTC (permalink / raw)
  To: Elliot Berman
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

Hi Elliot,

On Mon, 14 Oct 2024 at 17:53, Elliot Berman <quic_eberman@quicinc.com> wrote:
>
> On Thu, Oct 10, 2024 at 09:59:23AM +0100, Fuad Tabba wrote:
> > Add support for mmap() and fault() for guest_memfd in the host.
> > The ability to fault in a guest page is contingent on that page
> > being shared with the host.
> >
> > The guest_memfd PRIVATE memory attribute is not used for two
> > reasons. First because it reflects the userspace expectation for
> > that memory location, and therefore can be toggled by userspace.
> > The second is, although each guest_memfd file has a 1:1 binding
> > with a KVM instance, the plan is to allow multiple files per
> > inode, e.g. to allow intra-host migration to a new KVM instance,
> > without destroying guest_memfd.
> >
> > The mapping is restricted to only memory explicitly shared with
> > the host. KVM checks that the host doesn't have any mappings for
> > private memory via the folio's refcount. To avoid races between
> > paths that check mappability and paths that check whether the
> > host has any mappings (via the refcount), the folio lock is held
> > in while either check is being performed.
> >
> > This new feature is gated with a new configuration option,
> > CONFIG_KVM_GMEM_MAPPABLE.
> >
> > Co-developed-by: Ackerley Tng <ackerleytng@google.com>
> > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> > Co-developed-by: Elliot Berman <quic_eberman@quicinc.com>
> > Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
> > Signed-off-by: Fuad Tabba <tabba@google.com>
> >
> > ---
> >
> > Note that the functions kvm_gmem_is_mapped(),
> > kvm_gmem_set_mappable(), and int kvm_gmem_clear_mappable() are
> > not used in this patch series. They are intended to be used in
> > future patches [*], which check and toggle mapability when the
> > guest shares/unshares pages with the host.
> >
> > [*] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.12-v3-pkvm
> >
> > ---
> >  include/linux/kvm_host.h |  52 +++++++++++
> >  virt/kvm/Kconfig         |   4 +
> >  virt/kvm/guest_memfd.c   | 185 +++++++++++++++++++++++++++++++++++++++
> >  virt/kvm/kvm_main.c      | 138 +++++++++++++++++++++++++++++
> >  4 files changed, 379 insertions(+)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index acf85995b582..bda7fda9945e 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -2527,4 +2527,56 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> >                                   struct kvm_pre_fault_memory *range);
> >  #endif
> >
> > +#ifdef CONFIG_KVM_GMEM_MAPPABLE
> > +bool kvm_gmem_is_mappable(struct kvm *kvm, gfn_t gfn, gfn_t end);
> > +bool kvm_gmem_is_mapped(struct kvm *kvm, gfn_t start, gfn_t end);
> > +int kvm_gmem_set_mappable(struct kvm *kvm, gfn_t start, gfn_t end);
> > +int kvm_gmem_clear_mappable(struct kvm *kvm, gfn_t start, gfn_t end);
> > +int kvm_slot_gmem_set_mappable(struct kvm_memory_slot *slot, gfn_t start,
> > +                            gfn_t end);
> > +int kvm_slot_gmem_clear_mappable(struct kvm_memory_slot *slot, gfn_t start,
> > +                              gfn_t end);
> > +bool kvm_slot_gmem_is_mappable(struct kvm_memory_slot *slot, gfn_t gfn);
> > +#else
> > +static inline bool kvm_gmem_is_mappable(struct kvm *kvm, gfn_t gfn, gfn_t end)
> > +{
> > +     WARN_ON_ONCE(1);
> > +     return false;
> > +}
> > +static inline bool kvm_gmem_is_mapped(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > +     WARN_ON_ONCE(1);
> > +     return false;
> > +}
> > +static inline int kvm_gmem_set_mappable(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > +     WARN_ON_ONCE(1);
> > +     return -EINVAL;
> > +}
> > +static inline int kvm_gmem_clear_mappable(struct kvm *kvm, gfn_t start,
> > +                                       gfn_t end)
> > +{
> > +     WARN_ON_ONCE(1);
> > +     return -EINVAL;
> > +}
> > +static inline int kvm_slot_gmem_set_mappable(struct kvm_memory_slot *slot,
> > +                                          gfn_t start, gfn_t end)
> > +{
> > +     WARN_ON_ONCE(1);
> > +     return -EINVAL;
> > +}
> > +static inline int kvm_slot_gmem_clear_mappable(struct kvm_memory_slot *slot,
> > +                                            gfn_t start, gfn_t end)
> > +{
> > +     WARN_ON_ONCE(1);
> > +     return -EINVAL;
> > +}
> > +static inline bool kvm_slot_gmem_is_mappable(struct kvm_memory_slot *slot,
> > +                                          gfn_t gfn)
> > +{
> > +     WARN_ON_ONCE(1);
> > +     return false;
> > +}
> > +#endif /* CONFIG_KVM_GMEM_MAPPABLE */
> > +
> >  #endif
> > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> > index fd6a3010afa8..2cfcb0848e37 100644
> > --- a/virt/kvm/Kconfig
> > +++ b/virt/kvm/Kconfig
> > @@ -120,3 +120,7 @@ config HAVE_KVM_ARCH_GMEM_PREPARE
> >  config HAVE_KVM_ARCH_GMEM_INVALIDATE
> >         bool
> >         depends on KVM_PRIVATE_MEM
> > +
> > +config KVM_GMEM_MAPPABLE
> > +       select KVM_PRIVATE_MEM
> > +       bool
> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > index f414646c475b..df3a6f05a16e 100644
> > --- a/virt/kvm/guest_memfd.c
> > +++ b/virt/kvm/guest_memfd.c
> > @@ -370,7 +370,184 @@ static void kvm_gmem_init_mount(void)
> >       kvm_gmem_mnt->mnt_flags |= MNT_NOEXEC;
> >  }
> >
> > +#ifdef CONFIG_KVM_GMEM_MAPPABLE
> > +static struct folio *
> > +__kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
> > +                gfn_t gfn, kvm_pfn_t *pfn, bool *is_prepared,
> > +                int *max_order);
> > +
> > +static int gmem_set_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
> > +{
> > +     struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
> > +     void *xval = xa_mk_value(true);
> > +     pgoff_t i;
> > +     bool r;
> > +
> > +     filemap_invalidate_lock(inode->i_mapping);
> > +     for (i = start; i < end; i++) {
> > +             r = xa_err(xa_store(mappable_offsets, i, xval, GFP_KERNEL));
>
> I think it might not be strictly necessary,

Sorry, but I don't quite get what isn't strictly necessary. Is it the
checking for an error?

> > +             if (r)
> > +                     break;
> > +     }
> > +     filemap_invalidate_unlock(inode->i_mapping);
> > +
> > +     return r;
> > +}
> > +
> > +static int gmem_clear_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
> > +{
> > +     struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
> > +     pgoff_t i;
> > +     int r = 0;
> > +
> > +     filemap_invalidate_lock(inode->i_mapping);
> > +     for (i = start; i < end; i++) {
> > +             struct folio *folio;
> > +
> > +             /*
> > +              * Holds the folio lock until after checking its refcount,
> > +              * to avoid races with paths that fault in the folio.
> > +              */
> > +             folio = kvm_gmem_get_folio(inode, i);
>
> We don't need to allocate the folio here. I think we can use
>
>                 folio = filemap_lock_folio(inode, i);
>                 if (!folio || WARN_ON_ONCE(IS_ERR(folio)))
>                         continue;

Good point (it takes an inode->i_mapping though).

>                 folio = filemap_lock_folio(inode->i_mapping, i);


> > +             if (WARN_ON_ONCE(IS_ERR(folio)))
> > +                     continue;
> > +
> > +             /*
> > +              * Check that the host doesn't have any mappings on clearing
> > +              * the mappable flag, because clearing the flag implies that the
> > +              * memory will be unshared from the host. Therefore, to maintain
> > +              * the invariant that the host cannot access private memory, we
> > +              * need to check that it doesn't have any mappings to that
> > +              * memory before making it private.
> > +              *
> > +              * Two references are expected because of kvm_gmem_get_folio().
> > +              */
> > +             if (folio_ref_count(folio) > 2)
>
> If we'd like to be prepared for large folios, it should be
> folio_nr_pages(folio) + 1.

Will do that.

Thanks!
/fuad



> > +                     r = -EPERM;
> > +             else
> > +                     xa_erase(mappable_offsets, i);
> > +
> > +             folio_put(folio);
> > +             folio_unlock(folio);
> > +
> > +             if (r)
> > +                     break;
> > +     }
> > +     filemap_invalidate_unlock(inode->i_mapping);
> > +
> > +     return r;
> > +}
> > +
> > +static bool gmem_is_mappable(struct inode *inode, pgoff_t pgoff)
> > +{
> > +     struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
> > +     bool r;
> > +
> > +     filemap_invalidate_lock_shared(inode->i_mapping);
> > +     r = xa_find(mappable_offsets, &pgoff, pgoff, XA_PRESENT);
> > +     filemap_invalidate_unlock_shared(inode->i_mapping);
> > +
> > +     return r;
> > +}
> > +
> > +int kvm_slot_gmem_set_mappable(struct kvm_memory_slot *slot, gfn_t start, gfn_t end)
> > +{
> > +     struct inode *inode = file_inode(slot->gmem.file);
> > +     pgoff_t start_off = slot->gmem.pgoff + start - slot->base_gfn;
> > +     pgoff_t end_off = start_off + end - start;
> > +
> > +     return gmem_set_mappable(inode, start_off, end_off);
> > +}
> > +
> > +int kvm_slot_gmem_clear_mappable(struct kvm_memory_slot *slot, gfn_t start, gfn_t end)
> > +{
> > +     struct inode *inode = file_inode(slot->gmem.file);
> > +     pgoff_t start_off = slot->gmem.pgoff + start - slot->base_gfn;
> > +     pgoff_t end_off = start_off + end - start;
> > +
> > +     return gmem_clear_mappable(inode, start_off, end_off);
> > +}
> > +
> > +bool kvm_slot_gmem_is_mappable(struct kvm_memory_slot *slot, gfn_t gfn)
> > +{
> > +     struct inode *inode = file_inode(slot->gmem.file);
> > +     unsigned long pgoff = slot->gmem.pgoff + gfn - slot->base_gfn;
> > +
> > +     return gmem_is_mappable(inode, pgoff);
> > +}
> > +
> > +static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
> > +{
> > +     struct inode *inode = file_inode(vmf->vma->vm_file);
> > +     struct folio *folio;
> > +     vm_fault_t ret = VM_FAULT_LOCKED;
> > +
> > +     /*
> > +      * Holds the folio lock until after checking whether it can be faulted
> > +      * in, to avoid races with paths that change a folio's mappability.
> > +      */
> > +     folio = kvm_gmem_get_folio(inode, vmf->pgoff);
> > +     if (!folio)
> > +             return VM_FAULT_SIGBUS;
> > +
> > +     if (folio_test_hwpoison(folio)) {
> > +             ret = VM_FAULT_HWPOISON;
> > +             goto out;
> > +     }
> > +
> > +     if (!gmem_is_mappable(inode, vmf->pgoff)) {
> > +             ret = VM_FAULT_SIGBUS;
> > +             goto out;
> > +     }
> > +
> > +     if (!folio_test_uptodate(folio)) {
> > +             unsigned long nr_pages = folio_nr_pages(folio);
> > +             unsigned long i;
> > +
> > +             for (i = 0; i < nr_pages; i++)
> > +                     clear_highpage(folio_page(folio, i));
> > +
> > +             folio_mark_uptodate(folio);
> > +     }
> > +
> > +     vmf->page = folio_file_page(folio, vmf->pgoff);
> > +out:
> > +     if (ret != VM_FAULT_LOCKED) {
> > +             folio_put(folio);
> > +             folio_unlock(folio);
> > +     }
> > +
> > +     return ret;
> > +}
> > +
> > +static const struct vm_operations_struct kvm_gmem_vm_ops = {
> > +     .fault = kvm_gmem_fault,
> > +};
> > +
> > +static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
> > +{
> > +     if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) !=
> > +         (VM_SHARED | VM_MAYSHARE)) {
> > +             return -EINVAL;
> > +     }
> > +
> > +     file_accessed(file);
> > +     vm_flags_set(vma, VM_DONTDUMP);
> > +     vma->vm_ops = &kvm_gmem_vm_ops;
> > +
> > +     return 0;
> > +}
> > +#else
> > +static int gmem_set_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
> > +{
> > +     WARN_ON_ONCE(1);
> > +     return -EINVAL;
> > +}
> > +#define kvm_gmem_mmap NULL
> > +#endif /* CONFIG_KVM_GMEM_MAPPABLE */
> > +
> >  static struct file_operations kvm_gmem_fops = {
> > +     .mmap           = kvm_gmem_mmap,
> >       .open           = generic_file_open,
> >       .release        = kvm_gmem_release,
> >       .fallocate      = kvm_gmem_fallocate,
> > @@ -557,6 +734,14 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
> >               goto err_gmem;
> >       }
> >
> > +     if (IS_ENABLED(CONFIG_KVM_GMEM_MAPPABLE)) {
> > +             err = gmem_set_mappable(file_inode(file), 0, size >> PAGE_SHIFT);
> > +             if (err) {
> > +                     fput(file);
> > +                     goto err_gmem;
> > +             }
> > +     }
> > +
> >       kvm_get_kvm(kvm);
> >       gmem->kvm = kvm;
> >       xa_init(&gmem->bindings);
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 05cbb2548d99..aed9cf2f1685 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -3263,6 +3263,144 @@ static int next_segment(unsigned long len, int offset)
> >               return len;
> >  }
> >
> > +#ifdef CONFIG_KVM_GMEM_MAPPABLE
> > +static bool __kvm_gmem_is_mappable(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > +     struct kvm_memslot_iter iter;
> > +
> > +     lockdep_assert_held(&kvm->slots_lock);
> > +
> > +     kvm_for_each_memslot_in_gfn_range(&iter, kvm_memslots(kvm), start, end) {
> > +             struct kvm_memory_slot *memslot = iter.slot;
> > +             gfn_t gfn_start, gfn_end, i;
> > +
> > +             gfn_start = max(start, memslot->base_gfn);
> > +             gfn_end = min(end, memslot->base_gfn + memslot->npages);
> > +             if (WARN_ON_ONCE(gfn_start >= gfn_end))
> > +                     continue;
> > +
> > +             for (i = gfn_start; i < gfn_end; i++) {
> > +                     if (!kvm_slot_gmem_is_mappable(memslot, i))
> > +                             return false;
> > +             }
> > +     }
> > +
> > +     return true;
> > +}
> > +
> > +bool kvm_gmem_is_mappable(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > +     bool r;
> > +
> > +     mutex_lock(&kvm->slots_lock);
> > +     r = __kvm_gmem_is_mappable(kvm, start, end);
> > +     mutex_unlock(&kvm->slots_lock);
> > +
> > +     return r;
> > +}
> > +
> > +static bool kvm_gmem_is_pfn_mapped(struct kvm *kvm, struct kvm_memory_slot *memslot, gfn_t gfn_idx)
> > +{
> > +     struct page *page;
> > +     bool is_mapped;
> > +     kvm_pfn_t pfn;
> > +
> > +     /*
> > +      * Holds the folio lock until after checking its refcount,
> > +      * to avoid races with paths that fault in the folio.
> > +      */
> > +     if (WARN_ON_ONCE(kvm_gmem_get_pfn_locked(kvm, memslot, gfn_idx, &pfn, NULL)))
> > +             return false;
> > +
> > +     page = pfn_to_page(pfn);
> > +
> > +     /* Two references are expected because of kvm_gmem_get_pfn_locked(). */
> > +     is_mapped = page_ref_count(page) > 2;
> > +
> > +     put_page(page);
> > +     unlock_page(page);
> > +
> > +     return is_mapped;
> > +}
> > +
> > +static bool __kvm_gmem_is_mapped(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > +     struct kvm_memslot_iter iter;
> > +
> > +     lockdep_assert_held(&kvm->slots_lock);
> > +
> > +     kvm_for_each_memslot_in_gfn_range(&iter, kvm_memslots(kvm), start, end) {
> > +             struct kvm_memory_slot *memslot = iter.slot;
> > +             gfn_t gfn_start, gfn_end, i;
> > +
> > +             gfn_start = max(start, memslot->base_gfn);
> > +             gfn_end = min(end, memslot->base_gfn + memslot->npages);
> > +             if (WARN_ON_ONCE(gfn_start >= gfn_end))
> > +                     continue;
> > +
> > +             for (i = gfn_start; i < gfn_end; i++) {
> > +                     if (kvm_gmem_is_pfn_mapped(kvm, memslot, i))
> > +                             return true;
> > +             }
> > +     }
> > +
> > +     return false;
> > +}
> > +
> > +bool kvm_gmem_is_mapped(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > +     bool r;
> > +
> > +     mutex_lock(&kvm->slots_lock);
> > +     r = __kvm_gmem_is_mapped(kvm, start, end);
> > +     mutex_unlock(&kvm->slots_lock);
> > +
> > +     return r;
> > +}
> > +
> > +static int kvm_gmem_toggle_mappable(struct kvm *kvm, gfn_t start, gfn_t end,
> > +                                 bool is_mappable)
> > +{
> > +     struct kvm_memslot_iter iter;
> > +     int r = 0;
> > +
> > +     mutex_lock(&kvm->slots_lock);
> > +
> > +     kvm_for_each_memslot_in_gfn_range(&iter, kvm_memslots(kvm), start, end) {
> > +             struct kvm_memory_slot *memslot = iter.slot;
> > +             gfn_t gfn_start, gfn_end;
> > +
> > +             gfn_start = max(start, memslot->base_gfn);
> > +             gfn_end = min(end, memslot->base_gfn + memslot->npages);
> > +             if (WARN_ON_ONCE(start >= end))
> > +                     continue;
> > +
> > +             if (is_mappable)
> > +                     r = kvm_slot_gmem_set_mappable(memslot, gfn_start, gfn_end);
> > +             else
> > +                     r = kvm_slot_gmem_clear_mappable(memslot, gfn_start, gfn_end);
> > +
> > +             if (WARN_ON_ONCE(r))
> > +                     break;
> > +     }
> > +
> > +     mutex_unlock(&kvm->slots_lock);
> > +
> > +     return r;
> > +}
> > +
> > +int kvm_gmem_set_mappable(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > +     return kvm_gmem_toggle_mappable(kvm, start, end, true);
> > +}
> > +
> > +int kvm_gmem_clear_mappable(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > +     return kvm_gmem_toggle_mappable(kvm, start, end, false);
> > +}
> > +
> > +#endif /* CONFIG_KVM_GMEM_MAPPABLE */
> > +
> >  /* Copy @len bytes from guest memory at '(@gfn * PAGE_SIZE) + @offset' to @data */
> >  static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
> >                                void *data, int offset, int len)
> > --
> > 2.47.0.rc0.187.ge670bccf7e-goog
> >


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 06/11] KVM: guest_memfd: Add KVM capability to check if guest_memfd is host mappable
  2024-10-10  8:59 ` [PATCH v3 06/11] KVM: guest_memfd: Add KVM capability to check if guest_memfd is host mappable Fuad Tabba
@ 2024-10-15 10:30   ` Suzuki K Poulose
  2024-10-15 10:33     ` Fuad Tabba
  0 siblings, 1 reply; 28+ messages in thread
From: Suzuki K Poulose @ 2024-10-15 10:30 UTC (permalink / raw)
  To: Fuad Tabba, kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
	rientjes, jhubbard, fvdl, hughd, jthoughton

Hi Fuad

On 10/10/2024 09:59, Fuad Tabba wrote:
> Add the KVM capability KVM_CAP_GUEST_MEMFD_MAPPABLE, which is
> true if mapping guest memory is supported by the host.
>
> Signed-off-by: Fuad Tabba <tabba@google.com>
> ---
>   include/uapi/linux/kvm.h | 1 +
>   virt/kvm/kvm_main.c      | 4 ++++
>   2 files changed, 5 insertions(+)
>
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 637efc055145..2c6057bab71c 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -933,6 +933,7 @@ struct kvm_enable_cap {
>   #define KVM_CAP_PRE_FAULT_MEMORY 236
>   #define KVM_CAP_X86_APIC_BUS_CYCLES_NS 237
>   #define KVM_CAP_X86_GUEST_MODE 238
> +#define KVM_CAP_GUEST_MEMFD_MAPPABLE 239
>
>   struct kvm_irq_routing_irqchip {
>       __u32 irqchip;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 77e6412034b9..c2ff09197795 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -5176,6 +5176,10 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
>   #ifdef CONFIG_KVM_PRIVATE_MEM
>       case KVM_CAP_GUEST_MEMFD:
>               return !kvm || kvm_arch_has_private_mem(kvm);
> +#endif
> +#ifdef CONFIG_KVM_GMEM_MAPPABLE
> +     case KVM_CAP_GUEST_MEMFD_MAPPABLE:
> +             return !kvm || kvm_arch_has_private_mem(kvm);

minor nit: Keying this on whether the "kvm" instance has private mem
may not be flexible enough to support other types of CC guest that
may use guestmem, but not "mappable" memory.  e.g. CCA may not
support "mappable", unless we have a way to explicitly pass down
"you can map a shared page from the guest_memfd, but it is not
sharable in place".

We could solve it when we get there, but it might be worth
considering.

Suzuki




>   #endif
>       default:
>               break;

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 06/11] KVM: guest_memfd: Add KVM capability to check if guest_memfd is host mappable
  2024-10-15 10:30   ` Suzuki K Poulose
@ 2024-10-15 10:33     ` Fuad Tabba
  0 siblings, 0 replies; 28+ messages in thread
From: Fuad Tabba @ 2024-10-15 10:33 UTC (permalink / raw)
  To: Suzuki K Poulose
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

Hi Suzuki,

On Tue, 15 Oct 2024 at 11:30, Suzuki K Poulose <suzuki.poulose@arm.com> wrote:
>
> Hi Fuad
>
> On 10/10/2024 09:59, Fuad Tabba wrote:
> > Add the KVM capability KVM_CAP_GUEST_MEMFD_MAPPABLE, which is
> > true if mapping guest memory is supported by the host.
> >
> > Signed-off-by: Fuad Tabba <tabba@google.com>
> > ---
> >   include/uapi/linux/kvm.h | 1 +
> >   virt/kvm/kvm_main.c      | 4 ++++
> >   2 files changed, 5 insertions(+)
> >
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 637efc055145..2c6057bab71c 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -933,6 +933,7 @@ struct kvm_enable_cap {
> >   #define KVM_CAP_PRE_FAULT_MEMORY 236
> >   #define KVM_CAP_X86_APIC_BUS_CYCLES_NS 237
> >   #define KVM_CAP_X86_GUEST_MODE 238
> > +#define KVM_CAP_GUEST_MEMFD_MAPPABLE 239
> >
> >   struct kvm_irq_routing_irqchip {
> >       __u32 irqchip;
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 77e6412034b9..c2ff09197795 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -5176,6 +5176,10 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> >   #ifdef CONFIG_KVM_PRIVATE_MEM
> >       case KVM_CAP_GUEST_MEMFD:
> >               return !kvm || kvm_arch_has_private_mem(kvm);
> > +#endif
> > +#ifdef CONFIG_KVM_GMEM_MAPPABLE
> > +     case KVM_CAP_GUEST_MEMFD_MAPPABLE:
> > +             return !kvm || kvm_arch_has_private_mem(kvm);
>
> minor nit: Keying this on whether the "kvm" instance has private mem
> may not be flexible enough to support other types of CC guest that
> may use guestmem, but not "mappable" memory.  e.g. CCA may not
> support "mappable", unless we have a way to explicitly pass down
> "you can map a shared page from the guest_memfd, but it is not
> sharable in place".
>
> We could solve it when we get there, but it might be worth
> considering.

I did consider that, but I assumed that the configuration option would
be sufficient. Otherwise, we could make it dependent on the VM type.

Cheers,
/fuad

> Suzuki
>
>
>
>
> >   #endif
> >       default:
> >               break;
>
> IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 04/11] KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared
  2024-10-15 10:27     ` Fuad Tabba
@ 2024-10-16 16:53       ` Elliot Berman
  0 siblings, 0 replies; 28+ messages in thread
From: Elliot Berman @ 2024-10-16 16:53 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

On Tue, Oct 15, 2024 at 11:27:48AM +0100, Fuad Tabba wrote:
> Hi Elliot,
> 
> On Mon, 14 Oct 2024 at 17:53, Elliot Berman <quic_eberman@quicinc.com> wrote:
> >
> > On Thu, Oct 10, 2024 at 09:59:23AM +0100, Fuad Tabba wrote:
> > > Add support for mmap() and fault() for guest_memfd in the host.
> > > The ability to fault in a guest page is contingent on that page
> > > being shared with the host.
> > >
> > > The guest_memfd PRIVATE memory attribute is not used for two
> > > reasons. First because it reflects the userspace expectation for
> > > that memory location, and therefore can be toggled by userspace.
> > > The second is, although each guest_memfd file has a 1:1 binding
> > > with a KVM instance, the plan is to allow multiple files per
> > > inode, e.g. to allow intra-host migration to a new KVM instance,
> > > without destroying guest_memfd.
> > >
> > > The mapping is restricted to only memory explicitly shared with
> > > the host. KVM checks that the host doesn't have any mappings for
> > > private memory via the folio's refcount. To avoid races between
> > > paths that check mappability and paths that check whether the
> > > host has any mappings (via the refcount), the folio lock is held
> > > in while either check is being performed.
> > >
> > > This new feature is gated with a new configuration option,
> > > CONFIG_KVM_GMEM_MAPPABLE.
> > >
> > > Co-developed-by: Ackerley Tng <ackerleytng@google.com>
> > > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> > > Co-developed-by: Elliot Berman <quic_eberman@quicinc.com>
> > > Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
> > > Signed-off-by: Fuad Tabba <tabba@google.com>
> > >
> > > ---
> > >
> > > Note that the functions kvm_gmem_is_mapped(),
> > > kvm_gmem_set_mappable(), and int kvm_gmem_clear_mappable() are
> > > not used in this patch series. They are intended to be used in
> > > future patches [*], which check and toggle mapability when the
> > > guest shares/unshares pages with the host.
> > >
> > > [*] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.12-v3-pkvm
> > >
> > > ---
> > >  include/linux/kvm_host.h |  52 +++++++++++
> > >  virt/kvm/Kconfig         |   4 +
> > >  virt/kvm/guest_memfd.c   | 185 +++++++++++++++++++++++++++++++++++++++
> > >  virt/kvm/kvm_main.c      | 138 +++++++++++++++++++++++++++++
> > >  4 files changed, 379 insertions(+)
> > >
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index acf85995b582..bda7fda9945e 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -2527,4 +2527,56 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> > >                                   struct kvm_pre_fault_memory *range);
> > >  #endif
> > >
> > > +#ifdef CONFIG_KVM_GMEM_MAPPABLE
> > > +bool kvm_gmem_is_mappable(struct kvm *kvm, gfn_t gfn, gfn_t end);
> > > +bool kvm_gmem_is_mapped(struct kvm *kvm, gfn_t start, gfn_t end);
> > > +int kvm_gmem_set_mappable(struct kvm *kvm, gfn_t start, gfn_t end);
> > > +int kvm_gmem_clear_mappable(struct kvm *kvm, gfn_t start, gfn_t end);
> > > +int kvm_slot_gmem_set_mappable(struct kvm_memory_slot *slot, gfn_t start,
> > > +                            gfn_t end);
> > > +int kvm_slot_gmem_clear_mappable(struct kvm_memory_slot *slot, gfn_t start,
> > > +                              gfn_t end);
> > > +bool kvm_slot_gmem_is_mappable(struct kvm_memory_slot *slot, gfn_t gfn);
> > > +#else
> > > +static inline bool kvm_gmem_is_mappable(struct kvm *kvm, gfn_t gfn, gfn_t end)
> > > +{
> > > +     WARN_ON_ONCE(1);
> > > +     return false;
> > > +}
> > > +static inline bool kvm_gmem_is_mapped(struct kvm *kvm, gfn_t start, gfn_t end)
> > > +{
> > > +     WARN_ON_ONCE(1);
> > > +     return false;
> > > +}
> > > +static inline int kvm_gmem_set_mappable(struct kvm *kvm, gfn_t start, gfn_t end)
> > > +{
> > > +     WARN_ON_ONCE(1);
> > > +     return -EINVAL;
> > > +}
> > > +static inline int kvm_gmem_clear_mappable(struct kvm *kvm, gfn_t start,
> > > +                                       gfn_t end)
> > > +{
> > > +     WARN_ON_ONCE(1);
> > > +     return -EINVAL;
> > > +}
> > > +static inline int kvm_slot_gmem_set_mappable(struct kvm_memory_slot *slot,
> > > +                                          gfn_t start, gfn_t end)
> > > +{
> > > +     WARN_ON_ONCE(1);
> > > +     return -EINVAL;
> > > +}
> > > +static inline int kvm_slot_gmem_clear_mappable(struct kvm_memory_slot *slot,
> > > +                                            gfn_t start, gfn_t end)
> > > +{
> > > +     WARN_ON_ONCE(1);
> > > +     return -EINVAL;
> > > +}
> > > +static inline bool kvm_slot_gmem_is_mappable(struct kvm_memory_slot *slot,
> > > +                                          gfn_t gfn)
> > > +{
> > > +     WARN_ON_ONCE(1);
> > > +     return false;
> > > +}
> > > +#endif /* CONFIG_KVM_GMEM_MAPPABLE */
> > > +
> > >  #endif
> > > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> > > index fd6a3010afa8..2cfcb0848e37 100644
> > > --- a/virt/kvm/Kconfig
> > > +++ b/virt/kvm/Kconfig
> > > @@ -120,3 +120,7 @@ config HAVE_KVM_ARCH_GMEM_PREPARE
> > >  config HAVE_KVM_ARCH_GMEM_INVALIDATE
> > >         bool
> > >         depends on KVM_PRIVATE_MEM
> > > +
> > > +config KVM_GMEM_MAPPABLE
> > > +       select KVM_PRIVATE_MEM
> > > +       bool
> > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > > index f414646c475b..df3a6f05a16e 100644
> > > --- a/virt/kvm/guest_memfd.c
> > > +++ b/virt/kvm/guest_memfd.c
> > > @@ -370,7 +370,184 @@ static void kvm_gmem_init_mount(void)
> > >       kvm_gmem_mnt->mnt_flags |= MNT_NOEXEC;
> > >  }
> > >
> > > +#ifdef CONFIG_KVM_GMEM_MAPPABLE
> > > +static struct folio *
> > > +__kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
> > > +                gfn_t gfn, kvm_pfn_t *pfn, bool *is_prepared,
> > > +                int *max_order);
> > > +
> > > +static int gmem_set_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
> > > +{
> > > +     struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
> > > +     void *xval = xa_mk_value(true);
> > > +     pgoff_t i;
> > > +     bool r;
> > > +
> > > +     filemap_invalidate_lock(inode->i_mapping);
> > > +     for (i = start; i < end; i++) {
> > > +             r = xa_err(xa_store(mappable_offsets, i, xval, GFP_KERNEL));
> >
> > I think it might not be strictly necessary,
> 
> Sorry, but I don't quite get what isn't strictly necessary. Is it the
> checking for an error?
> 


Oops, I was thinking we need to check the folio_ref_count when setting
the ref_count. I'd started replying, then realized doing the check isn't
necessary. I missed deleting the start of my comment, sorry about that
:)

> > > +             if (r)
> > > +                     break;
> > > +     }
> > > +     filemap_invalidate_unlock(inode->i_mapping);
> > > +
> > > +     return r;
> > > +}
> > > +
> > > +static int gmem_clear_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
> > > +{
> > > +     struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
> > > +     pgoff_t i;
> > > +     int r = 0;
> > > +
> > > +     filemap_invalidate_lock(inode->i_mapping);
> > > +     for (i = start; i < end; i++) {
> > > +             struct folio *folio;
> > > +
> > > +             /*
> > > +              * Holds the folio lock until after checking its refcount,
> > > +              * to avoid races with paths that fault in the folio.
> > > +              */
> > > +             folio = kvm_gmem_get_folio(inode, i);
> >
> > We don't need to allocate the folio here. I think we can use
> >
> >                 folio = filemap_lock_folio(inode, i);
> >                 if (!folio || WARN_ON_ONCE(IS_ERR(folio)))
> >                         continue;
> 
> Good point (it takes an inode->i_mapping though).
> 
> >                 folio = filemap_lock_folio(inode->i_mapping, i);
> 
> 
> > > +             if (WARN_ON_ONCE(IS_ERR(folio)))
> > > +                     continue;
> > > +
> > > +             /*
> > > +              * Check that the host doesn't have any mappings on clearing
> > > +              * the mappable flag, because clearing the flag implies that the
> > > +              * memory will be unshared from the host. Therefore, to maintain
> > > +              * the invariant that the host cannot access private memory, we
> > > +              * need to check that it doesn't have any mappings to that
> > > +              * memory before making it private.
> > > +              *
> > > +              * Two references are expected because of kvm_gmem_get_folio().
> > > +              */
> > > +             if (folio_ref_count(folio) > 2)
> >
> > If we'd like to be prepared for large folios, it should be
> > folio_nr_pages(folio) + 1.
> 
> Will do that.
> 
> Thanks!
> /fuad
> 
> 
> 
> > > +                     r = -EPERM;
> > > +             else
> > > +                     xa_erase(mappable_offsets, i);
> > > +
> > > +             folio_put(folio);
> > > +             folio_unlock(folio);
> > > +
> > > +             if (r)
> > > +                     break;
> > > +     }
> > > +     filemap_invalidate_unlock(inode->i_mapping);
> > > +
> > > +     return r;
> > > +}
> > > +
> > > +static bool gmem_is_mappable(struct inode *inode, pgoff_t pgoff)
> > > +{
> > > +     struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
> > > +     bool r;
> > > +
> > > +     filemap_invalidate_lock_shared(inode->i_mapping);
> > > +     r = xa_find(mappable_offsets, &pgoff, pgoff, XA_PRESENT);
> > > +     filemap_invalidate_unlock_shared(inode->i_mapping);
> > > +
> > > +     return r;
> > > +}
> > > +
> > > +int kvm_slot_gmem_set_mappable(struct kvm_memory_slot *slot, gfn_t start, gfn_t end)
> > > +{
> > > +     struct inode *inode = file_inode(slot->gmem.file);
> > > +     pgoff_t start_off = slot->gmem.pgoff + start - slot->base_gfn;
> > > +     pgoff_t end_off = start_off + end - start;
> > > +
> > > +     return gmem_set_mappable(inode, start_off, end_off);
> > > +}
> > > +
> > > +int kvm_slot_gmem_clear_mappable(struct kvm_memory_slot *slot, gfn_t start, gfn_t end)
> > > +{
> > > +     struct inode *inode = file_inode(slot->gmem.file);
> > > +     pgoff_t start_off = slot->gmem.pgoff + start - slot->base_gfn;
> > > +     pgoff_t end_off = start_off + end - start;
> > > +
> > > +     return gmem_clear_mappable(inode, start_off, end_off);
> > > +}
> > > +
> > > +bool kvm_slot_gmem_is_mappable(struct kvm_memory_slot *slot, gfn_t gfn)
> > > +{
> > > +     struct inode *inode = file_inode(slot->gmem.file);
> > > +     unsigned long pgoff = slot->gmem.pgoff + gfn - slot->base_gfn;
> > > +
> > > +     return gmem_is_mappable(inode, pgoff);
> > > +}
> > > +
> > > +static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
> > > +{
> > > +     struct inode *inode = file_inode(vmf->vma->vm_file);
> > > +     struct folio *folio;
> > > +     vm_fault_t ret = VM_FAULT_LOCKED;
> > > +
> > > +     /*
> > > +      * Holds the folio lock until after checking whether it can be faulted
> > > +      * in, to avoid races with paths that change a folio's mappability.
> > > +      */
> > > +     folio = kvm_gmem_get_folio(inode, vmf->pgoff);
> > > +     if (!folio)
> > > +             return VM_FAULT_SIGBUS;
> > > +
> > > +     if (folio_test_hwpoison(folio)) {
> > > +             ret = VM_FAULT_HWPOISON;
> > > +             goto out;
> > > +     }
> > > +
> > > +     if (!gmem_is_mappable(inode, vmf->pgoff)) {
> > > +             ret = VM_FAULT_SIGBUS;
> > > +             goto out;
> > > +     }
> > > +
> > > +     if (!folio_test_uptodate(folio)) {
> > > +             unsigned long nr_pages = folio_nr_pages(folio);
> > > +             unsigned long i;
> > > +
> > > +             for (i = 0; i < nr_pages; i++)
> > > +                     clear_highpage(folio_page(folio, i));
> > > +
> > > +             folio_mark_uptodate(folio);
> > > +     }
> > > +
> > > +     vmf->page = folio_file_page(folio, vmf->pgoff);
> > > +out:
> > > +     if (ret != VM_FAULT_LOCKED) {
> > > +             folio_put(folio);
> > > +             folio_unlock(folio);
> > > +     }
> > > +
> > > +     return ret;
> > > +}
> > > +
> > > +static const struct vm_operations_struct kvm_gmem_vm_ops = {
> > > +     .fault = kvm_gmem_fault,
> > > +};
> > > +
> > > +static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
> > > +{
> > > +     if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) !=
> > > +         (VM_SHARED | VM_MAYSHARE)) {
> > > +             return -EINVAL;
> > > +     }
> > > +
> > > +     file_accessed(file);
> > > +     vm_flags_set(vma, VM_DONTDUMP);
> > > +     vma->vm_ops = &kvm_gmem_vm_ops;
> > > +
> > > +     return 0;
> > > +}
> > > +#else
> > > +static int gmem_set_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
> > > +{
> > > +     WARN_ON_ONCE(1);
> > > +     return -EINVAL;
> > > +}
> > > +#define kvm_gmem_mmap NULL
> > > +#endif /* CONFIG_KVM_GMEM_MAPPABLE */
> > > +
> > >  static struct file_operations kvm_gmem_fops = {
> > > +     .mmap           = kvm_gmem_mmap,
> > >       .open           = generic_file_open,
> > >       .release        = kvm_gmem_release,
> > >       .fallocate      = kvm_gmem_fallocate,
> > > @@ -557,6 +734,14 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
> > >               goto err_gmem;
> > >       }
> > >
> > > +     if (IS_ENABLED(CONFIG_KVM_GMEM_MAPPABLE)) {
> > > +             err = gmem_set_mappable(file_inode(file), 0, size >> PAGE_SHIFT);
> > > +             if (err) {
> > > +                     fput(file);
> > > +                     goto err_gmem;
> > > +             }
> > > +     }
> > > +
> > >       kvm_get_kvm(kvm);
> > >       gmem->kvm = kvm;
> > >       xa_init(&gmem->bindings);
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index 05cbb2548d99..aed9cf2f1685 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -3263,6 +3263,144 @@ static int next_segment(unsigned long len, int offset)
> > >               return len;
> > >  }
> > >
> > > +#ifdef CONFIG_KVM_GMEM_MAPPABLE
> > > +static bool __kvm_gmem_is_mappable(struct kvm *kvm, gfn_t start, gfn_t end)
> > > +{
> > > +     struct kvm_memslot_iter iter;
> > > +
> > > +     lockdep_assert_held(&kvm->slots_lock);
> > > +
> > > +     kvm_for_each_memslot_in_gfn_range(&iter, kvm_memslots(kvm), start, end) {
> > > +             struct kvm_memory_slot *memslot = iter.slot;
> > > +             gfn_t gfn_start, gfn_end, i;
> > > +
> > > +             gfn_start = max(start, memslot->base_gfn);
> > > +             gfn_end = min(end, memslot->base_gfn + memslot->npages);
> > > +             if (WARN_ON_ONCE(gfn_start >= gfn_end))
> > > +                     continue;
> > > +
> > > +             for (i = gfn_start; i < gfn_end; i++) {
> > > +                     if (!kvm_slot_gmem_is_mappable(memslot, i))
> > > +                             return false;
> > > +             }
> > > +     }
> > > +
> > > +     return true;
> > > +}
> > > +
> > > +bool kvm_gmem_is_mappable(struct kvm *kvm, gfn_t start, gfn_t end)
> > > +{
> > > +     bool r;
> > > +
> > > +     mutex_lock(&kvm->slots_lock);
> > > +     r = __kvm_gmem_is_mappable(kvm, start, end);
> > > +     mutex_unlock(&kvm->slots_lock);
> > > +
> > > +     return r;
> > > +}
> > > +
> > > +static bool kvm_gmem_is_pfn_mapped(struct kvm *kvm, struct kvm_memory_slot *memslot, gfn_t gfn_idx)
> > > +{
> > > +     struct page *page;
> > > +     bool is_mapped;
> > > +     kvm_pfn_t pfn;
> > > +
> > > +     /*
> > > +      * Holds the folio lock until after checking its refcount,
> > > +      * to avoid races with paths that fault in the folio.
> > > +      */
> > > +     if (WARN_ON_ONCE(kvm_gmem_get_pfn_locked(kvm, memslot, gfn_idx, &pfn, NULL)))
> > > +             return false;
> > > +
> > > +     page = pfn_to_page(pfn);
> > > +
> > > +     /* Two references are expected because of kvm_gmem_get_pfn_locked(). */
> > > +     is_mapped = page_ref_count(page) > 2;
> > > +
> > > +     put_page(page);
> > > +     unlock_page(page);
> > > +
> > > +     return is_mapped;
> > > +}
> > > +
> > > +static bool __kvm_gmem_is_mapped(struct kvm *kvm, gfn_t start, gfn_t end)
> > > +{
> > > +     struct kvm_memslot_iter iter;
> > > +
> > > +     lockdep_assert_held(&kvm->slots_lock);
> > > +
> > > +     kvm_for_each_memslot_in_gfn_range(&iter, kvm_memslots(kvm), start, end) {
> > > +             struct kvm_memory_slot *memslot = iter.slot;
> > > +             gfn_t gfn_start, gfn_end, i;
> > > +
> > > +             gfn_start = max(start, memslot->base_gfn);
> > > +             gfn_end = min(end, memslot->base_gfn + memslot->npages);
> > > +             if (WARN_ON_ONCE(gfn_start >= gfn_end))
> > > +                     continue;
> > > +
> > > +             for (i = gfn_start; i < gfn_end; i++) {
> > > +                     if (kvm_gmem_is_pfn_mapped(kvm, memslot, i))
> > > +                             return true;
> > > +             }
> > > +     }
> > > +
> > > +     return false;
> > > +}
> > > +
> > > +bool kvm_gmem_is_mapped(struct kvm *kvm, gfn_t start, gfn_t end)
> > > +{
> > > +     bool r;
> > > +
> > > +     mutex_lock(&kvm->slots_lock);
> > > +     r = __kvm_gmem_is_mapped(kvm, start, end);
> > > +     mutex_unlock(&kvm->slots_lock);
> > > +
> > > +     return r;
> > > +}
> > > +
> > > +static int kvm_gmem_toggle_mappable(struct kvm *kvm, gfn_t start, gfn_t end,
> > > +                                 bool is_mappable)
> > > +{
> > > +     struct kvm_memslot_iter iter;
> > > +     int r = 0;
> > > +
> > > +     mutex_lock(&kvm->slots_lock);
> > > +
> > > +     kvm_for_each_memslot_in_gfn_range(&iter, kvm_memslots(kvm), start, end) {
> > > +             struct kvm_memory_slot *memslot = iter.slot;
> > > +             gfn_t gfn_start, gfn_end;
> > > +
> > > +             gfn_start = max(start, memslot->base_gfn);
> > > +             gfn_end = min(end, memslot->base_gfn + memslot->npages);
> > > +             if (WARN_ON_ONCE(start >= end))
> > > +                     continue;
> > > +
> > > +             if (is_mappable)
> > > +                     r = kvm_slot_gmem_set_mappable(memslot, gfn_start, gfn_end);
> > > +             else
> > > +                     r = kvm_slot_gmem_clear_mappable(memslot, gfn_start, gfn_end);
> > > +
> > > +             if (WARN_ON_ONCE(r))
> > > +                     break;
> > > +     }
> > > +
> > > +     mutex_unlock(&kvm->slots_lock);
> > > +
> > > +     return r;
> > > +}
> > > +
> > > +int kvm_gmem_set_mappable(struct kvm *kvm, gfn_t start, gfn_t end)
> > > +{
> > > +     return kvm_gmem_toggle_mappable(kvm, start, end, true);
> > > +}
> > > +
> > > +int kvm_gmem_clear_mappable(struct kvm *kvm, gfn_t start, gfn_t end)
> > > +{
> > > +     return kvm_gmem_toggle_mappable(kvm, start, end, false);
> > > +}
> > > +
> > > +#endif /* CONFIG_KVM_GMEM_MAPPABLE */
> > > +
> > >  /* Copy @len bytes from guest memory at '(@gfn * PAGE_SIZE) + @offset' to @data */
> > >  static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
> > >                                void *data, int offset, int len)
> > > --
> > > 2.47.0.rc0.187.ge670bccf7e-goog
> > >


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 05/11] KVM: guest_memfd: Add guest_memfd support to kvm_(read|/write)_guest_page()
  2024-10-10  8:59 ` [PATCH v3 05/11] KVM: guest_memfd: Add guest_memfd support to kvm_(read|/write)_guest_page() Fuad Tabba
@ 2024-10-17 21:53   ` Ackerley Tng
  2024-10-18  6:57     ` Patrick Roy
  0 siblings, 1 reply; 28+ messages in thread
From: Ackerley Tng @ 2024-10-17 21:53 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton, tabba

Fuad Tabba <tabba@google.com> writes:

> Make kvm_(read|/write)_guest_page() capable of accessing guest
> memory for slots that don't have a userspace address, but only if
> the memory is mappable, which also indicates that it is
> accessible by the host.

Fuad explained to me that this patch removes the need for userspace to
mmap a guest_memfd fd just to provide userspace_addr when only a limited
range of shared pages are required, e.g. for kvm_clock.

Questions to anyone who might be more familiar:

1. Should we let userspace save the trouble of providing userspace_addr
   if only KVM (and not userspace) needs to access the shared pages?
2. Other than kvm_{read,write}_guest_page, are there any other parts of
   KVM that may require updates so that guest_memfd can be used directly
   from the kernel?

Patrick, does this help to answer the question of "how does KVM
internally access guest_memfd for non-CoCo VMs" that you brought up in
this other thread [*]?

[*] https://lore.kernel.org/all/6bca3ad4-3eca-4a75-a775-5f8b0467d7a3@amazon.co.uk/

>
> Signed-off-by: Fuad Tabba <tabba@google.com>
> ---
>  virt/kvm/kvm_main.c | 137 ++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 118 insertions(+), 19 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index aed9cf2f1685..77e6412034b9 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3399,23 +3399,114 @@ int kvm_gmem_clear_mappable(struct kvm *kvm, gfn_t start, gfn_t end)
>  	return kvm_gmem_toggle_mappable(kvm, start, end, false);
>  }
>  
> +static int __kvm_read_guest_memfd_page(struct kvm *kvm,
> +				       struct kvm_memory_slot *slot,
> +				       gfn_t gfn, void *data, int offset,
> +				       int len)
> +{
> +	struct page *page;
> +	u64 pfn;
> +	int r;
> +
> +	/*
> +	 * Holds the folio lock until after checking whether it can be faulted
> +	 * in, to avoid races with paths that change a folio's mappability.
> +	 */
> +	r = kvm_gmem_get_pfn_locked(kvm, slot, gfn, &pfn, NULL);
> +	if (r)
> +		return r;
> +
> +	page = pfn_to_page(pfn);
> +
> +	if (!kvm_gmem_is_mappable(kvm, gfn, gfn + 1)) {
> +		r = -EPERM;
> +		goto unlock;
> +	}
> +	memcpy(data, page_address(page) + offset, len);
> +unlock:
> +	if (r)
> +		put_page(page);
> +	else
> +		kvm_release_pfn_clean(pfn);
> +	unlock_page(page);
> +
> +	return r;
> +}
> +
> +static int __kvm_write_guest_memfd_page(struct kvm *kvm,
> +					struct kvm_memory_slot *slot,
> +					gfn_t gfn, const void *data,
> +					int offset, int len)
> +{
> +	struct page *page;
> +	u64 pfn;
> +	int r;
> +
> +	/*
> +	 * Holds the folio lock until after checking whether it can be faulted
> +	 * in, to avoid races with paths that change a folio's mappability.
> +	 */
> +	r = kvm_gmem_get_pfn_locked(kvm, slot, gfn, &pfn, NULL);
> +	if (r)
> +		return r;
> +
> +	page = pfn_to_page(pfn);
> +
> +	if (!kvm_gmem_is_mappable(kvm, gfn, gfn + 1)) {
> +		r = -EPERM;
> +		goto unlock;
> +	}
> +	memcpy(page_address(page) + offset, data, len);
> +unlock:
> +	if (r)
> +		put_page(page);
> +	else
> +		kvm_release_pfn_dirty(pfn);
> +	unlock_page(page);
> +
> +	return r;
> +}
> +#else
> +static int __kvm_read_guest_memfd_page(struct kvm *kvm,
> +				       struct kvm_memory_slot *slot,
> +				       gfn_t gfn, void *data, int offset,
> +				       int len)
> +{
> +	WARN_ON_ONCE(1);
> +	return -EIO;
> +}
> +
> +static int __kvm_write_guest_memfd_page(struct kvm *kvm,
> +					struct kvm_memory_slot *slot,
> +					gfn_t gfn, const void *data,
> +					int offset, int len)
> +{
> +	WARN_ON_ONCE(1);
> +	return -EIO;
> +}
>  #endif /* CONFIG_KVM_GMEM_MAPPABLE */
>  
>  /* Copy @len bytes from guest memory at '(@gfn * PAGE_SIZE) + @offset' to @data */
> -static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
> -				 void *data, int offset, int len)
> +
> +static int __kvm_read_guest_page(struct kvm *kvm, struct kvm_memory_slot *slot,
> +				 gfn_t gfn, void *data, int offset, int len)
>  {
> -	int r;
>  	unsigned long addr;
>  
>  	if (WARN_ON_ONCE(offset + len > PAGE_SIZE))
>  		return -EFAULT;
>  
> +	if (IS_ENABLED(CONFIG_KVM_GMEM_MAPPABLE) &&
> +	    kvm_slot_can_be_private(slot) &&
> +	    !slot->userspace_addr) {
> +		return __kvm_read_guest_memfd_page(kvm, slot, gfn, data,
> +						   offset, len);
> +	}
> +
>  	addr = gfn_to_hva_memslot_prot(slot, gfn, NULL);
>  	if (kvm_is_error_hva(addr))
>  		return -EFAULT;
> -	r = __copy_from_user(data, (void __user *)addr + offset, len);
> -	if (r)
> +	if (__copy_from_user(data, (void __user *)addr + offset, len))
>  		return -EFAULT;
>  	return 0;
>  }
> @@ -3425,7 +3516,7 @@ int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
>  {
>  	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
>  
> -	return __kvm_read_guest_page(slot, gfn, data, offset, len);
> +	return __kvm_read_guest_page(kvm, slot, gfn, data, offset, len);
>  }
>  EXPORT_SYMBOL_GPL(kvm_read_guest_page);
>  
> @@ -3434,7 +3525,7 @@ int kvm_vcpu_read_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn, void *data,
>  {
>  	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
>  
> -	return __kvm_read_guest_page(slot, gfn, data, offset, len);
> +	return __kvm_read_guest_page(vcpu->kvm, slot, gfn, data, offset, len);
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_page);
>  
> @@ -3511,22 +3602,30 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_atomic);
>  
>  /* Copy @len bytes from @data into guest memory at '(@gfn * PAGE_SIZE) + @offset' */
>  static int __kvm_write_guest_page(struct kvm *kvm,
> -				  struct kvm_memory_slot *memslot, gfn_t gfn,
> -			          const void *data, int offset, int len)
> +				  struct kvm_memory_slot *slot, gfn_t gfn,
> +				  const void *data, int offset, int len)
>  {
> -	int r;
> -	unsigned long addr;
> -
>  	if (WARN_ON_ONCE(offset + len > PAGE_SIZE))
>  		return -EFAULT;
>  
> -	addr = gfn_to_hva_memslot(memslot, gfn);
> -	if (kvm_is_error_hva(addr))
> -		return -EFAULT;
> -	r = __copy_to_user((void __user *)addr + offset, data, len);
> -	if (r)
> -		return -EFAULT;
> -	mark_page_dirty_in_slot(kvm, memslot, gfn);
> +	if (IS_ENABLED(CONFIG_KVM_GMEM_MAPPABLE) &&
> +	    kvm_slot_can_be_private(slot) &&
> +	    !slot->userspace_addr) {
> +		int r = __kvm_write_guest_memfd_page(kvm, slot, gfn, data,
> +						     offset, len);
> +
> +		if (r)
> +			return r;
> +	} else {
> +		unsigned long addr = gfn_to_hva_memslot(slot, gfn);
> +
> +		if (kvm_is_error_hva(addr))
> +			return -EFAULT;
> +		if (__copy_to_user((void __user *)addr + offset, data, len))
> +			return -EFAULT;
> +	}
> +
> +	mark_page_dirty_in_slot(kvm, slot, gfn);
>  	return 0;
>  }


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 05/11] KVM: guest_memfd: Add guest_memfd support to kvm_(read|/write)_guest_page()
  2024-10-17 21:53   ` Ackerley Tng
@ 2024-10-18  6:57     ` Patrick Roy
  0 siblings, 0 replies; 28+ messages in thread
From: Patrick Roy @ 2024-10-18  6:57 UTC (permalink / raw)
  To: Ackerley Tng, Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, shuah, hch,
	jgg, rientjes, jhubbard, fvdl, hughd, jthoughton



On Thu, 2024-10-17 at 22:53 +0100, Ackerley Tng wrote:
> Fuad Tabba <tabba@google.com> writes:
> 
>> Make kvm_(read|/write)_guest_page() capable of accessing guest
>> memory for slots that don't have a userspace address, but only if
>> the memory is mappable, which also indicates that it is
>> accessible by the host.
> 
> Fuad explained to me that this patch removes the need for userspace to
> mmap a guest_memfd fd just to provide userspace_addr when only a limited
> range of shared pages are required, e.g. for kvm_clock.
> 
> Questions to anyone who might be more familiar:
> 
> 1. Should we let userspace save the trouble of providing userspace_addr
>    if only KVM (and not userspace) needs to access the shared pages?
> 2. Other than kvm_{read,write}_guest_page, are there any other parts of
>    KVM that may require updates so that guest_memfd can be used directly
>    from the kernel?
> 
> Patrick, does this help to answer the question of "how does KVM
> internally access guest_memfd for non-CoCo VMs" that you brought up in
> this other thread [*]?

Just patching kvm_{read,write}_guest sadly does not solve all the
problems on x86, there's also guest page table walks (which use a
get_user/try_cmpxchg on userspace_addr directly), and kvm-clock, which
uses a special mechanism (gfn_to_pfn_cache) to translate gpa to pfns via
userspace_addr (and uses gup in the process). These aren't impossible to
also teach about gmem (details in my patch series [1], which does all
this in the context of direct map removal), but with direct map removal,
all these would need to additionally temporarily restore the direct map,
which is expensive due to tlb flushes. That's the main reason why I was
so excited about the userspace_addr approach, because it'd mean no need
for tlb flushes (you can do copy_from_user and friends without direct
map entries), and every thing except kvm-clock will "just work" (and
massaging kvm-clock to also work in this model is easier than making it
work with on-demand direct map restoration, I think).

There's also something called `kvm_{read,write}_guest_cached`, (I think
async page faulting uses it), but that's fairly easy to deal with [3].

There's another problem for x86 here, which is that during instruction
fetch for MMIO emulation (the CPU doesn't help us here sadly, outside of
modern AMD processors [2]), there is no way to mark pages as
shared/mappable ahead of time (although I guess this is only a problem
for CoCo VMs with in-place sharing on x86 that don't do TDX-style
paravirtual MMIO, which I am not sure is something that even exists. I
know for my non-CoCo usecase I would just unconditionally set all of
gmem to faultable but without direct map entries. But then I'd need
separate knobs for "private" and "faultable").

[1]: https://lore.kernel.org/kvm/20240910163038.1298452-1-roypat@amazon.co.uk/
[2]: https://lore.kernel.org/kvm/ZkI0SCMARCB9bAfc@google.com/
[3]: https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@amazon.co.uk/T/#ma61de70f906cd6553ffba02563d1e617f7c9246b

> [*] https://lore.kernel.org/all/6bca3ad4-3eca-4a75-a775-5f8b0467d7a3@amazon.co.uk/
> 
>>
>> Signed-off-by: Fuad Tabba <tabba@google.com>
>> ---
>>  virt/kvm/kvm_main.c | 137 ++++++++++++++++++++++++++++++++++++++------
>>  1 file changed, 118 insertions(+), 19 deletions(-)
>>
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index aed9cf2f1685..77e6412034b9 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -3399,23 +3399,114 @@ int kvm_gmem_clear_mappable(struct kvm *kvm, gfn_t start, gfn_t end)
>>       return kvm_gmem_toggle_mappable(kvm, start, end, false);
>>  }
>>
>> +static int __kvm_read_guest_memfd_page(struct kvm *kvm,
>> +                                    struct kvm_memory_slot *slot,
>> +                                    gfn_t gfn, void *data, int offset,
>> +                                    int len)
>> +{
>> +     struct page *page;
>> +     u64 pfn;
>> +     int r;
>> +
>> +     /*
>> +      * Holds the folio lock until after checking whether it can be faulted
>> +      * in, to avoid races with paths that change a folio's mappability.
>> +      */
>> +     r = kvm_gmem_get_pfn_locked(kvm, slot, gfn, &pfn, NULL);
>> +     if (r)
>> +             return r;
>> +
>> +     page = pfn_to_page(pfn);
>> +
>> +     if (!kvm_gmem_is_mappable(kvm, gfn, gfn + 1)) {
>> +             r = -EPERM;
>> +             goto unlock;
>> +     }
>> +     memcpy(data, page_address(page) + offset, len);
>> +unlock:
>> +     if (r)
>> +             put_page(page);
>> +     else
>> +             kvm_release_pfn_clean(pfn);
>> +     unlock_page(page);
>> +
>> +     return r;
>> +}
>> +
>> +static int __kvm_write_guest_memfd_page(struct kvm *kvm,
>> +                                     struct kvm_memory_slot *slot,
>> +                                     gfn_t gfn, const void *data,
>> +                                     int offset, int len)
>> +{
>> +     struct page *page;
>> +     u64 pfn;
>> +     int r;
>> +
>> +     /*
>> +      * Holds the folio lock until after checking whether it can be faulted
>> +      * in, to avoid races with paths that change a folio's mappability.
>> +      */
>> +     r = kvm_gmem_get_pfn_locked(kvm, slot, gfn, &pfn, NULL);
>> +     if (r)
>> +             return r;
>> +
>> +     page = pfn_to_page(pfn);
>> +
>> +     if (!kvm_gmem_is_mappable(kvm, gfn, gfn + 1)) {
>> +             r = -EPERM;
>> +             goto unlock;
>> +     }
>> +     memcpy(page_address(page) + offset, data, len);
>> +unlock:
>> +     if (r)
>> +             put_page(page);
>> +     else
>> +             kvm_release_pfn_dirty(pfn);
>> +     unlock_page(page);
>> +
>> +     return r;
>> +}
>> +#else
>> +static int __kvm_read_guest_memfd_page(struct kvm *kvm,
>> +                                    struct kvm_memory_slot *slot,
>> +                                    gfn_t gfn, void *data, int offset,
>> +                                    int len)
>> +{
>> +     WARN_ON_ONCE(1);
>> +     return -EIO;
>> +}
>> +
>> +static int __kvm_write_guest_memfd_page(struct kvm *kvm,
>> +                                     struct kvm_memory_slot *slot,
>> +                                     gfn_t gfn, const void *data,
>> +                                     int offset, int len)
>> +{
>> +     WARN_ON_ONCE(1);
>> +     return -EIO;
>> +}
>>  #endif /* CONFIG_KVM_GMEM_MAPPABLE */
>>
>>  /* Copy @len bytes from guest memory at '(@gfn * PAGE_SIZE) + @offset' to @data */
>> -static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
>> -                              void *data, int offset, int len)
>> +
>> +static int __kvm_read_guest_page(struct kvm *kvm, struct kvm_memory_slot *slot,
>> +                              gfn_t gfn, void *data, int offset, int len)
>>  {
>> -     int r;
>>       unsigned long addr;
>>
>>       if (WARN_ON_ONCE(offset + len > PAGE_SIZE))
>>               return -EFAULT;
>>
>> +     if (IS_ENABLED(CONFIG_KVM_GMEM_MAPPABLE) &&
>> +         kvm_slot_can_be_private(slot) &&
>> +         !slot->userspace_addr) {
>> +             return __kvm_read_guest_memfd_page(kvm, slot, gfn, data,
>> +                                                offset, len);
>> +     }
>> +
>>       addr = gfn_to_hva_memslot_prot(slot, gfn, NULL);
>>       if (kvm_is_error_hva(addr))
>>               return -EFAULT;
>> -     r = __copy_from_user(data, (void __user *)addr + offset, len);
>> -     if (r)
>> +     if (__copy_from_user(data, (void __user *)addr + offset, len))
>>               return -EFAULT;
>>       return 0;
>>  }
>> @@ -3425,7 +3516,7 @@ int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
>>  {
>>       struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
>>
>> -     return __kvm_read_guest_page(slot, gfn, data, offset, len);
>> +     return __kvm_read_guest_page(kvm, slot, gfn, data, offset, len);
>>  }
>>  EXPORT_SYMBOL_GPL(kvm_read_guest_page);
>>
>> @@ -3434,7 +3525,7 @@ int kvm_vcpu_read_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn, void *data,
>>  {
>>       struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
>>
>> -     return __kvm_read_guest_page(slot, gfn, data, offset, len);
>> +     return __kvm_read_guest_page(vcpu->kvm, slot, gfn, data, offset, len);
>>  }
>>  EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_page);
>>
>> @@ -3511,22 +3602,30 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_atomic);
>>
>>  /* Copy @len bytes from @data into guest memory at '(@gfn * PAGE_SIZE) + @offset' */
>>  static int __kvm_write_guest_page(struct kvm *kvm,
>> -                               struct kvm_memory_slot *memslot, gfn_t gfn,
>> -                               const void *data, int offset, int len)
>> +                               struct kvm_memory_slot *slot, gfn_t gfn,
>> +                               const void *data, int offset, int len)
>>  {
>> -     int r;
>> -     unsigned long addr;
>> -
>>       if (WARN_ON_ONCE(offset + len > PAGE_SIZE))
>>               return -EFAULT;
>>
>> -     addr = gfn_to_hva_memslot(memslot, gfn);
>> -     if (kvm_is_error_hva(addr))
>> -             return -EFAULT;
>> -     r = __copy_to_user((void __user *)addr + offset, data, len);
>> -     if (r)
>> -             return -EFAULT;
>> -     mark_page_dirty_in_slot(kvm, memslot, gfn);
>> +     if (IS_ENABLED(CONFIG_KVM_GMEM_MAPPABLE) &&
>> +         kvm_slot_can_be_private(slot) &&
>> +         !slot->userspace_addr) {
>> +             int r = __kvm_write_guest_memfd_page(kvm, slot, gfn, data,
>> +                                                  offset, len);
>> +
>> +             if (r)
>> +                     return r;
>> +     } else {
>> +             unsigned long addr = gfn_to_hva_memslot(slot, gfn);
>> +
>> +             if (kvm_is_error_hva(addr))
>> +                     return -EFAULT;
>> +             if (__copy_to_user((void __user *)addr + offset, data, len))
>> +                     return -EFAULT;
>> +     }
>> +
>> +     mark_page_dirty_in_slot(kvm, slot, gfn);
>>       return 0;
>>  }


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2024-10-18  6:58 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-10  8:59 [PATCH v3 00/11] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
2024-10-10  8:59 ` [PATCH v3 01/11] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes Fuad Tabba
2024-10-12  6:12   ` kernel test robot
2024-10-10  8:59 ` [PATCH v3 02/11] KVM: guest_memfd: Track mappability within a struct kvm_gmem_private Fuad Tabba
2024-10-10  8:59 ` [PATCH v3 03/11] KVM: guest_memfd: Introduce kvm_gmem_get_pfn_locked(), which retains the folio lock Fuad Tabba
2024-10-10  8:59 ` [PATCH v3 04/11] KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared Fuad Tabba
2024-10-10 10:14   ` Kirill A. Shutemov
2024-10-10 10:23     ` Fuad Tabba
2024-10-10 12:03       ` Jason Gunthorpe
2024-10-10 14:27         ` Fuad Tabba
2024-10-10 12:20       ` Kirill A. Shutemov
2024-10-10 14:28         ` Fuad Tabba
2024-10-10 14:36           ` Kirill A. Shutemov
2024-10-10 14:37           ` Jason Gunthorpe
2024-10-14 16:52   ` Elliot Berman
2024-10-15 10:27     ` Fuad Tabba
2024-10-16 16:53       ` Elliot Berman
2024-10-10  8:59 ` [PATCH v3 05/11] KVM: guest_memfd: Add guest_memfd support to kvm_(read|/write)_guest_page() Fuad Tabba
2024-10-17 21:53   ` Ackerley Tng
2024-10-18  6:57     ` Patrick Roy
2024-10-10  8:59 ` [PATCH v3 06/11] KVM: guest_memfd: Add KVM capability to check if guest_memfd is host mappable Fuad Tabba
2024-10-15 10:30   ` Suzuki K Poulose
2024-10-15 10:33     ` Fuad Tabba
2024-10-10  8:59 ` [PATCH v3 07/11] KVM: guest_memfd: Add a guest_memfd() flag to initialize it as mappable Fuad Tabba
2024-10-10  8:59 ` [PATCH v3 08/11] KVM: guest_memfd: selftests: guest_memfd mmap() test when mapping is allowed Fuad Tabba
2024-10-10  8:59 ` [PATCH v3 09/11] KVM: arm64: Skip VMA checks for slots without userspace address Fuad Tabba
2024-10-10  8:59 ` [PATCH v3 10/11] KVM: arm64: Handle guest_memfd()-backed guest page faults Fuad Tabba
2024-10-10  8:59 ` [PATCH v3 11/11] KVM: arm64: Enable guest_memfd private memory when pKVM is enabled Fuad Tabba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).