[RFC PATCH 0/4] KVM: ioctl for populating guest

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd
@ 2024-10-24  9:54 Nikita Kalyazin
  2024-10-24  9:54 ` [PATCH 1/4] KVM: guest_memfd: add generic post_populate callback Nikita Kalyazin
                   ` (5 more replies)
  0 siblings, 6 replies; 20+ messages in thread
From: Nikita Kalyazin @ 2024-10-24  9:54 UTC (permalink / raw)
  To: pbonzini, corbet, kvm, linux-doc, linux-kernel
  Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
	derekmn, nsaenz, xmarcalx, kalyazin

Firecracker currently allows to populate guest memory from a separate
process via UserfaultFD [1].  This helps keep the VMM codebase and
functionality concise and generic, while offloading the logic of
obtaining guest memory to another process.  UserfaultFD is currently not
supported for guest_memfd, because it binds to a VMA, while guest_memfd
does not need to (or cannot) be necessarily mapped to userspace,
especially for private memory.  [2] proposes an alternative to
UserfaultFD for intercepting stage-2 faults, while this series
conceptually compliments it with the ability to populate guest memory
backed by guest_memfd for `KVM_X86_SW_PROTECTED_VM` VMs.

Patches 1-3 add a new ioctl, `KVM_GUEST_MEMFD_POPULATE`, that uses a
vendor-agnostic implementation of `post_populate` callback.

Patch 4 allows to call the ioctl from a separate (non-VMM) process.  It
has been prohibited by [3], but I have not been able to locate the exact
justification for the requirement.

Questions:
 - Does exposing a generic population interface via ioctl look
   sensible in this form?
 - Is there a path where "only VMM can call KVM API" requirement is
   relaxed? If not, what is the recommended efficient alternative for
   populating guest memory from outside the VMM?

[1]: https://github.com/firecracker-microvm/firecracker/blob/main/docs/snapshotting/handling-page-faults-on-snapshot-resume.md
[2]: https://lore.kernel.org/kvm/CADrL8HUHRMwUPhr7jLLBgD9YLFAnVHc=N-C=8er-x6GUtV97pQ@mail.gmail.com/T/
[3]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6d4e4c4fca5be806b888d606894d914847e82d78

Nikita

Nikita Kalyazin (4):
  KVM: guest_memfd: add generic post_populate callback
  KVM: add KVM_GUEST_MEMFD_POPULATE ioctl for guest_memfd
  KVM: allow KVM_GUEST_MEMFD_POPULATE in another mm
  KVM: document KVM_GUEST_MEMFD_POPULATE ioctl

 Documentation/virt/kvm/api.rst | 23 +++++++++++++++++++++++
 include/linux/kvm_host.h       |  3 +++
 include/uapi/linux/kvm.h       |  9 +++++++++
 virt/kvm/guest_memfd.c         | 28 ++++++++++++++++++++++++++++
 virt/kvm/kvm_main.c            | 19 ++++++++++++++++++-
 5 files changed, 81 insertions(+), 1 deletion(-)

base-commit: c8d430db8eec7d4fd13a6bea27b7086a54eda6da
-- 
2.40.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/4] KVM: guest_memfd: add generic post_populate callback
  2024-10-24  9:54 [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd Nikita Kalyazin
@ 2024-10-24  9:54 ` Nikita Kalyazin
  2024-11-22 18:40   ` Mike Day
  2024-10-24  9:54 ` [PATCH 2/4] KVM: add KVM_GUEST_MEMFD_POPULATE ioctl for guest_memfd Nikita Kalyazin
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Nikita Kalyazin @ 2024-10-24  9:54 UTC (permalink / raw)
  To: pbonzini, corbet, kvm, linux-doc, linux-kernel
  Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
	derekmn, nsaenz, xmarcalx, kalyazin

This adds a generic implementation of the `post_populate` callback for
the `kvm_gmem_populate`.  The only thing it does is populates the pages
with data provided by userspace if the user pointer is not NULL,
otherwise it clears the pages.
This is supposed to be used by KVM_X86_SW_PROTECTED_VM VMs.

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
---
 virt/kvm/guest_memfd.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 8f079a61a56d..954312fac462 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -620,6 +620,27 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 EXPORT_SYMBOL_GPL(kvm_gmem_get_pfn);
 
 #ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
+static int kvm_gmem_post_populate_generic(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pfn,
+				  void __user *src, int order, void *opaque)
+{
+	int ret = 0, i;
+	int npages = (1 << order);
+	gfn_t gfn;
+
+	if (src) {
+		void *vaddr = kmap_local_pfn(pfn);
+
+		ret = copy_from_user(vaddr, src, npages * PAGE_SIZE);
+		if (ret)
+			ret = -EINVAL;
+		kunmap_local(vaddr);
+	} else
+		for (gfn = gfn_start, i = 0; gfn < gfn_start + npages; gfn++, i++)
+			clear_highpage(pfn_to_page(pfn + i));
+
+	return ret;
+}
+
 long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long npages,
 		       kvm_gmem_populate_cb post_populate, void *opaque)
 {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/4] KVM: add KVM_GUEST_MEMFD_POPULATE ioctl for guest_memfd
  2024-10-24  9:54 [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd Nikita Kalyazin
  2024-10-24  9:54 ` [PATCH 1/4] KVM: guest_memfd: add generic post_populate callback Nikita Kalyazin
@ 2024-10-24  9:54 ` Nikita Kalyazin
  2024-10-24  9:54 ` [PATCH 3/4] KVM: allow KVM_GUEST_MEMFD_POPULATE in another mm Nikita Kalyazin
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Nikita Kalyazin @ 2024-10-24  9:54 UTC (permalink / raw)
  To: pbonzini, corbet, kvm, linux-doc, linux-kernel
  Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
	derekmn, nsaenz, xmarcalx, kalyazin

The ioctl populates guest_memfd with userspace-provided data.

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
---
 include/linux/kvm_host.h |  3 +++
 include/uapi/linux/kvm.h |  9 +++++++++
 virt/kvm/guest_memfd.c   |  7 +++++++
 virt/kvm/kvm_main.c      | 10 ++++++++++
 4 files changed, 29 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index db567d26f7b9..5b0347783598 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2505,6 +2505,9 @@ typedef int (*kvm_gmem_populate_cb)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 
 long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages,
 		       kvm_gmem_populate_cb post_populate, void *opaque);
+
+int kvm_gmem_guest_memfd_populate(struct kvm *kvm,
+				struct kvm_guest_memfd_populate *populate);
 #endif
 
 #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 637efc055145..5d8073de0d96 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1573,4 +1573,13 @@ struct kvm_pre_fault_memory {
 	__u64 padding[5];
 };
 
+struct kvm_guest_memfd_populate {
+	__u64 gpa;
+	__u64 size;
+	void __user *from;
+	__u64 flags;
+};
+
+#define KVM_GUEST_MEMFD_POPULATE _IOW(KVMIO,  0xd6, struct kvm_guest_memfd_populate)
+
 #endif /* __LINUX_KVM_H */
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 954312fac462..08630b87f0e3 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -720,4 +720,11 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
 	return ret && !i ? ret : i;
 }
 EXPORT_SYMBOL_GPL(kvm_gmem_populate);
+
+int kvm_gmem_guest_memfd_populate(struct kvm *kvm,
+				struct kvm_guest_memfd_populate *populate)
+{
+	return kvm_gmem_populate(kvm, populate->gpa >> PAGE_SHIFT, populate->from,
+		populate->size >> PAGE_SHIFT, kvm_gmem_post_populate_generic, NULL);
+}
 #endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 05cbb2548d99..e5bd2c0031bf 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -5383,6 +5383,16 @@ static long kvm_vm_ioctl(struct file *filp,
 		r = kvm_gmem_create(kvm, &guest_memfd);
 		break;
 	}
+	case KVM_GUEST_MEMFD_POPULATE: {
+		struct kvm_guest_memfd_populate populate;
+
+		r = -EFAULT;
+		if (copy_from_user(&populate, argp, sizeof(populate)))
+			goto out;
+
+		r = kvm_gmem_guest_memfd_populate(kvm, &populate);
+		break;
+	}
 #endif
 	default:
 		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/4] KVM: allow KVM_GUEST_MEMFD_POPULATE in another mm
  2024-10-24  9:54 [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd Nikita Kalyazin
  2024-10-24  9:54 ` [PATCH 1/4] KVM: guest_memfd: add generic post_populate callback Nikita Kalyazin
  2024-10-24  9:54 ` [PATCH 2/4] KVM: add KVM_GUEST_MEMFD_POPULATE ioctl for guest_memfd Nikita Kalyazin
@ 2024-10-24  9:54 ` Nikita Kalyazin
  2024-10-24  9:54 ` [PATCH 4/4] KVM: document KVM_GUEST_MEMFD_POPULATE ioctl Nikita Kalyazin
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Nikita Kalyazin @ 2024-10-24  9:54 UTC (permalink / raw)
  To: pbonzini, corbet, kvm, linux-doc, linux-kernel
  Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
	derekmn, nsaenz, xmarcalx, kalyazin

Allow calling KVM_GUEST_MEMFD_POPULATE ioctl by the process that does
not own the KVM context.
This is to enable guest_memfd population by a non-VMM process that is
useful for isolation of the memory management logic from the VMM for
security and performance reasons.

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
---
 virt/kvm/kvm_main.c | 29 ++++++++++++++++++-----------
 1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e5bd2c0031bf..eb626c4bf4d7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -5159,8 +5159,25 @@ static long kvm_vm_ioctl(struct file *filp,
 	void __user *argp = (void __user *)arg;
 	int r;
 
-	if (kvm->mm != current->mm || kvm->vm_dead)
+	if (kvm->vm_dead)
 		return -EIO;
+
+#ifdef CONFIG_KVM_PRIVATE_MEM
+	if (ioctl == KVM_GUEST_MEMFD_POPULATE) {
+		struct kvm_guest_memfd_populate populate;
+
+		r = -EFAULT;
+		if (copy_from_user(&populate, argp, sizeof(populate)))
+			goto out;
+
+		r = kvm_gmem_guest_memfd_populate(kvm, &populate);
+		goto out;
+	}
+#endif
+
+	if (kvm->mm != current->mm)
+		return -EIO;
+
 	switch (ioctl) {
 	case KVM_CREATE_VCPU:
 		r = kvm_vm_ioctl_create_vcpu(kvm, arg);
@@ -5383,16 +5400,6 @@ static long kvm_vm_ioctl(struct file *filp,
 		r = kvm_gmem_create(kvm, &guest_memfd);
 		break;
 	}
-	case KVM_GUEST_MEMFD_POPULATE: {
-		struct kvm_guest_memfd_populate populate;
-
-		r = -EFAULT;
-		if (copy_from_user(&populate, argp, sizeof(populate)))
-			goto out;
-
-		r = kvm_gmem_guest_memfd_populate(kvm, &populate);
-		break;
-	}
 #endif
 	default:
 		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 4/4] KVM: document KVM_GUEST_MEMFD_POPULATE ioctl
  2024-10-24  9:54 [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd Nikita Kalyazin
                   ` (2 preceding siblings ...)
  2024-10-24  9:54 ` [PATCH 3/4] KVM: allow KVM_GUEST_MEMFD_POPULATE in another mm Nikita Kalyazin
@ 2024-10-24  9:54 ` Nikita Kalyazin
  2024-11-20 12:09 ` [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd Nikita Kalyazin
  2024-11-20 13:55 ` Paolo Bonzini
  5 siblings, 0 replies; 20+ messages in thread
From: Nikita Kalyazin @ 2024-10-24  9:54 UTC (permalink / raw)
  To: pbonzini, corbet, kvm, linux-doc, linux-kernel
  Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
	derekmn, nsaenz, xmarcalx, kalyazin

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
---
 Documentation/virt/kvm/api.rst | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index e32471977d0a..f192dab41bad 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6442,6 +6442,30 @@ the capability to be present.
 
 `flags` must currently be zero.
 
+4.144 KVM_GUEST_MEMFD_POPULATE
+----------------------------
+
+:Capability: KVM_CAP_GUEST_MEMFD
+:Architectures: none
+:Type: vm ioctl
+:Parameters: struct kvm_guest_memfd_populate(in)
+:Returns: 0 if all requested pages populated, < 0 on error
+
+KVM_GUEST_MEMFD_POPULATE populates guest_memfd with data provided by userspace.
+
+::
+
+  struct kvm_guest_memfd_populate {
+  __u64 gpa;
+  __u64 size;
+  void __user *from;
+  __u64 flags;
+  };
+
+A gfn can only be populated once.  If a gfn is attempted to get populated
+multiple times without prior calls to fallocate(PUNCH_HOLE), subsequent calls
+will return EEXIST.
+If the `from` pointer is NULL, the pages are cleared.
 
 5. The kvm_run structure
 ========================
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd
  2024-10-24  9:54 [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd Nikita Kalyazin
                   ` (3 preceding siblings ...)
  2024-10-24  9:54 ` [PATCH 4/4] KVM: document KVM_GUEST_MEMFD_POPULATE ioctl Nikita Kalyazin
@ 2024-11-20 12:09 ` Nikita Kalyazin
  2024-11-20 13:46   ` David Hildenbrand
  2024-11-20 13:55 ` Paolo Bonzini
  5 siblings, 1 reply; 20+ messages in thread
From: Nikita Kalyazin @ 2024-11-20 12:09 UTC (permalink / raw)
  To: pbonzini, corbet, kvm, linux-doc, linux-kernel
  Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
	derekmn, nsaenz, xmarcalx, David Hildenbrand, Sean Christopherson,
	linux-mm

On 24/10/2024 10:54, Nikita Kalyazin wrote:
> [2] proposes an alternative to
> UserfaultFD for intercepting stage-2 faults, while this series
> conceptually compliments it with the ability to populate guest memory
> backed by guest_memfd for `KVM_X86_SW_PROTECTED_VM` VMs.

+David
+Sean
+mm

While measuring memory population performance of guest_memfd using this 
series, I noticed that guest_memfd population takes longer than my 
baseline, which is filling anonymous private memory via UFFDIO_COPY.

I am using x86_64 for my measurements and 3 GiB memory region:
  - anon/private UFFDIO_COPY:  940 ms
  - guest_memfd:              1371 ms (+46%)

It turns out that the effect is observable not only for guest_memfd, but 
also for any type of shared memory, eg memfd or anonymous memory mapped 
as shared.

Below are measurements of a plain mmap(MAP_POPULATE) operation:

mmap(NULL, 3ll * (1 << 30), PROT_READ | PROT_WRITE, MAP_PRIVATE | 
MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
  vs
mmap(NULL, 3ll * (1 << 30), PROT_READ | PROT_WRITE, MAP_SHARED | 
MAP_ANONYMOUS | MAP_POPULATE, -1, 0);

Results:
  - MAP_PRIVATE: 968 ms
  - MAP_SHARED: 1646 ms

I am seeing this effect on a range of kernels. The oldest I used was 
5.10, the newest is the current kvm-next (for-linus-2590-gd96c77bd4eeb).

When profiling with perf, I observe the following hottest operations 
(kvm-next). Attaching full distributions at the end of the email.

MAP_PRIVATE:
- 19.72% clear_page_erms, rep stos %al,%es:(%rdi)

MAP_SHARED:
- 43.94% shmem_get_folio_gfp, lock orb $0x8,(%rdi), which is atomic 
setting of the PG_uptodate bit
- 10.98% clear_page_erms, rep stos %al,%es:(%rdi)

Note that MAP_PRIVATE/do_anonymous_page calls __folio_mark_uptodate that 
sets the PG_uptodate bit regularly.
, while MAP_SHARED/shmem_get_folio_gfp calls folio_mark_uptodate that 
sets the PG_uptodate bit atomically.

While this logic is intuitive, its performance effect is more 
significant that I would expect.

The questions are:
  - Is this a well-known behaviour?
  - Is there a way to mitigate that, ie make shared memory (including 
guest_memfd) population faster/comparable to private memory?

Nikita


Appendix: full call tree obtained via perf

MAP_RPIVATE:

       - 87.97% __mmap
            entry_SYSCALL_64_after_hwframe
            do_syscall_64
            vm_mmap_pgoff
            __mm_populate
            populate_vma_page_range
          - __get_user_pages
             - 77.94% handle_mm_fault
                - 76.90% __handle_mm_fault
                   - 72.70% do_anonymous_page
                      - 31.92% vma_alloc_folio_noprof
                         - 30.74% alloc_pages_mpol_noprof
                            - 29.60% __alloc_pages_noprof
                               - 28.40% get_page_from_freelist
                                    19.72% clear_page_erms
                                  - 3.00% __rmqueue_pcplist
                                       __mod_zone_page_state
                                    1.18% _raw_spin_trylock
                      - 20.03% __pte_offset_map_lock
                         - 15.96% _raw_spin_lock
                              1.50% preempt_count_add
                         - 2.27% __pte_offset_map
                              __rcu_read_lock
                      - 7.22% __folio_batch_add_and_move
                         - 4.68% folio_batch_move_lru
                            - 3.77% lru_add
                               + 0.95% __mod_zone_page_state
                                 0.86% __mod_node_page_state
                           0.84% folios_put_refs
                           0.55% check_preemption_disabled
                      - 2.85% folio_add_new_anon_rmap
                         - __folio_mod_stat
                              __mod_node_page_state
                   - 1.15% pte_offset_map_nolock
                        __pte_offset_map
             - 7.59% follow_page_pte
                - 4.56% __pte_offset_map_lock
                   - 2.27% _raw_spin_lock
                        preempt_count_add
                     1.13% __pte_offset_map
                  0.75% folio_mark_accessed

MAP_SHARED:

       - 77.89% __mmap
            entry_SYSCALL_64_after_hwframe
            do_syscall_64
            vm_mmap_pgoff
            __mm_populate
            populate_vma_page_range
          - __get_user_pages
             - 72.11% handle_mm_fault
                - 71.67% __handle_mm_fault
                   - 69.62% do_fault
                      - 44.61% __do_fault
                         - shmem_fault
                            - 43.94% shmem_get_folio_gfp
                               - 17.20% 
shmem_alloc_and_add_folio.constprop.0
                                  - 5.10% shmem_alloc_folio
                                     - 4.58% folio_alloc_mpol_noprof
                                        - alloc_pages_mpol_noprof
                                           - 4.00% __alloc_pages_noprof
                                              - 3.31% get_page_from_freelist
                                                   1.24% __rmqueue_pcplist
                                  - 5.07% shmem_add_to_page_cache
                                     - 1.44% __mod_node_page_state
                                          0.61% check_preemption_disabled
                                       0.78% xas_store
                                       0.74% xas_find_conflict
                                       0.66% _raw_spin_lock_irq
                                  - 3.96% __folio_batch_add_and_move
                                     - 2.41% folio_batch_move_lru
                                          1.88% lru_add
                                  - 1.56% shmem_inode_acct_blocks
                                     - 1.24% __dquot_alloc_space
                                        - 0.77% inode_add_bytes
                                             _raw_spin_lock
                                  - 0.77% shmem_recalc_inode
                                       _raw_spin_lock
                                 10.98% clear_page_erms
                               - 1.17% filemap_get_entry
                                    0.78% xas_load
                      - 20.26% filemap_map_pages
                         - 12.23% next_uptodate_folio
                            - 1.27% xas_find
                                 xas_load
                         - 1.16% __pte_offset_map_lock
                              0.59% _raw_spin_lock
                      - 3.48% finish_fault
                         - 1.28% set_pte_range
                              0.96% folio_add_file_rmap_ptes
                         - 0.91% __pte_offset_map_lock
                              0.54% _raw_spin_lock
                     0.57% pte_offset_map_nolock
             - 4.11% follow_page_pte
                - 2.36% __pte_offset_map_lock
                   - 1.32% _raw_spin_lock
                        preempt_count_add
                     0.54% __pte_offset_map

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd
  2024-11-20 12:09 ` [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd Nikita Kalyazin
@ 2024-11-20 13:46   ` David Hildenbrand
  2024-11-20 15:13     ` David Hildenbrand
  0 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2024-11-20 13:46 UTC (permalink / raw)
  To: kalyazin, pbonzini, corbet, kvm, linux-doc, linux-kernel
  Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
	derekmn, nsaenz, xmarcalx, Sean Christopherson, linux-mm

On 20.11.24 13:09, Nikita Kalyazin wrote:
> On 24/10/2024 10:54, Nikita Kalyazin wrote:
>> [2] proposes an alternative to
>> UserfaultFD for intercepting stage-2 faults, while this series
>> conceptually compliments it with the ability to populate guest memory
>> backed by guest_memfd for `KVM_X86_SW_PROTECTED_VM` VMs.
> 
> +David
> +Sean
> +mm

Hi!

> 
> While measuring memory population performance of guest_memfd using this
> series, I noticed that guest_memfd population takes longer than my
> baseline, which is filling anonymous private memory via UFFDIO_COPY.
> 
> I am using x86_64 for my measurements and 3 GiB memory region:
>    - anon/private UFFDIO_COPY:  940 ms
>    - guest_memfd:              1371 ms (+46%)
> 
> It turns out that the effect is observable not only for guest_memfd, but
> also for any type of shared memory, eg memfd or anonymous memory mapped
> as shared.
 > Below are measurements of a plain mmap(MAP_POPULATE) operation:>
> mmap(NULL, 3ll * (1 << 30), PROT_READ | PROT_WRITE, MAP_PRIVATE |
> MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
>    vs
> mmap(NULL, 3ll * (1 << 30), PROT_READ | PROT_WRITE, MAP_SHARED |
> MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
> 
> Results:
>    - MAP_PRIVATE: 968 ms
>    - MAP_SHARED: 1646 ms

At least here it is expected to some degree: as soon as the page cache 
is involved map/unmap gets slower, because we are effectively 
maintaining two datastructures (page tables + page cache) instead of 
only a single one (page cache)

Can you make sure that THP/large folios don't interfere in your 
experiments (e.g., madvise(MADV_NOHUGEPAGE))?

> 
> I am seeing this effect on a range of kernels. The oldest I used was
> 5.10, the newest is the current kvm-next (for-linus-2590-gd96c77bd4eeb).
> 
> When profiling with perf, I observe the following hottest operations
> (kvm-next). Attaching full distributions at the end of the email.
> 
> MAP_PRIVATE:
> - 19.72% clear_page_erms, rep stos %al,%es:(%rdi)
> 
> MAP_SHARED:
> - 43.94% shmem_get_folio_gfp, lock orb $0x8,(%rdi), which is atomic
> setting of the PG_uptodate bit
> - 10.98% clear_page_erms, rep stos %al,%es:(%rdi)

Interesting.
> 
> Note that MAP_PRIVATE/do_anonymous_page calls __folio_mark_uptodate that
> sets the PG_uptodate bit regularly.
> , while MAP_SHARED/shmem_get_folio_gfp calls folio_mark_uptodate that
> sets the PG_uptodate bit atomically.
> 
> While this logic is intuitive, its performance effect is more
> significant that I would expect.

Yes. How much of the performance difference would remain if you hack out 
the atomic op just to play with it? I suspect there will still be some 
difference.

> 
> The questions are:
>    - Is this a well-known behaviour?
>    - Is there a way to mitigate that, ie make shared memory (including
> guest_memfd) population faster/comparable to private memory?

Likely. But your experiment measures above something different than what 
guest_memfd vs. anon does: guest_memfd doesn't update page tables, so I 
would assume guest_memfd will be faster than MAP_POPULATE.

How do you end up allocating memory for guest_memfd? Using simple 
fallocate()?

Note that we might improve allocation times with guest_memfd when 
allocating larger folios.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd
  2024-10-24  9:54 [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd Nikita Kalyazin
                   ` (4 preceding siblings ...)
  2024-11-20 12:09 ` [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd Nikita Kalyazin
@ 2024-11-20 13:55 ` Paolo Bonzini
  2024-11-20 17:41   ` Nikita Kalyazin
  5 siblings, 1 reply; 20+ messages in thread
From: Paolo Bonzini @ 2024-11-20 13:55 UTC (permalink / raw)
  To: Nikita Kalyazin, corbet, kvm, linux-doc, linux-kernel
  Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
	derekmn, nsaenz, xmarcalx

On 10/24/24 11:54, Nikita Kalyazin wrote:
> Firecracker currently allows to populate guest memory from a separate
> process via UserfaultFD [1].  This helps keep the VMM codebase and
> functionality concise and generic, while offloading the logic of
> obtaining guest memory to another process.  UserfaultFD is currently not
> supported for guest_memfd, because it binds to a VMA, while guest_memfd
> does not need to (or cannot) be necessarily mapped to userspace,
> especially for private memory.  [2] proposes an alternative to
> UserfaultFD for intercepting stage-2 faults, while this series
> conceptually compliments it with the ability to populate guest memory
> backed by guest_memfd for `KVM_X86_SW_PROTECTED_VM` VMs.
> 
> Patches 1-3 add a new ioctl, `KVM_GUEST_MEMFD_POPULATE`, that uses a
> vendor-agnostic implementation of `post_populate` callback.
> 
> Patch 4 allows to call the ioctl from a separate (non-VMM) process.  It
> has been prohibited by [3], but I have not been able to locate the exact
> justification for the requirement.

The justification is that the "struct kvm" has a long-lived tie to a 
host process's address space.

Invoking ioctls like KVM_SET_USER_MEMORY_REGION and KVM_RUN from 
different processes would make things very messy, because it is not 
clear which mm you are working with: the MMU notifier is registered for 
kvm->mm, but some functions such as get_user_pages do not take an mm for 
example and always operate on current->mm.

In your case, it should be enough to add a ioctl on the guestmemfd 
instead?  But the real question is, what are you using 
KVM_X86_SW_PROTECTED_VM for?

Paolo

> Questions:
>   - Does exposing a generic population interface via ioctl look
>     sensible in this form?
>   - Is there a path where "only VMM can call KVM API" requirement is
>     relaxed? If not, what is the recommended efficient alternative for
>     populating guest memory from outside the VMM?
> 
> [1]: https://github.com/firecracker-microvm/firecracker/blob/main/docs/snapshotting/handling-page-faults-on-snapshot-resume.md
> [2]: https://lore.kernel.org/kvm/CADrL8HUHRMwUPhr7jLLBgD9YLFAnVHc=N-C=8er-x6GUtV97pQ@mail.gmail.com/T/
> [3]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6d4e4c4fca5be806b888d606894d914847e82d78
> 
> Nikita
> 
> Nikita Kalyazin (4):
>    KVM: guest_memfd: add generic post_populate callback
>    KVM: add KVM_GUEST_MEMFD_POPULATE ioctl for guest_memfd
>    KVM: allow KVM_GUEST_MEMFD_POPULATE in another mm
>    KVM: document KVM_GUEST_MEMFD_POPULATE ioctl
> 
>   Documentation/virt/kvm/api.rst | 23 +++++++++++++++++++++++
>   include/linux/kvm_host.h       |  3 +++
>   include/uapi/linux/kvm.h       |  9 +++++++++
>   virt/kvm/guest_memfd.c         | 28 ++++++++++++++++++++++++++++
>   virt/kvm/kvm_main.c            | 19 ++++++++++++++++++-
>   5 files changed, 81 insertions(+), 1 deletion(-)
> 
> 
> base-commit: c8d430db8eec7d4fd13a6bea27b7086a54eda6da


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd
  2024-11-20 13:46   ` David Hildenbrand
@ 2024-11-20 15:13     ` David Hildenbrand
  2024-11-20 15:58       ` Nikita Kalyazin
  0 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2024-11-20 15:13 UTC (permalink / raw)
  To: kalyazin, pbonzini, corbet, kvm, linux-doc, linux-kernel
  Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
	derekmn, nsaenz, xmarcalx, Sean Christopherson, linux-mm

>>
>> The questions are:
>>     - Is this a well-known behaviour?
>>     - Is there a way to mitigate that, ie make shared memory (including
>> guest_memfd) population faster/comparable to private memory?
> 
> Likely. But your experiment measures above something different than what
> guest_memfd vs. anon does: guest_memfd doesn't update page tables, so I
> would assume guest_memfd will be faster than MAP_POPULATE.
> 
> How do you end up allocating memory for guest_memfd? Using simple
> fallocate()?

Heh, now I spot that your comment was as reply to a series.

If your ioctl is supposed to to more than "allocating memory" like 
MAP_POPULATE/MADV_POPULATE+* ... then POPULATE is a suboptimal choice. 
Because for allocating memory, we would want to use fallocate() instead. 
I assume you want to "allocate+copy"?

I'll note that, as we're moving into the direction of moving 
guest_memfd.c into mm/guestmem.c, we'll likely want to avoid "KVM_*" 
ioctls, and think about something generic.

Any clue how your new ioctl will interact with the WIP to have shared 
memory as part of guest_memfd? For example, could it be reasonable to 
"populate" the shared memory first (via VMA) and then convert that 
"allocated+filled" memory to private?

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd
  2024-11-20 15:13     ` David Hildenbrand
@ 2024-11-20 15:58       ` Nikita Kalyazin
  2024-11-20 16:20         ` David Hildenbrand
  0 siblings, 1 reply; 20+ messages in thread
From: Nikita Kalyazin @ 2024-11-20 15:58 UTC (permalink / raw)
  To: David Hildenbrand, pbonzini, corbet, kvm, linux-doc, linux-kernel
  Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
	derekmn, nsaenz, xmarcalx, Sean Christopherson, linux-mm



On 20/11/2024 15:13, David Hildenbrand wrote:
 > Hi!

Hi! :)

 >> Results:
 >>    - MAP_PRIVATE: 968 ms
 >>    - MAP_SHARED: 1646 ms
 >
 > At least here it is expected to some degree: as soon as the page cache
 > is involved map/unmap gets slower, because we are effectively
 > maintaining two datastructures (page tables + page cache) instead of
 > only a single one (page cache)
 >
 > Can you make sure that THP/large folios don't interfere in your
 > experiments (e.g., madvise(MADV_NOHUGEPAGE))?

I was using transparent_hugepage=never command line argument in my testing.

$ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

Is that sufficient to exclude the THP/large folio factor?

 >> While this logic is intuitive, its performance effect is more
 >> significant that I would expect.
 >
 > Yes. How much of the performance difference would remain if you hack out
 > the atomic op just to play with it? I suspect there will still be some
 > difference.

I have tried that, but could not see any noticeable difference in the 
overall results.

It looks like a big portion of the bottleneck has moved from 
shmem_get_folio_gfp/folio_mark_uptodate to 
finish_fault/__pte_offset_map_lock somehow.  I have no good explanation 
for why:

Orig:
                   - 69.62% do_fault
                      + 44.61% __do_fault
                      + 20.26% filemap_map_pages
                      + 3.48% finish_fault
Hacked:
                   - 67.39% do_fault
                      + 32.45% __do_fault
                      + 21.87% filemap_map_pages
                      + 11.97% finish_fault

Orig:
                      - 3.48% finish_fault
                         - 1.28% set_pte_range
                              0.96% folio_add_file_rmap_ptes
                         - 0.91% __pte_offset_map_lock
                              0.54% _raw_spin_lock
Hacked:
                      - 11.97% finish_fault
                         - 8.59% __pte_offset_map_lock
                            - 6.27% _raw_spin_lock
                                 preempt_count_add
                              1.00% __pte_offset_map
                         - 1.28% set_pte_range
                            - folio_add_file_rmap_ptes
                                 __mod_node_page_state

 > Note that we might improve allocation times with guest_memfd when
 > allocating larger folios.

I suppose it may not always be an option depending on requirements to 
consistency of the allocation latency.  Eg if a large folio isn't 
available at the time, the performance would degrade to the base case 
(please correct me if I'm missing something).

> Heh, now I spot that your comment was as reply to a series.

Yeah, sorry if it wasn't obvious.

> If your ioctl is supposed to to more than "allocating memory" like
> MAP_POPULATE/MADV_POPULATE+* ... then POPULATE is a suboptimal choice.
> Because for allocating memory, we would want to use fallocate() instead.
> I assume you want to "allocate+copy"?

Yes, the ultimate use case is "allocate+copy".

> I'll note that, as we're moving into the direction of moving
> guest_memfd.c into mm/guestmem.c, we'll likely want to avoid "KVM_*"
> ioctls, and think about something generic.

Good point, thanks.  Are we at the stage where some concrete API has 
been proposed yet? I might have missed that.

> Any clue how your new ioctl will interact with the WIP to have shared
> memory as part of guest_memfd? For example, could it be reasonable to
> "populate" the shared memory first (via VMA) and then convert that
> "allocated+filled" memory to private?

No, I can't immediately see why it shouldn't work.  My main concern 
would probably still be about the latency of the population stage as I 
can't see why it would improve compared to what we have now, because my 
feeling is this is linked with the sharedness property of guest_memfd.

> Cheers,
> 
> David / dhildenb



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd
  2024-11-20 15:58       ` Nikita Kalyazin
@ 2024-11-20 16:20         ` David Hildenbrand
  2024-11-20 16:44           ` David Hildenbrand
  0 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2024-11-20 16:20 UTC (permalink / raw)
  To: kalyazin, pbonzini, corbet, kvm, linux-doc, linux-kernel
  Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
	derekmn, nsaenz, xmarcalx, Sean Christopherson, linux-mm

On 20.11.24 16:58, Nikita Kalyazin wrote:
> 
> 
> On 20/11/2024 15:13, David Hildenbrand wrote:
>   > Hi!
> 
> Hi! :)
> 
>   >> Results:
>   >>    - MAP_PRIVATE: 968 ms
>   >>    - MAP_SHARED: 1646 ms
>   >
>   > At least here it is expected to some degree: as soon as the page cache
>   > is involved map/unmap gets slower, because we are effectively
>   > maintaining two datastructures (page tables + page cache) instead of
>   > only a single one (page cache)
>   >
>   > Can you make sure that THP/large folios don't interfere in your
>   > experiments (e.g., madvise(MADV_NOHUGEPAGE))?
> 
> I was using transparent_hugepage=never command line argument in my testing.
> 
> $ cat /sys/kernel/mm/transparent_hugepage/enabled
> always madvise [never]
> 
> Is that sufficient to exclude the THP/large folio factor?

Yes!

> 
>   >> While this logic is intuitive, its performance effect is more
>   >> significant that I would expect.
>   >
>   > Yes. How much of the performance difference would remain if you hack out
>   > the atomic op just to play with it? I suspect there will still be some
>   > difference.
> 
> I have tried that, but could not see any noticeable difference in the
> overall results.
> 
> It looks like a big portion of the bottleneck has moved from
> shmem_get_folio_gfp/folio_mark_uptodate to
> finish_fault/__pte_offset_map_lock somehow.  I have no good explanation
> for why:

That's what I assumed. The profiling results can be rather fuzzy and 
misleading with micro-benchmarks. :(

> 
> Orig:
>                     - 69.62% do_fault
>                        + 44.61% __do_fault
>                        + 20.26% filemap_map_pages
>                        + 3.48% finish_fault
> Hacked:
>                     - 67.39% do_fault
>                        + 32.45% __do_fault
>                        + 21.87% filemap_map_pages
>                        + 11.97% finish_fault
> 
> Orig:
>                        - 3.48% finish_fault
>                           - 1.28% set_pte_range
>                                0.96% folio_add_file_rmap_ptes
>                           - 0.91% __pte_offset_map_lock
>                                0.54% _raw_spin_lock
> Hacked:
>                        - 11.97% finish_fault
>                           - 8.59% __pte_offset_map_lock
>                              - 6.27% _raw_spin_lock
>                                   preempt_count_add
>                                1.00% __pte_offset_map
>                           - 1.28% set_pte_range
>                              - folio_add_file_rmap_ptes
>                                   __mod_node_page_state
> 
>   > Note that we might improve allocation times with guest_memfd when
>   > allocating larger folios.
> 
> I suppose it may not always be an option depending on requirements to
> consistency of the allocation latency.  Eg if a large folio isn't
> available at the time, the performance would degrade to the base case
> (please correct me if I'm missing something).

Yes, there are cons to that.

> 
>> Heh, now I spot that your comment was as reply to a series.
> 
> Yeah, sorry if it wasn't obvious.
> 
>> If your ioctl is supposed to to more than "allocating memory" like
>> MAP_POPULATE/MADV_POPULATE+* ... then POPULATE is a suboptimal choice.
>> Because for allocating memory, we would want to use fallocate() instead.
>> I assume you want to "allocate+copy"?
> 
> Yes, the ultimate use case is "allocate+copy".
> 
>> I'll note that, as we're moving into the direction of moving
>> guest_memfd.c into mm/guestmem.c, we'll likely want to avoid "KVM_*"
>> ioctls, and think about something generic.
> 
> Good point, thanks.  Are we at the stage where some concrete API has
> been proposed yet? I might have missed that.

People are working on it, and we're figuring out some remaining details 
(e.g., page_type to intercept folio_put() ). I assume we'll see a new 
RFC soonish (famous last words), but it's not been proposed yet.

> 
>> Any clue how your new ioctl will interact with the WIP to have shared
>> memory as part of guest_memfd? For example, could it be reasonable to
>> "populate" the shared memory first (via VMA) and then convert that
>> "allocated+filled" memory to private?
> 
> No, I can't immediately see why it shouldn't work.  My main concern
> would probably still be about the latency of the population stage as I
> can't see why it would improve compared to what we have now, because my
 > feeling is this is linked with the sharedness property of guest_memfd.

If the problem is the "pagecache" overhead, then yes, it will be a 
harder nut to crack. But maybe there are some low-hanging fruits to 
optimize? Finding the main cause for the added overhead would be 
interesting.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd
  2024-11-20 16:20         ` David Hildenbrand
@ 2024-11-20 16:44           ` David Hildenbrand
  2024-11-20 17:21             ` Nikita Kalyazin
  0 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2024-11-20 16:44 UTC (permalink / raw)
  To: kalyazin, pbonzini, corbet, kvm, linux-doc, linux-kernel
  Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
	derekmn, nsaenz, xmarcalx, Sean Christopherson, linux-mm

>> No, I can't immediately see why it shouldn't work.  My main concern
>> would probably still be about the latency of the population stage as I
>> can't see why it would improve compared to what we have now, because my
>   > feeling is this is linked with the sharedness property of guest_memfd.
> 
> If the problem is the "pagecache" overhead, then yes, it will be a
> harder nut to crack. But maybe there are some low-hanging fruits to
> optimize? Finding the main cause for the added overhead would be
> interesting.

Can you compare uffdio_copy() when using anonymous memory vs. shmem? 
That's likely the best we could currently achieve with guest_memfd.

There is the tools/testing/selftests/mm/uffd-stress benchmark, not sure 
if that is of any help; it SEGFAULTS for me right now with a (likely) 
division by 0.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd
  2024-11-20 16:44           ` David Hildenbrand
@ 2024-11-20 17:21             ` Nikita Kalyazin
  2024-11-20 18:29               ` David Hildenbrand
  0 siblings, 1 reply; 20+ messages in thread
From: Nikita Kalyazin @ 2024-11-20 17:21 UTC (permalink / raw)
  To: David Hildenbrand, pbonzini, corbet, kvm, linux-doc, linux-kernel
  Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
	derekmn, nsaenz, xmarcalx, Sean Christopherson, linux-mm



On 20/11/2024 16:44, David Hildenbrand wrote:
>> If the problem is the "pagecache" overhead, then yes, it will be a
>> harder nut to crack. But maybe there are some low-hanging fruits to
>> optimize? Finding the main cause for the added overhead would be
>> interesting.

Agreed, knowing the exact root cause would be really nice.

> Can you compare uffdio_copy() when using anonymous memory vs. shmem?
> That's likely the best we could currently achieve with guest_memfd.

Yeah, I was doing that too. It was about ~28% slower in my setup, while 
with guest_memfd it was ~34% slower.  The variance of the data was quite 
high so the difference may well be just noise.  In other words, I'd be 
much happier if we could bring guest_memfd (or even shmem) performance 
closer to the anon/private than if we just equalised guest_memfd with 
shmem (which are probably already pretty close).

> There is the tools/testing/selftests/mm/uffd-stress benchmark, not sure
> if that is of any help; it SEGFAULTS for me right now with a (likely)
> division by 0.

Thanks for the pointer, will take a look!

> Cheers,
> 
> David / dhildenb
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd
  2024-11-20 13:55 ` Paolo Bonzini
@ 2024-11-20 17:41   ` Nikita Kalyazin
  0 siblings, 0 replies; 20+ messages in thread
From: Nikita Kalyazin @ 2024-11-20 17:41 UTC (permalink / raw)
  To: Paolo Bonzini, corbet, kvm, linux-doc, linux-kernel
  Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
	derekmn, nsaenz, xmarcalx

On 20/11/2024 13:55, Paolo Bonzini wrote:
>> Patch 4 allows to call the ioctl from a separate (non-VMM) process.  It
>> has been prohibited by [3], but I have not been able to locate the exact
>> justification for the requirement.
> 
> The justification is that the "struct kvm" has a long-lived tie to a
> host process's address space.
> 
> Invoking ioctls like KVM_SET_USER_MEMORY_REGION and KVM_RUN from
> different processes would make things very messy, because it is not
> clear which mm you are working with: the MMU notifier is registered for
> kvm->mm, but some functions such as get_user_pages do not take an mm for
> example and always operate on current->mm.

That's fair, thanks for the explanation.

> In your case, it should be enough to add a ioctl on the guestmemfd
> instead?

That's right. That would be sufficient indeed.  Is that something that 
could be considered?  Would that be some non-KVM API, with guest_memfd 
moving to an mm library?

 > But the real question is, what are you using
 > KVM_X86_SW_PROTECTED_VM for?

The concrete use case is VM restoration from a snapshot in Firecracker 
[1].  In the current setup, the VMM registers a UFFD against the guest 
memory and sends the UFFD handle to an external process that knows how 
to obtain the snapshotted memory.  We would like to preserve the 
semantics, but also remove the guest memory from the direct map [2]. 
Mimicing this with guest_memfd would be sending some form of a 
guest_memfd handle to that process that would be using it to populate 
guest_memfd.

[1]: 
https://github.com/firecracker-microvm/firecracker/blob/main/docs/snapshotting/handling-page-faults-on-snapshot-resume.md#userfaultfd
[2]: 
https://lore.kernel.org/kvm/20241030134912.515725-1-roypat@amazon.co.uk/T/

> Paolo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd
  2024-11-20 17:21             ` Nikita Kalyazin
@ 2024-11-20 18:29               ` David Hildenbrand
  2024-11-21 16:46                 ` Nikita Kalyazin
  0 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2024-11-20 18:29 UTC (permalink / raw)
  To: kalyazin, pbonzini, corbet, kvm, linux-doc, linux-kernel
  Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
	derekmn, nsaenz, xmarcalx, Sean Christopherson, linux-mm

On 20.11.24 18:21, Nikita Kalyazin wrote:
> 
> 
> On 20/11/2024 16:44, David Hildenbrand wrote:
>>> If the problem is the "pagecache" overhead, then yes, it will be a
>>> harder nut to crack. But maybe there are some low-hanging fruits to
>>> optimize? Finding the main cause for the added overhead would be
>>> interesting.
> 
> Agreed, knowing the exact root cause would be really nice.
> 
>> Can you compare uffdio_copy() when using anonymous memory vs. shmem?
>> That's likely the best we could currently achieve with guest_memfd.
> 
> Yeah, I was doing that too. It was about ~28% slower in my setup, while
> with guest_memfd it was ~34% slower. 

I looked into uffdio_copy() for shmem and we still walk+modify page 
tables. In theory, we could try hacking that out: for filling the 
pagecache we would only need the vma properties, not the page table 
properties; that would then really resemble "only modify the pagecache".

That would likely resemble what we would expect with guest_memfd: work 
only on the pagecache and not the page tables. So it's rather surprising 
that guest_memfd is slower than that, as it currently doesn't mess with 
user page tables at all.

  The variance of the data was quite
> high so the difference may well be just noise.  In other words, I'd be
> much happier if we could bring guest_memfd (or even shmem) performance
> closer to the anon/private than if we just equalised guest_memfd with
> shmem (which are probably already pretty close).

Makes sense. Best we can do is:

anon: work only on page tables
shmem/guest_memfd: work only on pageacache

So at least "only one treelike structure to update".

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd
  2024-11-20 18:29               ` David Hildenbrand
@ 2024-11-21 16:46                 ` Nikita Kalyazin
  2024-11-26 16:04                   ` Nikita Kalyazin
  0 siblings, 1 reply; 20+ messages in thread
From: Nikita Kalyazin @ 2024-11-21 16:46 UTC (permalink / raw)
  To: David Hildenbrand, pbonzini, corbet, kvm, linux-doc, linux-kernel
  Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
	derekmn, nsaenz, xmarcalx, Sean Christopherson, linux-mm



On 20/11/2024 18:29, David Hildenbrand wrote:
 > Any clue how your new ioctl will interact with the WIP to have shared
 > memory as part of guest_memfd? For example, could it be reasonable to
 > "populate" the shared memory first (via VMA) and then convert that
 > "allocated+filled" memory to private?

Patrick and I synced internally on this.  What may actually work for 
guest_memfd population is the following.

Non-CoCo use case:
  - fallocate syscall to fill the page cache, no page content 
initialisation (like it is now)
  - pwrite syscall to initialise the content + mark up-to-date (mark 
prepared), no specific preparation logic is required

The pwrite will have "once" semantics until a subsequent 
fallocate(FALLOC_FL_PUNCH_HOLE), ie the next pwrite call will "see" the 
page is already prepared and return EIO/ENOSPC or something.

SEV-SNP use case (no changes):
  - fallocate as above
  - KVM_SEV_SNP_LAUNCH_UPDATE to initialise/prepare

We don't think fallocate/pwrite have dependencies on current->mm 
assumptions that Paolo mentioned in [1], so they should be safe to be 
called on guest_memfd from a non-VMM process.

[1]: 
https://lore.kernel.org/kvm/20241024095429.54052-1-kalyazin@amazon.com/T/#m57498f8e2fde577ad1da948ec74dd2225cd2056c

 > Makes sense. Best we can do is:
 >
 > anon: work only on page tables
 > shmem/guest_memfd: work only on pageacache
 >
 > So at least "only one treelike structure to update".

This seems to hold with the above reasoning.

 > --
> Cheers,
> 
> David / dhildenb 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/4] KVM: guest_memfd: add generic post_populate callback
  2024-10-24  9:54 ` [PATCH 1/4] KVM: guest_memfd: add generic post_populate callback Nikita Kalyazin
@ 2024-11-22 18:40   ` Mike Day
  2024-11-25 11:46     ` Nikita Kalyazin
  0 siblings, 1 reply; 20+ messages in thread
From: Mike Day @ 2024-11-22 18:40 UTC (permalink / raw)
  To: Nikita Kalyazin, pbonzini, corbet, kvm, linux-doc, linux-kernel
  Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
	derekmn, nsaenz, xmarcalx



On 10/24/24 04:54, Nikita Kalyazin wrote:
> This adds a generic implementation of the `post_populate` callback for
> the `kvm_gmem_populate`.  The only thing it does is populates the pages
> with data provided by userspace if the user pointer is not NULL,
> otherwise it clears the pages.
> This is supposed to be used by KVM_X86_SW_PROTECTED_VM VMs.
> 
> Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
> ---
>   virt/kvm/guest_memfd.c | 21 +++++++++++++++++++++
>   1 file changed, 21 insertions(+)
> 
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 8f079a61a56d..954312fac462 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -620,6 +620,27 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>   EXPORT_SYMBOL_GPL(kvm_gmem_get_pfn);
>   
>   #ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM

KVM_AMD_SEV can select KVM_GENERIC_PRIVATE_MEM, so to guarantee this is only for
software protection it might be good to use:

#if defined CONFIG_KVM_GENERIC_PRIVATE_MEM && !defined CONFIG_KVM_AMD_SEV

That could end up too verbose so there should probably be some more concise mechanism
to guarantee this generic callback isn't used for a hardware-protected guest.

Mike

> +static int kvm_gmem_post_populate_generic(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pfn,
> +				  void __user *src, int order, void *opaque)
> +{
> +	int ret = 0, i;
> +	int npages = (1 << order);
> +	gfn_t gfn;
> +
> +	if (src) {
> +		void *vaddr = kmap_local_pfn(pfn);
> +
> +		ret = copy_from_user(vaddr, src, npages * PAGE_SIZE);
> +		if (ret)
> +			ret = -EINVAL;
> +		kunmap_local(vaddr);
> +	} else
> +		for (gfn = gfn_start, i = 0; gfn < gfn_start + npages; gfn++, i++)
> +			clear_highpage(pfn_to_page(pfn + i));
> +
> +	return ret;
> +}
> +
>   long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long npages,
>   		       kvm_gmem_populate_cb post_populate, void *opaque)
>   {

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/4] KVM: guest_memfd: add generic post_populate callback
  2024-11-22 18:40   ` Mike Day
@ 2024-11-25 11:46     ` Nikita Kalyazin
  0 siblings, 0 replies; 20+ messages in thread
From: Nikita Kalyazin @ 2024-11-25 11:46 UTC (permalink / raw)
  To: michael.day, pbonzini, corbet, kvm, linux-doc, linux-kernel
  Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
	derekmn, nsaenz, xmarcalx

On 22/11/2024 18:40, Mike Day wrote:
> On 10/24/24 04:54, Nikita Kalyazin wrote:
>> This adds a generic implementation of the `post_populate` callback for
>> the `kvm_gmem_populate`.  The only thing it does is populates the pages
>> with data provided by userspace if the user pointer is not NULL,
>> otherwise it clears the pages.
>> This is supposed to be used by KVM_X86_SW_PROTECTED_VM VMs.
>>
>> Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
>> ---
>>   virt/kvm/guest_memfd.c | 21 +++++++++++++++++++++
>>   1 file changed, 21 insertions(+)
>>
>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> index 8f079a61a56d..954312fac462 100644
>> --- a/virt/kvm/guest_memfd.c
>> +++ b/virt/kvm/guest_memfd.c
>> @@ -620,6 +620,27 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct 
>> kvm_memory_slot *slot,
>>   EXPORT_SYMBOL_GPL(kvm_gmem_get_pfn);
>>
>>   #ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> 
> KVM_AMD_SEV can select KVM_GENERIC_PRIVATE_MEM, so to guarantee this is 
> only for
> software protection it might be good to use:
> 
> #if defined CONFIG_KVM_GENERIC_PRIVATE_MEM && !defined CONFIG_KVM_AMD_SEV
> 
> That could end up too verbose so there should probably be some more 
> concise mechanism
> to guarantee this generic callback isn't used for a hardware-protected 
> guest.

Thanks, will make a note for myself for the next iteration.

> 
> Mike


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd
  2024-11-21 16:46                 ` Nikita Kalyazin
@ 2024-11-26 16:04                   ` Nikita Kalyazin
  2024-11-28 12:11                     ` David Hildenbrand
  0 siblings, 1 reply; 20+ messages in thread
From: Nikita Kalyazin @ 2024-11-26 16:04 UTC (permalink / raw)
  To: David Hildenbrand, pbonzini, corbet, kvm, linux-doc, linux-kernel
  Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
	derekmn, nsaenz, xmarcalx, Sean Christopherson, linux-mm



On 21/11/2024 16:46, Nikita Kalyazin wrote:
> 
> 
> On 20/11/2024 18:29, David Hildenbrand wrote:
>  > Any clue how your new ioctl will interact with the WIP to have shared
>  > memory as part of guest_memfd? For example, could it be reasonable to
>  > "populate" the shared memory first (via VMA) and then convert that
>  > "allocated+filled" memory to private?
> 
> Patrick and I synced internally on this.  What may actually work for 
> guest_memfd population is the following.
> 
> Non-CoCo use case:
>   - fallocate syscall to fill the page cache, no page content 
> initialisation (like it is now)
>   - pwrite syscall to initialise the content + mark up-to-date (mark 
> prepared), no specific preparation logic is required
> 
> The pwrite will have "once" semantics until a subsequent 
> fallocate(FALLOC_FL_PUNCH_HOLE), ie the next pwrite call will "see" the 
> page is already prepared and return EIO/ENOSPC or something.

I prototyped that to see if it was possible (and it was).  Actually the 
write syscall can also do the allocation part, so no prior fallocate 
would be required.  The only thing is there is a cap on how much IO can 
be done in a single call (MAX_RW_COUNT) [1], but it doesn't look like a 
significant problem.  Does it sound like an acceptable solution?

[1]: https://elixir.bootlin.com/linux/v6.12.1/source/fs/read_write.c#L507

> 
> SEV-SNP use case (no changes):
>   - fallocate as above
>   - KVM_SEV_SNP_LAUNCH_UPDATE to initialise/prepare
> 
> We don't think fallocate/pwrite have dependencies on current->mm 
> assumptions that Paolo mentioned in [1], so they should be safe to be 
> called on guest_memfd from a non-VMM process.
> 
> [1]: https://lore.kernel.org/kvm/20241024095429.54052-1- 
> kalyazin@amazon.com/T/#m57498f8e2fde577ad1da948ec74dd2225cd2056c
> 
>  > Makes sense. Best we can do is:
>  >
>  > anon: work only on page tables
>  > shmem/guest_memfd: work only on pageacache
>  >
>  > So at least "only one treelike structure to update".
> 
> This seems to hold with the above reasoning.
> 
>  > --
>> Cheers,
>>
>> David / dhildenb 
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd
  2024-11-26 16:04                   ` Nikita Kalyazin
@ 2024-11-28 12:11                     ` David Hildenbrand
  0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2024-11-28 12:11 UTC (permalink / raw)
  To: kalyazin, pbonzini, corbet, kvm, linux-doc, linux-kernel
  Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
	derekmn, nsaenz, xmarcalx, Sean Christopherson, linux-mm

On 26.11.24 17:04, Nikita Kalyazin wrote:
> 
> 
> On 21/11/2024 16:46, Nikita Kalyazin wrote:
>>
>>
>> On 20/11/2024 18:29, David Hildenbrand wrote:
>>   > Any clue how your new ioctl will interact with the WIP to have shared
>>   > memory as part of guest_memfd? For example, could it be reasonable to
>>   > "populate" the shared memory first (via VMA) and then convert that
>>   > "allocated+filled" memory to private?
>>
>> Patrick and I synced internally on this.  What may actually work for
>> guest_memfd population is the following.
>>
>> Non-CoCo use case:
>>    - fallocate syscall to fill the page cache, no page content
>> initialisation (like it is now)
>>    - pwrite syscall to initialise the content + mark up-to-date (mark
>> prepared), no specific preparation logic is required
>>
>> The pwrite will have "once" semantics until a subsequent
>> fallocate(FALLOC_FL_PUNCH_HOLE), ie the next pwrite call will "see" the
>> page is already prepared and return EIO/ENOSPC or something.
> 
> I prototyped that to see if it was possible (and it was).  Actually the
> write syscall can also do the allocation part, so no prior fallocate
> would be required. 

Right

> The only thing is there is a cap on how much IO can
> be done in a single call (MAX_RW_COUNT) [1], but it doesn't look like a
> significant problem.  Does it sound like an acceptable solution?

Does sound quite clean to me. Of course, one thing to figure out is how 
to enable this only for that special type of VM type, but that should be 
possible to be resolved.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2024-11-28 12:11 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-24  9:54 [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd Nikita Kalyazin
2024-10-24  9:54 ` [PATCH 1/4] KVM: guest_memfd: add generic post_populate callback Nikita Kalyazin
2024-11-22 18:40   ` Mike Day
2024-11-25 11:46     ` Nikita Kalyazin
2024-10-24  9:54 ` [PATCH 2/4] KVM: add KVM_GUEST_MEMFD_POPULATE ioctl for guest_memfd Nikita Kalyazin
2024-10-24  9:54 ` [PATCH 3/4] KVM: allow KVM_GUEST_MEMFD_POPULATE in another mm Nikita Kalyazin
2024-10-24  9:54 ` [PATCH 4/4] KVM: document KVM_GUEST_MEMFD_POPULATE ioctl Nikita Kalyazin
2024-11-20 12:09 ` [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd Nikita Kalyazin
2024-11-20 13:46   ` David Hildenbrand
2024-11-20 15:13     ` David Hildenbrand
2024-11-20 15:58       ` Nikita Kalyazin
2024-11-20 16:20         ` David Hildenbrand
2024-11-20 16:44           ` David Hildenbrand
2024-11-20 17:21             ` Nikita Kalyazin
2024-11-20 18:29               ` David Hildenbrand
2024-11-21 16:46                 ` Nikita Kalyazin
2024-11-26 16:04                   ` Nikita Kalyazin
2024-11-28 12:11                     ` David Hildenbrand
2024-11-20 13:55 ` Paolo Bonzini
2024-11-20 17:41   ` Nikita Kalyazin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox