From: Sean Christopherson <seanjc@google.com>
To: Quentin Perret <qperret@google.com>
Cc: Andy Lutomirski <luto@kernel.org>,
Steven Price <steven.price@arm.com>,
Chao Peng <chao.p.peng@linux.intel.com>,
kvm list <kvm@vger.kernel.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
Linux API <linux-api@vger.kernel.org>,
qemu-devel@nongnu.org, Paolo Bonzini <pbonzini@redhat.com>,
Jonathan Corbet <corbet@lwn.net>,
Vitaly Kuznetsov <vkuznets@redhat.com>,
Wanpeng Li <wanpengli@tencent.com>,
Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>,
Thomas Gleixner <tglx@linutronix.de>,
Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
the arch/x86 maintainers <x86@kernel.org>,
"H. Peter Anvin" <hpa@zytor.com>, Hugh Dickins <hughd@google.com>,
Jeff Layton <jlayton@kernel.org>,
"J . Bruce Fields" <bfields@fieldses.org>,
Andrew Morton <akpm@linux-foundation.org>,
Mike Rapoport <rppt@kernel.org>,
"Maciej S . Szmigiero" <mail@maciej.szmigiero.name>,
Vlastimil Babka <vbabka@suse.cz>,
Vishal Annapurve <vannapurve@google.com>,
Yu Zhang <yu.c.zhang@linux.intel.com>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
"Nakajima, Jun" <jun.nakajima@intel.com>,
Dave Hansen <dave.hansen@intel.com>,
Andi Kleen <ak@linux.intel.com>,
David Hildenbrand <david@redhat.com>,
Marc Zyngier <maz@kernel.org>, Will Deacon <will@kernel.org>
Subject: Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
Date: Tue, 5 Apr 2022 18:03:21 +0000 [thread overview]
Message-ID: <YkyEaYiL0BrDYcZv@google.com> (raw)
In-Reply-To: <Ykwbqv90C7+8K+Ao@google.com>
On Tue, Apr 05, 2022, Quentin Perret wrote:
> On Monday 04 Apr 2022 at 15:04:17 (-0700), Andy Lutomirski wrote:
> > >> - it can be very useful for protected VMs to do shared=>private
> > >> conversions. Think of a VM receiving some data from the host in a
> > >> shared buffer, and then it wants to operate on that buffer without
> > >> risking to leak confidential informations in a transient state. In
> > >> that case the most logical thing to do is to convert the buffer back
> > >> to private, do whatever needs to be done on that buffer (decrypting a
> > >> frame, ...), and then share it back with the host to consume it;
> > >
> > > If performance is a motivation, why would the guest want to do two
> > > conversions instead of just doing internal memcpy() to/from a private
> > > page? I would be quite surprised if multiple exits and TLB shootdowns is
> > > actually faster, especially at any kind of scale where zapping stage-2
> > > PTEs will cause lock contention and IPIs.
> >
> > I don't know the numbers or all the details, but this is arm64, which is a
> > rather better architecture than x86 in this regard. So maybe it's not so
> > bad, at least in very simple cases, ignoring all implementation details.
> > (But see below.) Also the systems in question tend to have fewer CPUs than
> > some of the massive x86 systems out there.
>
> Yep. I can try and do some measurements if that's really necessary, but
> I'm really convinced the cost of the TLBI for the shared->private
> conversion is going to be significantly smaller than the cost of memcpy
> the buffer twice in the guest for us.
It's not just the TLB shootdown, the VM-Exits aren't free. And barring non-trivial
improvements to KVM's MMU, e.g. sharding of mmu_lock, modifying the page tables will
block all other updates and MMU operations. Taking mmu_lock for read, should arm64
ever convert to a rwlock, is not an option because KVM needs to block other
conversions to avoid races.
Hmm, though batching multiple pages into a single request would mitigate most of
the overhead.
> There are variations of that idea: e.g. allow userspace to mmap the
> entire private fd but w/o taking a reference on pages mapped with
> PROT_NONE. And then the VMM can use mprotect() in response to
> share/unshare requests. I think Marc liked that idea as it keeps the
> userspace API closer to normal KVM -- there actually is a
> straightforward gpa->hva relation. Not sure how much that would impact
> the implementation at this point.
>
> For the shared=>private conversion, this would be something like so:
>
> - the guest issues a hypercall to unshare a page;
>
> - the hypervisor forwards the request to the host;
>
> - the host kernel forwards the request to userspace;
>
> - userspace then munmap()s the shared page;
>
> - KVM then tries to take a reference to the page. If it succeeds, it
> re-enters the guest with a flag of some sort saying that the share
> succeeded, and the hypervisor will adjust pgtables accordingly. If
> KVM failed to take a reference, it flags this and the hypervisor will
> be responsible for communicating that back to the guest. This means
> the guest must handle failures (possibly fatal).
>
> (There are probably many ways in which we can optimize this, e.g. by
> having the host proactively munmap() pages it no longer needs so that
> the unshare hypercall from the guest doesn't need to exit all the way
> back to host userspace.)
...
> > Maybe there could be a special mode for the private memory fds in which
> > specific pages are marked as "managed by this fd but actually shared".
> > pread() and pwrite() would work on those pages, but not mmap(). (Or maybe
> > mmap() but the resulting mappings would not permit GUP.)
Unless I misunderstand what you intend by pread()/pwrite(), I think we'd need to
allow mmap(), otherwise e.g. uaccess from the kernel wouldn't work.
> > And transitioning them would be a special operation on the fd that is
> > specific to pKVM and wouldn't work on TDX or SEV.
To keep things feature agnostic (IMO, baking TDX vs SEV vs pKVM info into private-fd
is a really bad idea), this could be handled by adding a flag and/or callback into
the notifier/client stating whether or not it supports mapping a private-fd, and then
mapping would be allowed if and only if all consumers support/allow mapping.
> > Hmm. Sean and Chao, are we making a bit of a mistake by making these fds
> > technology-agnostic? That is, would we want to distinguish between a TDX
> > backing fd, a SEV backing fd, a software-based backing fd, etc? API-wise
> > this could work by requiring the fd to be bound to a KVM VM instance and
> > possibly even configured a bit before any other operations would be
> > allowed.
I really don't want to distinguish between between each exact feature, but I've
no objection to adding flags/callbacks to track specific properties of the
downstream consumers, e.g. "can this memory be accessed by userspace" is a fine
abstraction. It also scales to multiple consumers (see above).
next prev parent reply other threads:[~2022-04-06 4:01 UTC|newest]
Thread overview: 183+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-03-10 14:08 [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
2022-03-10 14:08 ` Chao Peng
2022-03-10 14:08 ` [PATCH v5 01/13] mm/memfd: Introduce MFD_INACCESSIBLE flag Chao Peng
2022-03-10 14:08 ` Chao Peng
2022-04-11 15:10 ` Kirill A. Shutemov
2022-04-11 15:10 ` Kirill A. Shutemov
2022-04-12 13:11 ` Chao Peng
2022-04-12 13:11 ` Chao Peng
2022-04-23 5:43 ` Vishal Annapurve
2022-04-24 8:15 ` Chao Peng
2022-04-24 8:15 ` Chao Peng
2022-03-10 14:09 ` [PATCH v5 02/13] mm: Introduce memfile_notifier Chao Peng
2022-03-10 14:09 ` Chao Peng
2022-03-29 18:45 ` Sean Christopherson
2022-04-08 12:54 ` Chao Peng
2022-04-08 12:54 ` Chao Peng
2022-04-12 14:36 ` Hillf Danton
2022-04-13 6:47 ` Chao Peng
2022-03-10 14:09 ` [PATCH v5 03/13] mm/shmem: Support memfile_notifier Chao Peng
2022-03-10 14:09 ` Chao Peng
2022-03-10 23:08 ` Dave Chinner
2022-03-10 23:08 ` Dave Chinner
2022-03-11 8:42 ` Chao Peng
2022-03-11 8:42 ` Chao Peng
2022-04-11 15:26 ` Kirill A. Shutemov
2022-04-11 15:26 ` Kirill A. Shutemov
2022-04-12 13:12 ` Chao Peng
2022-04-12 13:12 ` Chao Peng
2022-04-19 22:40 ` Vishal Annapurve
2022-04-20 3:24 ` Chao Peng
2022-04-20 3:24 ` Chao Peng
2022-03-10 14:09 ` [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK Chao Peng
2022-03-10 14:09 ` Chao Peng
2022-04-07 16:05 ` Sean Christopherson
2022-04-07 17:09 ` Andy Lutomirski
2022-04-07 17:09 ` Andy Lutomirski
2022-04-08 17:56 ` Sean Christopherson
2022-04-08 18:54 ` David Hildenbrand
2022-04-08 18:54 ` David Hildenbrand
2022-04-12 14:36 ` Jason Gunthorpe
2022-04-12 14:36 ` Jason Gunthorpe
2022-04-12 21:27 ` Andy Lutomirski
2022-04-12 21:27 ` Andy Lutomirski
2022-04-13 16:30 ` David Hildenbrand
2022-04-13 16:30 ` David Hildenbrand
2022-04-13 16:24 ` David Hildenbrand
2022-04-13 16:24 ` David Hildenbrand
2022-04-13 17:52 ` Jason Gunthorpe
2022-04-13 17:52 ` Jason Gunthorpe
2022-04-25 14:07 ` David Hildenbrand
2022-04-25 14:07 ` David Hildenbrand
2022-04-08 13:02 ` Chao Peng
2022-04-08 13:02 ` Chao Peng
2022-04-11 15:34 ` Kirill A. Shutemov
2022-04-11 15:34 ` Kirill A. Shutemov
2022-04-12 5:14 ` Hugh Dickins
2022-04-11 15:32 ` Kirill A. Shutemov
2022-04-11 15:32 ` Kirill A. Shutemov
2022-04-12 13:39 ` Chao Peng
2022-04-12 13:39 ` Chao Peng
2022-04-12 19:28 ` Kirill A. Shutemov
2022-04-12 19:28 ` Kirill A. Shutemov
2022-04-13 9:15 ` Chao Peng
2022-04-13 9:15 ` Chao Peng
2022-03-10 14:09 ` [PATCH v5 05/13] KVM: Extend the memslot to support fd-based private memory Chao Peng
2022-03-10 14:09 ` Chao Peng
2022-03-28 21:27 ` Sean Christopherson
2022-04-08 13:21 ` Chao Peng
2022-04-08 13:21 ` Chao Peng
2022-03-28 21:56 ` Sean Christopherson
2022-04-08 13:46 ` Chao Peng
2022-04-08 13:46 ` Chao Peng
2022-04-08 17:45 ` Sean Christopherson
2022-03-10 14:09 ` [PATCH v5 06/13] KVM: Use kvm_userspace_memory_region_ext Chao Peng
2022-03-10 14:09 ` Chao Peng
2022-03-28 22:26 ` Sean Christopherson
2022-04-08 13:58 ` Chao Peng
2022-04-08 13:58 ` Chao Peng
2022-03-10 14:09 ` [PATCH v5 07/13] KVM: Add KVM_EXIT_MEMORY_ERROR exit Chao Peng
2022-03-10 14:09 ` Chao Peng
2022-03-28 22:33 ` Sean Christopherson
2022-04-08 13:59 ` Chao Peng
2022-04-08 13:59 ` Chao Peng
2022-03-10 14:09 ` [PATCH v5 08/13] KVM: Use memfile_pfn_ops to obtain pfn for private pages Chao Peng
2022-03-10 14:09 ` Chao Peng
2022-03-28 23:56 ` Sean Christopherson
2022-04-08 14:07 ` Chao Peng
2022-04-08 14:07 ` Chao Peng
2022-04-28 12:37 ` Chao Peng
2022-04-28 12:37 ` Chao Peng
2022-03-10 14:09 ` [PATCH v5 09/13] KVM: Handle page fault for private memory Chao Peng
2022-03-10 14:09 ` Chao Peng
2022-03-29 1:07 ` Sean Christopherson
2022-04-12 12:10 ` Chao Peng
2022-04-12 12:10 ` Chao Peng
2022-03-10 14:09 ` [PATCH v5 10/13] KVM: Register private memslot to memory backing store Chao Peng
2022-03-10 14:09 ` Chao Peng
2022-03-29 19:01 ` Sean Christopherson
2022-04-12 12:40 ` Chao Peng
2022-04-12 12:40 ` Chao Peng
2022-03-10 14:09 ` [PATCH v5 11/13] KVM: Zap existing KVM mappings when pages changed in the private fd Chao Peng
2022-03-10 14:09 ` Chao Peng
2022-03-29 19:23 ` Sean Christopherson
2022-04-12 12:43 ` Chao Peng
2022-04-12 12:43 ` Chao Peng
2022-04-05 23:45 ` Michael Roth
2022-04-08 3:06 ` Sean Christopherson
2022-04-19 22:43 ` Vishal Annapurve
2022-04-20 3:17 ` Chao Peng
2022-04-20 3:17 ` Chao Peng
2022-03-10 14:09 ` [PATCH v5 12/13] KVM: Expose KVM_MEM_PRIVATE Chao Peng
2022-03-10 14:09 ` Chao Peng
2022-03-29 19:13 ` Sean Christopherson
2022-04-12 12:56 ` Chao Peng
2022-04-12 12:56 ` Chao Peng
2022-03-10 14:09 ` [PATCH v5 13/13] memfd_create.2: Describe MFD_INACCESSIBLE flag Chao Peng
2022-03-10 14:09 ` Chao Peng
2022-03-24 15:51 ` [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory Quentin Perret
2022-03-28 17:13 ` Sean Christopherson
2022-03-28 18:00 ` Quentin Perret
2022-03-28 18:58 ` Sean Christopherson
2022-03-29 17:01 ` Quentin Perret
2022-03-30 8:58 ` Steven Price
2022-03-30 8:58 ` Steven Price
2022-03-30 10:39 ` Quentin Perret
2022-03-30 17:58 ` Sean Christopherson
2022-03-31 16:04 ` Andy Lutomirski
2022-03-31 16:04 ` Andy Lutomirski
2022-04-01 14:59 ` Quentin Perret
2022-04-01 17:14 ` Sean Christopherson
2022-04-01 18:03 ` Quentin Perret
2022-04-01 18:24 ` Sean Christopherson
2022-04-01 19:56 ` Andy Lutomirski
2022-04-01 19:56 ` Andy Lutomirski
2022-04-04 15:01 ` Quentin Perret
2022-04-04 17:06 ` Sean Christopherson
2022-04-04 22:04 ` Andy Lutomirski
2022-04-04 22:04 ` Andy Lutomirski
2022-04-05 10:36 ` Quentin Perret
2022-04-05 17:51 ` Andy Lutomirski
2022-04-05 17:51 ` Andy Lutomirski
2022-04-05 18:30 ` Sean Christopherson
2022-04-06 18:42 ` Andy Lutomirski
2022-04-06 18:42 ` Andy Lutomirski
2022-04-06 13:05 ` Quentin Perret
2022-04-05 18:03 ` Sean Christopherson [this message]
2022-04-06 10:34 ` Quentin Perret
2022-04-22 10:56 ` Chao Peng
2022-04-22 10:56 ` Chao Peng
2022-04-22 11:06 ` Paolo Bonzini
2022-04-22 11:06 ` Paolo Bonzini
2022-04-24 8:07 ` Chao Peng
2022-04-24 8:07 ` Chao Peng
2022-04-24 16:59 ` Andy Lutomirski
2022-04-24 16:59 ` Andy Lutomirski
2022-04-25 13:40 ` Chao Peng
2022-04-25 13:40 ` Chao Peng
2022-04-25 14:52 ` Andy Lutomirski
2022-04-25 14:52 ` Andy Lutomirski
2022-04-25 20:30 ` Sean Christopherson
2022-06-10 19:18 ` Andy Lutomirski
2022-06-10 19:27 ` Sean Christopherson
2022-04-28 12:29 ` Chao Peng
2022-04-28 12:29 ` Chao Peng
2022-05-03 11:12 ` Quentin Perret
2022-05-09 22:30 ` Michael Roth
2022-05-09 23:29 ` Sean Christopherson
2022-07-21 20:05 ` Gupta, Pankaj
2022-07-21 21:19 ` Sean Christopherson
2022-07-21 21:36 ` Gupta, Pankaj
2022-07-23 3:09 ` Andy Lutomirski
2022-07-25 9:19 ` Gupta, Pankaj
2022-03-30 16:18 ` Sean Christopherson
2022-03-28 20:16 ` Andy Lutomirski
2022-03-28 20:16 ` Andy Lutomirski
2022-03-28 22:48 ` Nakajima, Jun
2022-03-28 22:48 ` Nakajima, Jun
2022-03-29 0:04 ` Sean Christopherson
2022-04-08 21:35 ` Vishal Annapurve
2022-04-12 13:00 ` Chao Peng
2022-04-12 13:00 ` Chao Peng
2022-04-12 19:58 ` Kirill A. Shutemov
2022-04-12 19:58 ` Kirill A. Shutemov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YkyEaYiL0BrDYcZv@google.com \
--to=seanjc@google.com \
--cc=ak@linux.intel.com \
--cc=akpm@linux-foundation.org \
--cc=bfields@fieldses.org \
--cc=bp@alien8.de \
--cc=chao.p.peng@linux.intel.com \
--cc=corbet@lwn.net \
--cc=dave.hansen@intel.com \
--cc=david@redhat.com \
--cc=hpa@zytor.com \
--cc=hughd@google.com \
--cc=jlayton@kernel.org \
--cc=jmattson@google.com \
--cc=joro@8bytes.org \
--cc=jun.nakajima@intel.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=kvm@vger.kernel.org \
--cc=linux-api@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luto@kernel.org \
--cc=mail@maciej.szmigiero.name \
--cc=maz@kernel.org \
--cc=mingo@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=qperret@google.com \
--cc=rppt@kernel.org \
--cc=steven.price@arm.com \
--cc=tglx@linutronix.de \
--cc=vannapurve@google.com \
--cc=vbabka@suse.cz \
--cc=vkuznets@redhat.com \
--cc=wanpengli@tencent.com \
--cc=will@kernel.org \
--cc=x86@kernel.org \
--cc=yu.c.zhang@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.