From: Quentin Perret <qperret@google.com>
To: David Hildenbrand <david@redhat.com>
Cc: Will Deacon <will@kernel.org>,
Sean Christopherson <seanjc@google.com>,
Vishal Annapurve <vannapurve@google.com>,
Matthew Wilcox <willy@infradead.org>,
Fuad Tabba <tabba@google.com>,
kvm@vger.kernel.org, kvmarm@lists.linux.dev, pbonzini@redhat.com,
chenhuacai@kernel.org, mpe@ellerman.id.au, anup@brainfault.org,
paul.walmsley@sifive.com, palmer@dabbelt.com,
aou@eecs.berkeley.edu, viro@zeniv.linux.org.uk,
brauner@kernel.org, akpm@linux-foundation.org,
xiaoyao.li@intel.com, yilun.xu@intel.com,
chao.p.peng@linux.intel.com, jarkko@kernel.org,
amoorthy@google.com, dmatlack@google.com,
yu.c.zhang@linux.intel.com, isaku.yamahata@intel.com,
mic@digikod.net, vbabka@suse.cz, ackerleytng@google.com,
mail@maciej.szmigiero.name, michael.roth@amd.com,
wei.w.wang@intel.com, liam.merwick@oracle.com,
isaku.yamahata@gmail.com, kirill.shutemov@linux.intel.com,
suzuki.poulose@arm.com, steven.price@arm.com,
quic_mnalajal@quicinc.com, quic_tsoni@quicinc.com,
quic_svaddagi@quicinc.com, quic_cvanscha@quicinc.com,
quic_pderrin@quicinc.com, quic_pheragu@quicinc.com,
catalin.marinas@arm.com, james.morse@arm.com,
yuzenghui@huawei.com, oliver.upton@linux.dev, maz@kernel.org,
keirf@google.com, linux-mm@kvack.org
Subject: Re: folio_mmapped
Date: Thu, 28 Mar 2024 10:10:20 +0000 [thread overview]
Message-ID: <ZgVCDPoQbbXjTBQp@google.com> (raw)
In-Reply-To: <d0500f89-df3b-42cd-aa5a-5b3005f67638@redhat.com>
Hey David,
I'll try to pick up the baton while Will is away :-)
On Thursday 28 Mar 2024 at 10:06:52 (+0100), David Hildenbrand wrote:
> On 27.03.24 20:34, Will Deacon wrote:
> > I suppose the main thing is that the architecture backend can deal with
> > these states, so the core code shouldn't really care as long as it's
> > aware that shared memory may be pinned.
>
> So IIUC, the states are:
>
> (1) Private: inaccesible by the host, accessible by the guest, "owned by
> the guest"
>
> (2) Host Shared: accessible by the host + guest, "owned by the host"
>
> (3) Guest Shared: accessible by the host, "owned by the guest"
Yup.
> Memory ballooning is simply transitioning from (3) to (2), and then
> discarding the memory.
Well, not quite actually, see below.
> Any state I am missing?
So there is probably state (0) which is 'owned only by the host'. It's a
bit obvious, but I'll make it explicit because it has its importance for
the rest of the discussion.
And while at it, there are other cases (memory shared/owned with/by the
hypervisor and/or TrustZone) but they're somewhat irrelevant to this
discussion. These pages are usually backed by kernel allocations, so
much less problematic to deal with. So let's ignore those.
> Which transitions are possible?
Basically a page must be in the 'exclusively owned' state for an owner
to initiate a share or donation. So e.g. a shared page must be unshared
before it can be donated to someone else (that is true regardless of the
owner, host, guest, hypervisor, ...). That simplifies significantly the
state tracking in pKVM.
> (1) <-> (2) ? Not sure if the direct transition is possible.
Yep, not possible.
> (2) <-> (3) ? IIUC yes.
Actually it's not directly possible as is. The ballooning procedure is
essentially a (1) -> (0) transition. (We also tolerate (3) -> (0) in a
single hypercall when doing ballooning, but it's technically just a
(3) -> (1) -> (0) sequence that has been micro-optimized).
Note that state (2) is actually never used for protected VMs. It's
mainly used to implement standard non-protected VMs. The biggest
difference in pKVM between protected and non-protected VMs is basically
that in the former case, in the fault path KVM does a (0) -> (1)
transition, but in the latter it's (0) -> (2). That implies that in the
unprotected case, the host remains the page owner and is allowed to
decide to unshare arbitrary pages, to restrict the guest permissions for
the shared pages etc, which paves the way for implementing migration,
swap, ... relatively easily.
> (1) <-> (3) ? IIUC yes.
Yep.
<snip>
> > I agree on all of these and, yes, (3) is the problem for us. We've also
> > been thinking a bit about CoW recently and I suspect the use of
> > vm_normal_page() in do_wp_page() could lead to issues similar to those
> > we hit with GUP. There are various ways to approach that, but I'm not
> > sure what's best.
>
> Would COW be required or is that just the nasty side-effect of trying to use
> anonymous memory?
That'd qualify as an undesirable side effect I think.
> >
> > > I'm curious, may there be a requirement in the future that shared memory
> > > could be mapped into other processes? (thinking vhost-user and such things).
> >
> > It's not impossible. We use crosvm as our VMM, and that has a
> > multi-process sandbox mode which I think relies on just that...
> >
>
> Okay, so basing the design on anonymous memory might not be the best choice
> ... :/
So, while we're at this stage, let me throw another idea at the wall to
see if it sticks :-)
One observation is that a standard memfd would work relatively well for
pKVM if we had a way to enforce that all mappings to it are MAP_SHARED.
KVM would still need to take an 'exclusive GUP' from the fault path
(which may fail in case of a pre-existing GUP, but that's fine), but
then CoW and friends largely become a non-issue by construction I think.
Is there any way we could enforce that cleanly? Perhaps introducing a
sort of 'mmap notifier' would do the trick? By that I mean something a
bit similar to an MMU notifier offered by memfd that KVM could register
against whenever the memfd is attached to a protected VM memslot.
One of the nice things here is that we could retain an entire mapping of
the whole of guest memory in userspace, conversions wouldn't require any
additional efforts from userspace. A bad thing is that a process that is
being passed such a memfd may not expect the new semantic and the
inability to map !MAP_SHARED. But I guess a process that receives a
handle to private memory must be enlightened regardless of the type of
fd, so maybe it's not so bad.
Thoughts?
Thanks,
Quentin
next prev parent reply other threads:[~2024-03-28 10:10 UTC|newest]
Thread overview: 48+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20240222161047.402609-1-tabba@google.com>
[not found] ` <20240222141602976-0800.eberman@hu-eberman-lv.qualcomm.com>
2024-02-23 0:35 ` folio_mmapped Matthew Wilcox
2024-02-26 9:28 ` folio_mmapped David Hildenbrand
2024-02-26 21:14 ` folio_mmapped Elliot Berman
2024-02-27 14:59 ` folio_mmapped David Hildenbrand
2024-02-28 10:48 ` folio_mmapped Quentin Perret
2024-02-28 11:11 ` folio_mmapped David Hildenbrand
2024-02-28 12:44 ` folio_mmapped Quentin Perret
2024-02-28 13:00 ` folio_mmapped David Hildenbrand
2024-02-28 13:34 ` folio_mmapped Quentin Perret
2024-02-28 18:43 ` folio_mmapped Elliot Berman
2024-02-28 18:51 ` Quentin Perret
2024-02-29 10:04 ` folio_mmapped David Hildenbrand
2024-02-29 19:01 ` folio_mmapped Fuad Tabba
2024-03-01 0:40 ` folio_mmapped Elliot Berman
2024-03-01 11:16 ` folio_mmapped David Hildenbrand
2024-03-04 12:53 ` folio_mmapped Quentin Perret
2024-03-04 20:22 ` folio_mmapped David Hildenbrand
2024-03-01 11:06 ` folio_mmapped David Hildenbrand
2024-03-04 12:36 ` folio_mmapped Quentin Perret
2024-03-04 19:04 ` folio_mmapped Sean Christopherson
2024-03-04 20:17 ` folio_mmapped David Hildenbrand
2024-03-04 21:43 ` folio_mmapped Elliot Berman
2024-03-04 21:58 ` folio_mmapped David Hildenbrand
2024-03-19 9:47 ` folio_mmapped Quentin Perret
2024-03-19 9:54 ` folio_mmapped David Hildenbrand
2024-03-18 17:06 ` folio_mmapped Vishal Annapurve
2024-03-18 22:02 ` folio_mmapped David Hildenbrand
[not found] ` <CAGtprH8B8y0Khrid5X_1twMce7r-Z7wnBiaNOi-QwxVj4D+L3w@mail.gmail.com>
2024-03-19 0:10 ` folio_mmapped Sean Christopherson
2024-03-19 10:26 ` folio_mmapped David Hildenbrand
2024-03-19 13:19 ` folio_mmapped David Hildenbrand
2024-03-19 14:31 ` folio_mmapped Will Deacon
2024-03-19 23:54 ` folio_mmapped Elliot Berman
2024-03-22 16:36 ` Will Deacon
2024-03-22 18:46 ` Elliot Berman
2024-03-27 19:31 ` Will Deacon
[not found] ` <2d6fc3c0-a55b-4316-90b8-deabb065d007@redhat.com>
2024-03-22 21:21 ` folio_mmapped David Hildenbrand
2024-03-26 22:04 ` folio_mmapped Elliot Berman
2024-03-27 19:34 ` folio_mmapped Will Deacon
2024-03-28 9:06 ` folio_mmapped David Hildenbrand
2024-03-28 10:10 ` Quentin Perret [this message]
2024-03-28 10:32 ` folio_mmapped David Hildenbrand
2024-03-28 10:58 ` folio_mmapped Quentin Perret
2024-03-28 11:41 ` folio_mmapped David Hildenbrand
2024-03-29 18:38 ` folio_mmapped Vishal Annapurve
2024-04-04 0:15 ` folio_mmapped Sean Christopherson
2024-03-19 15:04 ` folio_mmapped Sean Christopherson
2024-03-22 17:16 ` folio_mmapped David Hildenbrand
2024-02-26 9:03 ` [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZgVCDPoQbbXjTBQp@google.com \
--to=qperret@google.com \
--cc=ackerleytng@google.com \
--cc=akpm@linux-foundation.org \
--cc=amoorthy@google.com \
--cc=anup@brainfault.org \
--cc=aou@eecs.berkeley.edu \
--cc=brauner@kernel.org \
--cc=catalin.marinas@arm.com \
--cc=chao.p.peng@linux.intel.com \
--cc=chenhuacai@kernel.org \
--cc=david@redhat.com \
--cc=dmatlack@google.com \
--cc=isaku.yamahata@gmail.com \
--cc=isaku.yamahata@intel.com \
--cc=james.morse@arm.com \
--cc=jarkko@kernel.org \
--cc=keirf@google.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=kvm@vger.kernel.org \
--cc=kvmarm@lists.linux.dev \
--cc=liam.merwick@oracle.com \
--cc=linux-mm@kvack.org \
--cc=mail@maciej.szmigiero.name \
--cc=maz@kernel.org \
--cc=mic@digikod.net \
--cc=michael.roth@amd.com \
--cc=mpe@ellerman.id.au \
--cc=oliver.upton@linux.dev \
--cc=palmer@dabbelt.com \
--cc=paul.walmsley@sifive.com \
--cc=pbonzini@redhat.com \
--cc=quic_cvanscha@quicinc.com \
--cc=quic_mnalajal@quicinc.com \
--cc=quic_pderrin@quicinc.com \
--cc=quic_pheragu@quicinc.com \
--cc=quic_svaddagi@quicinc.com \
--cc=quic_tsoni@quicinc.com \
--cc=seanjc@google.com \
--cc=steven.price@arm.com \
--cc=suzuki.poulose@arm.com \
--cc=tabba@google.com \
--cc=vannapurve@google.com \
--cc=vbabka@suse.cz \
--cc=viro@zeniv.linux.org.uk \
--cc=wei.w.wang@intel.com \
--cc=will@kernel.org \
--cc=willy@infradead.org \
--cc=xiaoyao.li@intel.com \
--cc=yilun.xu@intel.com \
--cc=yu.c.zhang@linux.intel.com \
--cc=yuzenghui@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).