Re: folio_mmapped - David Hildenbrand

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: Will Deacon <will@kernel.org>
Cc: Sean Christopherson <seanjc@google.com>,
	Vishal Annapurve <vannapurve@google.com>,
	Quentin Perret <qperret@google.com>,
	Matthew Wilcox <willy@infradead.org>,
	Fuad Tabba <tabba@google.com>,
	kvm@vger.kernel.org, kvmarm@lists.linux.dev, pbonzini@redhat.com,
	chenhuacai@kernel.org, mpe@ellerman.id.au, anup@brainfault.org,
	paul.walmsley@sifive.com, palmer@dabbelt.com,
	aou@eecs.berkeley.edu, viro@zeniv.linux.org.uk,
	brauner@kernel.org, akpm@linux-foundation.org,
	xiaoyao.li@intel.com, yilun.xu@intel.com,
	chao.p.peng@linux.intel.com, jarkko@kernel.org,
	amoorthy@google.com, dmatlack@google.com,
	yu.c.zhang@linux.intel.com, isaku.yamahata@intel.com,
	mic@digikod.net, vbabka@suse.cz, ackerleytng@google.com,
	mail@maciej.szmigiero.name, michael.roth@amd.com,
	wei.w.wang@intel.com, liam.merwick@oracle.com,
	isaku.yamahata@gmail.com, kirill.shutemov@linux.intel.com,
	suzuki.poulose@arm.com, steven.price@arm.com,
	quic_mnalajal@quicinc.com, quic_tsoni@quicinc.com,
	quic_svaddagi@quicinc.com, quic_cvanscha@quicinc.com,
	quic_pderrin@quicinc.com, quic_pheragu@quicinc.com,
	catalin.marinas@arm.com, james.morse@arm.com,
	yuzenghui@huawei.com, oliver.upton@linux.dev, maz@kernel.org,
	keirf@google.com, linux-mm@kvack.org
Subject: Re: folio_mmapped
Date: Thu, 28 Mar 2024 10:06:52 +0100	[thread overview]
Message-ID: <d0500f89-df3b-42cd-aa5a-5b3005f67638@redhat.com> (raw)
In-Reply-To: <20240327193454.GB11880@willie-the-truck>

On 27.03.24 20:34, Will Deacon wrote:
> Hi again, David,
> 
> On Fri, Mar 22, 2024 at 06:52:14PM +0100, David Hildenbrand wrote:
>> On 19.03.24 15:31, Will Deacon wrote:
>> sorry for the late reply!
> 
> Bah, you and me both!

This time I'm faster! :)

> 
>>> On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote:
>>>> On 19.03.24 01:10, Sean Christopherson wrote:
>>>>> On Mon, Mar 18, 2024, Vishal Annapurve wrote:
>>>>>> On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote:
>>>   From the pKVM side, we're working on guest_memfd primarily to avoid
>>> diverging from what other CoCo solutions end up using, but if it gets
>>> de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do
>>> today with anonymous memory, then it's a really hard sell to switch over
>>> from what we have in production. We're also hoping that, over time,
>>> guest_memfd will become more closely integrated with the mm subsystem to
>>> enable things like hypervisor-assisted page migration, which we would
>>> love to have.
>>
>> Reading Sean's reply, he has a different view on that. And I think that's
>> the main issue: there are too many different use cases and too many
>> different requirements that could turn guest_memfd into something that maybe
>> it really shouldn't be.
> 
> No argument there, and we're certainly not tied to any specific
> mechanism on the pKVM side. Maybe Sean can chime in, but we've
> definitely spoken about migration being a goal in the past, so I guess
> something changed since then on the guest_memfd side.
> 
> Regardless, from our point of view, we just need to make sure that
> whatever we settle on for pKVM does the things we need it to do (or can
> at least be extended to do them) and we're happy to implement that in
> whatever way works best for upstream, guest_memfd or otherwise.
> 
>>> We're happy to pursue alternative approaches using anonymous memory if
>>> you'd prefer to keep guest_memfd limited in functionality (e.g.
>>> preventing GUP of private pages by extending mapping_flags as per [1]),
>>> but we're equally willing to contribute to guest_memfd if extensions are
>>> welcome.
>>>
>>> What do you prefer?
>>
>> Let me summarize the history:
> 
> First off, thanks for piecing together the archaeology...
> 
>> AMD had its thing running and it worked for them (but I recall it was hacky
>> :) ).
>>
>> TDX made it possible to crash the machine when accessing secure memory from
>> user space (MCE).
>>
>> So secure memory must not be mapped into user space -- no page tables.
>> Prototypes with anonymous memory existed (and I didn't hate them, although
>> hacky), but one of the other selling points of guest_memfd was that we could
>> create VMs that wouldn't need any page tables at all, which I found
>> interesting.
> 
> Are the prototypes you refer to here based on the old stuff from Kirill?

Yes.

> We followed that work at the time, thinking we were going to be using
> that before guest_memfd came along, so we've sadly been collecting
> out-of-tree patches for a little while :/

:/

> 
>> There was a bit more to that (easier conversion, avoiding GUP, specifying on
>> allocation that the memory was unmovable ...), but I'll get to that later.
>>
>> The design principle was: nasty private memory (unmovable, unswappable,
>> inaccessible, un-GUPable) is allocated from guest_memfd, ordinary "shared"
>> memory is allocated from an ordinary memfd.
>>
>> This makes sense: shared memory is neither nasty nor special. You can
>> migrate it, swap it out, map it into page tables, GUP it, ... without any
>> issues.
> 
> Slight aside and not wanting to derail the discussion, but we have a few
> different types of sharing which we'll have to consider:

Thanks for sharing!

> 
>    * Memory shared from the host to the guest. This remains owned by the
>      host and the normal mm stuff can be made to work with it.

Okay, host and guest can access it. We can jut migrate memory around, 
swap it out ... like ordinary guest memory today.

> 
>    * Memory shared from the guest to the host. This remains owned by the
>      guest, so there's a pin on the pages and the normal mm stuff can't
>      work without co-operation from the guest (see next point).

Okay, host and guest can access it, but we cannot migrate memory around 
or swap it out ... like ordinary guest memory today that is longterm pinned.

> 
>    * Memory relinquished from the guest to the host. This actually unmaps
>      the pages from the host and transfers ownership back to the host,
>      after which the pin is dropped and the normal mm stuff can work. We
>      use this to implement ballooning.
> 

Okay, so this is essentially just a state transition between the two above.


> I suppose the main thing is that the architecture backend can deal with
> these states, so the core code shouldn't really care as long as it's
> aware that shared memory may be pinned.

So IIUC, the states are:

(1) Private: inaccesible by the host, accessible by the guest, "owned by
     the guest"

(2) Host Shared: accessible by the host + guest, "owned by the host"

(3) Guest Shared: accessible by the host, "owned by the guest"


Memory ballooning is simply transitioning from (3) to (2), and then 
discarding the memory.

Any state I am missing?


Which transitions are possible?

(1) <-> (2) ? Not sure if the direct transition is possible.

(2) <-> (3) ? IIUC yes.

(1) <-> (3) ? IIUC yes.



There is ongoing work on longterm-pinning memory from a memfd/shmem. So 
thinking in terms of my vague "fd guest_memfd + fd pair", that approach 
could look like the following:

(1) guest_memfd (could be "with longterm pin")

(2) memfd

(3) memfd with a longterm pin

But again, just some possible idea to make it work with guest_memfd.

> 
>> So if I would describe some key characteristics of guest_memfd as of today,
>> it would probably be:
>>
>> 1) Memory is unmovable and unswappable. Right from the beginning, it is
>>     allocated as unmovable (e.g., not placed on ZONE_MOVABLE, CMA, ...).
>> 2) Memory is inaccessible. It cannot be read from user space, the
>>     kernel, it cannot be GUP'ed ... only some mechanisms might end up
>>     touching that memory (e.g., hibernation, /proc/kcore) might end up
>>     touching it "by accident", and we usually can handle these cases.
>> 3) Memory can be discarded in page granularity. There should be no cases
>>     where you cannot discard memory to over-allocate memory for private
>>     pages that have been replaced by shared pages otherwise.
>> 4) Page tables are not required (well, it's an memfd), and the fd could
>>     in theory be passed to other processes.
>>
>> Having "ordinary shared" memory in there implies that 1) and 2) will have to
>> be adjusted for them, which kind-of turns it "partially" into ordinary shmem
>> again.
> 
> Yes, and we'd also need a way to establish hugepages (where possible)
> even for the *private* memory so as to reduce the depth of the guest's
> stage-2 walk.
> 

Understood, and as discussed, that's a bit more "hairy".

>> Going back to the beginning: with pKVM, we likely want the following
>>
>> 1) Convert pages private<->shared in-place
>> 2) Stop user space + kernel from accessing private memory in process
>>     context. Likely for pKVM we would only crash the process, which
>>     would be acceptable.
>> 3) Prevent GUP to private memory. Otherwise we could crash the kernel.
>> 4) Prevent private pages from swapout+migration until supported.
>>
>>
>> I suspect your current solution with anonymous memory gets all but 3) sorted
>> out, correct?
> 
> I agree on all of these and, yes, (3) is the problem for us. We've also
> been thinking a bit about CoW recently and I suspect the use of
> vm_normal_page() in do_wp_page() could lead to issues similar to those
> we hit with GUP. There are various ways to approach that, but I'm not
> sure what's best.

Would COW be required or is that just the nasty side-effect of trying to 
use anonymous memory?

> 
>> I'm curious, may there be a requirement in the future that shared memory
>> could be mapped into other processes? (thinking vhost-user and such things).
> 
> It's not impossible. We use crosvm as our VMM, and that has a
> multi-process sandbox mode which I think relies on just that...
> 

Okay, so basing the design on anonymous memory might not be the best 
choice ... :/

> Cheers,
> 
> Will
> 
> (btw: I'm getting some time away from the computer over Easter, so I'll be
>   a little slow on email again. Nothing personal!).

Sure, no worries! Enjoy!

-- 
Cheers,

David / dhildenb

next prev parent reply	other threads:[~2024-03-28  9:07 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20240222161047.402609-1-tabba@google.com>
     [not found] ` <20240222141602976-0800.eberman@hu-eberman-lv.qualcomm.com>
2024-02-23  0:35   ` folio_mmapped Matthew Wilcox
2024-02-26  9:28     ` folio_mmapped David Hildenbrand
2024-02-26 21:14       ` folio_mmapped Elliot Berman
2024-02-27 14:59         ` folio_mmapped David Hildenbrand
2024-02-28 10:48           ` folio_mmapped Quentin Perret
2024-02-28 11:11             ` folio_mmapped David Hildenbrand
2024-02-28 12:44               ` folio_mmapped Quentin Perret
2024-02-28 13:00                 ` folio_mmapped David Hildenbrand
2024-02-28 13:34                   ` folio_mmapped Quentin Perret
2024-02-28 18:43                     ` folio_mmapped Elliot Berman
2024-02-28 18:51                       ` Quentin Perret
2024-02-29 10:04                     ` folio_mmapped David Hildenbrand
2024-02-29 19:01                       ` folio_mmapped Fuad Tabba
2024-03-01  0:40                         ` folio_mmapped Elliot Berman
2024-03-01 11:16                           ` folio_mmapped David Hildenbrand
2024-03-04 12:53                             ` folio_mmapped Quentin Perret
2024-03-04 20:22                               ` folio_mmapped David Hildenbrand
2024-03-01 11:06                         ` folio_mmapped David Hildenbrand
2024-03-04 12:36                       ` folio_mmapped Quentin Perret
2024-03-04 19:04                         ` folio_mmapped Sean Christopherson
2024-03-04 20:17                           ` folio_mmapped David Hildenbrand
2024-03-04 21:43                             ` folio_mmapped Elliot Berman
2024-03-04 21:58                               ` folio_mmapped David Hildenbrand
2024-03-19  9:47                                 ` folio_mmapped Quentin Perret
2024-03-19  9:54                                   ` folio_mmapped David Hildenbrand
2024-03-18 17:06                             ` folio_mmapped Vishal Annapurve
2024-03-18 22:02                               ` folio_mmapped David Hildenbrand
     [not found]                                 ` <CAGtprH8B8y0Khrid5X_1twMce7r-Z7wnBiaNOi-QwxVj4D+L3w@mail.gmail.com>
2024-03-19  0:10                                   ` folio_mmapped Sean Christopherson
2024-03-19 10:26                                     ` folio_mmapped David Hildenbrand
2024-03-19 13:19                                       ` folio_mmapped David Hildenbrand
2024-03-19 14:31                                       ` folio_mmapped Will Deacon
2024-03-19 23:54                                         ` folio_mmapped Elliot Berman
2024-03-22 16:36                                           ` Will Deacon
2024-03-22 18:46                                             ` Elliot Berman
2024-03-27 19:31                                               ` Will Deacon
     [not found]                                         ` <2d6fc3c0-a55b-4316-90b8-deabb065d007@redhat.com>
2024-03-22 21:21                                           ` folio_mmapped David Hildenbrand
2024-03-26 22:04                                             ` folio_mmapped Elliot Berman
2024-03-27 19:34                                           ` folio_mmapped Will Deacon
2024-03-28  9:06                                             ` David Hildenbrand [this message]
2024-03-28 10:10                                               ` folio_mmapped Quentin Perret
2024-03-28 10:32                                                 ` folio_mmapped David Hildenbrand
2024-03-28 10:58                                                   ` folio_mmapped Quentin Perret
2024-03-28 11:41                                                     ` folio_mmapped David Hildenbrand
2024-03-29 18:38                                                       ` folio_mmapped Vishal Annapurve
2024-04-04  0:15                                             ` folio_mmapped Sean Christopherson
2024-03-19 15:04                                       ` folio_mmapped Sean Christopherson
2024-03-22 17:16                                         ` folio_mmapped David Hildenbrand
2024-02-26  9:03   ` [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d0500f89-df3b-42cd-aa5a-5b3005f67638@redhat.com \
    --to=david@redhat.com \
    --cc=ackerleytng@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=amoorthy@google.com \
    --cc=anup@brainfault.org \
    --cc=aou@eecs.berkeley.edu \
    --cc=brauner@kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=chao.p.peng@linux.intel.com \
    --cc=chenhuacai@kernel.org \
    --cc=dmatlack@google.com \
    --cc=isaku.yamahata@gmail.com \
    --cc=isaku.yamahata@intel.com \
    --cc=james.morse@arm.com \
    --cc=jarkko@kernel.org \
    --cc=keirf@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=kvmarm@lists.linux.dev \
    --cc=liam.merwick@oracle.com \
    --cc=linux-mm@kvack.org \
    --cc=mail@maciej.szmigiero.name \
    --cc=maz@kernel.org \
    --cc=mic@digikod.net \
    --cc=michael.roth@amd.com \
    --cc=mpe@ellerman.id.au \
    --cc=oliver.upton@linux.dev \
    --cc=palmer@dabbelt.com \
    --cc=paul.walmsley@sifive.com \
    --cc=pbonzini@redhat.com \
    --cc=qperret@google.com \
    --cc=quic_cvanscha@quicinc.com \
    --cc=quic_mnalajal@quicinc.com \
    --cc=quic_pderrin@quicinc.com \
    --cc=quic_pheragu@quicinc.com \
    --cc=quic_svaddagi@quicinc.com \
    --cc=quic_tsoni@quicinc.com \
    --cc=seanjc@google.com \
    --cc=steven.price@arm.com \
    --cc=suzuki.poulose@arm.com \
    --cc=tabba@google.com \
    --cc=vannapurve@google.com \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    --cc=wei.w.wang@intel.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=xiaoyao.li@intel.com \
    --cc=yilun.xu@intel.com \
    --cc=yu.c.zhang@linux.intel.com \
    --cc=yuzenghui@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).