From: David Hildenbrand <david@redhat.com>
To: Patrick Roy <roypat@amazon.co.uk>,
tabba@google.com, quic_eberman@quicinc.com, seanjc@google.com,
pbonzini@redhat.com, jthoughton@google.com,
ackerleytng@google.com, vannapurve@google.com, rppt@kernel.org
Cc: graf@amazon.com, jgowans@amazon.com, derekmn@amazon.com,
kalyazin@amazon.com, xmarcalx@amazon.com, linux-mm@kvack.org,
corbet@lwn.net, catalin.marinas@arm.com, will@kernel.org,
chenhuacai@kernel.org, kernel@xen0n.name,
paul.walmsley@sifive.com, palmer@dabbelt.com,
aou@eecs.berkeley.edu, hca@linux.ibm.com, gor@linux.ibm.com,
agordeev@linux.ibm.com, borntraeger@linux.ibm.com,
svens@linux.ibm.com, gerald.schaefer@linux.ibm.com,
tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
luto@kernel.org, peterz@infradead.org, rostedt@goodmis.org,
mhiramat@kernel.org, mathieu.desnoyers@efficios.com,
shuah@kernel.org, kvm@vger.kernel.org, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org, loongarch@lists.linux.dev,
linux-riscv@lists.infradead.org, linux-s390@vger.kernel.org,
linux-trace-kernel@vger.kernel.org,
linux-kselftest@vger.kernel.org
Subject: Re: [RFC PATCH v3 0/6] Direct Map Removal for guest_memfd
Date: Mon, 4 Nov 2024 22:30:53 +0100 [thread overview]
Message-ID: <10e4d078-3cdb-4d1c-a1a3-80e91b247217@redhat.com> (raw)
In-Reply-To: <90c9d8c0-814e-4c86-86ef-439cb5552cb6@amazon.co.uk>
>> We talked about shared (faultable) vs. private (unfaultable), and how it
>> would interact with the directmap patches here.
>>
>> As discussed, having private (unfaultable) memory with the direct-map
>> removed and shared (faultable) memory with the direct-mapping can make
>> sense for non-TDX/AMD-SEV/... non-CoCo use cases. Not sure about CoCo,
>> the discussion here seems to indicate that it might currently not be
>> required.
>>
>> So one thing we could do is that shared (faultable) will have a direct
>> mapping and be gup-able and private (unfaultable) memory will not have a
>> direct mapping and is, by design, not gup-able.>
>> Maybe it could make sense to not have a direct map for all guest_memfd
>> memory, making it behave like secretmem (and it would be easy to
>> implement)? But I'm not sure if that is really desirable in VM context.
>
> This would work for us (in this scenario, the swiotlb areas would be
> "traditional" memory, e.g. set to shared via mem attributes instead of
> "shared" inside KVM), it's kinda what I had prototyped in my v1 of this
> series (well, we'd need to figure out how to get the mappings of gmem
> back into KVM, since in this setup, short-circuiting it into
> userspace_addr wouldn't work, unless we banish swiotlb into a different
> memslot altogether somehow).
Right.
> But I don't think it'd work for pKVM, iirc
> they need GUP on gmem, and also want direct map removal (... but maybe,
> the gmem VMA for non-CoCo usecase and the gmem VMA for pKVM could be
> behave differently? non-CoCo gets essentially memfd_secret, pKVM gets
> GUP+no faults of private mem).
Good question. So far my perception was that the directmap removal on
"private/unfaultable" would be sufficient.
>
>> Having a mixture of "has directmap" and "has no directmap" for shared
>> (faultable) memory should not be done. Similarly, private memory really
>> should stay "unfaultable".
>
> You've convinced me that having both GUP-able and non GUP-able
> memory in the same VMA will be tricky. However, I'm less convinced on
> why private memory should stay unfaultable; only that it shouldn't be
> faultable into a VMA that also allows GUP. Can we have two VMAs? One
> that disallows GUP, but allows userspace access to shared and private,
> and one that allows GUP, but disallows accessing private memory? Maybe
> via some `PROT_NOGUP` flag to `mmap`? I guess this is a slightly
> different spin of the above idea.
What we are trying to achieve is making guest_memfd not behave
completely different on that level for different "types" of VMs. So one
of the goals should be to try to unify it as much as possible.
shared -> faultable: GUP-able
private -> unfaultable: unGUP-able
And it makes sense, because a lot of future work will rely on some
important properties: for example, if private memory cannot be faulted
in + GUPed, core-MM will never have obtained valid references to such a
page. There is no need to split large folios into smaller ones for
tracking purposes; there is no need to maintain per-page refcounts and
pincounts ...
It doesn't mean that we cannot consider it if really required, but there
really has to be a strong case for it, because it will all get really messy.
For example, one issue is that a folio only has a single mapping
(folio->mapping), and that is used in the GUP-fast path (no VMA) to
determine whether GUP-fast is allowed or not.
So you'd have to force everything through GUP-slow, where you could
consider VMA properties :( It sounds quite suboptimal.
I don't think multiple VMAs are what we really want. See below.
>
>> I think one of the points raised during the bi-weekly call was that
>> using a viommu/swiotlb might be the right call, such that all memory can
>> be considered private (unfaultable) that is not explicitly
>> shared/expected to be modified by the hypervisor (-> faultable, ->
>> GUP-able).
>>
>> Further, I think Sean had some good points why we should explore that
>> direction, but I recall that there were some issue to be sorted out
>> (interpreted instructions requiring direct map when accessing "private"
>> memory?), not sure if that is already working/can be made working in KVM.
>
> Yeah, the big one is MMIO instruction emulation on x86, which does guest
> page table walks and instruction fetch (and particularly the latter
> cannot be known ahead-of-time by the guest, aka cannot be explicitly
> "shared"). That's what the majority of my v2 series was about. For
> traditional memslots, KVM handles these via get_user and friends, but if
> we don't have a VMA that allows faulting all of gmem, then that's
> impossible, and we're in "temporarily restore direct map" land. Which
> comes with significantly performance penalties due to TLB flushes.
Agreed.
> >> What's your opinion after the call and the next step for use cases
like
>> you have in mind (IIRC firecracker, which wants to not have the
>> direct-map for guest memory where it can be avoided)?
>
> Yea, the usecase is for Firecracker to not have direct map entries for
> guest memory, unless needed for I/O (-> swiotlb).
>
> As for next steps, let's determine once and for all if we can do the
> KVM-internal guest memory accesses for MMIO emulation through userspace
> mappings (although if we can't I'll have some serious soul-searching to
> do, because all other solutions we talked about so far also have fairly
> big drawbacks; on-demand direct map reinsertion has terrible
> performance
So IIUC, KVM would have to access "unfaultable" guest_memfd memory using
fd+offset, and that's problematic because "no-directmap".
So you'd have to map+unmap the directmap repeatedly, and still expose it
temporarily in the direct map to others. I see how that is undesirable,
even when trying to cache hotspots (partly destroying the purpose of the
directmap removal).
Would a per-MM kernel mapping of these pages work, so KVM can access them?
It sounds a bit like what is required for clean per-MM allocations [1]:
establish a per-MM kernel mapping of (selected?) pages. Not necessarily
all of them.
Yes, we'd be avoiding VMAs, GUP, mapcounts, pincounts and everything
involved with ordinary user mappings for these private/unfaultable
thingies. Just like as discussed in, and similar to [1].
Just throwing it out there, maybe we really want to avoid the directmap
(keep it unmapped) and maintain a per-mm mapping for a bunch of folios
that can be easily removed when required by guest_memfd (ftruncate,
conversion private->shared) on request.
[1] https://lore.kernel.org/all/20240911143421.85612-1-faresx@amazon.de/T/#u
--
Cheers,
David / dhildenb
next prev parent reply other threads:[~2024-11-04 21:31 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-30 13:49 [RFC PATCH v3 0/6] Direct Map Removal for guest_memfd Patrick Roy
2024-10-30 13:49 ` [RFC PATCH v3 1/6] arch: introduce set_direct_map_valid_noflush() Patrick Roy
2024-10-31 9:57 ` David Hildenbrand
2024-11-11 12:12 ` Vlastimil Babka
2024-11-12 14:48 ` Patrick Roy
2024-10-30 13:49 ` [RFC PATCH v3 2/6] kvm: gmem: add flag to remove memory from kernel direct map Patrick Roy
2024-10-31 13:56 ` Mike Day
2024-10-30 13:49 ` [RFC PATCH v3 3/6] kvm: gmem: implement direct map manipulation routines Patrick Roy
2024-10-31 14:19 ` Mike Day
2024-10-30 13:49 ` [RFC PATCH v3 4/6] kvm: gmem: add trace point for direct map state changes Patrick Roy
2024-10-30 13:49 ` [RFC PATCH v3 5/6] kvm: document KVM_GMEM_NO_DIRECT_MAP flag Patrick Roy
2024-10-30 13:49 ` [RFC PATCH v3 6/6] kvm: selftests: run gmem tests with KVM_GMEM_NO_DIRECT_MAP set Patrick Roy
2024-10-31 9:50 ` [RFC PATCH v3 0/6] Direct Map Removal for guest_memfd David Hildenbrand
2024-10-31 10:42 ` Patrick Roy
2024-11-01 0:10 ` Manwaring, Derek
2024-11-01 15:18 ` Sean Christopherson
2024-11-01 18:32 ` Kaplan, David
2024-11-01 16:06 ` Dave Hansen
2024-11-01 16:56 ` Manwaring, Derek
2024-11-01 17:20 ` Dave Hansen
2024-11-01 18:31 ` Manwaring, Derek
2024-11-01 18:43 ` Dave Hansen
2024-11-01 19:29 ` Manwaring, Derek
2024-11-01 19:39 ` Dave Hansen
2024-11-04 8:33 ` Reshetova, Elena
2024-11-06 17:04 ` Manwaring, Derek
2024-11-08 10:36 ` Reshetova, Elena
2024-11-13 3:31 ` Manwaring, Derek
2024-11-04 12:18 ` David Hildenbrand
2024-11-04 13:09 ` Patrick Roy
2024-11-04 21:30 ` David Hildenbrand [this message]
2024-11-12 14:40 ` Patrick Roy
2024-11-12 14:52 ` David Hildenbrand
2024-11-15 16:59 ` Patrick Roy
2024-11-15 17:10 ` David Hildenbrand
2024-11-15 17:23 ` Patrick Roy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=10e4d078-3cdb-4d1c-a1a3-80e91b247217@redhat.com \
--to=david@redhat.com \
--cc=ackerleytng@google.com \
--cc=agordeev@linux.ibm.com \
--cc=aou@eecs.berkeley.edu \
--cc=borntraeger@linux.ibm.com \
--cc=bp@alien8.de \
--cc=catalin.marinas@arm.com \
--cc=chenhuacai@kernel.org \
--cc=corbet@lwn.net \
--cc=dave.hansen@linux.intel.com \
--cc=derekmn@amazon.com \
--cc=gerald.schaefer@linux.ibm.com \
--cc=gor@linux.ibm.com \
--cc=graf@amazon.com \
--cc=hca@linux.ibm.com \
--cc=hpa@zytor.com \
--cc=jgowans@amazon.com \
--cc=jthoughton@google.com \
--cc=kalyazin@amazon.com \
--cc=kernel@xen0n.name \
--cc=kvm@vger.kernel.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-riscv@lists.infradead.org \
--cc=linux-s390@vger.kernel.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=loongarch@lists.linux.dev \
--cc=luto@kernel.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=mhiramat@kernel.org \
--cc=mingo@redhat.com \
--cc=palmer@dabbelt.com \
--cc=paul.walmsley@sifive.com \
--cc=pbonzini@redhat.com \
--cc=peterz@infradead.org \
--cc=quic_eberman@quicinc.com \
--cc=rostedt@goodmis.org \
--cc=roypat@amazon.co.uk \
--cc=rppt@kernel.org \
--cc=seanjc@google.com \
--cc=shuah@kernel.org \
--cc=svens@linux.ibm.com \
--cc=tabba@google.com \
--cc=tglx@linutronix.de \
--cc=vannapurve@google.com \
--cc=will@kernel.org \
--cc=x86@kernel.org \
--cc=xmarcalx@amazon.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).