Re: [RFC PATCH v3 0/6] Direct Map Removal for guest_memfd

linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: Patrick Roy <roypat@amazon.co.uk>,
	tabba@google.com, quic_eberman@quicinc.com, seanjc@google.com,
	pbonzini@redhat.com, jthoughton@google.com,
	ackerleytng@google.com, vannapurve@google.com, rppt@kernel.org
Cc: graf@amazon.com, jgowans@amazon.com, derekmn@amazon.com,
	kalyazin@amazon.com, xmarcalx@amazon.com, linux-mm@kvack.org,
	corbet@lwn.net, catalin.marinas@arm.com, will@kernel.org,
	chenhuacai@kernel.org, kernel@xen0n.name,
	paul.walmsley@sifive.com, palmer@dabbelt.com,
	aou@eecs.berkeley.edu, hca@linux.ibm.com, gor@linux.ibm.com,
	agordeev@linux.ibm.com, borntraeger@linux.ibm.com,
	svens@linux.ibm.com, gerald.schaefer@linux.ibm.com,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
	luto@kernel.org, peterz@infradead.org, rostedt@goodmis.org,
	mhiramat@kernel.org, mathieu.desnoyers@efficios.com,
	shuah@kernel.org, kvm@vger.kernel.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, loongarch@lists.linux.dev,
	linux-riscv@lists.infradead.org, linux-s390@vger.kernel.org,
	linux-trace-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org
Subject: Re: [RFC PATCH v3 0/6] Direct Map Removal for guest_memfd
Date: Mon, 4 Nov 2024 22:30:53 +0100	[thread overview]
Message-ID: <10e4d078-3cdb-4d1c-a1a3-80e91b247217@redhat.com> (raw)
In-Reply-To: <90c9d8c0-814e-4c86-86ef-439cb5552cb6@amazon.co.uk>

>> We talked about shared (faultable) vs. private (unfaultable), and how it
>> would interact with the directmap patches here.
>>
>> As discussed, having private (unfaultable) memory with the direct-map
>> removed and shared (faultable) memory with the direct-mapping can make
>> sense for non-TDX/AMD-SEV/... non-CoCo use cases. Not sure about CoCo,
>> the discussion here seems to indicate that it might currently not be
>> required.
>>
>> So one thing we could do is that shared (faultable) will have a direct
>> mapping and be gup-able and private (unfaultable) memory will not have a
>> direct mapping and is, by design, not gup-able.>
>> Maybe it could make sense to not have a direct map for all guest_memfd
>> memory, making it behave like secretmem (and it would be easy to
>> implement)? But I'm not sure if that is really desirable in VM context.
> 
> This would work for us (in this scenario, the swiotlb areas would be
> "traditional" memory, e.g. set to shared via mem attributes instead of
> "shared" inside KVM), it's kinda what I had prototyped in my v1 of this
> series (well, we'd need to figure out how to get the mappings of gmem
> back into KVM, since in this setup, short-circuiting it into
> userspace_addr wouldn't work, unless we banish swiotlb into a different
> memslot altogether somehow).

Right.

> But I don't think it'd work for pKVM, iirc
> they need GUP on gmem, and also want direct map removal (... but maybe,
> the gmem VMA for non-CoCo usecase and the gmem VMA for pKVM could be
> behave differently?  non-CoCo gets essentially memfd_secret, pKVM gets
> GUP+no faults of private mem).

Good question. So far my perception was that the directmap removal on 
"private/unfaultable" would be sufficient.

> 
>> Having a mixture of "has directmap" and "has no directmap" for shared
>> (faultable) memory should not be done. Similarly, private memory really
>> should stay "unfaultable".
> 
> You've convinced me that having both GUP-able and non GUP-able
> memory in the same VMA will be tricky. However, I'm less convinced on
> why private memory should stay unfaultable; only that it shouldn't be
> faultable into a VMA that also allows GUP. Can we have two VMAs? One
> that disallows GUP, but allows userspace access to shared and private,
> and one that allows GUP, but disallows accessing private memory? Maybe
> via some `PROT_NOGUP` flag to `mmap`? I guess this is a slightly
> different spin of the above idea.

What we are trying to achieve is making guest_memfd not behave 
completely different on that level for different "types" of VMs. So one 
of the goals should be to try to unify it as much as possible.

shared -> faultable: GUP-able
private -> unfaultable: unGUP-able


And it makes sense, because a lot of future work will rely on some 
important properties: for example, if private memory cannot be faulted 
in + GUPed, core-MM will never have obtained valid references to such a 
page. There is no need to split large folios into smaller ones for 
tracking purposes; there is no need to maintain per-page refcounts and 
pincounts ...

It doesn't mean that we cannot consider it if really required, but there 
really has to be a strong case for it, because it will all get really messy.

For example, one issue is that a folio only has a single mapping 
(folio->mapping), and that is used in the GUP-fast path (no VMA) to 
determine whether GUP-fast is allowed or not.

So you'd have to force everything through GUP-slow, where you could 
consider VMA properties :( It sounds quite suboptimal.

I don't think multiple VMAs are what we really want. See below.

> 
>> I think one of the points raised during the bi-weekly call was that
>> using a viommu/swiotlb might be the right call, such that all memory can
>> be considered private (unfaultable) that is not explicitly
>> shared/expected to be modified by the hypervisor (-> faultable, ->
>> GUP-able).
>>
>> Further, I think Sean had some good points why we should explore that
>> direction, but I recall that there were some issue to be sorted out
>> (interpreted instructions requiring direct map when accessing "private"
>> memory?), not sure if that is already working/can be made working in KVM.
> 
> Yeah, the big one is MMIO instruction emulation on x86, which does guest
> page table walks and instruction fetch (and particularly the latter
> cannot be known ahead-of-time by the guest, aka cannot be explicitly
> "shared"). That's what the majority of my v2 series was about. For
> traditional memslots, KVM handles these via get_user and friends, but if
> we don't have a VMA that allows faulting all of gmem, then that's
> impossible, and we're in "temporarily restore direct map" land. Which
> comes with significantly performance penalties due to TLB flushes.

Agreed.

 > >> What's your opinion after the call and the next step for use cases 
like
>> you have in mind (IIRC firecracker, which wants to not have the
>> direct-map for guest memory where it can be avoided)?
> 
> Yea, the usecase is for Firecracker to not have direct map entries for
> guest memory, unless needed for I/O (-> swiotlb).
> 
> As for next steps, let's determine once and for all if we can do the
> KVM-internal guest memory accesses for MMIO emulation through userspace
> mappings (although if we can't I'll have some serious soul-searching to
> do, because all other solutions we talked about so far also have fairly
> big drawbacks; on-demand direct map reinsertion has terrible
> performance
So IIUC, KVM would have to access "unfaultable" guest_memfd memory using 
fd+offset, and that's problematic because "no-directmap".

So you'd have to map+unmap the directmap repeatedly, and still expose it 
temporarily in the direct map to others. I see how that is undesirable, 
even when trying to cache hotspots (partly destroying the purpose of the 
directmap removal).


Would a per-MM kernel mapping of these pages work, so KVM can access them?

It sounds a bit like what is required for clean per-MM allocations [1]: 
establish a per-MM kernel mapping of (selected?) pages. Not necessarily 
all of them.

Yes, we'd be avoiding VMAs, GUP, mapcounts, pincounts and everything 
involved with ordinary user mappings for these private/unfaultable 
thingies. Just like as discussed in, and similar to [1].

Just throwing it out there, maybe we really want to avoid the directmap 
(keep it unmapped) and maintain a per-mm mapping for a bunch of folios 
that can be easily removed when required by guest_memfd (ftruncate, 
conversion private->shared) on request.

[1] https://lore.kernel.org/all/20240911143421.85612-1-faresx@amazon.de/T/#u

-- 
Cheers,

David / dhildenb

next prev parent reply	other threads:[~2024-11-04 21:31 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-30 13:49 [RFC PATCH v3 0/6] Direct Map Removal for guest_memfd Patrick Roy
2024-10-30 13:49 ` [RFC PATCH v3 1/6] arch: introduce set_direct_map_valid_noflush() Patrick Roy
2024-10-31  9:57   ` David Hildenbrand
2024-11-11 12:12     ` Vlastimil Babka
2024-11-12 14:48       ` Patrick Roy
2024-10-30 13:49 ` [RFC PATCH v3 2/6] kvm: gmem: add flag to remove memory from kernel direct map Patrick Roy
2024-10-31 13:56   ` Mike Day
2024-10-30 13:49 ` [RFC PATCH v3 3/6] kvm: gmem: implement direct map manipulation routines Patrick Roy
2024-10-31 14:19   ` Mike Day
2024-10-30 13:49 ` [RFC PATCH v3 4/6] kvm: gmem: add trace point for direct map state changes Patrick Roy
2024-10-30 13:49 ` [RFC PATCH v3 5/6] kvm: document KVM_GMEM_NO_DIRECT_MAP flag Patrick Roy
2024-10-30 13:49 ` [RFC PATCH v3 6/6] kvm: selftests: run gmem tests with KVM_GMEM_NO_DIRECT_MAP set Patrick Roy
2024-10-31  9:50 ` [RFC PATCH v3 0/6] Direct Map Removal for guest_memfd David Hildenbrand
2024-10-31 10:42   ` Patrick Roy
2024-11-01  0:10     ` Manwaring, Derek
2024-11-01 15:18       ` Sean Christopherson
2024-11-01 18:32         ` Kaplan, David
2024-11-01 16:06       ` Dave Hansen
2024-11-01 16:56         ` Manwaring, Derek
2024-11-01 17:20           ` Dave Hansen
2024-11-01 18:31             ` Manwaring, Derek
2024-11-01 18:43               ` Dave Hansen
2024-11-01 19:29                 ` Manwaring, Derek
2024-11-01 19:39                   ` Dave Hansen
2024-11-04  8:33           ` Reshetova, Elena
2024-11-06 17:04             ` Manwaring, Derek
2024-11-08 10:36               ` Reshetova, Elena
2024-11-13  3:31                 ` Manwaring, Derek
2024-11-04 12:18     ` David Hildenbrand
2024-11-04 13:09       ` Patrick Roy
2024-11-04 21:30         ` David Hildenbrand [this message]
2024-11-12 14:40           ` Patrick Roy
2024-11-12 14:52             ` David Hildenbrand
2024-11-15 16:59               ` Patrick Roy
2024-11-15 17:10                 ` David Hildenbrand
2024-11-15 17:23                   ` Patrick Roy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=10e4d078-3cdb-4d1c-a1a3-80e91b247217@redhat.com \
    --to=david@redhat.com \
    --cc=ackerleytng@google.com \
    --cc=agordeev@linux.ibm.com \
    --cc=aou@eecs.berkeley.edu \
    --cc=borntraeger@linux.ibm.com \
    --cc=bp@alien8.de \
    --cc=catalin.marinas@arm.com \
    --cc=chenhuacai@kernel.org \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=derekmn@amazon.com \
    --cc=gerald.schaefer@linux.ibm.com \
    --cc=gor@linux.ibm.com \
    --cc=graf@amazon.com \
    --cc=hca@linux.ibm.com \
    --cc=hpa@zytor.com \
    --cc=jgowans@amazon.com \
    --cc=jthoughton@google.com \
    --cc=kalyazin@amazon.com \
    --cc=kernel@xen0n.name \
    --cc=kvm@vger.kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-riscv@lists.infradead.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=loongarch@lists.linux.dev \
    --cc=luto@kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mhiramat@kernel.org \
    --cc=mingo@redhat.com \
    --cc=palmer@dabbelt.com \
    --cc=paul.walmsley@sifive.com \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=quic_eberman@quicinc.com \
    --cc=rostedt@goodmis.org \
    --cc=roypat@amazon.co.uk \
    --cc=rppt@kernel.org \
    --cc=seanjc@google.com \
    --cc=shuah@kernel.org \
    --cc=svens@linux.ibm.com \
    --cc=tabba@google.com \
    --cc=tglx@linutronix.de \
    --cc=vannapurve@google.com \
    --cc=will@kernel.org \
    --cc=x86@kernel.org \
    --cc=xmarcalx@amazon.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).