From: "Brendan Jackman" <brendan.jackman@linux.dev>
To: "Ackerley Tng" <ackerleytng@google.com>,
"Takahiro Itazuri" <itazur@amazon.com>, <fvdl@google.com>,
<seanjc@google.com>, <ljs@kernel.org>
Cc: <Liam.Howlett@oracle.com>, <agordeev@linux.ibm.com>,
<ajones@ventanamicro.com>, <akpm@linux-foundation.org>,
<alex@ghiti.fr>, <andrii@kernel.org>, <aou@eecs.berkeley.edu>,
<ast@kernel.org>, <baolu.lu@linux.intel.com>, <bp@alien8.de>,
<chenhuacai@kernel.org>, <corbet@lwn.net>, <coxu@redhat.com>,
<daniel@iogearbox.net>, <dave.hansen@linux.intel.com>,
<david@kernel.org>, <derekmn@amazon.com>, <dev.jain@arm.com>,
<eddyz87@gmail.com>, <gerald.schaefer@linux.ibm.com>,
<gor@linux.ibm.com>, <haoluo@google.com>, <hca@linux.ibm.com>,
<hpa@zytor.com>, <itazur@amazon.co.uk>, <jackabt@amazon.co.uk>,
<jackmanb@google.com>, <jannh@google.com>, <jgg@ziepe.ca>,
<jgross@suse.com>, <jhubbard@nvidia.com>,
<jiayuan.chen@shopee.com>, <jmattson@google.com>,
<joey.gouly@arm.com>, <john.fastabend@gmail.com>,
<jolsa@kernel.org>, <jthoughton@google.com>, <kpsingh@kernel.org>,
<kvm@vger.kernel.org>, <kvmarm@lists.linux.dev>,
<lenb@kernel.org>, <linux-kernel@vger.kernel.org>,
<linux-mm@kvack.org>, <lorenzo.stoakes@oracle.com>,
<luto@kernel.org>, <maobibo@loongson.cn>, <martin.lau@linux.dev>,
<maz@kernel.org>, <mhocko@suse.com>, <mingo@redhat.com>,
<mlevitsk@redhat.com>, <nikita.kalyazin@linux.dev>,
<oupton@kernel.org>, <palmer@dabbelt.com>,
<patrick.roy@linux.dev>, <pavel@kernel.org>,
<pbonzini@redhat.com>, <peterx@redhat.com>,
<peterz@infradead.org>, <pfalcato@suse.de>, <pjw@kernel.org>,
<prsampat@amd.com>, <rafael@kernel.org>, <riel@surriel.com>,
<rppt@kernel.org>, <ryan.roberts@arm.com>, <sdf@fomichev.me>,
<shijie@os.amperecomputing.com>, <skhan@linuxfoundation.org>,
<song@kernel.org>, <surenb@google.com>, <suzuki.poulose@arm.com>,
<svens@linux.ibm.com>, <tabba@google.com>, <tglx@kernel.org>,
<thuth@redhat.com>, <urezki@gmail.com>, <vannapurve@google.com>,
<vbabka@kernel.org>, <will@kernel.org>, <willy@infradead.org>,
<wu.fei9@sanechips.com.cn>, <x86@kernel.org>,
<yang@os.amperecomputing.com>, <yangyicong@hisilicon.com>,
<yonghong.song@linux.dev>, <yosry@kernel.org>,
<yu-cheng.yu@intel.com>, <yuzenghui@huawei.com>,
<zhengqi.arch@bytedance.com>, <zulinx86@gmai.com>
Subject: Re: [PATCH v12 10/16] KVM: guest_memfd: Add flag to remove from direct map
Date: Fri, 03 Jul 2026 17:25:05 +0000 [thread overview]
Message-ID: <DJP40SEE38XA.3BXJN4U0VDIOS@linux.dev> (raw)
In-Reply-To: <CAEvNRgG07EMrx-SpMaO3gHmdGVwOb75XNy7_RARBo0chidn7Yg@mail.gmail.com>
Alright I think I'm finally getting a bit more up to speed on the
important questions here.
On Thu May 14, 2026 at 4:45 PM UTC, Ackerley Tng wrote:
> Takahiro Itazuri <itazur@amazon.com> writes:
>
>>
>> [...snip...]
>>
>
> Brought this topic up on the guest_memfd biweekly today!
>
>>
>> Agreed with both of you. I'll adopt the filemap-level approach:
>>
>> - Move the zap/restore hooks from guest_memfd into filemap_add_folio()
>> / filemap_remove_folio().
>> - Tighten AS_NO_DIRECT_MAP semantics so that, for folios in such a
>> mapping, the direct map is invalid for the entire time the folio
>> resides in the page cache.
>> - Drop the per-folio KVM_GMEM_FOLIO_NO_DIRECT_MAP bookkeeping in
>> folio->private, since the existence of the folio in the mapping is
>> itself the state.
Yeah so I protoyped this and I think it's fine.. except for zeroing.
>> On each guest memory population path,
>>
>> - memcpy-based population from userspace goes through the userspace
>> mapping of guest_memfd, not through the kernel direct map, so the
>> filemap-level invariant doesn't affect it. But this is slow, which
>> is what motivated the write() syscall support.
>>
>> - write(): meant to speed up the userspace-memcpy case above by doing
>> the copy in the kernel. I believe Brendan's __GFP_UNMAPPED/mermap
>> work [1] would give us a low-overhead way to get temporary kernel
>> access to an AS_NO_DIRECT_MAP. Landing mermap may take a while, but
>> this series does not introduce the write() path, so mermap is not a
>> blocker for now.
>>
>> - kvm_gmem_populate(): this is a TDX/SNP-only path, and NO_DIRECT_MAP
>> is not available on those VM types —
>> kvm_arch_gmem_supports_no_direct_map() returns false for
>> KVM_X86_TDX_VM and KVM_X86_SNP_VM, which are its only callers
>> today. So it doesn't interact with the filemap invariant IIUC.
There are also the fault paths though; if the pages are nonpresent in
the direct map for the duration of their life in the page cache (and I
think they should be) then by the time we get to
kvm_mmu_faultin_pfn_gmem() or kvm_gmem_fault_user_mapping() we lost the
ability to zero them.
My original answer for this was "that's fine, we'll use __GFP_ZERO
(which will probably use the mermap under the hood)", but now I've
realised there's a good reason we don't set __GFP_ZERO at the moment,
namely that it's wasted if we end up doing kvm_gmem_populate()
(Continued below...)
> I'm a little bit uncomfortable this statement since it seems to say TDX
> and SNP aren't taken care of. Would just like to discuss (for
> a line of sight to SNP and TDX support):
Are you saying we need NO_DIRECT_MAP support for TDX/SNP? I think that
would be doable but what's the value? So that we can get a #PF instead
of #MCE if we screw up?
> For non-in-place population where the source physical page is different
> from the destination physical page,
>
> + TDX: the TDX module does the population and works with physical
> addresses, so no issue with populate? Other parts of TDX may have
> trouble though, but that can be handled later.
> + SNP: sev_gmem_post_populate() does a memcpy() after using
> kmap_local_page()
>
> Would mermap be a drop in replacement for kmap_local_page() here?
Yeah basically.
> Would guest_memfd need to force a TLB flush after mermap+memcpy?
It's not required for correctness, no (mermap does those flushes
internally). For security, I dunno, this comes back to my confusion
above about why we'd want NO_DIRECT_MAP for TDX at all, maybe best to
chat face-to-face about that and then follow up here with a summary.
====
ANYWAY, here is how I would ultimately see all of this working, at least
for non-CoCo cases:
- AS_NO_DIRECT_MAP causes filemap.c to set ALLOC_UNMAPPED (that's what
the next iteration of __GFP_UNMAPPED will be called) so you get pages
directly from the page allocator that are already fully zapped.
- Where guest_memfd.c currently does clear_highpage(), it now isntead
does something a bit like clear_page_mermap() from
https://lore.kernel.org/all/20260320-page_alloc-unmapped-v2-20-28bf1bd54f41@google.com/
- The write() path does something similar with the mermap.
- Those mermap operations would leave behind stale TLB entries that
could be exploited by the VMM for CPU vulns. To prevent that we need
to force a TLB flush before freeing the physical pages they point to.
Luckily now that all the folio allocations are pushed into
mm/filemap.c we can just do that in kvm_gmem_free_folio(), preventing
bugs like the one I had here (bottom of the mail):
https://lore.kernel.org/all/DHH1NTVNTA8W.2313NYMA29J42@google.com/
Note there's no need for the page allocator to suport ALLOC_UNMAPPED
with __GFP_ZERO in this design, which is nice.
====
NOW, the thing I'm stuck on (again lol) is the patchset-fu. Here's all
the parts we need, with dependencies indented:
0. efficient GUEST_MEMFD_FLAG_NO_DIRECT_MAP
1. AS_NO_DIRECT_MAP
2. ALLOC_UNMAPPED (formerly known as __GFP_UNMAPPED)
3. alloc_flags arg to the page allocator (I'm sneakily introducing this
in [1])
4. freetype_t
5. The mermap
6. The mm-local region
I originally posted all of those in [0], except part 3. Doing all of
that together in one series would be a bit too much though. Approaches I
can see to avoid that:
Approach X:
- Do parts 1, 2 and 4 as a standalone series. The only beneficiary of
AS_NO_DIRECT_MAP would be secretmem.
- Then another series that fills in 0, 5 and 6.
Approach Y:
- One series that does parts 0, 1, 5, and 6. AS_NO_DIRECT_MAP is
implemented by having filemap.c itself call folio_zap_direct_map(),
then guest_memfd.c zeroes it via the mermap. It works but it's really
slow.
- Then another series that fills in parts 2 and 4, switches filemap.c
over from manual folio_zap_direct_map() to ALLOC_UNMAPPED, making
things fast.
Approach X seems natural from a code progression perspective but leaves
us with an interim phase where we have a bunch of complexity just to
"optimise secretmem" which nobody cares about.
Approach Y seems natural from a feature progression perspective but
leaves us with an interim phase where we expensively zap a page, only to
then immediately do this complex mermap dance to access it right
afterwards.
Any thoughts / other ideas? Personally I think I prefer X.
[0] https://lore.kernel.org/all/DHH1NTVNTA8W.2313NYMA29J42@google.com/
[1] https://lore.kernel.org/all/20260703-alloc-trylock-v5-0-c87b714e19d3@google.com/
Apologies for the Friday night essay :D
Brendan
next prev parent reply other threads:[~2026-07-03 17:25 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-10 15:17 [PATCH v12 00/16] Direct Map Removal Support for guest_memfd Kalyazin, Nikita
2026-04-10 15:17 ` [PATCH v12 01/16] set_memory: set_direct_map_* to take address Kalyazin, Nikita
2026-04-21 14:43 ` Lorenzo Stoakes
2026-06-26 14:38 ` Brendan Jackman
2026-06-26 14:58 ` David Hildenbrand (Arm)
2026-06-26 15:08 ` Brendan Jackman
2026-06-26 15:04 ` Brendan Jackman
2026-06-26 15:28 ` Mike Rapoport
2026-04-10 15:18 ` [PATCH v12 02/16] set_memory: add folio_{zap,restore}_direct_map helpers Kalyazin, Nikita
2026-07-03 10:19 ` Brendan Jackman
2026-07-03 13:38 ` Mike Rapoport
2026-07-03 14:54 ` Brendan Jackman
2026-04-10 15:18 ` [PATCH v12 03/16] mm/secretmem: make use of folio_{zap,restore}_direct_map Kalyazin, Nikita
2026-04-10 15:18 ` [PATCH v12 04/16] mm/gup: drop secretmem optimization from gup_fast_folio_allowed Kalyazin, Nikita
2026-04-10 15:18 ` [PATCH v12 05/16] mm/gup: drop local variable in gup_fast_folio_allowed Kalyazin, Nikita
2026-04-10 15:18 ` [PATCH v12 06/16] mm: introduce AS_NO_DIRECT_MAP Kalyazin, Nikita
2026-04-10 15:19 ` [PATCH v12 07/16] KVM: guest_memfd: Add stub for kvm_arch_gmem_invalidate Kalyazin, Nikita
2026-04-10 15:19 ` [PATCH v12 08/16] KVM: x86: define kvm_arch_gmem_supports_no_direct_map() Kalyazin, Nikita
2026-04-10 15:19 ` [PATCH v12 09/16] KVM: arm64: " Kalyazin, Nikita
2026-04-21 16:55 ` Marc Zyngier
2026-06-26 14:45 ` Brendan Jackman
2026-04-10 15:19 ` [PATCH v12 10/16] KVM: guest_memfd: Add flag to remove from direct map Kalyazin, Nikita
2026-04-21 16:31 ` Sean Christopherson
2026-04-21 17:08 ` Frank van der Linden
2026-05-08 8:18 ` Takahiro Itazuri
2026-05-14 16:45 ` Ackerley Tng
2026-07-03 17:25 ` Brendan Jackman [this message]
2026-04-10 15:19 ` [PATCH v12 11/16] KVM: selftests: load elf via bounce buffer Kalyazin, Nikita
2026-04-10 15:19 ` [PATCH v12 12/16] KVM: selftests: set KVM_MEM_GUEST_MEMFD in vm_mem_add() if guest_memfd != -1 Kalyazin, Nikita
2026-04-10 15:20 ` [PATCH v12 13/16] KVM: selftests: Add guest_memfd based vm_mem_backing_src_types Kalyazin, Nikita
2026-06-26 14:22 ` Brendan Jackman
2026-04-10 15:20 ` [PATCH v12 14/16] KVM: selftests: cover GUEST_MEMFD_FLAG_NO_DIRECT_MAP in existing selftests Kalyazin, Nikita
2026-04-10 15:20 ` [PATCH v12 15/16] KVM: selftests: stuff vm_mem_backing_src_type into vm_shape Kalyazin, Nikita
2026-04-10 15:20 ` [PATCH v12 16/16] KVM: selftests: Test guest execution from direct map removed gmem Kalyazin, Nikita
2026-04-21 13:40 ` [PATCH v12 00/16] Direct Map Removal Support for guest_memfd Lorenzo Stoakes
2026-04-21 16:36 ` Sean Christopherson
2026-05-06 8:07 ` Takahiro Itazuri
2026-05-26 16:27 ` Lorenzo Stoakes
2026-06-26 15:28 ` Brendan Jackman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=DJP40SEE38XA.3BXJN4U0VDIOS@linux.dev \
--to=brendan.jackman@linux.dev \
--cc=Liam.Howlett@oracle.com \
--cc=ackerleytng@google.com \
--cc=agordeev@linux.ibm.com \
--cc=ajones@ventanamicro.com \
--cc=akpm@linux-foundation.org \
--cc=alex@ghiti.fr \
--cc=andrii@kernel.org \
--cc=aou@eecs.berkeley.edu \
--cc=ast@kernel.org \
--cc=baolu.lu@linux.intel.com \
--cc=bp@alien8.de \
--cc=chenhuacai@kernel.org \
--cc=corbet@lwn.net \
--cc=coxu@redhat.com \
--cc=daniel@iogearbox.net \
--cc=dave.hansen@linux.intel.com \
--cc=david@kernel.org \
--cc=derekmn@amazon.com \
--cc=dev.jain@arm.com \
--cc=eddyz87@gmail.com \
--cc=fvdl@google.com \
--cc=gerald.schaefer@linux.ibm.com \
--cc=gor@linux.ibm.com \
--cc=haoluo@google.com \
--cc=hca@linux.ibm.com \
--cc=hpa@zytor.com \
--cc=itazur@amazon.co.uk \
--cc=itazur@amazon.com \
--cc=jackabt@amazon.co.uk \
--cc=jackmanb@google.com \
--cc=jannh@google.com \
--cc=jgg@ziepe.ca \
--cc=jgross@suse.com \
--cc=jhubbard@nvidia.com \
--cc=jiayuan.chen@shopee.com \
--cc=jmattson@google.com \
--cc=joey.gouly@arm.com \
--cc=john.fastabend@gmail.com \
--cc=jolsa@kernel.org \
--cc=jthoughton@google.com \
--cc=kpsingh@kernel.org \
--cc=kvm@vger.kernel.org \
--cc=kvmarm@lists.linux.dev \
--cc=lenb@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=luto@kernel.org \
--cc=maobibo@loongson.cn \
--cc=martin.lau@linux.dev \
--cc=maz@kernel.org \
--cc=mhocko@suse.com \
--cc=mingo@redhat.com \
--cc=mlevitsk@redhat.com \
--cc=nikita.kalyazin@linux.dev \
--cc=oupton@kernel.org \
--cc=palmer@dabbelt.com \
--cc=patrick.roy@linux.dev \
--cc=pavel@kernel.org \
--cc=pbonzini@redhat.com \
--cc=peterx@redhat.com \
--cc=peterz@infradead.org \
--cc=pfalcato@suse.de \
--cc=pjw@kernel.org \
--cc=prsampat@amd.com \
--cc=rafael@kernel.org \
--cc=riel@surriel.com \
--cc=rppt@kernel.org \
--cc=ryan.roberts@arm.com \
--cc=sdf@fomichev.me \
--cc=seanjc@google.com \
--cc=shijie@os.amperecomputing.com \
--cc=skhan@linuxfoundation.org \
--cc=song@kernel.org \
--cc=surenb@google.com \
--cc=suzuki.poulose@arm.com \
--cc=svens@linux.ibm.com \
--cc=tabba@google.com \
--cc=tglx@kernel.org \
--cc=thuth@redhat.com \
--cc=urezki@gmail.com \
--cc=vannapurve@google.com \
--cc=vbabka@kernel.org \
--cc=will@kernel.org \
--cc=willy@infradead.org \
--cc=wu.fei9@sanechips.com.cn \
--cc=x86@kernel.org \
--cc=yang@os.amperecomputing.com \
--cc=yangyicong@hisilicon.com \
--cc=yonghong.song@linux.dev \
--cc=yosry@kernel.org \
--cc=yu-cheng.yu@intel.com \
--cc=yuzenghui@huawei.com \
--cc=zhengqi.arch@bytedance.com \
--cc=zulinx86@gmai.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox