From: Takahiro Itazuri <itazur@amazon.com>
To: <fvdl@google.com>, <seanjc@google.com>, <ljs@kernel.org>
Cc: <Liam.Howlett@oracle.com>, <ackerleytng@google.com>,
<agordeev@linux.ibm.com>, <ajones@ventanamicro.com>,
<akpm@linux-foundation.org>, <alex@ghiti.fr>, <andrii@kernel.org>,
<aou@eecs.berkeley.edu>, <ast@kernel.org>,
<baolu.lu@linux.intel.com>, <borntraeger@linux.ibm.com>,
<bp@alien8.de>, <bpf@vger.kernel.org>, <catalin.marinas@arm.com>,
<chenhuacai@kernel.org>, <corbet@lwn.net>, <coxu@redhat.com>,
<daniel@iogearbox.net>, <dave.hansen@linux.intel.com>,
<david@kernel.org>, <derekmn@amazon.com>, <dev.jain@arm.com>,
<eddyz87@gmail.com>, <gerald.schaefer@linux.ibm.com>,
<gor@linux.ibm.com>, <haoluo@google.com>, <hca@linux.ibm.com>,
<hpa@zytor.com>, <itazur@amazon.co.uk>, <jackabt@amazon.co.uk>,
<jackmanb@google.com>, <jannh@google.com>, <jgg@ziepe.ca>,
<jgross@suse.com>, <jhubbard@nvidia.com>,
<jiayuan.chen@shopee.com>, <jmattson@google.com>,
<joey.gouly@arm.com>, <john.fastabend@gmail.com>,
<jolsa@kernel.org>, <jthoughton@google.com>,
<kalyazin@amazon.co.uk>, <kas@kernel.org>, <kernel@xen0n.name>,
<kpsingh@kernel.org>, <kvm@vger.kernel.org>,
<kvmarm@lists.linux.dev>, <lenb@kernel.org>,
<linux-arm-kernel@lists.infradead.org>,
<linux-doc@vger.kernel.org>, <linux-fsdevel@vger.kernel.org>,
<linux-kernel@vger.kernel.org>, <linux-kselftest@vger.kernel.org>,
<linux-mm@kvack.org>, <linux-pm@vger.kernel.org>,
<linux-riscv@lists.infradead.org>, <linux-s390@vger.kernel.org>,
<loongarch@lists.linux.dev>, <lorenzo.stoakes@oracle.com>,
<luto@kernel.org>, <maobibo@loongson.cn>, <martin.lau@linux.dev>,
<maz@kernel.org>, <mhocko@suse.com>, <mingo@redhat.com>,
<mlevitsk@redhat.com>, <nikita.kalyazin@linux.dev>,
<oupton@kernel.org>, <palmer@dabbelt.com>,
<patrick.roy@linux.dev>, <pavel@kernel.org>,
<pbonzini@redhat.com>, <peterx@redhat.com>,
<peterz@infradead.org>, <pfalcato@suse.de>, <pjw@kernel.org>,
<prsampat@amd.com>, <rafael@kernel.org>, <riel@surriel.com>,
<rppt@kernel.org>, <ryan.roberts@arm.com>, <sdf@fomichev.me>,
<shijie@os.amperecomputing.com>, <skhan@linuxfoundation.org>,
<song@kernel.org>, <surenb@google.com>, <suzuki.poulose@arm.com>,
<svens@linux.ibm.com>, <tabba@google.com>, <tglx@kernel.org>,
<thuth@redhat.com>, <urezki@gmail.com>, <vannapurve@google.com>,
<vbabka@kernel.org>, <will@kernel.org>, <willy@infradead.org>,
<wu.fei9@sanechips.com.cn>, <x86@kernel.org>,
<yang@os.amperecomputing.com>, <yangyicong@hisilicon.com>,
<yonghong.song@linux.dev>, <yosry@kernel.org>,
<yu-cheng.yu@intel.com>, <yuzenghui@huawei.com>,
<zhengqi.arch@bytedance.com>, <zulinx86@gmai.com>
Subject: Re: [PATCH v12 10/16] KVM: guest_memfd: Add flag to remove from direct map
Date: Fri, 8 May 2026 08:18:10 +0000 [thread overview]
Message-ID: <20260508081812.12345-1-itazur@amazon.com> (raw)
In-Reply-To: <CAPTztWb67XZvfcMVnbegDNNW0LJa9UsaTGx3M898xJUJrekk0w@mail.gmail.com>
Hi Sean, Frank, Lorenzo,
On Tue, Apr 21, 2026 at 10:08:48AM -0700, Frank van der Linden wrote:
> On Tue, Apr 21, 2026 at 9:31 AM Sean Christopherson <seanjc@google.com> wrote:
> > Making guest_memfd responsible for zapping and restoring the direct map on a per-
> > folio basis feels wrong given the addition of AS_NO_DIRECT_MAP. I especially don't
> > like that the "rules" for when an AS_NO_DIRECT_MAP folio has a direct map will vary
> > based on the owner, and even within an owner (e.g. guest_memfd) will be ad hoc.
> >
> > E.g. as per the series to add guest_memfd write() support[*]:
> >
> > When direct map removal is implemented [2]
> > - write() will not be allowed to access pages that have already
> > been removed from direct map
> > - on completion, write() will remove the populated pages from
> > direct map
> >
> > That's pretty gross ABI, because with KVM_GMEM_FOLIO_NO_DIRECT_MAP, userspace can
> > write() exactly once. To re-write memory, I assume userspace would need to do a
> > PUNCH_HOLE or truncate.
> >
> > What's preventing us from handling this automagically in e.g. filemap_add_folio()
> > and filemap_remove_folio()? Then the usage rules are pretty straightforward: the
> > kernel must *always* assume the direct map is invalid for folios from
> > AS_NO_DIRECT_MAP mappings.
> >
> > Then if KVM needs to utilize a kernel mapping, e.g. in kvm_gmem_populate(), KVM
> > could use dedicated variants of kmap_local_xxx() to deal with a local mapping for
> > a folio/page without a direct map. Or, KVM could simply disallow the specific
> > sequence that would require KVM to do the memcpy (I'm pretty sure we can do that
> > with in-place shared=>private conversion support).
> >
> > I realize that could throw a big wrench into write() performance, but IMO, before
> > merging either series, we need a complete story for exactly how this will all fit
> > together, in a maintainable fashion and with sane ABI.
>
> I agree with this - this approach would also allow for memory that was
> never in the direct map to begin with, or has been taken out already
> (for which I happen to have a use case :-)). guest_memfd and other
> code can then assume that AS_NO_DIRECT_MAP means they have to take
> explicit action to map it if needed. It's a clean, simple ABI.
>
> With the current set of patches, it seems like this couldn't be done
> in a clean manner.
Agreed with both of you. I'll adopt the filemap-level approach:
- Move the zap/restore hooks from guest_memfd into filemap_add_folio()
/ filemap_remove_folio().
- Tighten AS_NO_DIRECT_MAP semantics so that, for folios in such a
mapping, the direct map is invalid for the entire time the folio
resides in the page cache.
- Drop the per-folio KVM_GMEM_FOLIO_NO_DIRECT_MAP bookkeeping in
folio->private, since the existence of the folio in the mapping is
itself the state.
On each guest memory population path,
- memcpy-based population from userspace goes through the userspace
mapping of guest_memfd, not through the kernel direct map, so the
filemap-level invariant doesn't affect it. But this is slow, which
is what motivated the write() syscall support.
- write(): meant to speed up the userspace-memcpy case above by doing
the copy in the kernel. I believe Brendan's __GFP_UNMAPPED/mermap
work [1] would give us a low-overhead way to get temporary kernel
access to an AS_NO_DIRECT_MAP. Landing mermap may take a while, but
this series does not introduce the write() path, so mermap is not a
blocker for now.
- kvm_gmem_populate(): this is a TDX/SNP-only path, and NO_DIRECT_MAP
is not available on those VM types —
kvm_arch_gmem_supports_no_direct_map() returns false for
KVM_X86_TDX_VM and KVM_X86_SNP_VM, which are its only callers
today. So it doesn't interact with the filemap invariant IIUC.
So, unless I'm missing any path, adopting the filemap-level approach in
this series should be fine.
I'd like to consult with you folks on how to proceed in advance. In a
separate reply on the cover letter thread [2], Lorenzo and Sean
suggested that the mm pieces should go through the mm subsystem:
On Tue, Apr 21, 2026 at 04:36:00PM +0000, Sean Christopherson wrote:
> Yeah, when the time comes, the mm pieces definitely need to go through the mm
> tree. Ideally, I think this would be merged in two separate parts, with all mm
> changes going through the mm tree, and then the KVM changes through the KVM tree
> using a stable topic branch/tag from Andrew.
I see two reasonable paths to get there, and would appreciate your
input on which you prefer:
Path A — validate on KVM side first, then split:
- Post v13 as a single series on the KVM list, gather feedback and
make sure the design is acceptable to KVM reviewers.
- Once v13 looks good ("the time comes"), do the MM/KVM split,
rebase the MM part onto the appropriate MM branch, and post the
MM part to linux-mm to build consensus with MM maintainers.
Path B — split early and seek MM consensus in parallel:
- With the filemap rework already in place, do the MM/KVM split
now and post the MM part to linux-mm directly. The KVM part follows
on top of a stable topic from MM.
Which of the two would you rather see? Happy to go either way.
[1] https://lore.kernel.org/all/20260320-page_alloc-unmapped-v2-0-28bf1bd54f41@google.com/
[2] https://lore.kernel.org/all/20260506080753.14517-1-itazur@amazon.com/
Takahiro
next prev parent reply other threads:[~2026-05-08 8:18 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-10 15:17 [PATCH v12 00/16] Direct Map Removal Support for guest_memfd Kalyazin, Nikita
2026-04-10 15:17 ` [PATCH v12 01/16] set_memory: set_direct_map_* to take address Kalyazin, Nikita
2026-04-21 14:43 ` Lorenzo Stoakes
2026-04-10 15:18 ` [PATCH v12 02/16] set_memory: add folio_{zap,restore}_direct_map helpers Kalyazin, Nikita
2026-04-10 15:18 ` [PATCH v12 03/16] mm/secretmem: make use of folio_{zap,restore}_direct_map Kalyazin, Nikita
2026-04-10 15:18 ` [PATCH v12 04/16] mm/gup: drop secretmem optimization from gup_fast_folio_allowed Kalyazin, Nikita
2026-04-10 15:18 ` [PATCH v12 05/16] mm/gup: drop local variable in gup_fast_folio_allowed Kalyazin, Nikita
2026-04-10 15:18 ` [PATCH v12 06/16] mm: introduce AS_NO_DIRECT_MAP Kalyazin, Nikita
2026-04-10 15:19 ` [PATCH v12 07/16] KVM: guest_memfd: Add stub for kvm_arch_gmem_invalidate Kalyazin, Nikita
2026-04-10 15:19 ` [PATCH v12 08/16] KVM: x86: define kvm_arch_gmem_supports_no_direct_map() Kalyazin, Nikita
2026-04-10 15:19 ` [PATCH v12 09/16] KVM: arm64: " Kalyazin, Nikita
2026-04-21 16:55 ` Marc Zyngier
2026-04-10 15:19 ` [PATCH v12 10/16] KVM: guest_memfd: Add flag to remove from direct map Kalyazin, Nikita
2026-04-21 16:31 ` Sean Christopherson
2026-04-21 17:08 ` Frank van der Linden
2026-05-08 8:18 ` Takahiro Itazuri [this message]
2026-04-10 15:19 ` [PATCH v12 11/16] KVM: selftests: load elf via bounce buffer Kalyazin, Nikita
2026-04-10 15:19 ` [PATCH v12 12/16] KVM: selftests: set KVM_MEM_GUEST_MEMFD in vm_mem_add() if guest_memfd != -1 Kalyazin, Nikita
2026-04-10 15:20 ` [PATCH v12 13/16] KVM: selftests: Add guest_memfd based vm_mem_backing_src_types Kalyazin, Nikita
2026-04-10 15:20 ` [PATCH v12 14/16] KVM: selftests: cover GUEST_MEMFD_FLAG_NO_DIRECT_MAP in existing selftests Kalyazin, Nikita
2026-04-10 15:20 ` [PATCH v12 15/16] KVM: selftests: stuff vm_mem_backing_src_type into vm_shape Kalyazin, Nikita
2026-04-10 15:20 ` [PATCH v12 16/16] KVM: selftests: Test guest execution from direct map removed gmem Kalyazin, Nikita
2026-04-21 13:40 ` [PATCH v12 00/16] Direct Map Removal Support for guest_memfd Lorenzo Stoakes
2026-04-21 16:36 ` Sean Christopherson
2026-05-06 8:07 ` Takahiro Itazuri
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260508081812.12345-1-itazur@amazon.com \
--to=itazur@amazon.com \
--cc=Liam.Howlett@oracle.com \
--cc=ackerleytng@google.com \
--cc=agordeev@linux.ibm.com \
--cc=ajones@ventanamicro.com \
--cc=akpm@linux-foundation.org \
--cc=alex@ghiti.fr \
--cc=andrii@kernel.org \
--cc=aou@eecs.berkeley.edu \
--cc=ast@kernel.org \
--cc=baolu.lu@linux.intel.com \
--cc=borntraeger@linux.ibm.com \
--cc=bp@alien8.de \
--cc=bpf@vger.kernel.org \
--cc=catalin.marinas@arm.com \
--cc=chenhuacai@kernel.org \
--cc=corbet@lwn.net \
--cc=coxu@redhat.com \
--cc=daniel@iogearbox.net \
--cc=dave.hansen@linux.intel.com \
--cc=david@kernel.org \
--cc=derekmn@amazon.com \
--cc=dev.jain@arm.com \
--cc=eddyz87@gmail.com \
--cc=fvdl@google.com \
--cc=gerald.schaefer@linux.ibm.com \
--cc=gor@linux.ibm.com \
--cc=haoluo@google.com \
--cc=hca@linux.ibm.com \
--cc=hpa@zytor.com \
--cc=itazur@amazon.co.uk \
--cc=jackabt@amazon.co.uk \
--cc=jackmanb@google.com \
--cc=jannh@google.com \
--cc=jgg@ziepe.ca \
--cc=jgross@suse.com \
--cc=jhubbard@nvidia.com \
--cc=jiayuan.chen@shopee.com \
--cc=jmattson@google.com \
--cc=joey.gouly@arm.com \
--cc=john.fastabend@gmail.com \
--cc=jolsa@kernel.org \
--cc=jthoughton@google.com \
--cc=kalyazin@amazon.co.uk \
--cc=kas@kernel.org \
--cc=kernel@xen0n.name \
--cc=kpsingh@kernel.org \
--cc=kvm@vger.kernel.org \
--cc=kvmarm@lists.linux.dev \
--cc=lenb@kernel.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-pm@vger.kernel.org \
--cc=linux-riscv@lists.infradead.org \
--cc=linux-s390@vger.kernel.org \
--cc=ljs@kernel.org \
--cc=loongarch@lists.linux.dev \
--cc=lorenzo.stoakes@oracle.com \
--cc=luto@kernel.org \
--cc=maobibo@loongson.cn \
--cc=martin.lau@linux.dev \
--cc=maz@kernel.org \
--cc=mhocko@suse.com \
--cc=mingo@redhat.com \
--cc=mlevitsk@redhat.com \
--cc=nikita.kalyazin@linux.dev \
--cc=oupton@kernel.org \
--cc=palmer@dabbelt.com \
--cc=patrick.roy@linux.dev \
--cc=pavel@kernel.org \
--cc=pbonzini@redhat.com \
--cc=peterx@redhat.com \
--cc=peterz@infradead.org \
--cc=pfalcato@suse.de \
--cc=pjw@kernel.org \
--cc=prsampat@amd.com \
--cc=rafael@kernel.org \
--cc=riel@surriel.com \
--cc=rppt@kernel.org \
--cc=ryan.roberts@arm.com \
--cc=sdf@fomichev.me \
--cc=seanjc@google.com \
--cc=shijie@os.amperecomputing.com \
--cc=skhan@linuxfoundation.org \
--cc=song@kernel.org \
--cc=surenb@google.com \
--cc=suzuki.poulose@arm.com \
--cc=svens@linux.ibm.com \
--cc=tabba@google.com \
--cc=tglx@kernel.org \
--cc=thuth@redhat.com \
--cc=urezki@gmail.com \
--cc=vannapurve@google.com \
--cc=vbabka@kernel.org \
--cc=will@kernel.org \
--cc=willy@infradead.org \
--cc=wu.fei9@sanechips.com.cn \
--cc=x86@kernel.org \
--cc=yang@os.amperecomputing.com \
--cc=yangyicong@hisilicon.com \
--cc=yonghong.song@linux.dev \
--cc=yosry@kernel.org \
--cc=yu-cheng.yu@intel.com \
--cc=yuzenghui@huawei.com \
--cc=zhengqi.arch@bytedance.com \
--cc=zulinx86@gmai.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox