The Linux Kernel Mailing List
 help / color / mirror / Atom feed
* Re: [PATCH v12 10/16] KVM: guest_memfd: Add flag to remove from direct map
       [not found] <CAPTztWb67XZvfcMVnbegDNNW0LJa9UsaTGx3M898xJUJrekk0w@mail.gmail.com>
@ 2026-05-08  8:18 ` Takahiro Itazuri
  0 siblings, 0 replies; only message in thread
From: Takahiro Itazuri @ 2026-05-08  8:18 UTC (permalink / raw)
  To: fvdl, seanjc, ljs
  Cc: Liam.Howlett, ackerleytng, agordeev, ajones, akpm, alex, andrii,
	aou, ast, baolu.lu, borntraeger, bp, bpf, catalin.marinas,
	chenhuacai, corbet, coxu, daniel, dave.hansen, david, derekmn,
	dev.jain, eddyz87, gerald.schaefer, gor, haoluo, hca, hpa, itazur,
	jackabt, jackmanb, jannh, jgg, jgross, jhubbard, jiayuan.chen,
	jmattson, joey.gouly, john.fastabend, jolsa, jthoughton, kalyazin,
	kas, kernel, kpsingh, kvm, kvmarm, lenb, linux-arm-kernel,
	linux-doc, linux-fsdevel, linux-kernel, linux-kselftest, linux-mm,
	linux-pm, linux-riscv, linux-s390, loongarch, lorenzo.stoakes,
	luto, maobibo, martin.lau, maz, mhocko, mingo, mlevitsk,
	nikita.kalyazin, oupton, palmer, patrick.roy, pavel, pbonzini,
	peterx, peterz, pfalcato, pjw, prsampat, rafael, riel, rppt,
	ryan.roberts, sdf, shijie, skhan, song, surenb, suzuki.poulose,
	svens, tabba, tglx, thuth, urezki, vannapurve, vbabka, will,
	willy, wu.fei9, x86, yang, yangyicong, yonghong.song, yosry,
	yu-cheng.yu, yuzenghui, zhengqi.arch, zulinx86

Hi Sean, Frank, Lorenzo,

On Tue, Apr 21, 2026 at 10:08:48AM -0700, Frank van der Linden wrote:
> On Tue, Apr 21, 2026 at 9:31 AM Sean Christopherson <seanjc@google.com> wrote:
> > Making guest_memfd responsible for zapping and restoring the direct map on a per-
> > folio basis feels wrong given the addition of AS_NO_DIRECT_MAP.  I especially don't
> > like that the "rules" for when an AS_NO_DIRECT_MAP folio has a direct map will vary
> > based on the owner, and even within an owner (e.g. guest_memfd) will be ad hoc.
> >
> > E.g. as per the series to add guest_memfd write() support[*]:
> >
> >   When direct map removal is implemented [2]
> >    - write() will not be allowed to access pages that have already
> >      been removed from direct map
> >    - on completion, write() will remove the populated pages from
> >      direct map
> >
> > That's pretty gross ABI, because with KVM_GMEM_FOLIO_NO_DIRECT_MAP, userspace can
> > write() exactly once.  To re-write memory, I assume userspace would need to do a
> > PUNCH_HOLE or truncate.
> >
> > What's preventing us from handling this automagically in e.g. filemap_add_folio()
> > and filemap_remove_folio()?  Then the usage rules are pretty straightforward: the
> > kernel must *always* assume the direct map is invalid for folios from
> > AS_NO_DIRECT_MAP mappings.
> >
> > Then if KVM needs to utilize a kernel mapping, e.g. in kvm_gmem_populate(), KVM
> > could use dedicated variants of kmap_local_xxx() to deal with a local mapping for
> > a folio/page without a direct map.  Or, KVM could simply disallow the specific
> > sequence that would require KVM to do the memcpy (I'm pretty sure we can do that
> > with in-place shared=>private conversion support).
> >
> > I realize that could throw a big wrench into write() performance, but IMO, before
> > merging either series, we need a complete story for exactly how this will all fit
> > together, in a maintainable fashion and with sane ABI.
>
> I agree with this - this approach would also allow for memory that was
> never in the direct map to begin with, or has been taken out already
> (for which I happen to have a use case :-)). guest_memfd and other
> code can then assume that AS_NO_DIRECT_MAP means they have to take
> explicit action to map it if needed. It's a clean, simple ABI.
>
> With the current set of patches, it seems like this couldn't be done
> in a clean manner.

Agreed with both of you.  I'll adopt the filemap-level approach:

- Move the zap/restore hooks from guest_memfd into filemap_add_folio()
  / filemap_remove_folio().
- Tighten AS_NO_DIRECT_MAP semantics so that, for folios in such a
  mapping, the direct map is invalid for the entire time the folio
  resides in the page cache.
- Drop the per-folio KVM_GMEM_FOLIO_NO_DIRECT_MAP bookkeeping in
  folio->private, since the existence of the folio in the mapping is
  itself the state.

On each guest memory population path,

- memcpy-based population from userspace goes through the userspace
  mapping of guest_memfd, not through the kernel direct map, so the
  filemap-level invariant doesn't affect it.  But this is slow, which
  is what motivated the write() syscall support.

- write(): meant to speed up the userspace-memcpy case above by doing
  the copy in the kernel.  I believe Brendan's __GFP_UNMAPPED/mermap
  work [1] would give us a low-overhead way to get temporary kernel
  access to an AS_NO_DIRECT_MAP.  Landing mermap may take a while, but
  this series does not introduce the write() path, so mermap is not a
  blocker for now.

- kvm_gmem_populate(): this is a TDX/SNP-only path, and NO_DIRECT_MAP
  is not available on those VM types —
  kvm_arch_gmem_supports_no_direct_map() returns false for
  KVM_X86_TDX_VM and KVM_X86_SNP_VM, which are its only callers
  today.  So it doesn't interact with the filemap invariant IIUC.

So, unless I'm missing any path, adopting the filemap-level approach in
this series should be fine.


I'd like to consult with you folks on how to proceed in advance.  In a
separate reply on the cover letter thread [2], Lorenzo and Sean
suggested that the mm pieces should go through the mm subsystem:

On Tue, Apr 21, 2026 at 04:36:00PM +0000, Sean Christopherson wrote:
> Yeah, when the time comes, the mm pieces definitely need to go through the mm
> tree.  Ideally, I think this would be merged in two separate parts, with all mm
> changes going through the mm tree, and then the KVM changes through the KVM tree
> using a stable topic branch/tag from Andrew.

I see two reasonable paths to get there, and would appreciate your
input on which you prefer:

Path A — validate on KVM side first, then split:
  - Post v13 as a single series on the KVM list, gather feedback and
    make sure the design is acceptable to KVM reviewers.
  - Once v13 looks good ("the time comes"), do the MM/KVM split,
    rebase the MM part onto the appropriate MM branch, and post the
    MM part to linux-mm to build consensus with MM maintainers.

Path B — split early and seek MM consensus in parallel:
  - With the filemap rework already in place, do the MM/KVM split
    now and post the MM part to linux-mm directly.  The KVM part follows
    on top of a stable topic from MM.

Which of the two would you rather see?  Happy to go either way.


[1] https://lore.kernel.org/all/20260320-page_alloc-unmapped-v2-0-28bf1bd54f41@google.com/
[2] https://lore.kernel.org/all/20260506080753.14517-1-itazur@amazon.com/

Takahiro


^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2026-05-08  8:18 UTC | newest]

Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CAPTztWb67XZvfcMVnbegDNNW0LJa9UsaTGx3M898xJUJrekk0w@mail.gmail.com>
2026-05-08  8:18 ` [PATCH v12 10/16] KVM: guest_memfd: Add flag to remove from direct map Takahiro Itazuri

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox