Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Sean Christopherson <seanjc@google.com>
To: David Stevens <stevensd@chromium.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	Yu Zhang <yu.c.zhang@linux.intel.com>,
	 Isaku Yamahata <isaku.yamahata@gmail.com>,
	Zhi Wang <zhi.wang.linux@gmail.com>,
	 Maxim Levitsky <mlevitsk@redhat.com>,
	kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org,
	 kvm@vger.kernel.org, Axel Rasmussen <axelrasmussen@google.com>
Subject: Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages
Date: Fri, 15 Mar 2024 10:59:05 -0700	[thread overview]
Message-ID: <ZfSMaUFa5hsPP-eR@google.com> (raw)
In-Reply-To: <ZfMxj_e7M_toVR3a@google.com>

On Thu, Mar 14, 2024, Sean Christopherson wrote:
> +Alex, who is looking at the huge-VM_PFNMAP angle in particular.

Oof, *Axel*.  Sorry Axel.

> On Thu, Mar 14, 2024, Sean Christopherson wrote:
> > -Christ{oph,ian} to avoid creating more noise...
> > 
> > On Thu, Mar 14, 2024, David Stevens wrote:
> > > Because of that, the specific type of pfns that don't work right now are
> > > pfn_valid() && !PG_Reserved && !page_ref_count() - what I called the
> > > non-refcounted pages in a bad choice of words. If that's correct, then
> > > perhaps this series should go a little bit further in modifying
> > > hva_to_pfn_remapped, but it isn't fundamentally wrong.
> > 
> > Loosely related to all of this, I have a mildly ambitious idea.  Well, one mildly
> > ambitious idea, and one crazy ambitious idea.  Crazy ambitious idea first...
> > 
> > Something we (GCE side of Google) have been eyeballing is adding support for huge
> > VM_PFNMAP memory, e.g. for mapping large amounts of device (a.k.a. GPU) memory
> > into guests using hugepages.  One of the hiccups is that follow_pte() doesn't play
> > nice with hugepages, at all, e.g. even has a "VM_BUG_ON(pmd_trans_huge(*pmd))".
> > Teaching follow_pte() to play nice with hugepage probably is doing, but making
> > sure all existing users are aware, maybe not so much.
> > 
> > My first (half baked, crazy ambitious) idea is to move away from follow_pte() and
> > get_user_page_fast_only() for mmu_notifier-aware lookups, i.e. that don't need
> > to grab references, and replace them with a new converged API that locklessly walks
> > host userspace page tables, and grabs the hugepage size along the way, e.g. so that
> > arch code wouldn't have to do a second walk of the page tables just to get the
> > hugepage size.
> > 
> > In other words, for the common case (mmu_notifier integration, no reference needed),
> > route hva_to_pfn_fast() into the new API and walk the userspace page tables (probably
> > only for write faults, to avoid CoW compliciations) before doing anything else.
> > 
> > Uses of hva_to_pfn() that need to get a reference to the struct page couldn't be
> > converted, e.g. when stuffing physical addresses into the VMCS for nested virtualization.
> > But for everything else, grabbing a reference is a non-goal, i.e. actually "getting"
> > a user page is wasted effort and actively gets in the way.
> > 
> > I was initially hoping we could go super simple and use something like x86's
> > host_pfn_mapping_level(), but there are too many edge cases in gup() that need to
> > be respected, e.g. to avoid mapping memfd_secret pages into KVM guests.  I.e. the
> > API would need to be a formal mm-owned thing, not some homebrewed KVM implementation.
> > 
> > I can't tell if the payoff would be big enough to justify the effort involved, i.e.
> > having a single unified API for grabbing PFNs from the primary MMU might just be a
> > pie-in-the-sky type idea.
> > 
> > My second, less ambitious idea: the previously linked LWN[*] article about the
> > writeback issues reminded me of something that has bugged me for a long time.  IIUC,
> > getting a writable mapping from the primary MMU marks the page/folio dirty, and that
> > page/folio stays dirty until the data is written back and the mapping is made read-only.
> > And because KVM is tapped into the mmu_notifiers, KVM will be notified *before* the
> > RW=>RO conversion completes, i.e. before the page/folio is marked clean.
> > 
> > I _think_ that means that calling kvm_set_page_dirty() when zapping a SPTE (or
> > dropping any mmu_notifier-aware mapping) is completely unnecessary.  If that is the
> > case, _and_ we can weasel our way out of calling kvm_set_page_accessed() too, then
> > with FOLL_GET plumbed into hva_to_pfn(), we can:
> > 
> >   - Drop kvm_{set,release}_pfn_{accessed,dirty}(), because all callers of hva_to_pfn()
> >     that aren't tied into mmu_notifiers, i.e. aren't guaranteed to drop mappings
> >     before the page/folio is cleaned, will *know* that they hold a refcounted struct
> >     page.
> > 
> >   - Skip "KVM: x86/mmu: Track if sptes refer to refcounted pages" entirely, because
> >     KVM never needs to know if a SPTE points at a refcounted page.
> > 
> > In other words, double down on immediately doing put_page() after gup() if FOLL_GET
> > isn't specified, and naturally make all KVM MMUs compatible with pfn_valid() PFNs
> > that are acquired by follow_pte().
> > 
> > I suspect we can simply mark pages as access when a page is retrieved from the primary
> > MMU, as marking a page accessed when its *removed* from the guest is rather nonsensical.
> > E.g. if a page is mapped into the guest for a long time and it gets swapped out, marking
> > the page accessed when KVM drops its SPTEs in response to the swap adds no value.  And
> > through the mmu_notifiers, KVM already plays nice with setups that use idle page
> > tracking to make reclaim decisions.
> > 
> > [*] https://lwn.net/Articles/930667

next prev parent reply	other threads:[~2024-03-15 17:59 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-29  2:57 [PATCH v11 0/8] KVM: allow mapping non-refcounted pages David Stevens
2024-02-29  2:57 ` [PATCH v11 1/8] KVM: Assert that a page's refcount is elevated when marking accessed/dirty David Stevens
2024-02-29  2:57 ` [PATCH v11 2/8] KVM: Relax BUG_ON argument validation David Stevens
2024-02-29  2:57 ` [PATCH v11 3/8] KVM: mmu: Introduce kvm_follow_pfn() David Stevens
2024-02-29  2:57 ` [PATCH v11 4/8] KVM: mmu: Improve handling of non-refcounted pfns David Stevens
2024-02-29  2:57 ` [PATCH v11 5/8] KVM: Migrate kvm_vcpu_map() to kvm_follow_pfn() David Stevens
2024-02-29  2:57 ` [PATCH v11 6/8] KVM: x86: Migrate " David Stevens
2024-02-29  2:57 ` [PATCH v11 7/8] KVM: x86/mmu: Track if sptes refer to refcounted pages David Stevens
2024-02-29  2:57 ` [PATCH v11 8/8] KVM: x86/mmu: Handle non-refcounted pages David Stevens
2024-04-04 16:03   ` Dmitry Osipenko
2024-04-15  7:28     ` David Stevens
2024-04-15  9:36       ` Paolo Bonzini
2024-02-29 13:36 ` [PATCH v11 0/8] KVM: allow mapping " Christoph Hellwig
2024-03-13  4:55   ` David Stevens
2024-03-13  9:55     ` Christian König
2024-03-13 13:34       ` Sean Christopherson
2024-03-13 14:37         ` Christian König
2024-03-13 14:48           ` Sean Christopherson
     [not found]             ` <9e604f99-5b63-44d7-8476-00859dae1dc4@amd.com>
2024-03-13 15:09               ` Christian König
2024-03-13 15:47               ` Sean Christopherson
     [not found]                 ` <93df19f9-6dab-41fc-bbcd-b108e52ff50b@amd.com>
2024-03-13 17:26                   ` Sean Christopherson
     [not found]                     ` <c84fcf0a-f944-4908-b7f6-a1b66a66a6bc@amd.com>
2024-03-14  9:20                       ` Christian König
2024-03-14 11:31                         ` David Stevens
2024-03-14 11:51                           ` Christian König
2024-03-14 14:45                             ` Sean Christopherson
2024-03-18  1:26                             ` Christoph Hellwig
2024-03-18 13:10                               ` Paolo Bonzini
2024-03-18 23:20                                 ` Christoph Hellwig
2024-03-14 16:17                           ` Sean Christopherson
2024-03-14 17:19                             ` Sean Christopherson
2024-03-15 17:59                               ` Sean Christopherson [this message]
2024-03-20 20:54                                 ` Axel Rasmussen
2024-03-13 13:33     ` Christoph Hellwig
2024-06-21 18:32 ` Sean Christopherson
2024-07-31 11:41   ` Alex Bennée
2024-07-31 15:01     ` Sean Christopherson
2024-08-05 23:44     ` David Stevens

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZfSMaUFa5hsPP-eR@google.com \
    --to=seanjc@google.com \
    --cc=axelrasmussen@google.com \
    --cc=isaku.yamahata@gmail.com \
    --cc=kvm@vger.kernel.org \
    --cc=kvmarm@lists.linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mlevitsk@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=stevensd@chromium.org \
    --cc=yu.c.zhang@linux.intel.com \
    --cc=zhi.wang.linux@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).