kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: David Matlack <dmatlack@google.com>
Cc: Peter Xu <peterx@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>,
	kvm@vger.kernel.org, Ben Gardon <bgardon@google.com>,
	Joerg Roedel <joro@8bytes.org>, Jim Mattson <jmattson@google.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Janis Schoetterl-Glausch <scgl@linux.vnet.ibm.com>,
	Junaid Shahid <junaids@google.com>,
	Oliver Upton <oupton@google.com>,
	Harish Barathvajasankar <hbarath@google.com>,
	Peter Shier <pshier@google.com>
Subject: Re: [RFC PATCH 12/15] KVM: x86/mmu: Split large pages when dirty logging is enabled
Date: Wed, 1 Dec 2021 23:37:40 +0000	[thread overview]
Message-ID: <YagHRESjukJoS7NQ@google.com> (raw)
In-Reply-To: <CALzav=c9F+f=UqBjQD9sotNC72j2Gq1Fa=cdLoz2xOjRd5hypg@mail.gmail.com>

On Wed, Dec 01, 2021, David Matlack wrote:
> On Wed, Dec 1, 2021 at 10:29 AM Sean Christopherson <seanjc@google.com> wrote:
> > Hmm, in this particular case, I think using the caches is the wrong approach.  The
> > behavior of pre-filling the caches makes sense for vCPUs because faults may need
> > multiple objects and filling the cache ensures the entire fault can be handled
> > without dropping mmu_lock.  And any extra/unused objects can be used by future
> > faults.  For page splitting, neither of those really holds true.  If there are a
> > lot of pages to split, KVM will have to drop mmu_lock to refill the cache.  And if
> > there are few pages to split, or the caches are refilled toward the end of the walk,
> > KVM may end up with a pile of unused objects it needs to free.
> >
> > Since this code already needs to handle failure, and more importantly, it's a
> > best-effort optimization, I think trying to use the caches is a square peg, round
> > hole scenario.
> >
> > Rather than use the caches, we could do allocation 100% on-demand and never drop
> > mmu_lock to do allocation.  The one caveat is that direct reclaim would need to be
> > disallowed so that the allocation won't sleep.  That would mean that eager splitting
> > would fail under heavy memory pressure when it otherwise might succeed by reclaiming.
> > That would mean vCPUs get penalized as they'd need to do the splitting on fault and
> > potentially do direct reclaim as well.  It's not obvious that that would be a problem
> > in practice, e.g. the vCPU is probably already seeing a fair amount of disruption due
> > to memory pressure, and slowing down vCPUs might alleviate some of that pressure.
> 
> Not necessarily. The vCPUs might be running just fine in the VM being
> split because they are in their steady state and not faulting in any
> new memory. (Memory pressure might be coming from another VM landing
> on the host.)

Hrm, true.

> IMO, if we have an opportunity to avoid doing direct reclaim in the
> critical path of customer execution we should take it.
>
> 
> The on-demand approach will also increase the amount of time we have
> to hold the MMU lock to page splitting. This is not too terrible for
> the TDP MMU since we are holding the MMU lock in read mode, but is
> going to become a problem when we add page splitting support for the
> shadow MMU.
> 
> I do agree that the caches approach, as implemented, will inevitably
> end up with a pile of unused objects at the end that need to be freed.
> I'd be happy to take a look and see if there's anyway to reduce the
> amount of unused objects at the end with a bit smarter top-up logic.

It's not just the extra objects, it's the overall complexity that bothers me.
Complexity isn't really the correct word, it's more that as written, the logic
is spread over several files and is disingenuous from the perspective that the
split_cache is in kvm->arch, which implies persistence, but the cache are
completely torn down after evey memslot split.

I suspect part of the problem is that the code is trying to plan for a future
where nested MMUs also support splitting large pages.  Usually I'm all for that
sort of thing, but in this case it creates a lot of APIs that should not exist,
either because the function is not needed at all, or because it's a helper buried
in tdp_mmu.c.  E.g. assert_split_caches_invariants() is overkill.

That's solvable by refactoring and shuffling code, but using kvm_mmu_memory_cache
still feels wrong.  The caches don't fully solve the might_sleep() problem since
the loop still has to drop mmu_lock purely because it needs to allocate memory,
and at the same time the caches are too agressive because we can theoretically get
false positives on OOM scenarios, e.g. a topup could fail when trying to allocate
25 objects, when only 1 is needed.  We could enhance the cache code, which is
pretty rudimentary, but it still feels forced.

One thing we can take advantage of is that remote TLB flushes can be deferred
until after all roots are done, and don't need to be serviced if mmu_lock is
dropped.  Changes from a hugepage to a collection of smaller pages is atomic, no
memory is freed, and there are no changes in gfn=>pfn made by the split.  If
something else comes along and modifies the newly created sp or its children,
then it will flush accordingly.  Similar to write-protecting the page, the only
requirement is that all vCPUs see the small pages before the ioctl() returns,
i.e. before userspace can query the dirty log.  Never needing to flush is one
less reason to use a variant of tdp_mmu_iter_cond_resched(). 

So, what if we do something like this?  Try to allocate on-demand without dropping
mmu_lock.  In the happy case, it will succeed and there's no need to drop mmu_lock.
If allocation fails, drop RCU and mmu_lock and retry with direct relcaim allowed.

Some ugly gotos to reduce indentation, there's probably a better way to dress
this up.  Comments obviously needed.  This also doesn't track whether or not a
flush is needed, that will sadly need to be an in/out param, assuming we want to
return success/failure.

static struct kvm_mmu_page *tdp_mmu_alloc_sp(gfp_t allow_direct_reclaim)
{
	gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_ZERO | allow_direct_reclaim;
	struct kvm_mmu_page *sp;
	u64 *spt;

	spt = (void *)__get_free_page(gfp);
	if (!spt)
		return NULL;

	sp = kmem_cache_alloc(mmu_page_header_cache, gfp);
	if (!sp) {
		free_page((unsigned long)spt);
		return NULL;
	}

	sp->spt = spt;

	return sp;
}

static int tdp_mmu_split_large_pages(struct kvm *kvm, struct kvm_mmu_page *root,
				     gfn_t start, gfn_t end, int target_level)
{
	struct kvm_mmu_page *sp = NULL;
	struct tdp_iter iter;

	rcu_read_lock();

	for_each_tdp_pte_min_level(iter, root, target_level + 1, start, end) {
retry:
		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
			continue;

		if (!is_shadow_present_pte(iter.old_spte || !is_large_pte(pte))
			continue;

		if (likely(sp))
			goto do_split;

		sp = tdp_mmu_alloc_sp(0);
		if (!sp) {
			rcu_read_unlock();
			read_unlock(&kvm->mmu_lock);

			sp = tdp_mmu_alloc_sp(__GFP_DIRECT_RECLAIM);

			read_lock(&kvm->mmu_lock);

			if (!sp)
				return -ENOMEM;

			rcu_read_lock();
			tdp_iter_restart(iter);
			continue;
		}

do_split:
		init_tdp_mmu_page(sp, iter->gfn, get_child_page_role(&iter));

		if (!tdp_mmu_split_large_page(kvm, &iter, sp))
			goto retry;

		sp = NULL;
	}

	rcu_read_unlock();

	return 0;
}


  reply	other threads:[~2021-12-01 23:37 UTC|newest]

Thread overview: 77+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-19 23:57 [RFC PATCH 00/15] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
2021-11-19 23:57 ` [RFC PATCH 01/15] KVM: x86/mmu: Rename rmap_write_protect to kvm_vcpu_write_protect_gfn David Matlack
2021-11-22 18:52   ` Ben Gardon
2021-11-26 12:18   ` Peter Xu
2021-11-19 23:57 ` [RFC PATCH 02/15] KVM: x86/mmu: Rename __rmap_write_protect to rmap_write_protect David Matlack
2021-11-22 18:52   ` Ben Gardon
2021-11-26 12:18   ` Peter Xu
2021-11-19 23:57 ` [RFC PATCH 03/15] KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails David Matlack
2021-11-22 18:52   ` Ben Gardon
2021-11-30 23:25     ` David Matlack
2021-11-19 23:57 ` [RFC PATCH 04/15] KVM: x86/mmu: Factor out logic to atomically install a new page table David Matlack
2021-11-22 18:52   ` Ben Gardon
2021-11-30 23:27     ` David Matlack
2021-12-01 19:13   ` Sean Christopherson
2021-12-01 21:52     ` David Matlack
2021-11-19 23:57 ` [RFC PATCH 05/15] KVM: x86/mmu: Abstract mmu caches out to a separate struct David Matlack
2021-11-22 18:55   ` Ben Gardon
2021-11-22 18:55     ` Ben Gardon
2021-11-30 23:28     ` David Matlack
2021-11-19 23:57 ` [RFC PATCH 06/15] KVM: x86/mmu: Derive page role from parent David Matlack
2021-11-20 12:53   ` Paolo Bonzini
2021-11-27  2:07     ` Lai Jiangshan
2021-11-27 10:26       ` Paolo Bonzini
2021-11-30 23:31     ` David Matlack
2021-12-01  0:45       ` Sean Christopherson
2021-12-01 21:56         ` David Matlack
2021-11-19 23:57 ` [RFC PATCH 07/15] KVM: x86/mmu: Pass in vcpu->arch.mmu_caches instead of vcpu David Matlack
2021-11-22 18:56   ` Ben Gardon
2021-11-19 23:57 ` [RFC PATCH 08/15] KVM: x86/mmu: Helper method to check for large and present sptes David Matlack
2021-11-22 18:56   ` Ben Gardon
2021-12-01 18:34   ` Sean Christopherson
2021-12-01 21:13     ` David Matlack
2021-11-19 23:57 ` [RFC PATCH 09/15] KVM: x86/mmu: Move restore_acc_track_spte to spte.c David Matlack
2021-11-22 18:56   ` Ben Gardon
2021-11-19 23:57 ` [RFC PATCH 10/15] KVM: x86/mmu: Abstract need_resched logic from tdp_mmu_iter_cond_resched David Matlack
2021-11-22 18:56   ` Ben Gardon
2021-11-19 23:57 ` [RFC PATCH 11/15] KVM: x86/mmu: Refactor tdp_mmu iterators to take kvm_mmu_page root David Matlack
2021-11-22 18:56   ` Ben Gardon
2021-11-19 23:57 ` [RFC PATCH 12/15] KVM: x86/mmu: Split large pages when dirty logging is enabled David Matlack
2021-11-22  5:05   ` Nikunj A. Dadhania
2021-11-30 23:33     ` David Matlack
2021-11-22 19:30   ` Ben Gardon
2021-11-30 23:44     ` David Matlack
2021-11-26 12:01   ` Peter Xu
2021-11-30 23:56     ` David Matlack
2021-12-01  1:00       ` Sean Christopherson
2021-12-01  1:29         ` David Matlack
2021-12-01  2:29           ` Peter Xu
2021-12-01 18:29             ` Sean Christopherson
2021-12-01 21:36               ` David Matlack
2021-12-01 23:37                 ` Sean Christopherson [this message]
2021-12-02 17:41                   ` David Matlack
2021-12-02 18:42                     ` Sean Christopherson
2021-12-03  0:00                       ` David Matlack
2021-12-03  1:07                         ` Sean Christopherson
2021-12-03 17:22                           ` David Matlack
2021-11-19 23:57 ` [RFC PATCH 13/15] KVM: x86/mmu: Split large pages during CLEAR_DIRTY_LOG David Matlack
2021-11-26 12:17   ` Peter Xu
2021-12-01  0:16     ` David Matlack
2021-12-01  0:17       ` David Matlack
2021-12-01  4:03         ` Peter Xu
2021-12-01 22:14           ` David Matlack
2021-12-03  4:57             ` Peter Xu
2021-12-01 19:22   ` Sean Christopherson
2021-12-01 19:49     ` Ben Gardon
2021-12-01 20:16       ` Sean Christopherson
2021-12-01 22:11         ` Ben Gardon
2021-12-01 22:17     ` David Matlack
2021-11-19 23:57 ` [RFC PATCH 14/15] KVM: x86/mmu: Add tracepoint for splitting large pages David Matlack
2021-11-19 23:57 ` [RFC PATCH 15/15] KVM: x86/mmu: Update page stats when " David Matlack
2021-12-01 19:36   ` Sean Christopherson
2021-12-01 21:11     ` David Matlack
2021-11-26 14:13 ` [RFC PATCH 00/15] KVM: x86/mmu: Eager Page Splitting for the TDP MMU Peter Xu
2021-11-30 23:22   ` David Matlack
2021-12-01  4:10     ` Peter Xu
2021-12-01  4:19       ` Peter Xu
2021-12-01 21:46       ` David Matlack

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YagHRESjukJoS7NQ@google.com \
    --to=seanjc@google.com \
    --cc=bgardon@google.com \
    --cc=dmatlack@google.com \
    --cc=hbarath@google.com \
    --cc=jmattson@google.com \
    --cc=joro@8bytes.org \
    --cc=junaids@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=oupton@google.com \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=pshier@google.com \
    --cc=scgl@linux.vnet.ibm.com \
    --cc=vkuznets@redhat.com \
    --cc=wanpengli@tencent.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).