From: Sean Christopherson <seanjc@google.com>
To: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
Paolo Bonzini <pbonzini@redhat.com>,
Vitaly Kuznetsov <vkuznets@redhat.com>,
Maxim Levitsky <mlevitsk@redhat.com>,
David Matlack <dmatlack@google.com>,
Lai Jiangshan <jiangshan.ljs@antgroup.com>
Subject: Re: [PATCH V3 04/12] KVM: X86/MMU: Add local shadow pages
Date: Wed, 20 Jul 2022 00:35:04 +0000 [thread overview]
Message-ID: <YtdNuLiGvzp7ZvoQ@google.com> (raw)
In-Reply-To: <20220521131700.3661-5-jiangshanlai@gmail.com>
On Sat, May 21, 2022, Lai Jiangshan wrote:
> +static struct kvm_mmu_page *
> +kvm_mmu_alloc_local_shadow_page(struct kvm_vcpu *vcpu, union kvm_mmu_page_role role)
Don't split the function name to a new line, even if it means running (well) over
the 80 char soft limit.
static struct kvm_mmu_page *kvm_mmu_alloc_per_vcpu_shadow_page(struct kvm_vcpu *vcpu,
union kvm_mmu_page_role role)
> +{
> + struct kvm_mmu_page *sp;
> +
> + sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> + sp->gfn = 0;
Why explicitly zero gfn but not gfns? Either rely on __GFP_ZERO or don't, mixing
behavior is confusing. If there's an assumption that "gfn" be zero, e.g. due to
masking, then that would be a good WARN candidate.
> + sp->role = role;
> + /*
> + * Use the preallocated mmu->pae_root when the shadow page's
> + * level is PT32E_ROOT_LEVEL which may need to be put in the 32 bits
> + * CR3 (even in x86_64) or decrypted. The preallocated one is prepared
> + * for the requirements.
> + */
> + if (role.level == PT32E_ROOT_LEVEL &&
> + !WARN_ON_ONCE(!vcpu->arch.mmu->pae_root))
> + sp->spt = vcpu->arch.mmu->pae_root;
> + else
> + sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> + /* sp->gfns is not used for local shadow page */
This comment isn't helpful as it doesn't provide any information as to _why_ gfns
isn't used. For simple enforcement, a KVM_BUG_ON() is much more effective as it
documents the underlying assumption, e.g.
KVM_BUG_ON(sp_has_gptes(sp), vcpu->kvm);
but I'm fairly confident that won't actually work, because sp_has_gptes() will
return true for pages that are backed by pae_root, i.e. are not passthrough.
In other words, this all subtly relies on the PDPTEs not being write-protected
and not being reachable through things like mmu_page_hash. I don't know that we
need to add a dedicated flag for these pages, but we need _something_ to document
what's going on.
Hmm, but if we do add kvm_mmu_page_role.per_vcpu, it would allow for code
consolidation, and I think it will yield more intuitive code. And sp_has_gptes()
is easy to fix.
> + set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> + sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
I would prefer that kvm_mmu_alloc_per_vcpu_shadow_page() and kvm_mmu_alloc_page()
share common bits, and then add comments for the differences. For example, this
path fails to invoke kvm_mod_used_mmu_pages(), which arguably it should do when
not using pae_root, i.e. when it actually "allocates" a page.
I've always found it annoying/odd that kvm_mmu_alloc_page() adds the page to
active_mmu_pages, but the caller adds the page to mmu_page_hash. This is a good
excuse to fix that.
If role.per_vcpu is a thing, and is tracked in vcpu->arch.mmu->root_role, then
we can do:
if (level < PT32E_ROOT_LEVEL)
role.per_vcpu = 0;
/* Per-vCPU roots are (obviously) not tracked in the per-VM lists. */
if (unlikely(role.per_vcpu))
return kvm_mmu_alloc_page(vcpu, role, true, gfn);
sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
for_each_valid_sp(vcpu->kvm, sp, sp_list) {
...
}
++vcpu->kvm->stat.mmu_cache_miss;
sp = kvm_mmu_alloc_page(vcpu, role, gfn);
and kvm_mmu_alloc_page() becomes something like (completely untested, and I'm not
at all confident about the gfn logic).
static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu,
union kvm_mmu_page_role role,
gfn_t gfn)
{
struct kvm_mmu_page *sp;
sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
/*
* Use the preallocated mmu->pae_root when the shadow page's level is
* PT32E_ROOT_LEVEL. When using PAE paging, the backing page may need
* to have a 32-bit physical address (to go into a 32-bit CR3), and/or
* may need to be decrypted (!TDP + SME). The preallocated pae_root
* is prepared for said requirements.
*/
if (role.per_vcpu && role.level == PT32E_ROOT_LEVEL) {
sp->spt = vcpu->arch.mmu->pae_root;
memset(sp->spt, 0, sizeof(u64) * 4);
} else {
sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
}
if (sp_has_gptes(sp))
sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
WARN_ON_ONCE(role.per_vcpu && gfn);
sp->gfn = gfn;
sp->role = role;
set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
/*
* active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
* depends on valid pages being added to the head of the list. See
* comments in kvm_zap_obsolete_pages().
*/
sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
kvm_mod_used_mmu_pages(vcpu->kvm, +1);
return sp;
}
> +
> + return sp;
> +}
> +
> static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct)
> {
> struct kvm_mmu_page *sp;
> @@ -2121,6 +2191,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> if (level <= vcpu->arch.mmu->cpu_role.base.level)
> role.passthrough = 0;
>
> + if (unlikely(level >= PT32E_ROOT_LEVEL && using_local_root_page(vcpu->arch.mmu)))
> + return kvm_mmu_alloc_local_shadow_page(vcpu, role);
> +
> sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> for_each_valid_sp(vcpu->kvm, sp, sp_list) {
> if (sp->gfn != gfn) {
> @@ -3351,6 +3424,37 @@ static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa,
> *root_hpa = INVALID_PAGE;
> }
>
> +static void mmu_free_local_root_page(struct kvm *kvm, struct kvm_mmu *mmu)
> +{
> + u64 spte = mmu->root.hpa;
> + struct kvm_mmu_page *sp = to_shadow_page(spte & PT64_BASE_ADDR_MASK);
> + int i;
> +
> + /* Free level 5 or 4 roots for shadow NPT for 32 bit L1 */
> + while (sp->role.level > PT32E_ROOT_LEVEL)
Maybe a for-loop?
/* Free level 5 or 4 roots for shadow NPT for 32 bit L1 */
for (sp = to_shadow_page(mmu->root.hpa & PT64_BASE_ADDR_MASK);
sp->role.level > PT32E_ROOT_LEVEL;
sp = to_shadow_page(spte & PT64_BASE_ADDR_MASK)) {
> + {
> + spte = sp->spt[0];
> + mmu_page_zap_pte(kvm, sp, sp->spt + 0, NULL);
> + free_page((unsigned long)sp->spt);
> + kmem_cache_free(mmu_page_header_cache, sp);
Probably worth a helper for free_page()+kmem_cache_free(), especially if the
!pae_root case is accounted. And then we can combine with tdp_mmu_free_sp() if
we ever decide to fully account TDP MMU pages (to play nice with reclaim).
E.g.
static void __mmu_free_per_vcpu_root_page(struct kvm *kvm,
struct kvm_mmu_page *sp)
{
if (sp->spt != mmu->pae_root) {
free_page((unsigned long)sp->spt);
kvm_mod_used_mmu_pages(kvm, -1);
}
kmem_cache_free(mmu_page_header_cache, sp);
}
static void mmu_free_per_vcpu_root_page(struct kvm *kvm, struct kvm_mmu *mmu)
{
struct kvm_mmu_page *sp;
u64 spte;
int i;
/* Free level 5 or 4 roots for shadow NPT for 32 bit L1 */
for (sp = to_shadow_page(mmu->root.hpa & PT64_BASE_ADDR_MASK);
sp->role.level > PT32E_ROOT_LEVEL;
sp = to_shadow_page(spte & PT64_BASE_ADDR_MASK)) {
spte = sp->spt[0];
mmu_page_zap_pte(kvm, sp, sp->spt + 0, NULL);
__mmu_free_per_vcpu_root_page(kvm, sp);
if (!is_shadow_present_pte(spte))
return;
}
if (WARN_ON_ONCE(sp->role.level != PT32E_ROOT_LEVEL))
return;
/* Disconnect PAE root from the 4 PAE page directories */
for (i = 0; i < 4; i++)
mmu_page_zap_pte(kvm, sp, sp->spt + i, NULL);
__mmu_free_per_vcpu_root_page(kvm, sp);
}
> + if (!is_shadow_present_pte(spte))
> + return;
> + sp = to_shadow_page(spte & PT64_BASE_ADDR_MASK);
> + }
next prev parent reply other threads:[~2022-07-20 0:35 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-05-21 13:16 [PATCH V3 00/12] KVM: X86/MMU: Use one-off local shadow page for special roots Lai Jiangshan
2022-05-21 13:16 ` [PATCH V3 01/12] KVM: X86/MMU: Verify PDPTE for nested NPT in PAE paging mode when page fault Lai Jiangshan
2022-07-19 21:17 ` Sean Christopherson
2022-05-21 13:16 ` [PATCH V3 02/12] KVM: X86/MMU: Add using_local_root_page() Lai Jiangshan
2022-05-26 21:28 ` David Matlack
2022-05-26 21:38 ` Sean Christopherson
2022-07-19 22:03 ` Sean Christopherson
2022-05-21 13:16 ` [PATCH V3 03/12] KVM: X86/MMU: Reduce a check in using_local_root_page() for common cases Lai Jiangshan
2022-05-21 13:16 ` [PATCH V3 04/12] KVM: X86/MMU: Add local shadow pages Lai Jiangshan
2022-05-26 21:38 ` David Matlack
2022-05-26 22:01 ` David Matlack
2022-07-20 0:35 ` Sean Christopherson [this message]
2022-05-21 13:16 ` [PATCH V3 05/12] KVM: X86/MMU: Link PAE root pagetable with its children Lai Jiangshan
2022-07-19 22:21 ` Sean Christopherson
2022-05-21 13:16 ` [PATCH V3 06/12] KVM: X86/MMU: Activate local shadow pages and remove old logic Lai Jiangshan
2022-05-21 13:16 ` [PATCH V3 07/12] KVM: X86/MMU: Remove the check of the return value of to_shadow_page() Lai Jiangshan
2022-07-19 22:42 ` Sean Christopherson
2022-05-21 13:16 ` [PATCH V3 08/12] KVM: X86/MMU: Allocate mmu->pae_root for PAE paging on-demand Lai Jiangshan
2022-07-19 23:08 ` Sean Christopherson
2022-07-20 0:07 ` Sean Christopherson
2022-05-21 13:16 ` [PATCH V3 09/12] KVM: X86/MMU: Move the verifying of NPT's PDPTE in FNAME(fetch) Lai Jiangshan
2022-07-19 23:21 ` Sean Christopherson
2022-05-21 13:16 ` [PATCH V3 10/12] KVM: X86/MMU: Remove unused INVALID_PAE_ROOT and IS_VALID_PAE_ROOT Lai Jiangshan
2022-07-19 23:11 ` Sean Christopherson
2022-05-21 13:16 ` [PATCH V3 11/12] KVM: X86/MMU: Don't use mmu->pae_root when shadowing PAE NPT in 64-bit host Lai Jiangshan
2022-07-19 23:26 ` Sean Christopherson
2022-07-19 23:27 ` Sean Christopherson
2022-05-21 13:17 ` [PATCH V3 12/12] KVM: X86/MMU: Remove mmu_alloc_special_roots() Lai Jiangshan
2022-05-26 8:49 ` [PATCH V3 00/12] KVM: X86/MMU: Use one-off local shadow page for special roots Lai Jiangshan
2022-05-26 20:27 ` David Matlack
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YtdNuLiGvzp7ZvoQ@google.com \
--to=seanjc@google.com \
--cc=dmatlack@google.com \
--cc=jiangshan.ljs@antgroup.com \
--cc=jiangshanlai@gmail.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mlevitsk@redhat.com \
--cc=pbonzini@redhat.com \
--cc=vkuznets@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.