Re: [PATCH v2 13/16] KVM: arm64: nv: Invalidate TLBs based on shadow S2 TTL-like information

From: Marc Zyngier <maz@kernel.org>
To: Oliver Upton <oliver.upton@linux.dev>
Cc: kvmarm@lists.linux.dev, kvm@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	James Morse <james.morse@arm.com>,
	Suzuki K Poulose <suzuki.poulose@arm.com>,
	Zenghui Yu <yuzenghui@huawei.com>,
	Joey Gouly <joey.gouly@arm.com>,
	Alexandru Elisei <alexandru.elisei@arm.com>,
	Christoffer Dall <christoffer.dall@arm.com>,
	Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com>
Subject: Re: [PATCH v2 13/16] KVM: arm64: nv: Invalidate TLBs based on shadow S2 TTL-like information
Date: Tue, 04 Jun 2024 08:59:49 +0100	[thread overview]
Message-ID: <87frtt2l56.wl-maz@kernel.org> (raw)
In-Reply-To: <Zl4NScV0E_YV7GR2@linux.dev>

On Mon, 03 Jun 2024 19:36:57 +0100,
Oliver Upton <oliver.upton@linux.dev> wrote:
> 
> On Wed, May 29, 2024 at 03:56:25PM +0100, Marc Zyngier wrote:

[...]

> > +/*
> > + * Compute the equivalent of the TTL field by parsing the shadow PT.  The
> > + * granule size is extracted from the cached VTCR_EL2.TG0 while the level is
> > + * retrieved from first entry carrying the level as a tag.
> > + */
> > +static u8 get_guest_mapping_ttl(struct kvm_s2_mmu *mmu, u64 addr)
> > +{
> 
> Can you add a lockdep assertion that the MMU lock is held for write
> here? At least for me this is far enough away from the 'real' page table
> walk that it wasn't clear what locks were held at this point.

Sure thing.

> 
> > +	u64 tmp, sz = 0, vtcr = mmu->tlb_vtcr;
> > +	kvm_pte_t pte;
> > +	u8 ttl, level;
> > +
> > +	switch (vtcr & VTCR_EL2_TG0_MASK) {
> > +	case VTCR_EL2_TG0_4K:
> > +		ttl = (TLBI_TTL_TG_4K << 2);
> > +		break;
> > +	case VTCR_EL2_TG0_16K:
> > +		ttl = (TLBI_TTL_TG_16K << 2);
> > +		break;
> > +	case VTCR_EL2_TG0_64K:
> > +	default:	    /* IMPDEF: treat any other value as 64k */
> > +		ttl = (TLBI_TTL_TG_64K << 2);
> > +		break;
> > +	}
> > +
> > +	tmp = addr;
> > +
> > +again:
> > +	/* Iteratively compute the block sizes for a particular granule size */
> > +	switch (vtcr & VTCR_EL2_TG0_MASK) {
> > +	case VTCR_EL2_TG0_4K:
> > +		if	(sz < SZ_4K)	sz = SZ_4K;
> > +		else if (sz < SZ_2M)	sz = SZ_2M;
> > +		else if (sz < SZ_1G)	sz = SZ_1G;
> > +		else			sz = 0;
> > +		break;
> > +	case VTCR_EL2_TG0_16K:
> > +		if	(sz < SZ_16K)	sz = SZ_16K;
> > +		else if (sz < SZ_32M)	sz = SZ_32M;
> > +		else			sz = 0;
> > +		break;
> > +	case VTCR_EL2_TG0_64K:
> > +	default:	    /* IMPDEF: treat any other value as 64k */
> > +		if	(sz < SZ_64K)	sz = SZ_64K;
> > +		else if (sz < SZ_512M)	sz = SZ_512M;
> > +		else			sz = 0;
> > +		break;
> > +	}
> > +
> > +	if (sz == 0)
> > +		return 0;
> > +
> > +	tmp &= ~(sz - 1);
> > +	if (kvm_pgtable_get_leaf(mmu->pgt, tmp, &pte, NULL))
> > +		goto again;
> 
> Assuming we're virtualizing a larger TG than what's in use at L0 this
> may not actually find a valid leaf that exists within the span of a
> single virtual TLB entry.
> 
> For example, if we're using 4K at L0 and 16K at L1, we could have:
> 
> 	[ ----- valid 16K entry ------- ]
> 
> mapped as:
> 
> 	[ ----- | ----- | valid | ----- ]
> 
> in the shadow S2. kvm_pgtable_get_leaf() will always return the first
> splintered page, which could be invalid.
> 
> What I'm getting at is: should this use a bespoke table walker that
> scans for a valid TTL in the range of [addr, addr + sz)? It may make
> sense to back off a bit more aggressively and switch to a conservative,
> unscoped TLBI to avoid visiting too many PTEs.

I had something along those lines at some point (circa 2019), and
quickly dropped it as it had a horrible "look-around" behaviour,
specially if the L1 S2 granule size is much larger than L0's (64k vs
4k). As you pointed out, it needs heuristics to limit the look-around,
which I don't find very satisfying.

Which is why the current code limits the search to be in depth only,
hoping for the head descriptor to be valid, and quickly backs off to
do a level-0 invalidation.

My preferred option would be to allow the use of non-valid entries to
cache the level (always using the first L0 entry that would map the L1
descriptor), but this opens another can of worms: you could end-up
with page table pages containing only invalid descriptors, except for
the presence of a level annotation, which screws the refcounting. I'd
very much like to see this rather than the look-around option.

Now, it is important to consider how useful this is. I expect modern
hypervisors to use either TTL-hinted (which we emulate even if the HW
doesn't support it) or Range-based invalidation in the vast majority
of the cases, so this would only help SW that hasn't got on with the
program.

Thoughts?

	M.

-- 
Without deviation from the norm, progress is not possible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel