From: Marc Zyngier <maz@kernel.org>
To: Ryan Roberts <ryan.roberts@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
Will Deacon <will@kernel.org>,
Oliver Upton <oliver.upton@linux.dev>,
Suzuki K Poulose <suzuki.poulose@arm.com>,
James Morse <james.morse@arm.com>,
Zenghui Yu <yuzenghui@huawei.com>,
Ard Biesheuvel <ardb@kernel.org>,
Anshuman Khandual <anshuman.khandual@arm.com>,
linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev
Subject: Re: [PATCH v4 02/12] arm64/mm: Update range-based tlb invalidation routines for FEAT_LPA2
Date: Thu, 19 Oct 2023 22:06:07 +0100 [thread overview]
Message-ID: <87y1fy5nls.wl-maz@kernel.org> (raw)
In-Reply-To: <20231009185008.3803879-3-ryan.roberts@arm.com>
On Mon, 09 Oct 2023 19:49:58 +0100,
Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> The BADDR field of the range-based tlbi instructions is specified in
> 64KB units when LPA2 is in use (TCR.DS=1), whereas it is in page units
> otherwise.
>
> When LPA2 is enabled, use the non-range tlbi instructions to forward
> align to a 64KB boundary first, then use range-based tlbi from there on,
> until we have either invalidated all pages or we have a single page
> remaining. If the latter, that is done with non-range tlbi. (Previously
> we invalidated a single odd page first, but we can no longer do this
> because it could wreck our 64KB alignment). When LPA2 is not in use, we
> don't need the initial alignemnt step. However, the bigger impact is
> that we can no longer use the previous method of iterating from smallest
> to largest 'scale', since this would likely unalign the boundary again
> for the LPA2 case. So instead we iterate from highest to lowest scale,
> which guarrantees that we remain 64KB aligned until the last op (at
> scale=0).
>
> The original commit (d1d3aa98 "arm64: tlb: Use the TLBI RANGE feature in
> arm64") stated this as the reason for incrementing scale:
>
> However, in most scenarios, the pages = 1 when flush_tlb_range() is
> called. Start from scale = 3 or other proper value (such as scale
> =ilog2(pages)), will incur extra overhead. So increase 'scale' from 0
> to maximum, the flush order is exactly opposite to the example.
>
> But pages=1 is already special cased by the non-range invalidation path,
> which will take care of it the first time through the loop (both in the
> original commit and in my change), so I don't think switching to
> decrement scale should have any extra performance impact after all.
Surely this can be benchmarked. After all, HW supporting range
invalidation is common enough these days.
>
> Note: This patch uses LPA2 range-based tlbi based on the new lpa2 param
> passed to __flush_tlb_range_op(). This allows both KVM and the kernel to
> opt-in/out of LPA2 usage independently. But once both are converted over
> (and keyed off the same static key), the parameter could be dropped and
> replaced by the static key directly in the macro.
Why can't this be done right away? Have a patch common to the two
series that exposes the static key, and use that from the start. This
would avoid the current (and rather ugly) extra parameter that I find
unnecessarily hard to parse.
And if the 64kB alignment above is cheap enough, maybe this could
become the one true way?
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
> arch/arm64/include/asm/tlb.h | 6 +++-
> arch/arm64/include/asm/tlbflush.h | 46 ++++++++++++++++++++-----------
> arch/arm64/kvm/hyp/nvhe/tlb.c | 2 +-
> arch/arm64/kvm/hyp/vhe/tlb.c | 2 +-
> 4 files changed, 37 insertions(+), 19 deletions(-)
>
> diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
> index 93c537635dbb..396ba9b4872c 100644
> --- a/arch/arm64/include/asm/tlb.h
> +++ b/arch/arm64/include/asm/tlb.h
> @@ -25,7 +25,6 @@ static void tlb_flush(struct mmu_gather *tlb);
> * get the tlbi levels in arm64. Default value is TLBI_TTL_UNKNOWN if more than
> * one of cleared_* is set or neither is set - this elides the level hinting to
> * the hardware.
> - * Arm64 doesn't support p4ds now.
> */
> static inline int tlb_get_level(struct mmu_gather *tlb)
> {
> @@ -48,6 +47,11 @@ static inline int tlb_get_level(struct mmu_gather *tlb)
> tlb->cleared_p4ds))
> return 1;
>
> + if (tlb->cleared_p4ds && !(tlb->cleared_ptes ||
> + tlb->cleared_pmds ||
> + tlb->cleared_puds))
> + return 0;
> +
> return TLBI_TTL_UNKNOWN;
> }
>
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index e688246b3b13..4d34035fe7d6 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -136,10 +136,14 @@ static inline unsigned long get_trans_granule(void)
> * The address range is determined by below formula:
> * [BADDR, BADDR + (NUM + 1) * 2^(5*SCALE + 1) * PAGESIZE)
> *
> + * If LPA2 is in use, BADDR holds addr[52:16]. Else BADDR holds page number.
> + * See ARM DDI 0487I.a C5.5.21.
Please update this to the latest published ARM ARM. I know it will be
obsolete quickly enough, but still. Also, "page number" is rather
imprecise, and doesn't match the language of the architecture.
> + *
> */
> -#define __TLBI_VADDR_RANGE(addr, asid, scale, num, ttl) \
> +#define __TLBI_VADDR_RANGE(addr, asid, scale, num, ttl, lpa2) \
> ({ \
> - unsigned long __ta = (addr) >> PAGE_SHIFT; \
> + unsigned long __addr_shift = lpa2 ? 16 : PAGE_SHIFT; \
> + unsigned long __ta = (addr) >> __addr_shift; \
> unsigned long __ttl = (ttl >= 1 && ttl <= 3) ? ttl : 0; \
> __ta &= GENMASK_ULL(36, 0); \
> __ta |= __ttl << 37; \
> @@ -354,34 +358,44 @@ static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> * @tlb_level: Translation Table level hint, if known
> * @tlbi_user: If 'true', call an additional __tlbi_user()
> * (typically for user ASIDs). 'flase' for IPA instructions
> + * @lpa2: If 'true', the lpa2 scheme is used as set out below
> *
> * When the CPU does not support TLB range operations, flush the TLB
> * entries one by one at the granularity of 'stride'. If the TLB
> * range ops are supported, then:
> *
> - * 1. If 'pages' is odd, flush the first page through non-range
> - * operations;
> + * 1. If FEAT_LPA2 is in use, the start address of a range operation
> + * must be 64KB aligned, so flush pages one by one until the
> + * alignment is reached using the non-range operations. This step is
> + * skipped if LPA2 is not in use.
> *
> * 2. For remaining pages: the minimum range granularity is decided
> * by 'scale', so multiple range TLBI operations may be required.
> - * Start from scale = 0, flush the corresponding number of pages
> - * ((num+1)*2^(5*scale+1) starting from 'addr'), then increase it
> - * until no pages left.
> + * Start from scale = 3, flush the corresponding number of pages
> + * ((num+1)*2^(5*scale+1) starting from 'addr'), then descrease it
> + * until one or zero pages are left. We must start from highest scale
> + * to ensure 64KB start alignment is maintained in the LPA2 case.
Surely the algorithm is a bit more subtle than this, because always
starting with scale==3 means that you're invalidating at least 64k
*pages*, which is an awful lot (a minimum of 256MB?).
> + *
> + * 3. If there is 1 page remaining, flush it through non-range
> + * operations. Range operations can only span an even number of
> + * pages. We save this for last to ensure 64KB start alignment is
> + * maintained for the LPA2 case.
> *
> * Note that certain ranges can be represented by either num = 31 and
> * scale or num = 0 and scale + 1. The loop below favours the latter
> * since num is limited to 30 by the __TLBI_RANGE_NUM() macro.
> */
> #define __flush_tlb_range_op(op, start, pages, stride, \
> - asid, tlb_level, tlbi_user) \
> + asid, tlb_level, tlbi_user, lpa2) \
> do { \
> int num = 0; \
> - int scale = 0; \
> + int scale = 3; \
> unsigned long addr; \
> \
> while (pages > 0) { \
Not an issue with your patch, but we could be more robust here. If
'pages' is an unsigned quantity and what we have a bug in converging
to 0 below, we'll be looping for a long time. Not to mention the side
effects on pages and start.
> if (!system_supports_tlb_range() || \
> - pages % 2 == 1) { \
> + pages == 1 || \
> + (lpa2 && start != ALIGN(start, SZ_64K))) { \
> addr = __TLBI_VADDR(start, asid); \
> __tlbi_level(op, addr, tlb_level); \
> if (tlbi_user) \
> @@ -394,19 +408,19 @@ do { \
> num = __TLBI_RANGE_NUM(pages, scale); \
> if (num >= 0) { \
> addr = __TLBI_VADDR_RANGE(start, asid, scale, \
> - num, tlb_level); \
> + num, tlb_level, lpa2); \
> __tlbi(r##op, addr); \
> if (tlbi_user) \
> __tlbi_user(r##op, addr); \
> start += __TLBI_RANGE_PAGES(num, scale) << PAGE_SHIFT; \
> pages -= __TLBI_RANGE_PAGES(num, scale); \
> } \
> - scale++; \
> + scale--; \
> } \
> } while (0)
>
> -#define __flush_s2_tlb_range_op(op, start, pages, stride, tlb_level) \
> - __flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false)
> +#define __flush_s2_tlb_range_op(op, start, pages, stride, tlb_level, lpa2) \
> + __flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false, lpa2)
>
> static inline void __flush_tlb_range(struct vm_area_struct *vma,
> unsigned long start, unsigned long end,
> @@ -436,9 +450,9 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
> asid = ASID(vma->vm_mm);
>
> if (last_level)
> - __flush_tlb_range_op(vale1is, start, pages, stride, asid, tlb_level, true);
> + __flush_tlb_range_op(vale1is, start, pages, stride, asid, tlb_level, true, false);
> else
> - __flush_tlb_range_op(vae1is, start, pages, stride, asid, tlb_level, true);
> + __flush_tlb_range_op(vae1is, start, pages, stride, asid, tlb_level, true, false);
>
> dsb(ish);
> mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
> diff --git a/arch/arm64/kvm/hyp/nvhe/tlb.c b/arch/arm64/kvm/hyp/nvhe/tlb.c
> index 1b265713d6be..d42b72f78a9b 100644
> --- a/arch/arm64/kvm/hyp/nvhe/tlb.c
> +++ b/arch/arm64/kvm/hyp/nvhe/tlb.c
> @@ -198,7 +198,7 @@ void __kvm_tlb_flush_vmid_range(struct kvm_s2_mmu *mmu,
> /* Switch to requested VMID */
> __tlb_switch_to_guest(mmu, &cxt, false);
>
> - __flush_s2_tlb_range_op(ipas2e1is, start, pages, stride, 0);
> + __flush_s2_tlb_range_op(ipas2e1is, start, pages, stride, 0, false);
>
> dsb(ish);
> __tlbi(vmalle1is);
> diff --git a/arch/arm64/kvm/hyp/vhe/tlb.c b/arch/arm64/kvm/hyp/vhe/tlb.c
> index 46bd43f61d76..6041c6c78984 100644
> --- a/arch/arm64/kvm/hyp/vhe/tlb.c
> +++ b/arch/arm64/kvm/hyp/vhe/tlb.c
> @@ -161,7 +161,7 @@ void __kvm_tlb_flush_vmid_range(struct kvm_s2_mmu *mmu,
> /* Switch to requested VMID */
> __tlb_switch_to_guest(mmu, &cxt);
>
> - __flush_s2_tlb_range_op(ipas2e1is, start, pages, stride, 0);
> + __flush_s2_tlb_range_op(ipas2e1is, start, pages, stride, 0, false);
>
> dsb(ish);
> __tlbi(vmalle1is);
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
next prev parent reply other threads:[~2023-10-19 21:06 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-10-09 18:49 [PATCH v4 00/12] KVM: arm64: Support FEAT_LPA2 at hyp s1 and vm s2 Ryan Roberts
2023-10-09 18:49 ` [PATCH v4 01/12] arm64/mm: Update non-range tlb invalidation routines for FEAT_LPA2 Ryan Roberts
2023-10-19 8:03 ` Marc Zyngier
2023-10-19 9:22 ` Ryan Roberts
2023-10-20 8:05 ` Marc Zyngier
2023-10-20 12:39 ` Ryan Roberts
2023-10-20 13:02 ` Marc Zyngier
2023-10-20 13:21 ` Ryan Roberts
2023-10-20 13:41 ` Marc Zyngier
2023-10-09 18:49 ` [PATCH v4 02/12] arm64/mm: Update range-based " Ryan Roberts
2023-10-19 21:06 ` Marc Zyngier [this message]
2023-10-20 14:55 ` Ryan Roberts
2023-10-09 18:49 ` [PATCH v4 03/12] arm64/mm: Add FEAT_LPA2 specific ID_AA64MMFR0.TGRAN[2] Ryan Roberts
2023-10-09 18:50 ` [PATCH v4 04/12] KVM: arm64: Add ARM64_HAS_LPA2 CPU capability Ryan Roberts
2023-10-20 8:16 ` Marc Zyngier
2023-10-20 15:03 ` Ryan Roberts
2023-10-23 9:34 ` Marc Zyngier
2023-11-13 11:57 ` Ryan Roberts
2023-11-13 12:16 ` Marc Zyngier
2023-10-09 18:50 ` [PATCH v4 05/12] KVM: arm64: Add new (V)TCR_EL2 field definitions for FEAT_LPA2 Ryan Roberts
2023-10-09 18:50 ` [PATCH v4 06/12] KVM: arm64: Use LPA2 page-tables for stage2 and hyp stage1 Ryan Roberts
2023-10-20 9:16 ` Marc Zyngier
2023-10-20 15:06 ` Ryan Roberts
2023-10-23 9:36 ` Marc Zyngier
2023-10-09 18:50 ` [PATCH v4 07/12] KVM: arm64: Prepare TCR_EL2.PS in cpu_prepare_hyp_mode() Ryan Roberts
2023-10-20 9:21 ` Marc Zyngier
2023-10-20 15:07 ` Ryan Roberts
2023-10-09 18:50 ` [PATCH v4 08/12] KVM: arm64: Convert translation level parameter to s8 Ryan Roberts
2023-10-20 10:42 ` Marc Zyngier
2023-10-20 15:11 ` Ryan Roberts
2023-10-09 18:50 ` [PATCH v4 09/12] KVM: arm64: Support up to 5 levels of translation in kvm_pgtable Ryan Roberts
2023-10-09 18:50 ` [PATCH v4 10/12] KVM: arm64: Allow guests with >48-bit IPA size on FEAT_LPA2 systems Ryan Roberts
2023-10-09 18:50 ` [PATCH v4 11/12] KVM: selftests: arm64: Determine max ipa size per-page size Ryan Roberts
2023-10-09 18:50 ` [PATCH v4 12/12] KVM: selftests: arm64: Support P52V48 4K and 16K guest_modes Ryan Roberts
2023-10-20 10:54 ` [PATCH v4 00/12] KVM: arm64: Support FEAT_LPA2 at hyp s1 and vm s2 Marc Zyngier
2023-10-20 15:22 ` Ryan Roberts
2023-10-23 9:42 ` Marc Zyngier
2023-10-23 15:00 ` Ryan Roberts
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87y1fy5nls.wl-maz@kernel.org \
--to=maz@kernel.org \
--cc=anshuman.khandual@arm.com \
--cc=ardb@kernel.org \
--cc=catalin.marinas@arm.com \
--cc=james.morse@arm.com \
--cc=kvmarm@lists.linux.dev \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=oliver.upton@linux.dev \
--cc=ryan.roberts@arm.com \
--cc=suzuki.poulose@arm.com \
--cc=will@kernel.org \
--cc=yuzenghui@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox