Re: [PATCH] arm64: tlb: Fix TLBI RANGE operand

From: Marc Zyngier <maz@kernel.org>
To: Gavin Shan <gshan@redhat.com>
Cc: linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, catalin.marinas@arm.com,
	will@kernel.org, akpm@linux-foundation.org, apopple@nvidia.com,
	mark.rutland@arm.com, ryan.roberts@arm.com, rananta@google.com,
	yangyicong@hisilicon.com, v-songbaohua@oppo.com,
	yezhenyu2@huawei.com, yihyu@redhat.com, shan.gavin@gmail.com
Subject: Re: [PATCH] arm64: tlb: Fix TLBI RANGE operand
Date: Wed, 03 Apr 2024 14:44:21 +0100	[thread overview]
Message-ID: <87jzlesgqy.wl-maz@kernel.org> (raw)
In-Reply-To: <20134458-508b-4853-a962-c1b44009e8ed@redhat.com>

On Wed, 03 Apr 2024 12:37:30 +0100,
Gavin Shan <gshan@redhat.com> wrote:
> 
> On 4/3/24 18:58, Marc Zyngier wrote:
> > On Wed, 03 Apr 2024 07:49:29 +0100,
> > Gavin Shan <gshan@redhat.com> wrote:
> >> 
> >> KVM/arm64 relies on TLBI RANGE feature to flush TLBs when the dirty
> >> bitmap is collected by VMM and the corresponding PTEs need to be
> >> write-protected again. Unfortunately, the operand passed to the TLBI
> >> RANGE instruction isn't correctly sorted out by commit d1d3aa98b1d4
> >> ("arm64: tlb: Use the TLBI RANGE feature in arm64"). It leads to
> >> crash on the destination VM after live migration because some of the
> >> dirty pages are missed.
> >> 
> >> For example, I have a VM where 8GB memory is assigned, starting from
> >> 0x40000000 (1GB). Note that the host has 4KB as the base page size.
> >> All TLBs for VM can be covered by one TLBI RANGE operation. However,
> >> I receives 0xffff708000040000 as the operand, which is wrong and the
> >> correct one should be 0x00007f8000040000. From the wrong operand, we
> >> have 3 and 1 for SCALE (bits[45:44) and NUM (bits943:39], only 1GB
> >> instead of 8GB memory is covered.
> >> 
> >> Fix the macro __TLBI_RANGE_NUM() so that the correct NUM and TLBI
> >> RANGE operand are provided.
> >> 
> >> Fixes: d1d3aa98b1d4 ("arm64: tlb: Use the TLBI RANGE feature in arm64")
> >> Cc: stable@kernel.org # v5.10+
> >> Reported-by: Yihuang Yu <yihyu@redhat.com>
> >> Signed-off-by: Gavin Shan <gshan@redhat.com>
> >> ---
> >>   arch/arm64/include/asm/tlbflush.h | 2 +-
> >>   1 file changed, 1 insertion(+), 1 deletion(-)
> >> 
> >> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> >> index 3b0e8248e1a4..07c4fb4b82b4 100644
> >> --- a/arch/arm64/include/asm/tlbflush.h
> >> +++ b/arch/arm64/include/asm/tlbflush.h
> >> @@ -166,7 +166,7 @@ static inline unsigned long get_trans_granule(void)
> >>    */
> >>   #define TLBI_RANGE_MASK			GENMASK_ULL(4, 0)
> >>   #define __TLBI_RANGE_NUM(pages, scale)	\
> >> -	((((pages) >> (5 * (scale) + 1)) & TLBI_RANGE_MASK) - 1)
> >> +	((((pages) >> (5 * (scale) + 1)) - 1) & TLBI_RANGE_MASK)
> >>     /*
> >>    *	TLB Invalidation
> > 
> > This looks pretty wrong, by the very definition of the comment that's
> > just above:
> > 
> > <quote>
> > /*
> >   * Generate 'num' values from -1 to 30 with -1 rejected by the
> >   * __flush_tlb_range() loop below.
> >   */
> > </quote>
> > 
> > With your change, num can't ever be negative, and that breaks
> > __flush_tlb_range_op():
> > 
> > <quote>
> > 		num = __TLBI_RANGE_NUM(pages, scale);			\
> > 		if (num >= 0) {						\
> > 			addr = __TLBI_VADDR_RANGE(start >> shift, asid, \
> > 						scale, num, tlb_level);	\
> > 			__tlbi(r##op, addr);				\
> > 			if (tlbi_user)					\
> > 				__tlbi_user(r##op, addr);		\
> > 			start += __TLBI_RANGE_PAGES(num, scale) << PAGE_SHIFT; \
> > 			pages -= __TLBI_RANGE_PAGES(num, scale);	\
> > 		}							\
> > 		scale--;						\
> > </quote>
> > 
> > We'll then shove whatever value we've found in the TLBI operation,
> > leading to unknown results instead of properly adjusting the scale to
> > issue a smaller invalidation.
> > 
> 
> Marc, thanks for your review and comments.
> 
> Indeed, this patch is incomplete at least. I think we need __TLBI_RANGE_NUM()
> to return [-1 31] instead of [-1 30], to be consistent with MAX_TLBI_RANGE_PAGES.
> -1 will be rejected in the following loop. I'm not 100% sure if I did the correct
> calculation though.
> 
> /*
>  * Generate 'num' values in range [-1 31], but -1 will be rejected
>  * by the __flush_tlb_range() loop below.
>  */
> #define __TLBI_RANGE_NUM(pages, scale)                                          \
>         ({                                                                      \
>                 int __next = (pages) & (1ULL << (5 * (scale) + 6));             \
>                 int __mask = ((pages) >> (5 * (scale) + 1)) & TLBI_RANGE_MASK;  \
>                 int __num = (((pages) >> (5 * (scale) + 1)) - 1) &              \
>                             TLBI_RANGE_MASK;                                    \
>                 (__next || __mask) ? __num : -1;                                \
>         })

I'm afraid I don't follow the logic here, and it looks awfully
complex.  I came up with something simpler with this:

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 3b0e8248e1a4..b3f1a9c61189 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -161,12 +161,18 @@ static inline unsigned long get_trans_granule(void)
 #define MAX_TLBI_RANGE_PAGES		__TLBI_RANGE_PAGES(31, 3)
 
 /*
- * Generate 'num' values from -1 to 30 with -1 rejected by the
+ * Generate 'num' values from -1 to 31 with -1 rejected by the
  * __flush_tlb_range() loop below.
  */
 #define TLBI_RANGE_MASK			GENMASK_ULL(4, 0)
-#define __TLBI_RANGE_NUM(pages, scale)	\
-	((((pages) >> (5 * (scale) + 1)) & TLBI_RANGE_MASK) - 1)
+#define __TLBI_RANGE_NUM(pages, scale)					\
+	({								\
+		int __pages = min((pages),				\
+				  __TLBI_RANGE_PAGES(31, (scale)));	\
+		int __numplus1 = __pages >> (5 * (scale) + 1);		\
+									\
+		(__numplus1 - 1);					\
+	})
 
 /*
  *	TLB Invalidation
@@ -379,10 +385,6 @@ static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
  * 3. If there is 1 page remaining, flush it through non-range operations. Range
  *    operations can only span an even number of pages. We save this for last to
  *    ensure 64KB start alignment is maintained for the LPA2 case.
- *
- * Note that certain ranges can be represented by either num = 31 and
- * scale or num = 0 and scale + 1. The loop below favours the latter
- * since num is limited to 30 by the __TLBI_RANGE_NUM() macro.
  */
 #define __flush_tlb_range_op(op, start, pages, stride,			\
 				asid, tlb_level, tlbi_user, lpa2)	\

> 
> Alternatively, we can also limit the number of pages to be invalidated from
> arch/arm64/kvm/hyp/pgtable.c::kvm_tlb_flush_vmid_range() because the maximal
> capacity is (MAX_TLBI_RANGE_PAGES - 1) instead of MAX_TLBI_RANGE_PAGES, as
> the comments for __flush_tlb_range_nosync() say.
> 
> -               inval_pages = min(pages, MAX_TLBI_RANGE_PAGES);
> +               inval_pages = min(pages, MAX_TLBI_RANGE_PAGES - 1);
> 
> 
> static inline void __flush_tlb_range_nosync(...)
> {
>         :
>        /*
>          * When not uses TLB range ops, we can handle up to
>          * (MAX_DVM_OPS - 1) pages;
>          * When uses TLB range ops, we can handle up to
>          * (MAX_TLBI_RANGE_PAGES - 1) pages.
>          */
>         if ((!system_supports_tlb_range() &&
>              (end - start) >= (MAX_DVM_OPS * stride)) ||
>             pages >= MAX_TLBI_RANGE_PAGES) {
>                 flush_tlb_mm(vma->vm_mm);
>                 return;
>         }
> }
> 
> Please let me know which way is better.

I would really prefer to fix the range stuff itself instead of
papering over the problem by reducing the reach of the range
invalidation.

> 
> > I think the problem is that you are triggering NUM=31 and SCALE=3,
> > which the current code cannot handle as per the comment above
> > __flush_tlb_range_op() (we can't do NUM=30 and SCALE=4, obviously).
> > 
> 
> Yes, exactly.
> 
> > Can you try the untested patch below?
> > 
> > 
> > diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> > index 3b0e8248e1a4..b71a1cece802 100644
> > --- a/arch/arm64/include/asm/tlbflush.h
> > +++ b/arch/arm64/include/asm/tlbflush.h
> > @@ -379,10 +379,6 @@ static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> >    * 3. If there is 1 page remaining, flush it through non-range operations. Range
> >    *    operations can only span an even number of pages. We save this for last to
> >    *    ensure 64KB start alignment is maintained for the LPA2 case.
> > - *
> > - * Note that certain ranges can be represented by either num = 31 and
> > - * scale or num = 0 and scale + 1. The loop below favours the latter
> > - * since num is limited to 30 by the __TLBI_RANGE_NUM() macro.
> >    */
> >   #define __flush_tlb_range_op(op, start, pages, stride,			\
> >   				asid, tlb_level, tlbi_user, lpa2)	\
> > @@ -407,6 +403,7 @@ do {									\
> >   									\
> >   		num = __TLBI_RANGE_NUM(pages, scale);			\
> >   		if (num >= 0) {						\
> > +			num += 1;					\
> >   			addr = __TLBI_VADDR_RANGE(start >> shift, asid, \
> >   						scale, num, tlb_level);	\
> >   			__tlbi(r##op, addr);				\
> > 
> 
> Thanks, but I don't think it's going to work. The loop will be running infinitely
> because the condition 'if (num >= 0)' can't be met when @pages is 0x200000 when
> @scale is 3/2/1/0 until @scale becomes negative and positive again, but @scale
> isn't in range [0 3]. I ported the chunk of code to user-space and I can see this
> with added printf() messages.

Yeah, we lose num==0, which is silly. Hopefully the hack above helps a
bit.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel