Re: [PATCH v3 1/3] arm64: tlb: Fix TLBI RANGE operand

From: Marc Zyngier <maz@kernel.org>
To: Ryan Roberts <ryan.roberts@arm.com>
Cc: Gavin Shan <gshan@redhat.com>,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, catalin.marinas@arm.com,
	will@kernel.org, akpm@linux-foundation.org,
	oliver.upton@linux.dev, apopple@nvidia.com, rananta@google.com,
	mark.rutland@arm.com, v-songbaohua@oppo.com,
	yangyicong@hisilicon.com, shahuang@redhat.com, yihyu@redhat.com,
	shan.gavin@gmail.com
Subject: Re: [PATCH v3 1/3] arm64: tlb: Fix TLBI RANGE operand
Date: Wed, 10 Apr 2024 09:45:37 +0100	[thread overview]
Message-ID: <86wmp5sj0u.wl-maz@kernel.org> (raw)
In-Reply-To: <7a929104-5f09-4ff6-8792-4a9e93bc0894@arm.com>

On Mon, 08 Apr 2024 09:29:31 +0100,
Ryan Roberts <ryan.roberts@arm.com> wrote:
> 
> On 05/04/2024 04:58, Gavin Shan wrote:
> > KVM/arm64 relies on TLBI RANGE feature to flush TLBs when the dirty
> > pages are collected by VMM and the page table entries become write
> > protected during live migration. Unfortunately, the operand passed
> > to the TLBI RANGE instruction isn't correctly sorted out due to the
> > commit 117940aa6e5f ("KVM: arm64: Define kvm_tlb_flush_vmid_range()").
> > It leads to crash on the destination VM after live migration because
> > TLBs aren't flushed completely and some of the dirty pages are missed.
> > 
> > For example, I have a VM where 8GB memory is assigned, starting from
> > 0x40000000 (1GB). Note that the host has 4KB as the base page size.
> > In the middile of migration, kvm_tlb_flush_vmid_range() is executed
> > to flush TLBs. It passes MAX_TLBI_RANGE_PAGES as the argument to
> > __kvm_tlb_flush_vmid_range() and __flush_s2_tlb_range_op(). SCALE#3
> > and NUM#31, corresponding to MAX_TLBI_RANGE_PAGES, isn't supported
> > by __TLBI_RANGE_NUM(). In this specific case, -1 has been returned
> > from __TLBI_RANGE_NUM() for SCALE#3/2/1/0 and rejected by the loop
> > in the __flush_tlb_range_op() until the variable @scale underflows
> > and becomes -9, 0xffff708000040000 is set as the operand. The operand
> > is wrong since it's sorted out by __TLBI_VADDR_RANGE() according to
> > invalid @scale and @num.
> > 
> > Fix it by extending __TLBI_RANGE_NUM() to support the combination of
> > SCALE#3 and NUM#31. With the changes, [-1 31] instead of [-1 30] can
> > be returned from the macro, meaning the TLBs for 0x200000 pages in the
> > above example can be flushed in one shoot with SCALE#3 and NUM#31. The
> > macro TLBI_RANGE_MASK is dropped since no one uses it any more. The
> > comments are also adjusted accordingly.
> 
> Perhaps I'm being overly pedantic, but I don't think the bug is
> __TLBI_RANGE_NUM() not being able to return 31; It is clearly documented that it
> can only return in the range [-1, 30] and a maximum of (MAX_TLBI_RANGE_PAGES -
> 1) pages are supported.

I guess "clearly" is pretty relative. I find it misleading that we
don't support the full range of what the architecture offers and have
these odd limitations.

> The bug is in the kvm caller, which tries to call __flush_tlb_range_op() with
> MAX_TLBI_RANGE_PAGES; clearly out-of-bounds.

Nobody disputes that point, hence the Fixes: tag pointing to the KVM
patch. But there are two ways to fix it: either reduce the amount KVM
can use for range invalidation, or fix the helper to allow the full
range offered by the architecture.

> So personally, I would prefer to fix the bug first. Then separately
> enhance the infrastructure to support NUM=31.

I don't think this buys us much, apart from making it harder for
people to know what they need/want/randomly-elect to backport.

> > Fixes: 117940aa6e5f ("KVM: arm64: Define kvm_tlb_flush_vmid_range()")
> 
> I would argue that the bug was actually introduced by commit 360839027a6e
> ("arm64: tlb: Refactor the core flush algorithm of __flush_tlb_range"), which
> separated the tlbi loop from the range size validation in __flush_tlb_range().
> Before this, all calls would have to go through __flush_tlb_range() and
> therefore anything bigger than (MAX_TLBI_RANGE_PAGES - 1) pages would cause the
> whole mm to be flushed. Although I get that bisect will lead to this one, so
> that's probably the right one to highlight.

I haven't tried to bisect it, only picked this as the obviously
culprit.

To your point, using __flush_tlb_range() made little sense for KVM --
what would be the vma here? Splitting the helper out was, I think the
correct decision. But we of course lost sight of the __TLBI_RANGE_NUM
limitation in the process.

> I get why it was split, but perhaps it should have been split at a higher level;
> If tlbi range is not supported, then KVM will flush the whole vmid. Would it be
> better for KVM to follow the same pattern as __flush_tlb_range_nosync() and
> issue per-block tlbis upto a max of MAX_DVM_OPS before falling back to the whole
> vmid? And if tlbi range is supported, KVM uses it regardless of the size of the
> range, whereas __flush_tlb_range_nosync() falls back to flush_tlb_mm() at a
> certain size. It's not clear why this divergence is useful?

That'd be a definitive improvement indeed, and would bring back some
much needed consistency.

> > Cc: stable@kernel.org # v6.6+
> > Reported-by: Yihuang Yu <yihyu@redhat.com>
> > Suggested-by: Marc Zyngier <maz@kernel.org>
> > Signed-off-by: Gavin Shan <gshan@redhat.com>
> 
> Anyway, the implementation looks correct, so:
> 
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>

Thanks for that!

	M.

-- 
Without deviation from the norm, progress is not possible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel