From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 49B98154C0F for ; Wed, 10 Apr 2024 08:45:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712738740; cv=none; b=LpmmQuQ8jT7ezBTBpTlXNtpNkwiwjyB7zDLrmBZ6gIrgqdjCMv5uDr1bCDZju088Tbt26jgTetrVRB5jTkHPk3vCVrIzEQl+zLoYUFP/q9IRxhEt+tDxxo1AWNp81XtACJNgjrwXxm5CR5g7uHTtpbm14/moTiClV1ns3uTunCg= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712738740; c=relaxed/simple; bh=rjpJ/B0jBLfP6NqGowlaRPiXpWGZqILHxap2uCmUq7E=; h=Date:Message-ID:From:To:Cc:Subject:In-Reply-To:References: MIME-Version:Content-Type; b=Gv6lq//iiArTBPjbROeNbqQQslg/u81sbj0WoR4I4tXxotAzpGXTUnBDAcq3m01B2Em25GA/2kFv+RQWHz6z0vyNH/mUS83ltjakKlvfeIqRdMqOcfNZvdYQ4Vz+jw+u2VVR3bpj5a7u50uEYXPCPwippxfzfeMSZ4BYLESeQA8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=nZnkn8Hr; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="nZnkn8Hr" Received: by smtp.kernel.org (Postfix) with ESMTPSA id D033EC433C7; Wed, 10 Apr 2024 08:45:39 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1712738739; bh=rjpJ/B0jBLfP6NqGowlaRPiXpWGZqILHxap2uCmUq7E=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=nZnkn8HrApjPn+iBhw/Z53N0bB5DkavtqggzdJ1FKc6z1KIgt0UcANOWxTPYTfGOd b9mtO7drDcTBlWti/u/3DvjOkvxj5wYBbafkNTnrgekkExcrFU7dx7o2mATD9GZnNS iLWh1Th97DjMhQdxJ6h1Ve+tuxUIRjwM+UiRFcSbriQN4RK2cutC6Gfa772FQQdzCi Rp8IRo65oXNhaMrn0C12M4J//fX5ggxoWjij075NKo1+RkQ8si6QmEBR753jNfEPR6 PCJxjwLBozu7H4wLtaxJ1xEY2Edtl00qQ+bsW8x0oEwpRWeYDW2cpbQUA9kB1LWABC y/B88RqarDZzQ== Received: from sofa.misterjones.org ([185.219.108.64] helo=goblin-girl.misterjones.org) by disco-boy.misterjones.org with esmtpsa (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.95) (envelope-from ) id 1ruTaL-0034tA-JX; Wed, 10 Apr 2024 09:45:37 +0100 Date: Wed, 10 Apr 2024 09:45:37 +0100 Message-ID: <86wmp5sj0u.wl-maz@kernel.org> From: Marc Zyngier To: Ryan Roberts Cc: Gavin Shan , linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, catalin.marinas@arm.com, will@kernel.org, akpm@linux-foundation.org, oliver.upton@linux.dev, apopple@nvidia.com, rananta@google.com, mark.rutland@arm.com, v-songbaohua@oppo.com, yangyicong@hisilicon.com, shahuang@redhat.com, yihyu@redhat.com, shan.gavin@gmail.com Subject: Re: [PATCH v3 1/3] arm64: tlb: Fix TLBI RANGE operand In-Reply-To: <7a929104-5f09-4ff6-8792-4a9e93bc0894@arm.com> References: <20240405035852.1532010-1-gshan@redhat.com> <20240405035852.1532010-2-gshan@redhat.com> <7a929104-5f09-4ff6-8792-4a9e93bc0894@arm.com> User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI-EPG/1.14.7 (Harue) FLIM-LB/1.14.9 (=?UTF-8?B?R29qxY0=?=) APEL-LB/10.8 EasyPG/1.0.0 Emacs/29.2 (aarch64-unknown-linux-gnu) MULE/6.0 (HANACHIRUSATO) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 (generated by SEMI-EPG 1.14.7 - "Harue") Content-Type: text/plain; charset=US-ASCII X-SA-Exim-Connect-IP: 185.219.108.64 X-SA-Exim-Rcpt-To: ryan.roberts@arm.com, gshan@redhat.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, catalin.marinas@arm.com, will@kernel.org, akpm@linux-foundation.org, oliver.upton@linux.dev, apopple@nvidia.com, rananta@google.com, mark.rutland@arm.com, v-songbaohua@oppo.com, yangyicong@hisilicon.com, shahuang@redhat.com, yihyu@redhat.com, shan.gavin@gmail.com X-SA-Exim-Mail-From: maz@kernel.org X-SA-Exim-Scanned: No (on disco-boy.misterjones.org); SAEximRunCond expanded to false On Mon, 08 Apr 2024 09:29:31 +0100, Ryan Roberts wrote: > > On 05/04/2024 04:58, Gavin Shan wrote: > > KVM/arm64 relies on TLBI RANGE feature to flush TLBs when the dirty > > pages are collected by VMM and the page table entries become write > > protected during live migration. Unfortunately, the operand passed > > to the TLBI RANGE instruction isn't correctly sorted out due to the > > commit 117940aa6e5f ("KVM: arm64: Define kvm_tlb_flush_vmid_range()"). > > It leads to crash on the destination VM after live migration because > > TLBs aren't flushed completely and some of the dirty pages are missed. > > > > For example, I have a VM where 8GB memory is assigned, starting from > > 0x40000000 (1GB). Note that the host has 4KB as the base page size. > > In the middile of migration, kvm_tlb_flush_vmid_range() is executed > > to flush TLBs. It passes MAX_TLBI_RANGE_PAGES as the argument to > > __kvm_tlb_flush_vmid_range() and __flush_s2_tlb_range_op(). SCALE#3 > > and NUM#31, corresponding to MAX_TLBI_RANGE_PAGES, isn't supported > > by __TLBI_RANGE_NUM(). In this specific case, -1 has been returned > > from __TLBI_RANGE_NUM() for SCALE#3/2/1/0 and rejected by the loop > > in the __flush_tlb_range_op() until the variable @scale underflows > > and becomes -9, 0xffff708000040000 is set as the operand. The operand > > is wrong since it's sorted out by __TLBI_VADDR_RANGE() according to > > invalid @scale and @num. > > > > Fix it by extending __TLBI_RANGE_NUM() to support the combination of > > SCALE#3 and NUM#31. With the changes, [-1 31] instead of [-1 30] can > > be returned from the macro, meaning the TLBs for 0x200000 pages in the > > above example can be flushed in one shoot with SCALE#3 and NUM#31. The > > macro TLBI_RANGE_MASK is dropped since no one uses it any more. The > > comments are also adjusted accordingly. > > Perhaps I'm being overly pedantic, but I don't think the bug is > __TLBI_RANGE_NUM() not being able to return 31; It is clearly documented that it > can only return in the range [-1, 30] and a maximum of (MAX_TLBI_RANGE_PAGES - > 1) pages are supported. I guess "clearly" is pretty relative. I find it misleading that we don't support the full range of what the architecture offers and have these odd limitations. > The bug is in the kvm caller, which tries to call __flush_tlb_range_op() with > MAX_TLBI_RANGE_PAGES; clearly out-of-bounds. Nobody disputes that point, hence the Fixes: tag pointing to the KVM patch. But there are two ways to fix it: either reduce the amount KVM can use for range invalidation, or fix the helper to allow the full range offered by the architecture. > So personally, I would prefer to fix the bug first. Then separately > enhance the infrastructure to support NUM=31. I don't think this buys us much, apart from making it harder for people to know what they need/want/randomly-elect to backport. > > Fixes: 117940aa6e5f ("KVM: arm64: Define kvm_tlb_flush_vmid_range()") > > I would argue that the bug was actually introduced by commit 360839027a6e > ("arm64: tlb: Refactor the core flush algorithm of __flush_tlb_range"), which > separated the tlbi loop from the range size validation in __flush_tlb_range(). > Before this, all calls would have to go through __flush_tlb_range() and > therefore anything bigger than (MAX_TLBI_RANGE_PAGES - 1) pages would cause the > whole mm to be flushed. Although I get that bisect will lead to this one, so > that's probably the right one to highlight. I haven't tried to bisect it, only picked this as the obviously culprit. To your point, using __flush_tlb_range() made little sense for KVM -- what would be the vma here? Splitting the helper out was, I think the correct decision. But we of course lost sight of the __TLBI_RANGE_NUM limitation in the process. > I get why it was split, but perhaps it should have been split at a higher level; > If tlbi range is not supported, then KVM will flush the whole vmid. Would it be > better for KVM to follow the same pattern as __flush_tlb_range_nosync() and > issue per-block tlbis upto a max of MAX_DVM_OPS before falling back to the whole > vmid? And if tlbi range is supported, KVM uses it regardless of the size of the > range, whereas __flush_tlb_range_nosync() falls back to flush_tlb_mm() at a > certain size. It's not clear why this divergence is useful? That'd be a definitive improvement indeed, and would bring back some much needed consistency. > > Cc: stable@kernel.org # v6.6+ > > Reported-by: Yihuang Yu > > Suggested-by: Marc Zyngier > > Signed-off-by: Gavin Shan > > Anyway, the implementation looks correct, so: > > Reviewed-by: Ryan Roberts Thanks for that! M. -- Without deviation from the norm, progress is not possible.