Re: [PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX erratum 010001

From: Catalin Marinas <catalin.marinas@arm.com>
To: "Zhang, Lei" <zhang.lei@jp.fujitsu.com>
Cc: 'Mark Rutland' <mark.rutland@arm.com>,
	"'james.morse@arm.com'" <james.morse@arm.com>,
	"'will.deacon@arm.com'" <will.deacon@arm.com>,
	"'linux-kernel@vger.kernel.org'" <linux-kernel@vger.kernel.org>,
	"'linux-arm-kernel@lists.infradead.org'"
	<linux-arm-kernel@lists.infradead.org>
Subject: Re: [PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX erratum 010001
Date: Tue, 29 Jan 2019 18:10:32 +0000	[thread overview]
Message-ID: <20190129181032.GC224095@arrakis.emea.arm.com> (raw)
In-Reply-To: <8898674D84E3B24BA3A2D289B872026A6A2C04E6@G01JPEXMBKW03>

Hi,

Could you please copy the whole description from the cover letter to the
actual patch and only send one email (full description as in here
together with the patch)? If we commit this to the kernel, it would be
useful to have the information in the log for reference later on.

More comments below:

On Tue, Jan 29, 2019 at 12:29:58PM +0000, Zhang, Lei wrote:
> On some variants of the Fujitsu-A64FX cores ver(1.0, 1.1),  
> memory accesses may cause undefined fault (Data abort, DFSC=0b111111).
> This problem will be fixed by next version of Fujitsu-A64FX.
> 
> This fault occurs under a specific hardware condition 
> when a load/store instruction perform an address translation using:
>   case-1  TTBR0_EL1 with TCR_EL1.NFD0 == 1.
>   case-2  TTBR0_EL2 with TCR_EL2.NFD0 == 1.
>   case-3  TTBR1_EL1 with TCR_EL1.NFD1 == 1.
>   case-4  TTBR1_EL2 with TCR_EL2.NFD1 == 1.
> And this fault occurs completely spurious.

So this looks like new information on the hardware behaviour since the
v2 of the patch. Can this fault occur for any type of instruction
accessing the memory or only for SVE instructions?

> Since TCR_ELx.NFD1 is set to '1' at the kernel in versions 
> past 4.17, the case-3 or case-4 may happen.
> 
> This fault can be taken only at stage-1, 
> so this fault is taken from EL0 to EL1/EL2, from EL1 to EL1, 
> or from EL2 to EL2.
> 
> I would like to post a workaround to avoid this problem on 
> existing Fujitsu-A64FX version.

How likely is it to trigger this erratum? In other words, aren't we
better off with a spurious fault that we ignore rather than toggling the
TCR_ELx.NFD1 bit?

> There are 2 points in this workaround.
> Point1: trap from EL1 to EL1, EL2 to EL2
> Set '0' to TCR_ELx.NFD1in kernel-entry, 
> and set '1' in kernel-exit.
> 
> From the view point of ARM specification, there is no problem to 
> reset TCR_ELx.{NFD0,NFD1} while in EL1/EL2, because 
> TCR_ELx.{NFD0,NFD1} controls whether to perform a translation 
> table walk in response to an access from EL0.

The problem is that this bit may be cached in the TLB (I haven't checked
the ARM ARM but that's usually the case with the TCR_ELx bits). If
that's the case, you can't guarantee a change unless you also perform a
TLBI VMALL. Arguably, if Fujitsu's microarchitecture doesn't cache the
NFD bits in the TLB, we could apply the workaround but I'd rather have
the spurious trap if it's not too often.

> I confirmed that:
> ・There is no load/store instruction between 
>   tramp_ventry and setting TCR_ELx.NFD1 to '0'.
> ・There is no load/store instruction between 
>   setting TCR_ELx.NFD1 to '1' and tramp_exit.

Could speculative loads also trigger this? Another option would be to
toggle it during kernel_neon_begin/end (with the caveat of TLBI as
mentioned above).

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel