From mboxrd@z Thu Jan 1 00:00:00 1970 From: dingtianhong@huawei.com (Ding Tianhong) Date: Tue, 26 Jan 2016 19:33:17 +0800 Subject: Unhandled level 2 translation fault on A72 board. In-Reply-To: <20160126110358.GA23579@localhost.localdomain> References: <56A72246.4050105@huawei.com> <20160126110358.GA23579@localhost.localdomain> Message-ID: <56A7597D.6020609@huawei.com> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org On 2016/1/26 19:03, Catalin Marinas wrote: > On Tue, Jan 26, 2016 at 03:37:42PM +0800, Ding Tianhong wrote: >> I met this problem when running the hackbench test on A72 chip board: >> >> sh[4779]: unhandled level 2 translation fault (11) at 0x7f96be0c80, esr 0x83000006 >> pgd = ffffffc01a1f0000 >> [7f96be0c80] *pgd=0000000084a20003, *pud=0000000084a20003, *pmd=0000000000000000 >> >> CPU: 1 PID: 4779 Comm: sh Tainted: G O 4.1.15+ #21 >> Hardware name: Hisilicon PhosphorHi1382 EVB (DT) >> task: ffffffc0163cc500 ti: ffffffc083abc000 task.ti: ffffffc083abc000 >> PC is at 0x7f96be0c80 >> LR is at 0x7fb2684eb4 >> pc : [<0000007f96be0c80>] lr : [<0000007fb2684eb4>] pstate: 60000000 > > So here it's user space trying to execute from 0x7f96be0c80 (instruction > abort). > >> sh[4963]: unhandled level 2 translation fault (11) at 0x00000000, esr 0x92000006 >> pgd = ffffffc0180c6000 >> [00000000] *pgd=0000000015157003, *pud=0000000015157003, *pmd=0000000000000000 >> >> CPU: 0 PID: 4963 Comm: sh Tainted: G O 4.1.15+ #21 >> Hardware name: Hisilicon PhosphorHi1382 EVB (DT) >> task: ffffffc0163cb980 ti: ffffffc0840c8000 task.ti: ffffffc0840c8000 >> PC is at 0x42c0c8 >> LR is at 0x42c03c >> pc : [<000000000042c0c8>] lr : [<000000000042c03c>] pstate: 80000000 > > And here you have a null pointer dereference. > >> if I run the benchmark only on the core which is in the same cluster, >> it looks fine and no error happened, but if I enable the core which in >> the different cluster, it will happened. >> >> I remember that I met the same problem on the A57 and fix it by enable >> the [bit6] of the CPUECTLR_EL1 and enable MN, But this time, I enable >> the same setting and looks no effort, I have no idea about this >> problem, does A57 and A72 has so big difference on TLB? > > I can't tell for sure it's a TLB issue. The kernel page table dump shows > *pmd being 0, so the fault is correctly called "level 2 translation > fault". It also seems that there is no vma at this address, hence the > kernel reports it as unhandled. It looks like data corruption which > could be caused by cache or TLB incoherence. Just make sure the > interconnect linking the two clusters is configured correctly by > _firmware_ before Linux starts. > Hi Catalin: Thanks for the apply, I have try to apply this patch to test: --- arch/arm64/kernel/process.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c index 6391485..d7d8439 100644 --- a/arch/arm64/kernel/process.c +++ b/arch/arm64/kernel/process.c @@ -283,6 +283,13 @@ static void tls_thread_switch(struct task_struct *next) : : "r" (tpidr), "r" (tpidrro)); } +static void tlb_flush_thread(struct task_struct *prev) +{ +/* Flush the prev task's TLB entries */ +if (prev->mm) +flush_tlb_mm(prev->mm); +} + /* * Thread switching. */ @@ -296,6 +303,8 @@ struct task_struct *__switch_to(struct task_struct *prev, hw_breakpoint_thread_switch(next); contextidr_thread_switch(next); +tlb_flush_thread(prev); + /* * Complete any pending TLB or cache maintenance on this CPU in case * the thread migrates to a different CPU. The hackbench would work fine after this patch, so I guess that the old thread tlb may not be invalidate as soon as possible, but I don't know why, everything is fine on A57, Does I miss something? Ding