* [RESEND PATCH 0/1] Fix the "hard LOCKUP" when running a heavy loading
@ 2015-11-03 8:10 Caesar Wang
2015-11-03 8:10 ` [RESEND PATCH] ARM: errata: Workaround for Cortex-A12 erratum 818325 Caesar Wang
` (3 more replies)
0 siblings, 4 replies; 12+ messages in thread
From: Caesar Wang @ 2015-11-03 8:10 UTC (permalink / raw)
To: linux-arm-kernel
As the following log:
where we experience a CPU hard lockup. The assembly code (disassembled by gdb)
0xc06c6e90 <__tcp_select_window+148>: beq 0xc06c6eb0<__tcp_select_window+180>
0xc06c6e94 <__tcp_select_window+152>: mov r2, #1008; 0x3f0
0xc06c6e98 <__tcp_select_window+156>: ldr r5, [r0,#1004] ; 0x3ec
0xc06c6e9c <__tcp_select_window+160>: ldrh r2, [r0,r2]
....
0xc06c6ee0 <__tcp_select_window+228>: addne r0, r0, #1
0xc06c6ee4 <__tcp_select_window+232>: lslne r0, r0, r2
0xc06c6ee8 <__tcp_select_window+236>: ldmne sp, {r4, r5,r11, sp,pc}
Could either the ?strhi?/?strlo? pair, or the lslne/ldmne pair, be
tripping over errata 818325, or a similar errata?
0xc06c6eec <__tcp_select_window+240>: b 0xc06c6f40<__tcp_select_window+324>
This is patch can fix the *hard lock* in some case.
As the Russell said:
"in other words, which can be handled by updating a control register in the firmware or
boot loader"
Maybe the better solution is in firmware.
Others, I'm no sure this workaround patch if can be accepted.
I resend this patch for getting some suggestion from you.
--
Thanks!
Huang Tao (1):
ARM: errata: Workaround for Cortex-A12 erratum 818325
arch/arm/Kconfig | 13 +++++++++++++
arch/arm/mm/proc-v7.S | 12 ++++++++++++
2 files changed, 25 insertions(+)
--
1.9.1
^ permalink raw reply [flat|nested] 12+ messages in thread* [RESEND PATCH] ARM: errata: Workaround for Cortex-A12 erratum 818325 2015-11-03 8:10 [RESEND PATCH 0/1] Fix the "hard LOCKUP" when running a heavy loading Caesar Wang @ 2015-11-03 8:10 ` Caesar Wang 2015-11-03 8:45 ` Arnd Bergmann 2015-11-03 10:21 ` kbuild test robot 2015-11-03 10:41 ` [PATCH v1] " Caesar Wang ` (2 subsequent siblings) 3 siblings, 2 replies; 12+ messages in thread From: Caesar Wang @ 2015-11-03 8:10 UTC (permalink / raw) To: linux-arm-kernel From: Huang Tao <huangtao@rock-chips.com> On Cortex-A12 (r0p0..r0p1-00lac0-rc11), when a CPU executes a sequence of two conditional store instructions with opposite condition code and updating the same register, the system might enter a deadlock if the second conditional instruction is an UNPREDICTABLE STR or STM instruction. This workaround setting bit[12] of the Feature Register prevents the erratum. This bit disables an optimisation applied to a sequence of 2 instructions that use opposing condition codes. Signed-off-by: Huang Tao <huangtao@rock-chips.com> Signed-off-by: Kever Yang <kever.yang@rock-chips.com> Signed-off-by: Caesar Wang <wxt@rock-chips.com> --- arch/arm/Kconfig | 13 +++++++++++++ arch/arm/mm/proc-v7.S | 12 ++++++++++++ 2 files changed, 25 insertions(+) diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 639411f..554b57a 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -1263,6 +1263,19 @@ config ARM_ERRATA_773022 loop buffer may deliver incorrect instructions. This workaround disables the loop buffer to avoid the erratum. +config ARM_ERRATA_818325 + bool "ARM errata: Execution of an UNPREDICTABLE STR or STM instruction might deadlock" + depends on CPU_V7 + help + This option enables the workaround for the 818325 Cortex-A12 + (r0p0..r0p1-00lac0-rc11) erratum. When a CPU executes a sequence of + two conditional store instructions with opposite condition code and + updating the same register, the system might enter a deadlock if the + second conditional instruction is an UNPREDICTABLE STR or STM + instruction. This workaround setting bit[12] of the Feature Register + prevents the erratum. This bit disables an optimisation applied to a + sequence of 2 instructions that use opposing condition codes. + endmenu source "arch/arm/common/Kconfig" diff --git a/arch/arm/mm/proc-v7.S b/arch/arm/mm/proc-v7.S index de2b246..2b338ec 100644 --- a/arch/arm/mm/proc-v7.S +++ b/arch/arm/mm/proc-v7.S @@ -439,6 +439,18 @@ __v7_setup_cont: teq r0, r10 beq __ca9_errata + /* Cortex-A12 Errata */ + ldr r10, =0x00000c0d @ Cortex-A12 primary part number + teq r0, r10 + bne 5f +#ifdef CONFIG_ARM_ERRATA_818325 + teq r6, #0x00 @ present in r0p0 + teqne r6, #0x01 @ present in r0p1-00lac0-rc11 + mrceq p15, 0, r10, c15, c0, 1 @ read diagnostic register + orreq r10, r10, #1 << 12 @ set bit #12 + mcreq p15, 0, r10, c15, c0, 1 @ write diagnostic register + isb +#endif /* Cortex-A15 Errata */ ldr r10, =0x00000c0f @ Cortex-A15 primary part number teq r0, r10 -- 1.9.1 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* [RESEND PATCH] ARM: errata: Workaround for Cortex-A12 erratum 818325 2015-11-03 8:10 ` [RESEND PATCH] ARM: errata: Workaround for Cortex-A12 erratum 818325 Caesar Wang @ 2015-11-03 8:45 ` Arnd Bergmann 2015-11-03 9:04 ` Caesar Wang 2015-11-03 10:21 ` kbuild test robot 1 sibling, 1 reply; 12+ messages in thread From: Arnd Bergmann @ 2015-11-03 8:45 UTC (permalink / raw) To: linux-arm-kernel On Tuesday 03 November 2015 16:10:09 Caesar Wang wrote: > > + /* Cortex-A12 Errata */ > + ldr r10, =0x00000c0d @ Cortex-A12 primary part number > + teq r0, r10 > + bne 5f > +#ifdef CONFIG_ARM_ERRATA_818325 > + teq r6, #0x00 @ present in r0p0 > + teqne r6, #0x01 @ present in r0p1-00lac0-rc11 > + mrceq p15, 0, r10, c15, c0, 1 @ read diagnostic register > + orreq r10, r10, #1 << 12 @ set bit #12 > + mcreq p15, 0, r10, c15, c0, 1 @ write diagnostic register > + isb > +#endif > /* Cortex-A15 Errata */ > Does this still build? You seem to have lost the '5:' label. Arnd ^ permalink raw reply [flat|nested] 12+ messages in thread
* [RESEND PATCH] ARM: errata: Workaround for Cortex-A12 erratum 818325 2015-11-03 8:45 ` Arnd Bergmann @ 2015-11-03 9:04 ` Caesar Wang 0 siblings, 0 replies; 12+ messages in thread From: Caesar Wang @ 2015-11-03 9:04 UTC (permalink / raw) To: linux-arm-kernel ? 2015?11?03? 16:45, Arnd Bergmann ??: > On Tuesday 03 November 2015 16:10:09 Caesar Wang wrote: >> + /* Cortex-A12 Errata */ >> + ldr r10, =0x00000c0d @ Cortex-A12 primary part number >> + teq r0, r10 >> + bne 5f beq __ca15_errata: >> +#ifdef CONFIG_ARM_ERRATA_818325 >> + teq r6, #0x00 @ present in r0p0 >> + teqne r6, #0x01 @ present in r0p1-00lac0-rc11 >> + mrceq p15, 0, r10, c15, c0, 1 @ read diagnostic register >> + orreq r10, r10, #1 << 12 @ set bit #12 >> + mcreq p15, 0, r10, c15, c0, 1 @ write diagnostic register >> + isb >> +#endif >> /* Cortex-A15 Errata */ >> > Does this still build? You seem to have lost the '5:' label. No, I didn't have build in next kernel. Yup, the patch need a bit change from the message. commit 17e7bf86690eaad4906d2295f0bd171cc194633b Author: Russell King <rmk+kernel@arm.linux.org.uk> Date: Sat Apr 4 21:34:33 2015 +0100 ARM: proc-v7: move CPU errata out of line ----- Original patch: https://patchwork.kernel.org/patch/4735341/ Applied and verified on kernel V3.14. > > Arnd > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel at lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel -- Thanks, Caesar ^ permalink raw reply [flat|nested] 12+ messages in thread
* [RESEND PATCH] ARM: errata: Workaround for Cortex-A12 erratum 818325 2015-11-03 8:10 ` [RESEND PATCH] ARM: errata: Workaround for Cortex-A12 erratum 818325 Caesar Wang 2015-11-03 8:45 ` Arnd Bergmann @ 2015-11-03 10:21 ` kbuild test robot 1 sibling, 0 replies; 12+ messages in thread From: kbuild test robot @ 2015-11-03 10:21 UTC (permalink / raw) To: linux-arm-kernel Hi Huang, [auto build test ERROR on mvebu/for-next -- if it's inappropriate base, please suggest rules for selecting the more suitable base] url: https://github.com/0day-ci/linux/commits/Caesar-Wang/ARM-errata-Workaround-for-Cortex-A12-erratum-818325/20151103-163417 config: arm-prima2_defconfig (attached as .config) reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=arm All errors (new ones prefixed by >>): /tmp/ccjF0uyl.s: Assembler messages: >> /tmp/ccjF0uyl.s: Error: local label `"5" (instance number 1 of a fb label)' is not defined --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation -------------- next part -------------- A non-text attachment was scrubbed... Name: .config.gz Type: application/octet-stream Size: 12215 bytes Desc: not available URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20151103/a7c6ec6c/attachment.obj> ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v1] ARM: errata: Workaround for Cortex-A12 erratum 818325 2015-11-03 8:10 [RESEND PATCH 0/1] Fix the "hard LOCKUP" when running a heavy loading Caesar Wang 2015-11-03 8:10 ` [RESEND PATCH] ARM: errata: Workaround for Cortex-A12 erratum 818325 Caesar Wang @ 2015-11-03 10:41 ` Caesar Wang 2015-11-03 11:14 ` [RESEND PATCH 0/1] Fix the "hard LOCKUP" when running a heavy loading Russell King - ARM Linux 2015-11-03 11:30 ` Will Deacon 3 siblings, 0 replies; 12+ messages in thread From: Caesar Wang @ 2015-11-03 10:41 UTC (permalink / raw) To: linux-arm-kernel From: Huang Tao <huangtao@rock-chips.com> On Cortex-A12 (r0p0..r0p1-00lac0-rc11), when a CPU executes a sequence of two conditional store instructions with opposite condition code and updating the same register, the system might enter a deadlock if the second conditional instruction is an UNPREDICTABLE STR or STM instruction. This workaround setting bit[12] of the Feature Register prevents the erratum. This bit disables an optimisation applied to a sequence of 2 instructions that use opposing condition codes. Signed-off-by: Huang Tao <huangtao@rock-chips.com> Signed-off-by: Kever Yang <kever.yang@rock-chips.com> Signed-off-by: Caesar Wang <wxt@rock-chips.com> --- Changes in v1: - fix the build error. arch/arm/Kconfig | 13 +++++++++++++ arch/arm/mm/proc-v7.S | 16 ++++++++++++++++ 2 files changed, 29 insertions(+) diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 639411f..554b57a 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -1263,6 +1263,19 @@ config ARM_ERRATA_773022 loop buffer may deliver incorrect instructions. This workaround disables the loop buffer to avoid the erratum. +config ARM_ERRATA_818325 + bool "ARM errata: Execution of an UNPREDICTABLE STR or STM instruction might deadlock" + depends on CPU_V7 + help + This option enables the workaround for the 818325 Cortex-A12 + (r0p0..r0p1-00lac0-rc11) erratum. When a CPU executes a sequence of + two conditional store instructions with opposite condition code and + updating the same register, the system might enter a deadlock if the + second conditional instruction is an UNPREDICTABLE STR or STM + instruction. This workaround setting bit[12] of the Feature Register + prevents the erratum. This bit disables an optimisation applied to a + sequence of 2 instructions that use opposing condition codes. + endmenu source "arch/arm/common/Kconfig" diff --git a/arch/arm/mm/proc-v7.S b/arch/arm/mm/proc-v7.S index de2b246..e95c83c 100644 --- a/arch/arm/mm/proc-v7.S +++ b/arch/arm/mm/proc-v7.S @@ -351,6 +351,17 @@ __ca9_errata: #endif b __errata_finish +__ca12_errata: +#ifdef CONFIG_ARM_ERRATA_818325 + teq r6, #0x00 @ present in r0p0 + teqne r6, #0x01 @ present in r0p1-00lac0-rc11 + mrceq p15, 0, r10, c15, c0, 1 @ read diagnostic register + orreq r10, r10, #1 << 12 @ set bit #12 + mcreq p15, 0, r10, c15, c0, 1 @ write diagnostic register + isb +#endif + b __errata_finish + __ca15_errata: #ifdef CONFIG_ARM_ERRATA_773022 cmp r6, #0x4 @ only present up to r0p4 @@ -439,6 +450,11 @@ __v7_setup_cont: teq r0, r10 beq __ca9_errata + /* Cortex-A12 Errata */ + ldr r10, =0x00000c0d @ Cortex-A12 primary part number + teq r0, r10 + beq __ca12_errata + /* Cortex-A15 Errata */ ldr r10, =0x00000c0f @ Cortex-A15 primary part number teq r0, r10 -- 1.9.1 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* [RESEND PATCH 0/1] Fix the "hard LOCKUP" when running a heavy loading 2015-11-03 8:10 [RESEND PATCH 0/1] Fix the "hard LOCKUP" when running a heavy loading Caesar Wang 2015-11-03 8:10 ` [RESEND PATCH] ARM: errata: Workaround for Cortex-A12 erratum 818325 Caesar Wang 2015-11-03 10:41 ` [PATCH v1] " Caesar Wang @ 2015-11-03 11:14 ` Russell King - ARM Linux 2015-11-03 12:00 ` Huang, Tao 2015-11-03 11:30 ` Will Deacon 3 siblings, 1 reply; 12+ messages in thread From: Russell King - ARM Linux @ 2015-11-03 11:14 UTC (permalink / raw) To: linux-arm-kernel On Tue, Nov 03, 2015 at 04:10:08PM +0800, Caesar Wang wrote: > As the Russell said: > "in other words, which can be handled by updating a control register in > the firmware or boot loader" > Maybe the better solution is in firmware. The full quote is: "I think we're at the point where we start insisting that workarounds which are simple enable/disable feature bit operations (in other words, which can be handled by updating a control register in the firmware or boot loader) must be done that way, and we are not going to add such workarounds to the kernel anymore." The position hasn't changed. Workarounds such as this should be handled in the firmware/boot loader before control is passed to the kernel. The reason is very simple: if the C compiler can generate code which triggers the bug, it can generate code which triggers the bug in the boot loader. So, the only place such workarounds can be done is before any C code gets executed. Putting such workarounds in the kernel is completely inappropriate. Sorry, I'm not going to accept this workaround into the kernel. -- FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 12+ messages in thread
* [RESEND PATCH 0/1] Fix the "hard LOCKUP" when running a heavy loading 2015-11-03 11:14 ` [RESEND PATCH 0/1] Fix the "hard LOCKUP" when running a heavy loading Russell King - ARM Linux @ 2015-11-03 12:00 ` Huang, Tao 0 siblings, 0 replies; 12+ messages in thread From: Huang, Tao @ 2015-11-03 12:00 UTC (permalink / raw) To: linux-arm-kernel Hello Russell: ? 2015?11?03? 19:14, Russell King - ARM Linux ??: > On Tue, Nov 03, 2015 at 04:10:08PM +0800, Caesar Wang wrote: >> As the Russell said: >> "in other words, which can be handled by updating a control register in >> the firmware or boot loader" >> Maybe the better solution is in firmware. > > The full quote is: > > "I think we're at the point where we start insisting that workarounds > which are simple enable/disable feature bit operations (in other words, > which can be handled by updating a control register in the firmware or > boot loader) must be done that way, and we are not going to add such > workarounds to the kernel anymore." > > The position hasn't changed. Workarounds such as this should be handled > in the firmware/boot loader before control is passed to the kernel. > > The reason is very simple: if the C compiler can generate code which > triggers the bug, it can generate code which triggers the bug in the > boot loader. So, the only place such workarounds can be done is before > any C code gets executed. Putting such workarounds in the kernel is > completely inappropriate. I agree with your reason for CPU0. But how about CPU1~3 if we don't use any firmware such as ARM Trusted Firmware to take control of CPU power on? If the CPU1~3 will run on Linux when its first instruction is running? BTW I don't want to argue with you the workaround is right or wrong because I know the errata just happen on r0p0 not r0p1. > > Sorry, I'm not going to accept this workaround into the kernel. It seems we should introduce some code outside the kernel to do such initialization? ^ permalink raw reply [flat|nested] 12+ messages in thread
* [RESEND PATCH 0/1] Fix the "hard LOCKUP" when running a heavy loading 2015-11-03 8:10 [RESEND PATCH 0/1] Fix the "hard LOCKUP" when running a heavy loading Caesar Wang ` (2 preceding siblings ...) 2015-11-03 11:14 ` [RESEND PATCH 0/1] Fix the "hard LOCKUP" when running a heavy loading Russell King - ARM Linux @ 2015-11-03 11:30 ` Will Deacon 2015-11-03 19:00 ` Doug Anderson 3 siblings, 1 reply; 12+ messages in thread From: Will Deacon @ 2015-11-03 11:30 UTC (permalink / raw) To: linux-arm-kernel On Tue, Nov 03, 2015 at 04:10:08PM +0800, Caesar Wang wrote: > As the following log: > where we experience a CPU hard lockup. The assembly code (disassembled by gdb) > > 0xc06c6e90 <__tcp_select_window+148>: beq 0xc06c6eb0<__tcp_select_window+180> > 0xc06c6e94 <__tcp_select_window+152>: mov r2, #1008; 0x3f0 > 0xc06c6e98 <__tcp_select_window+156>: ldr r5, [r0,#1004] ; 0x3ec > 0xc06c6e9c <__tcp_select_window+160>: ldrh r2, [r0,r2] > .... > > 0xc06c6ee0 <__tcp_select_window+228>: addne r0, r0, #1 > 0xc06c6ee4 <__tcp_select_window+232>: lslne r0, r0, r2 > 0xc06c6ee8 <__tcp_select_window+236>: ldmne sp, {r4, r5,r11, sp,pc} > > Could either the ?strhi?/?strlo? pair, or the lslne/ldmne pair, be > tripping over errata 818325, or a similar errata? No. One of the conditions for #818325 is: The second instruction is an UNPREDICTABLE STR or STM (maximum two2 registers in the list) with write-back and the write-back register is in the list of stored registers. I don't see either of those in your code snippet above, but then I don't see your strhi/strlo either. What's going on? > 0xc06c6eec <__tcp_select_window+240>: b 0xc06c6f40<__tcp_select_window+324> > > This is patch can fix the *hard lock* in some case. > > As the Russell said: > "in other words, which can be handled by updating a control register in the firmware or > boot loader" Russell is completely correct: this should be worked around in firmware. There are a number of reasons for that: (1) You want the workaround enabled for all privilege and security levels, which means applying it before you enter the kernel. (2) If Linux boots in non-secure, then the workaround may silently fail to apply. (3) The CPU may have an ECO fix, in which case we wouldn't want to enable the workaround. (4) Some workarounds (albeit not this one, afaict) require changing CPU configuration that can only be done very early on, e.g. whilst "the memory system is idle". Now, I appreciate that doing this in the kernel may be the easiest thing for your particular SoC, but that doesn't necessarily mean that it's the best thing to do in the mainline kernel. Whilst there *is* precedent for this already, we've been trying to move away from setting these bits in the kernel for the reasons mentioned above. Will ^ permalink raw reply [flat|nested] 12+ messages in thread
* [RESEND PATCH 0/1] Fix the "hard LOCKUP" when running a heavy loading 2015-11-03 11:30 ` Will Deacon @ 2015-11-03 19:00 ` Doug Anderson 2015-11-06 12:17 ` Will Deacon 0 siblings, 1 reply; 12+ messages in thread From: Doug Anderson @ 2015-11-03 19:00 UTC (permalink / raw) To: linux-arm-kernel Hi, On Tue, Nov 3, 2015 at 3:30 AM, Will Deacon <will.deacon@arm.com> wrote: > On Tue, Nov 03, 2015 at 04:10:08PM +0800, Caesar Wang wrote: >> As the following log: >> where we experience a CPU hard lockup. The assembly code (disassembled by gdb) >> >> 0xc06c6e90 <__tcp_select_window+148>: beq 0xc06c6eb0<__tcp_select_window+180> >> 0xc06c6e94 <__tcp_select_window+152>: mov r2, #1008; 0x3f0 >> 0xc06c6e98 <__tcp_select_window+156>: ldr r5, [r0,#1004] ; 0x3ec >> 0xc06c6e9c <__tcp_select_window+160>: ldrh r2, [r0,r2] >> .... >> >> 0xc06c6ee0 <__tcp_select_window+228>: addne r0, r0, #1 >> 0xc06c6ee4 <__tcp_select_window+232>: lslne r0, r0, r2 >> 0xc06c6ee8 <__tcp_select_window+236>: ldmne sp, {r4, r5,r11, sp,pc} >> >> Could either the ?strhi?/?strlo? pair, or the lslne/ldmne pair, be >> tripping over errata 818325, or a similar errata? > > No. One of the conditions for #818325 is: > > The second instruction is an UNPREDICTABLE STR or STM (maximum two2 > registers in the list) with write-back and the write-back register is > in the list of stored registers. > > I don't see either of those in your code snippet above, but then I don't > see your strhi/strlo either. What's going on? It looks like Caesar is proposing that this errata is the root cause for some hard lockups we're seeing on rk3288 Chromebooks. I agree with folks here that say this isn't terribly likely, but I always like to be proven wrong. ;) We've got code that samples / prints CPU_DBGPCSR at the time of a hard lockup. That register isn't 100% accurate about where a CPU is, but it's better than nothing (technically there may be ways to actually use the DBG registers to stop the remote CPU and maybe give more info, but I digress). When CPUs are hard locked up, they are often found at: <c0117c8c> v7_coherent_kern_range+0x58/0x74 or <c0118278> v7wbi_flush_user_tlb_range+0x30/0x38 That made me think that an errata might be the root cause of our hard lockups, since ARM errata often trigger in cache/tlb functions. I think Caesar dug up this old errata fix in response to my suggestion. If you know of any ARM errata that might trigger hard lockups like this, I'd certainly be all ears. It's also possible that we've got something running at too low of a voltage or we've got clock dividers or cache timings programmed incorrectly somewhere. To give a more full disassembly of one of the crashes: <4>[ 1623.480846] SMP: failed to stop secondary CPUs <3>[ 1623.480862] CPU1 PC: <c01827e8> __unqueue_futex+0x68/0x88 <3>[ 1623.480879] CPU2 PC: <c0117c8c> v7_coherent_kern_range+0x58/0x74 <3>[ 1623.480895] CPU3 PC: <c0118268> v7wbi_flush_user_tlb_range+0x20/0x38 --- c01827dc: e2841010 add r1, r4, #16 c01827e0: e2445004 sub r5, r4, #4 c01827e4: eb068d33 bl c0325cb8 <plist_del> (File Offset: 0x235cb8) => c01827e8: f595f000 pldw [r5] c01827ec: e1953f9f ldrex r3, [r5] c01827f0: e2433001 sub r3, r3, #1 c01827f4: e1852f93 strex r2, r3, [r5] c01827f8: e3320000 teq r2, #0 c01827fc: 1afffffa bne c01827ec <__unqueue_futex+0x6c> (File Offset: 0x927ec) c0182800: e89da830 ldm sp, {r4, r5, fp, sp, pc} --- c0117c80: e08cc002 add ip, ip, r2 c0117c84: e15c0001 cmp ip, r1 c0117c88: 3afffffb bcc c0117c7c <v7_coherent_kern_range+0x48> (File Offset: 0x27c7c) => c0117c8c: e3a00000 mov r0, #0 c0117c90: ee070fd1 mcr 15, 0, r0, cr7, cr1, {6} c0117c94: f57ff04a dsb ishst c0117c98: f57ff06f isb sy c0117c9c: e1a0f00e mov pc, lr --- c0118260: e1830600 orr r0, r3, r0, lsl #12 c0118264: e1a01601 lsl r1, r1, #12 => c0118268: ee080f33 mcr 15, 0, r0, cr8, cr3, {1} c011826c: e2800a01 add r0, r0, #4096 ; 0x1000 c0118270: e1500001 cmp r0, r1 c0118274: 3afffffb bcc c0118268 <v7wbi_flush_user_tlb_range+0x20> (File Offset: 0x28268) c0118278: f57ff04b dsb ish c011827c: e1a0f00e mov pc, lr ^ permalink raw reply [flat|nested] 12+ messages in thread
* [RESEND PATCH 0/1] Fix the "hard LOCKUP" when running a heavy loading 2015-11-03 19:00 ` Doug Anderson @ 2015-11-06 12:17 ` Will Deacon 2015-11-09 4:39 ` Doug Anderson 0 siblings, 1 reply; 12+ messages in thread From: Will Deacon @ 2015-11-06 12:17 UTC (permalink / raw) To: linux-arm-kernel On Tue, Nov 03, 2015 at 11:00:20AM -0800, Doug Anderson wrote: > Hi, Hey Doug, > When CPUs are hard locked up, they are often found at: > > <c0117c8c> v7_coherent_kern_range+0x58/0x74 > or > <c0118278> v7wbi_flush_user_tlb_range+0x30/0x38 > > That made me think that an errata might be the root cause of our hard > lockups, since ARM errata often trigger in cache/tlb functions. I > think Caesar dug up this old errata fix in response to my suggestion. I still don't see how 818325 is related, since there aren't any conditional stores in the sequences below. > If you know of any ARM errata that might trigger hard lockups like > this, I'd certainly be all ears. It's also possible that we've got > something running at too low of a voltage or we've got clock dividers > or cache timings programmed incorrectly somewhere. To give a more > full disassembly of one of the crashes: > > <4>[ 1623.480846] SMP: failed to stop secondary CPUs > <3>[ 1623.480862] CPU1 PC: <c01827e8> __unqueue_futex+0x68/0x88 > <3>[ 1623.480879] CPU2 PC: <c0117c8c> v7_coherent_kern_range+0x58/0x74 > <3>[ 1623.480895] CPU3 PC: <c0118268> v7wbi_flush_user_tlb_range+0x20/0x38 > > --- Do you have any register values for these CPUs? > c01827dc: e2841010 add r1, r4, #16 > c01827e0: e2445004 sub r5, r4, #4 > c01827e4: eb068d33 bl c0325cb8 <plist_del> (File > Offset: 0x235cb8) > => c01827e8: f595f000 pldw [r5] > c01827ec: e1953f9f ldrex r3, [r5] > c01827f0: e2433001 sub r3, r3, #1 > c01827f4: e1852f93 strex r2, r3, [r5] > c01827f8: e3320000 teq r2, #0 > c01827fc: 1afffffa bne c01827ec > <__unqueue_futex+0x6c> (File Offset: 0x927ec) > c0182800: e89da830 ldm sp, {r4, r5, fp, sp, pc} For example, the futex address in r5 ... > c0117c80: e08cc002 add ip, ip, r2 > c0117c84: e15c0001 cmp ip, r1 > c0117c88: 3afffffb bcc c0117c7c > <v7_coherent_kern_range+0x48> (File Offset: 0x27c7c) > => c0117c8c: e3a00000 mov r0, #0 > c0117c90: ee070fd1 mcr 15, 0, r0, cr7, cr1, {6} > c0117c94: f57ff04a dsb ishst > c0117c98: f57ff06f isb sy > c0117c9c: e1a0f00e mov pc, lr ... the address in r0 for the cache maintenance ... > c0118260: e1830600 orr r0, r3, r0, lsl #12 > c0118264: e1a01601 lsl r1, r1, #12 > => c0118268: ee080f33 mcr 15, 0, r0, cr8, cr3, {1} > c011826c: e2800a01 add r0, r0, #4096 ; 0x1000 > c0118270: e1500001 cmp r0, r1 > c0118274: 3afffffb bcc c0118268 > <v7wbi_flush_user_tlb_range+0x20> (File Offset: 0x28268) > c0118278: f57ff04b dsb ish > c011827c: e1a0f00e mov pc, lr ... and the address in r0 for the TLBI. Are the cores executing instructions at this point, or by "hard LOCKUP" do you mean that they're deadlocked in hardware? Will ^ permalink raw reply [flat|nested] 12+ messages in thread
* [RESEND PATCH 0/1] Fix the "hard LOCKUP" when running a heavy loading 2015-11-06 12:17 ` Will Deacon @ 2015-11-09 4:39 ` Doug Anderson 0 siblings, 0 replies; 12+ messages in thread From: Doug Anderson @ 2015-11-09 4:39 UTC (permalink / raw) To: linux-arm-kernel Will, On Fri, Nov 6, 2015 at 4:17 AM, Will Deacon <will.deacon@arm.com> wrote: > On Tue, Nov 03, 2015 at 11:00:20AM -0800, Doug Anderson wrote: >> Hi, > > Hey Doug, > >> When CPUs are hard locked up, they are often found at: >> >> <c0117c8c> v7_coherent_kern_range+0x58/0x74 >> or >> <c0118278> v7wbi_flush_user_tlb_range+0x30/0x38 >> >> That made me think that an errata might be the root cause of our hard >> lockups, since ARM errata often trigger in cache/tlb functions. I >> think Caesar dug up this old errata fix in response to my suggestion. > > I still don't see how 818325 is related, since there aren't any conditional > stores in the sequences below. > >> If you know of any ARM errata that might trigger hard lockups like >> this, I'd certainly be all ears. It's also possible that we've got >> something running at too low of a voltage or we've got clock dividers >> or cache timings programmed incorrectly somewhere. To give a more >> full disassembly of one of the crashes: >> >> <4>[ 1623.480846] SMP: failed to stop secondary CPUs >> <3>[ 1623.480862] CPU1 PC: <c01827e8> __unqueue_futex+0x68/0x88 >> <3>[ 1623.480879] CPU2 PC: <c0117c8c> v7_coherent_kern_range+0x58/0x74 >> <3>[ 1623.480895] CPU3 PC: <c0118268> v7wbi_flush_user_tlb_range+0x20/0x38 >> >> --- > > Do you have any register values for these CPUs? No, unfortunately not. The only reason I have the PCs is because we have code to sample CPU_DBGPCSR at hard lockup time (actually any panic time). There's no equivalent for other registers. The code does try to sample a number of times, so the fact that we only have one PC value for each of the other CPUs implies that they are either totally stuck or running in a very tight loop (from experimentation, if you are running in a tight loop of just a few instructions the CPU_DBGPCSR for a CPU may or may not update). If you're curious, you can see rockchip_panic_notify() in <https://chromium.googlesource.com/chromiumos/third_party/kernel/+/chromeos-3.14/arch/arm/mach-rockchip/rockchip.c>. It's basically some code that's been ported forward from code in an old Android tree and it's not beautiful, but it's better than nothing. The code only runs if the panic notifier failed to stop the other CPUs in a normal way. Technically (I think) I saw something in the CPU debug registers that would actually allow me to force another CPU to stop. That might let me gain control over it and inspect the other registers. Doing that is probably beyond what I have time for right now, though. >> c01827dc: e2841010 add r1, r4, #16 >> c01827e0: e2445004 sub r5, r4, #4 >> c01827e4: eb068d33 bl c0325cb8 <plist_del> (File >> Offset: 0x235cb8) >> => c01827e8: f595f000 pldw [r5] >> c01827ec: e1953f9f ldrex r3, [r5] >> c01827f0: e2433001 sub r3, r3, #1 >> c01827f4: e1852f93 strex r2, r3, [r5] >> c01827f8: e3320000 teq r2, #0 >> c01827fc: 1afffffa bne c01827ec >> <__unqueue_futex+0x6c> (File Offset: 0x927ec) >> c0182800: e89da830 ldm sp, {r4, r5, fp, sp, pc} > > For example, the futex address in r5 ... > >> c0117c80: e08cc002 add ip, ip, r2 >> c0117c84: e15c0001 cmp ip, r1 >> c0117c88: 3afffffb bcc c0117c7c >> <v7_coherent_kern_range+0x48> (File Offset: 0x27c7c) >> => c0117c8c: e3a00000 mov r0, #0 >> c0117c90: ee070fd1 mcr 15, 0, r0, cr7, cr1, {6} >> c0117c94: f57ff04a dsb ishst >> c0117c98: f57ff06f isb sy >> c0117c9c: e1a0f00e mov pc, lr > > ... the address in r0 for the cache maintenance ... > >> c0118260: e1830600 orr r0, r3, r0, lsl #12 >> c0118264: e1a01601 lsl r1, r1, #12 >> => c0118268: ee080f33 mcr 15, 0, r0, cr8, cr3, {1} >> c011826c: e2800a01 add r0, r0, #4096 ; 0x1000 >> c0118270: e1500001 cmp r0, r1 >> c0118274: 3afffffb bcc c0118268 >> <v7wbi_flush_user_tlb_range+0x20> (File Offset: 0x28268) >> c0118278: f57ff04b dsb ish >> c011827c: e1a0f00e mov pc, lr > > ... and the address in r0 for the TLBI. > > Are the cores executing instructions at this point, or by "hard LOCKUP" > do you mean that they're deadlocked in hardware? If they are executing, they aren't executing much. That means it's likely they're deadlocked in hardware. -Doug ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2015-11-09 4:39 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-11-03 8:10 [RESEND PATCH 0/1] Fix the "hard LOCKUP" when running a heavy loading Caesar Wang 2015-11-03 8:10 ` [RESEND PATCH] ARM: errata: Workaround for Cortex-A12 erratum 818325 Caesar Wang 2015-11-03 8:45 ` Arnd Bergmann 2015-11-03 9:04 ` Caesar Wang 2015-11-03 10:21 ` kbuild test robot 2015-11-03 10:41 ` [PATCH v1] " Caesar Wang 2015-11-03 11:14 ` [RESEND PATCH 0/1] Fix the "hard LOCKUP" when running a heavy loading Russell King - ARM Linux 2015-11-03 12:00 ` Huang, Tao 2015-11-03 11:30 ` Will Deacon 2015-11-03 19:00 ` Doug Anderson 2015-11-06 12:17 ` Will Deacon 2015-11-09 4:39 ` Doug Anderson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).