[RESEND PATCH 0/1] Fix the "hard LOCKUP" when running a heavy loading

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [RESEND PATCH 0/1] Fix the "hard LOCKUP" when running a heavy loading
@ 2015-11-03  8:10 Caesar Wang
  2015-11-03  8:10 ` [RESEND PATCH] ARM: errata: Workaround for Cortex-A12 erratum 818325 Caesar Wang
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Caesar Wang @ 2015-11-03  8:10 UTC (permalink / raw)
  To: linux-arm-kernel

As the following log:
where we experience a CPU hard lockup. The assembly code (disassembled by gdb)

0xc06c6e90 <__tcp_select_window+148>:        beq     0xc06c6eb0<__tcp_select_window+180>
0xc06c6e94 <__tcp_select_window+152>:        mov     r2, #1008; 0x3f0
0xc06c6e98 <__tcp_select_window+156>:        ldr     r5, [r0,#1004] ; 0x3ec
0xc06c6e9c <__tcp_select_window+160>:        ldrh    r2, [r0,r2]
....

0xc06c6ee0 <__tcp_select_window+228>:        addne   r0, r0, #1
0xc06c6ee4 <__tcp_select_window+232>:        lslne   r0, r0, r2
0xc06c6ee8 <__tcp_select_window+236>:        ldmne   sp, {r4, r5,r11, sp,pc}

Could either the ?strhi?/?strlo? pair, or the lslne/ldmne pair, be
tripping over errata 818325, or a similar errata?

0xc06c6eec <__tcp_select_window+240>:        b       0xc06c6f40<__tcp_select_window+324>

This is patch can fix the *hard lock* in some case.

As the Russell said:
"in other words, which can be handled by updating a control register in the firmware or
boot loader"
Maybe the better solution is in firmware.

Others, I'm no sure this workaround patch if can be accepted.

I resend this patch for getting some suggestion from you.

--
Thanks!

Huang Tao (1):
  ARM: errata: Workaround for Cortex-A12 erratum 818325

 arch/arm/Kconfig      | 13 +++++++++++++
 arch/arm/mm/proc-v7.S | 12 ++++++++++++
 2 files changed, 25 insertions(+)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RESEND PATCH] ARM: errata: Workaround for Cortex-A12 erratum 818325
  2015-11-03  8:10 [RESEND PATCH 0/1] Fix the "hard LOCKUP" when running a heavy loading Caesar Wang
@ 2015-11-03  8:10 ` Caesar Wang
  2015-11-03  8:45   ` Arnd Bergmann
  2015-11-03 10:21   ` kbuild test robot
  2015-11-03 10:41 ` [PATCH v1] " Caesar Wang
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 12+ messages in thread
From: Caesar Wang @ 2015-11-03  8:10 UTC (permalink / raw)
  To: linux-arm-kernel

From: Huang Tao <huangtao@rock-chips.com>

On Cortex-A12 (r0p0..r0p1-00lac0-rc11), when a CPU executes a sequence of
two conditional store instructions with opposite condition code and
updating the same register, the system might enter a deadlock if the
second conditional instruction is an UNPREDICTABLE STR or STM
instruction. This workaround setting bit[12] of the Feature Register
prevents the erratum. This bit disables an optimisation applied to a
sequence of 2 instructions that use opposing condition codes.

Signed-off-by: Huang Tao <huangtao@rock-chips.com>
Signed-off-by: Kever Yang <kever.yang@rock-chips.com>
Signed-off-by: Caesar Wang <wxt@rock-chips.com>

---

 arch/arm/Kconfig      | 13 +++++++++++++
 arch/arm/mm/proc-v7.S | 12 ++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 639411f..554b57a 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1263,6 +1263,19 @@ config ARM_ERRATA_773022
 	  loop buffer may deliver incorrect instructions. This
 	  workaround disables the loop buffer to avoid the erratum.
 
+config ARM_ERRATA_818325
+	bool "ARM errata: Execution of an UNPREDICTABLE STR or STM instruction might deadlock"
+	depends on CPU_V7
+	help
+	  This option enables the workaround for the 818325 Cortex-A12
+	  (r0p0..r0p1-00lac0-rc11) erratum. When a CPU executes a sequence of
+	  two conditional store instructions with opposite condition code and
+	  updating the same register, the system might enter a deadlock if the
+	  second conditional instruction is an UNPREDICTABLE STR or STM
+	  instruction. This workaround setting bit[12] of the Feature Register
+	  prevents the erratum. This bit disables an optimisation applied to a
+	  sequence of 2 instructions that use opposing condition codes.
+
 endmenu
 
 source "arch/arm/common/Kconfig"
diff --git a/arch/arm/mm/proc-v7.S b/arch/arm/mm/proc-v7.S
index de2b246..2b338ec 100644
--- a/arch/arm/mm/proc-v7.S
+++ b/arch/arm/mm/proc-v7.S
@@ -439,6 +439,18 @@ __v7_setup_cont:
 	teq	r0, r10
 	beq	__ca9_errata
 
+	/* Cortex-A12 Errata */
+	ldr	r10, =0x00000c0d		@ Cortex-A12 primary part number
+	teq	r0, r10
+	bne	5f
+#ifdef CONFIG_ARM_ERRATA_818325
+	teq	r6, #0x00			@ present in r0p0
+	teqne	r6, #0x01			@ present in r0p1-00lac0-rc11
+	mrceq	p15, 0, r10, c15, c0, 1		@ read diagnostic register
+	orreq	r10, r10, #1 << 12		@ set bit #12
+	mcreq	p15, 0, r10, c15, c0, 1		@ write diagnostic register
+	isb
+#endif
 	/* Cortex-A15 Errata */
 	ldr	r10, =0x00000c0f		@ Cortex-A15 primary part number
 	teq	r0, r10
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RESEND PATCH] ARM: errata: Workaround for Cortex-A12 erratum 818325
  2015-11-03  8:10 ` [RESEND PATCH] ARM: errata: Workaround for Cortex-A12 erratum 818325 Caesar Wang
@ 2015-11-03  8:45   ` Arnd Bergmann
  2015-11-03  9:04     ` Caesar Wang
  2015-11-03 10:21   ` kbuild test robot
  1 sibling, 1 reply; 12+ messages in thread
From: Arnd Bergmann @ 2015-11-03  8:45 UTC (permalink / raw)
  To: linux-arm-kernel

On Tuesday 03 November 2015 16:10:09 Caesar Wang wrote:
> 
> +       /* Cortex-A12 Errata */
> +       ldr     r10, =0x00000c0d                @ Cortex-A12 primary part number
> +       teq     r0, r10
> +       bne     5f
> +#ifdef CONFIG_ARM_ERRATA_818325
> +       teq     r6, #0x00                       @ present in r0p0
> +       teqne   r6, #0x01                       @ present in r0p1-00lac0-rc11
> +       mrceq   p15, 0, r10, c15, c0, 1         @ read diagnostic register
> +       orreq   r10, r10, #1 << 12              @ set bit #12
> +       mcreq   p15, 0, r10, c15, c0, 1         @ write diagnostic register
> +       isb
> +#endif
>         /* Cortex-A15 Errata */
> 

Does this still build? You seem to have lost the '5:' label.

	Arnd

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RESEND PATCH] ARM: errata: Workaround for Cortex-A12 erratum 818325
  2015-11-03  8:45   ` Arnd Bergmann
@ 2015-11-03  9:04     ` Caesar Wang
  0 siblings, 0 replies; 12+ messages in thread
From: Caesar Wang @ 2015-11-03  9:04 UTC (permalink / raw)
  To: linux-arm-kernel



? 2015?11?03? 16:45, Arnd Bergmann ??:
> On Tuesday 03 November 2015 16:10:09 Caesar Wang wrote:
>> +       /* Cortex-A12 Errata */
>> +       ldr     r10, =0x00000c0d                @ Cortex-A12 primary part number
>> +       teq     r0, r10
>> +       bne     5f

beq  __ca15_errata:
>> +#ifdef CONFIG_ARM_ERRATA_818325
>> +       teq     r6, #0x00                       @ present in r0p0
>> +       teqne   r6, #0x01                       @ present in r0p1-00lac0-rc11
>> +       mrceq   p15, 0, r10, c15, c0, 1         @ read diagnostic register
>> +       orreq   r10, r10, #1 << 12              @ set bit #12
>> +       mcreq   p15, 0, r10, c15, c0, 1         @ write diagnostic register
>> +       isb
>> +#endif
>>          /* Cortex-A15 Errata */
>>
> Does this still build? You seem to have lost the '5:' label.

No,  I didn't have build in next kernel.

Yup,  the patch need a bit change from the message.

commit 17e7bf86690eaad4906d2295f0bd171cc194633b
Author: Russell King <rmk+kernel@arm.linux.org.uk>
Date:   Sat Apr 4 21:34:33 2015 +0100

     ARM: proc-v7: move CPU errata out of line



-----
Original patch:
https://patchwork.kernel.org/patch/4735341/

Applied and verified on kernel V3.14.



>
> 	Arnd
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel


-- 
Thanks,
Caesar

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RESEND PATCH] ARM: errata: Workaround for Cortex-A12 erratum 818325
  2015-11-03  8:10 ` [RESEND PATCH] ARM: errata: Workaround for Cortex-A12 erratum 818325 Caesar Wang
  2015-11-03  8:45   ` Arnd Bergmann
@ 2015-11-03 10:21   ` kbuild test robot
  1 sibling, 0 replies; 12+ messages in thread
From: kbuild test robot @ 2015-11-03 10:21 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Huang,

[auto build test ERROR on mvebu/for-next -- if it's inappropriate base, please suggest rules for selecting the more suitable base]

url:    https://github.com/0day-ci/linux/commits/Caesar-Wang/ARM-errata-Workaround-for-Cortex-A12-erratum-818325/20151103-163417
config: arm-prima2_defconfig (attached as .config)
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=arm 

All errors (new ones prefixed by >>):

   /tmp/ccjF0uyl.s: Assembler messages:
>> /tmp/ccjF0uyl.s: Error: local label `"5" (instance number 1 of a fb label)' is not defined

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
-------------- next part --------------
A non-text attachment was scrubbed...
Name: .config.gz
Type: application/octet-stream
Size: 12215 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20151103/a7c6ec6c/attachment.obj>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v1] ARM: errata: Workaround for Cortex-A12 erratum 818325
  2015-11-03  8:10 [RESEND PATCH 0/1] Fix the "hard LOCKUP" when running a heavy loading Caesar Wang
  2015-11-03  8:10 ` [RESEND PATCH] ARM: errata: Workaround for Cortex-A12 erratum 818325 Caesar Wang
@ 2015-11-03 10:41 ` Caesar Wang
  2015-11-03 11:14 ` [RESEND PATCH 0/1] Fix the "hard LOCKUP" when running a heavy loading Russell King - ARM Linux
  2015-11-03 11:30 ` Will Deacon
  3 siblings, 0 replies; 12+ messages in thread
From: Caesar Wang @ 2015-11-03 10:41 UTC (permalink / raw)
  To: linux-arm-kernel

From: Huang Tao <huangtao@rock-chips.com>

On Cortex-A12 (r0p0..r0p1-00lac0-rc11), when a CPU executes a sequence of
two conditional store instructions with opposite condition code and
updating the same register, the system might enter a deadlock if the
second conditional instruction is an UNPREDICTABLE STR or STM
instruction. This workaround setting bit[12] of the Feature Register
prevents the erratum. This bit disables an optimisation applied to a
sequence of 2 instructions that use opposing condition codes.

Signed-off-by: Huang Tao <huangtao@rock-chips.com>
Signed-off-by: Kever Yang <kever.yang@rock-chips.com>
Signed-off-by: Caesar Wang <wxt@rock-chips.com>

---

Changes in v1:
- fix the build error.

 arch/arm/Kconfig      | 13 +++++++++++++
 arch/arm/mm/proc-v7.S | 16 ++++++++++++++++
 2 files changed, 29 insertions(+)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 639411f..554b57a 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1263,6 +1263,19 @@ config ARM_ERRATA_773022
 	  loop buffer may deliver incorrect instructions. This
 	  workaround disables the loop buffer to avoid the erratum.
 
+config ARM_ERRATA_818325
+	bool "ARM errata: Execution of an UNPREDICTABLE STR or STM instruction might deadlock"
+	depends on CPU_V7
+	help
+	  This option enables the workaround for the 818325 Cortex-A12
+	  (r0p0..r0p1-00lac0-rc11) erratum. When a CPU executes a sequence of
+	  two conditional store instructions with opposite condition code and
+	  updating the same register, the system might enter a deadlock if the
+	  second conditional instruction is an UNPREDICTABLE STR or STM
+	  instruction. This workaround setting bit[12] of the Feature Register
+	  prevents the erratum. This bit disables an optimisation applied to a
+	  sequence of 2 instructions that use opposing condition codes.
+
 endmenu
 
 source "arch/arm/common/Kconfig"
diff --git a/arch/arm/mm/proc-v7.S b/arch/arm/mm/proc-v7.S
index de2b246..e95c83c 100644
--- a/arch/arm/mm/proc-v7.S
+++ b/arch/arm/mm/proc-v7.S
@@ -351,6 +351,17 @@ __ca9_errata:
 #endif
 	b	__errata_finish
 
+__ca12_errata:
+#ifdef CONFIG_ARM_ERRATA_818325
+	teq	r6, #0x00			@ present in r0p0
+	teqne	r6, #0x01			@ present in r0p1-00lac0-rc11
+	mrceq	p15, 0, r10, c15, c0, 1		@ read diagnostic register
+	orreq	r10, r10, #1 << 12		@ set bit #12
+	mcreq	p15, 0, r10, c15, c0, 1		@ write diagnostic register
+	isb
+#endif
+	b	__errata_finish
+
 __ca15_errata:
 #ifdef CONFIG_ARM_ERRATA_773022
 	cmp	r6, #0x4			@ only present up to r0p4
@@ -439,6 +450,11 @@ __v7_setup_cont:
 	teq	r0, r10
 	beq	__ca9_errata
 
+	/* Cortex-A12 Errata */
+	ldr	r10, =0x00000c0d		@ Cortex-A12 primary part number
+	teq	r0, r10
+	beq	__ca12_errata
+
 	/* Cortex-A15 Errata */
 	ldr	r10, =0x00000c0f		@ Cortex-A15 primary part number
 	teq	r0, r10
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RESEND PATCH 0/1] Fix the "hard LOCKUP" when running a heavy loading
  2015-11-03  8:10 [RESEND PATCH 0/1] Fix the "hard LOCKUP" when running a heavy loading Caesar Wang
  2015-11-03  8:10 ` [RESEND PATCH] ARM: errata: Workaround for Cortex-A12 erratum 818325 Caesar Wang
  2015-11-03 10:41 ` [PATCH v1] " Caesar Wang
@ 2015-11-03 11:14 ` Russell King - ARM Linux
  2015-11-03 12:00   ` Huang, Tao
  2015-11-03 11:30 ` Will Deacon
  3 siblings, 1 reply; 12+ messages in thread
From: Russell King - ARM Linux @ 2015-11-03 11:14 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Nov 03, 2015 at 04:10:08PM +0800, Caesar Wang wrote:
> As the Russell said:
> "in other words, which can be handled by updating a control register in
> the firmware or boot loader"
> Maybe the better solution is in firmware.

The full quote is:

"I think we're at the point where we start insisting that workarounds
which are simple enable/disable feature bit operations (in other words,
which can be handled by updating a control register in the firmware or
boot loader) must be done that way, and we are not going to add such
workarounds to the kernel anymore."

The position hasn't changed.  Workarounds such as this should be handled
in the firmware/boot loader before control is passed to the kernel.

The reason is very simple: if the C compiler can generate code which
triggers the bug, it can generate code which triggers the bug in the
boot loader.  So, the only place such workarounds can be done is before
any C code gets executed.  Putting such workarounds in the kernel is
completely inappropriate.

Sorry, I'm not going to accept this workaround into the kernel.

-- 
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RESEND PATCH 0/1] Fix the "hard LOCKUP" when running a heavy loading
  2015-11-03 11:14 ` [RESEND PATCH 0/1] Fix the "hard LOCKUP" when running a heavy loading Russell King - ARM Linux
@ 2015-11-03 12:00   ` Huang, Tao
  0 siblings, 0 replies; 12+ messages in thread
From: Huang, Tao @ 2015-11-03 12:00 UTC (permalink / raw)
  To: linux-arm-kernel

Hello Russell:

? 2015?11?03? 19:14, Russell King - ARM Linux ??:
> On Tue, Nov 03, 2015 at 04:10:08PM +0800, Caesar Wang wrote:
>> As the Russell said:
>> "in other words, which can be handled by updating a control register in
>> the firmware or boot loader"
>> Maybe the better solution is in firmware.
> 
> The full quote is:
> 
> "I think we're at the point where we start insisting that workarounds
> which are simple enable/disable feature bit operations (in other words,
> which can be handled by updating a control register in the firmware or
> boot loader) must be done that way, and we are not going to add such
> workarounds to the kernel anymore."
> 
> The position hasn't changed.  Workarounds such as this should be handled
> in the firmware/boot loader before control is passed to the kernel.
> 
> The reason is very simple: if the C compiler can generate code which
> triggers the bug, it can generate code which triggers the bug in the
> boot loader.  So, the only place such workarounds can be done is before
> any C code gets executed.  Putting such workarounds in the kernel is
> completely inappropriate.

I agree with your reason for CPU0. But how about CPU1~3 if we don't use
any firmware such as ARM Trusted Firmware to take control of CPU power
on? If the CPU1~3 will run on Linux when its first instruction is running?

BTW I don't want to argue with you the workaround is right or wrong
because I know the errata just happen on r0p0 not r0p1.

> 
> Sorry, I'm not going to accept this workaround into the kernel.

It seems we should introduce some code outside the kernel to do such
initialization?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RESEND PATCH 0/1] Fix the "hard LOCKUP" when running a heavy loading
  2015-11-03  8:10 [RESEND PATCH 0/1] Fix the "hard LOCKUP" when running a heavy loading Caesar Wang
                   ` (2 preceding siblings ...)
  2015-11-03 11:14 ` [RESEND PATCH 0/1] Fix the "hard LOCKUP" when running a heavy loading Russell King - ARM Linux
@ 2015-11-03 11:30 ` Will Deacon
  2015-11-03 19:00   ` Doug Anderson
  3 siblings, 1 reply; 12+ messages in thread
From: Will Deacon @ 2015-11-03 11:30 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Nov 03, 2015 at 04:10:08PM +0800, Caesar Wang wrote:
> As the following log:
> where we experience a CPU hard lockup. The assembly code (disassembled by gdb)
> 
> 0xc06c6e90 <__tcp_select_window+148>:        beq     0xc06c6eb0<__tcp_select_window+180>
> 0xc06c6e94 <__tcp_select_window+152>:        mov     r2, #1008; 0x3f0
> 0xc06c6e98 <__tcp_select_window+156>:        ldr     r5, [r0,#1004] ; 0x3ec
> 0xc06c6e9c <__tcp_select_window+160>:        ldrh    r2, [r0,r2]
> ....
> 
> 0xc06c6ee0 <__tcp_select_window+228>:        addne   r0, r0, #1
> 0xc06c6ee4 <__tcp_select_window+232>:        lslne   r0, r0, r2
> 0xc06c6ee8 <__tcp_select_window+236>:        ldmne   sp, {r4, r5,r11, sp,pc}
> 
> Could either the ?strhi?/?strlo? pair, or the lslne/ldmne pair, be
> tripping over errata 818325, or a similar errata?

No. One of the conditions for #818325 is:

  The second instruction is an UNPREDICTABLE STR or STM (maximum two2
  registers in the list) with write-back and the write-back register is
  in the list of stored registers.

I don't see either of those in your code snippet above, but then I don't
see your strhi/strlo either. What's going on?

> 0xc06c6eec <__tcp_select_window+240>:        b       0xc06c6f40<__tcp_select_window+324>
> 
> This is patch can fix the *hard lock* in some case.
> 
> As the Russell said:
> "in other words, which can be handled by updating a control register in the firmware or
> boot loader"

Russell is completely correct: this should be worked around in firmware.
There are a number of reasons for that:

  (1) You want the workaround enabled for all privilege and security
      levels, which means applying it before you enter the kernel.

  (2) If Linux boots in non-secure, then the workaround may silently
      fail to apply.

  (3) The CPU may have an ECO fix, in which case we wouldn't want to
      enable the workaround.

  (4) Some workarounds (albeit not this one, afaict) require changing
      CPU configuration that can only be done very early on, e.g. whilst
      "the memory system is idle".

Now, I appreciate that doing this in the kernel may be the easiest thing
for your particular SoC, but that doesn't necessarily mean that it's the
best thing to do in the mainline kernel. Whilst there *is* precedent for
this already, we've been trying to move away from setting these bits in
the kernel for the reasons mentioned above.

Will

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RESEND PATCH 0/1] Fix the "hard LOCKUP" when running a heavy loading
  2015-11-03 11:30 ` Will Deacon
@ 2015-11-03 19:00   ` Doug Anderson
  2015-11-06 12:17     ` Will Deacon
  0 siblings, 1 reply; 12+ messages in thread
From: Doug Anderson @ 2015-11-03 19:00 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Tue, Nov 3, 2015 at 3:30 AM, Will Deacon <will.deacon@arm.com> wrote:
> On Tue, Nov 03, 2015 at 04:10:08PM +0800, Caesar Wang wrote:
>> As the following log:
>> where we experience a CPU hard lockup. The assembly code (disassembled by gdb)
>>
>> 0xc06c6e90 <__tcp_select_window+148>:        beq     0xc06c6eb0<__tcp_select_window+180>
>> 0xc06c6e94 <__tcp_select_window+152>:        mov     r2, #1008; 0x3f0
>> 0xc06c6e98 <__tcp_select_window+156>:        ldr     r5, [r0,#1004] ; 0x3ec
>> 0xc06c6e9c <__tcp_select_window+160>:        ldrh    r2, [r0,r2]
>> ....
>>
>> 0xc06c6ee0 <__tcp_select_window+228>:        addne   r0, r0, #1
>> 0xc06c6ee4 <__tcp_select_window+232>:        lslne   r0, r0, r2
>> 0xc06c6ee8 <__tcp_select_window+236>:        ldmne   sp, {r4, r5,r11, sp,pc}
>>
>> Could either the ?strhi?/?strlo? pair, or the lslne/ldmne pair, be
>> tripping over errata 818325, or a similar errata?
>
> No. One of the conditions for #818325 is:
>
>   The second instruction is an UNPREDICTABLE STR or STM (maximum two2
>   registers in the list) with write-back and the write-back register is
>   in the list of stored registers.
>
> I don't see either of those in your code snippet above, but then I don't
> see your strhi/strlo either. What's going on?

It looks like Caesar is proposing that this errata is the root cause
for some hard lockups we're seeing on rk3288 Chromebooks.  I agree
with folks here that say this isn't terribly likely, but I always like
to be proven wrong.  ;)

We've got code that samples / prints CPU_DBGPCSR at the time of a hard
lockup.  That register isn't 100% accurate about where a CPU is, but
it's better than nothing (technically there may be ways to actually
use the DBG registers to stop the remote CPU and maybe give more info,
but I digress).

When CPUs are hard locked up, they are often found at:

<c0117c8c> v7_coherent_kern_range+0x58/0x74
  or
<c0118278> v7wbi_flush_user_tlb_range+0x30/0x38

That made me think that an errata might be the root cause of our hard
lockups, since ARM errata often trigger in cache/tlb functions.  I
think Caesar dug up this old errata fix in response to my suggestion.

If you know of any ARM errata that might trigger hard lockups like
this, I'd certainly be all ears.  It's also possible that we've got
something running at too low of a voltage or we've got clock dividers
or cache timings programmed incorrectly somewhere.  To give a more
full disassembly of one of the crashes:

  <4>[ 1623.480846] SMP: failed to stop secondary CPUs
  <3>[ 1623.480862] CPU1 PC: <c01827e8> __unqueue_futex+0x68/0x88
  <3>[ 1623.480879] CPU2 PC: <c0117c8c> v7_coherent_kern_range+0x58/0x74
  <3>[ 1623.480895] CPU3 PC: <c0118268> v7wbi_flush_user_tlb_range+0x20/0x38

---

c01827dc:       e2841010        add     r1, r4, #16
c01827e0:       e2445004        sub     r5, r4, #4
c01827e4:       eb068d33        bl      c0325cb8 <plist_del> (File
Offset: 0x235cb8)
=> c01827e8:    f595f000        pldw    [r5]
c01827ec:       e1953f9f        ldrex   r3, [r5]
c01827f0:       e2433001        sub     r3, r3, #1
c01827f4:       e1852f93        strex   r2, r3, [r5]
c01827f8:       e3320000        teq     r2, #0
c01827fc:       1afffffa        bne     c01827ec
<__unqueue_futex+0x6c> (File Offset: 0x927ec)
c0182800:       e89da830        ldm     sp, {r4, r5, fp, sp, pc}

---

c0117c80:       e08cc002        add     ip, ip, r2
c0117c84:       e15c0001        cmp     ip, r1
c0117c88:       3afffffb        bcc     c0117c7c
<v7_coherent_kern_range+0x48> (File Offset: 0x27c7c)
=> c0117c8c:    e3a00000        mov     r0, #0
c0117c90:       ee070fd1        mcr     15, 0, r0, cr7, cr1, {6}
c0117c94:       f57ff04a        dsb     ishst
c0117c98:       f57ff06f        isb     sy
c0117c9c:       e1a0f00e        mov     pc, lr

---

c0118260:       e1830600        orr     r0, r3, r0, lsl #12
c0118264:       e1a01601        lsl     r1, r1, #12
=> c0118268:    ee080f33        mcr     15, 0, r0, cr8, cr3, {1}
c011826c:       e2800a01        add     r0, r0, #4096   ; 0x1000
c0118270:       e1500001        cmp     r0, r1
c0118274:       3afffffb        bcc     c0118268
<v7wbi_flush_user_tlb_range+0x20> (File Offset: 0x28268)
c0118278:       f57ff04b        dsb     ish
c011827c:       e1a0f00e        mov     pc, lr

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RESEND PATCH 0/1] Fix the "hard LOCKUP" when running a heavy loading
  2015-11-03 19:00   ` Doug Anderson
@ 2015-11-06 12:17     ` Will Deacon
  2015-11-09  4:39       ` Doug Anderson
  0 siblings, 1 reply; 12+ messages in thread
From: Will Deacon @ 2015-11-06 12:17 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Nov 03, 2015 at 11:00:20AM -0800, Doug Anderson wrote:
> Hi,

Hey Doug,

> When CPUs are hard locked up, they are often found at:
> 
> <c0117c8c> v7_coherent_kern_range+0x58/0x74
>   or
> <c0118278> v7wbi_flush_user_tlb_range+0x30/0x38
> 
> That made me think that an errata might be the root cause of our hard
> lockups, since ARM errata often trigger in cache/tlb functions.  I
> think Caesar dug up this old errata fix in response to my suggestion.

I still don't see how 818325 is related, since there aren't any conditional
stores in the sequences below.

> If you know of any ARM errata that might trigger hard lockups like
> this, I'd certainly be all ears.  It's also possible that we've got
> something running at too low of a voltage or we've got clock dividers
> or cache timings programmed incorrectly somewhere.  To give a more
> full disassembly of one of the crashes:
> 
>   <4>[ 1623.480846] SMP: failed to stop secondary CPUs
>   <3>[ 1623.480862] CPU1 PC: <c01827e8> __unqueue_futex+0x68/0x88
>   <3>[ 1623.480879] CPU2 PC: <c0117c8c> v7_coherent_kern_range+0x58/0x74
>   <3>[ 1623.480895] CPU3 PC: <c0118268> v7wbi_flush_user_tlb_range+0x20/0x38
> 
> ---

Do you have any register values for these CPUs?

> c01827dc:       e2841010        add     r1, r4, #16
> c01827e0:       e2445004        sub     r5, r4, #4
> c01827e4:       eb068d33        bl      c0325cb8 <plist_del> (File
> Offset: 0x235cb8)
> => c01827e8:    f595f000        pldw    [r5]
> c01827ec:       e1953f9f        ldrex   r3, [r5]
> c01827f0:       e2433001        sub     r3, r3, #1
> c01827f4:       e1852f93        strex   r2, r3, [r5]
> c01827f8:       e3320000        teq     r2, #0
> c01827fc:       1afffffa        bne     c01827ec
> <__unqueue_futex+0x6c> (File Offset: 0x927ec)
> c0182800:       e89da830        ldm     sp, {r4, r5, fp, sp, pc}

For example, the futex address in r5 ...

> c0117c80:       e08cc002        add     ip, ip, r2
> c0117c84:       e15c0001        cmp     ip, r1
> c0117c88:       3afffffb        bcc     c0117c7c
> <v7_coherent_kern_range+0x48> (File Offset: 0x27c7c)
> => c0117c8c:    e3a00000        mov     r0, #0
> c0117c90:       ee070fd1        mcr     15, 0, r0, cr7, cr1, {6}
> c0117c94:       f57ff04a        dsb     ishst
> c0117c98:       f57ff06f        isb     sy
> c0117c9c:       e1a0f00e        mov     pc, lr

... the address in r0 for the cache maintenance ...

> c0118260:       e1830600        orr     r0, r3, r0, lsl #12
> c0118264:       e1a01601        lsl     r1, r1, #12
> => c0118268:    ee080f33        mcr     15, 0, r0, cr8, cr3, {1}
> c011826c:       e2800a01        add     r0, r0, #4096   ; 0x1000
> c0118270:       e1500001        cmp     r0, r1
> c0118274:       3afffffb        bcc     c0118268
> <v7wbi_flush_user_tlb_range+0x20> (File Offset: 0x28268)
> c0118278:       f57ff04b        dsb     ish
> c011827c:       e1a0f00e        mov     pc, lr

... and the address in r0 for the TLBI.

Are the cores executing instructions at this point, or by "hard LOCKUP"
do you mean that they're deadlocked in hardware?

Will

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RESEND PATCH 0/1] Fix the "hard LOCKUP" when running a heavy loading
  2015-11-06 12:17     ` Will Deacon
@ 2015-11-09  4:39       ` Doug Anderson
  0 siblings, 0 replies; 12+ messages in thread
From: Doug Anderson @ 2015-11-09  4:39 UTC (permalink / raw)
  To: linux-arm-kernel

Will,

On Fri, Nov 6, 2015 at 4:17 AM, Will Deacon <will.deacon@arm.com> wrote:
> On Tue, Nov 03, 2015 at 11:00:20AM -0800, Doug Anderson wrote:
>> Hi,
>
> Hey Doug,
>
>> When CPUs are hard locked up, they are often found at:
>>
>> <c0117c8c> v7_coherent_kern_range+0x58/0x74
>>   or
>> <c0118278> v7wbi_flush_user_tlb_range+0x30/0x38
>>
>> That made me think that an errata might be the root cause of our hard
>> lockups, since ARM errata often trigger in cache/tlb functions.  I
>> think Caesar dug up this old errata fix in response to my suggestion.
>
> I still don't see how 818325 is related, since there aren't any conditional
> stores in the sequences below.
>
>> If you know of any ARM errata that might trigger hard lockups like
>> this, I'd certainly be all ears.  It's also possible that we've got
>> something running at too low of a voltage or we've got clock dividers
>> or cache timings programmed incorrectly somewhere.  To give a more
>> full disassembly of one of the crashes:
>>
>>   <4>[ 1623.480846] SMP: failed to stop secondary CPUs
>>   <3>[ 1623.480862] CPU1 PC: <c01827e8> __unqueue_futex+0x68/0x88
>>   <3>[ 1623.480879] CPU2 PC: <c0117c8c> v7_coherent_kern_range+0x58/0x74
>>   <3>[ 1623.480895] CPU3 PC: <c0118268> v7wbi_flush_user_tlb_range+0x20/0x38
>>
>> ---
>
> Do you have any register values for these CPUs?

No, unfortunately not.  The only reason I have the PCs is because we
have code to sample CPU_DBGPCSR at hard lockup time (actually any
panic time).  There's no equivalent for other registers.  The code
does try to sample a number of times, so the fact that we only have
one PC value for each of the other CPUs implies that they are either
totally stuck or running in a very tight loop (from experimentation,
if you are running in a tight loop of just a few instructions the
CPU_DBGPCSR for a CPU may or may not update).

If you're curious, you can see rockchip_panic_notify() in
<https://chromium.googlesource.com/chromiumos/third_party/kernel/+/chromeos-3.14/arch/arm/mach-rockchip/rockchip.c>.
It's basically some code that's been ported forward from code in an
old Android tree and it's not beautiful, but it's better than nothing.
The code only runs if the panic notifier failed to stop the other CPUs
in a normal way.

Technically (I think) I saw something in the CPU debug registers that
would actually allow me to force another CPU to stop.  That might let
me gain control over it and inspect the other registers.  Doing that
is probably beyond what I have time for right now, though.


>> c01827dc:       e2841010        add     r1, r4, #16
>> c01827e0:       e2445004        sub     r5, r4, #4
>> c01827e4:       eb068d33        bl      c0325cb8 <plist_del> (File
>> Offset: 0x235cb8)
>> => c01827e8:    f595f000        pldw    [r5]
>> c01827ec:       e1953f9f        ldrex   r3, [r5]
>> c01827f0:       e2433001        sub     r3, r3, #1
>> c01827f4:       e1852f93        strex   r2, r3, [r5]
>> c01827f8:       e3320000        teq     r2, #0
>> c01827fc:       1afffffa        bne     c01827ec
>> <__unqueue_futex+0x6c> (File Offset: 0x927ec)
>> c0182800:       e89da830        ldm     sp, {r4, r5, fp, sp, pc}
>
> For example, the futex address in r5 ...
>
>> c0117c80:       e08cc002        add     ip, ip, r2
>> c0117c84:       e15c0001        cmp     ip, r1
>> c0117c88:       3afffffb        bcc     c0117c7c
>> <v7_coherent_kern_range+0x48> (File Offset: 0x27c7c)
>> => c0117c8c:    e3a00000        mov     r0, #0
>> c0117c90:       ee070fd1        mcr     15, 0, r0, cr7, cr1, {6}
>> c0117c94:       f57ff04a        dsb     ishst
>> c0117c98:       f57ff06f        isb     sy
>> c0117c9c:       e1a0f00e        mov     pc, lr
>
> ... the address in r0 for the cache maintenance ...
>
>> c0118260:       e1830600        orr     r0, r3, r0, lsl #12
>> c0118264:       e1a01601        lsl     r1, r1, #12
>> => c0118268:    ee080f33        mcr     15, 0, r0, cr8, cr3, {1}
>> c011826c:       e2800a01        add     r0, r0, #4096   ; 0x1000
>> c0118270:       e1500001        cmp     r0, r1
>> c0118274:       3afffffb        bcc     c0118268
>> <v7wbi_flush_user_tlb_range+0x20> (File Offset: 0x28268)
>> c0118278:       f57ff04b        dsb     ish
>> c011827c:       e1a0f00e        mov     pc, lr
>
> ... and the address in r0 for the TLBI.
>
> Are the cores executing instructions at this point, or by "hard LOCKUP"
> do you mean that they're deadlocked in hardware?

If they are executing, they aren't executing much.  That means it's
likely they're deadlocked in hardware.

-Doug

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2015-11-09  4:39 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-11-03  8:10 [RESEND PATCH 0/1] Fix the "hard LOCKUP" when running a heavy loading Caesar Wang
2015-11-03  8:10 ` [RESEND PATCH] ARM: errata: Workaround for Cortex-A12 erratum 818325 Caesar Wang
2015-11-03  8:45   ` Arnd Bergmann
2015-11-03  9:04     ` Caesar Wang
2015-11-03 10:21   ` kbuild test robot
2015-11-03 10:41 ` [PATCH v1] " Caesar Wang
2015-11-03 11:14 ` [RESEND PATCH 0/1] Fix the "hard LOCKUP" when running a heavy loading Russell King - ARM Linux
2015-11-03 12:00   ` Huang, Tao
2015-11-03 11:30 ` Will Deacon
2015-11-03 19:00   ` Doug Anderson
2015-11-06 12:17     ` Will Deacon
2015-11-09  4:39       ` Doug Anderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).