linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* arm64: kdump broken on a large CPU system
@ 2018-12-10 22:30 Qian Cai
  2018-12-11 10:09 ` Marc Zyngier
  0 siblings, 1 reply; 17+ messages in thread
From: Qian Cai @ 2018-12-10 22:30 UTC (permalink / raw)
  To: Marc Zyngier, Ard Biesheuvel, Catalin Marinas, Will Deacon
  Cc: linux-arm-kernel

On this HPE Apollo 70 arm64 server with 256 CPUs, triggering a crash dump just
hung (4.20-rc6 as well as 4.18). It was confirmed that the executing went as far
as entering __cpu_soft_restart(),

__crash_kexec
  machine_kexec
    cpu_soft_restart
      restart
        __cpu_soft_restart

The earlycon was enabled but had no output from the 2nd kernel, so it was pretty
much stuck in all those assembly code in arm64/kernel/head.S or the early part
of start_kernel() before earlycon was initialized.

It turned out this has something to do with nr_cpus in the 1st kernel, although
the 2nd kernel always has nr_cpus=1 [1]. It was tested with both
crashkernel=512M or 768M.

nr_cpus <= 96  GOOD (2nd kernel was up in 2-3 mins.)
nr_cpus=256    BAD  (2nd kernel was NOT up after 1 hour.)
nr_cpus=127    BAD  (2nd kernel was NOT up after 10 mins.)

I did also test with and without CONFIG_ARM64_VHE (i.e., el2_switch) made no
difference.

[1] KDUMP_COMMANDLINE_APPEND="irqpoll nr_cpus=1 swiotlb=noforce reset_devices"

I am still figuring out a way to debug those assembly code to where it actually
hung, but the server was hooked up with a conserver that was not able to
generate any sysrq and I have no shell access to the conserver, so seems a bit
difficult to use kgdb or kdb in this case.

CPU information,

# lscpu
Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              256
On-line CPU(s) list: 0-255
Thread(s) per core:  4
Core(s) per socket:  32
Socket(s):           2
NUMA node(s):        2
Vendor ID:           Cavium
Model:               1
Model name:          ThunderX2 99xx
Stepping:            0x1
BogoMIPS:            400.00
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            32768K
NUMA node0 CPU(s):   0-127
NUMA node1 CPU(s):   128-255
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid
asimdrdm

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2019-01-10 20:00 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-12-10 22:30 arm64: kdump broken on a large CPU system Qian Cai
2018-12-11 10:09 ` Marc Zyngier
2018-12-11 11:34   ` James Morse
2018-12-12  2:51     ` AKASHI, Takahiro
2018-12-12  4:39       ` Qian Cai
2018-12-12 22:37         ` Qian Cai
2018-12-13  5:22           ` [PATCH] arm64: invalidate TLB before turning MMU on Qian Cai
2018-12-13  5:40             ` Bhupesh Sharma
2018-12-13 13:39               ` Qian Cai
2018-12-13 10:44             ` James Morse
2018-12-13 13:44               ` Qian Cai
2018-12-14  4:08             ` [PATCH v2] arm64: invalidate TLB just " Qian Cai
2018-12-14  5:01               ` Bhupesh Sharma
2018-12-14 12:54                 ` Qian Cai
2018-12-14  7:23               ` Ard Biesheuvel
2018-12-15  1:53                 ` Qian Cai
2019-01-10 20:00                   ` Bhupesh Sharma

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).