Re: arm64: kdump broken on a large CPU system

From: Qian Cai <cai@lca.pw>
To: "AKASHI, Takahiro" <takahiro.akashi@linaro.org>,
	James Morse <james.morse@arm.com>,
	Marc Zyngier <marc.zyngier@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will.deacon@arm.com>,
	kexec@lists.infradead.org, linux-arm-kernel@lists.infradead.org,
	Ard Biesheuvel <ard.biesheuvel@linaro.org>
Subject: Re: arm64: kdump broken on a large CPU system
Date: Wed, 12 Dec 2018 17:37:04 -0500	[thread overview]
Message-ID: <1544654224.18411.11.camel@lca.pw> (raw)
In-Reply-To: <e4a3456b-6a75-4564-a49f-0532d0b35726@lca.pw>

On Tue, 2018-12-11 at 23:39 -0500, Qian Cai wrote:
> [+ kexec@lists.infradead.org]
> 
> The debugging progress so far...
> 
> Wait up to 5 minutes for other CPUs to stop in crash_smp_send_stop() made no
> difference.
> 
> With "dev" branch of this tree [1], it is possible to print out messages from
> purgatory when passing something like "--port=0x602B0000
> --port-lsr=0x602B0000,0x80" to kexec. However, even enable_dcache() in
> setup_arch() will hung like forever on this machine (working fine on another
> arm64 server - Cortex-A72). After removed only enable_dcache() /
> disable_dcache() from setup_arch() etc without removing printf() lines, it did
> print out,
> 
> I'm in purgatory
> purgatory: entry=0000000090080000
> purgatory: dtb=0000000092d50000
> purgatory: D-cache Enabled before SHA verification
> purgatory: D-cache Disabled after SHA verification
> 
> So, it confirmed that it must hung somewhere in arm64/kernel/head.S (.stext)
> or
> the early part of start_kernel() before earlycon was initialized.
> 
> Also confirmed that passing nr_cpus=64 in the first kernel would again make
> everything work fine with this new kexec.
> 
> Since enable_dcache() would hung as well, I suspect this has something to do
> with enabling MMU (i.e, .stext -> __primary_switch -> __enable_mmu) coupling
> with some sort of per-CPU data where the number of CPUs matters.

Still debugging a hung to enable MMU (enable_dcache) in purgatory [1] which may
provide some clues for the hung later in the 2nd kernel.

dsb	nshst
tlbi	alle2
dsb	nsh
isb
bl	get_ips_bits
lsl	x1, x0, #TCR_IPS_EL2_SHIFT
orr	x1, x1, x7
mov	x0, x6
ldr	x2, =MEMORY_ATTRIBUTES
msr	mair_el2, x2
msr	tcr_el2, x1
msr	ttbr0_el2, x0
isb
mrs	x0, sctlr_el2
ldr	x3, =SCTLR_ELx_FLAGS
orr	x0, x0, x3
msr	sctlr_el2, x0     <--- hung right on this instruction.

Without CONFIG_ARM64_VHE (i.e., running in EL1), it is able to run
enable_dcache() but it still hung later in the 2nd kernel somewhere.

dsb	nshst
tlbi	vmalle1
dsb	nsh
isb
bl	get_ips_bits
lsl	x1, x0, #TCR_IPS_EL1_SHIFT
orr	x1, x1, x7
mov	x0, x6
ldr	x2, =MEMORY_ATTRIBUTES
msr	mair_el1, x2
msr	tcr_el1, x1
msr	ttbr0_el1, x0
isb
mrs	x0, sctlr_el1
ldr	x3, =SCTLR_ELx_FLAGS
orr	x0, x0, x3
msr	sctlr_el1, x0
isb

One data point of this system is that it has 4 threads on each core. Each 2-core 
share a same L1 and L2 caches, so that is 8 CPUs shares them each. All CPUs
share a same L3 cache.

Hence, I wonder if this is because of incomplete cache/TLB invalidation that had
stale entries (or uninitialised junk which just happens to look valid) present
before turning the MMU on.

[1] https://github.com/pratyushanand/kexec-tools/blob/devel/purgatory/arch/\
arm64/cache.S

> 
> Right now, I think I need to find a way to print directly to pl011 serial
> console while debugging those assembly code like CONFIG_DEBUG_LL for arm64, so
> it can be used to locate where exactly it hung. Otherwise, I am shooting in
> the
> dark.
> 
> [1] https://github.com/pratyushanand/kexec-tools
> 
> === original email ===
> 
> On this HPE Apollo 70 arm64 server with 256 CPUs, triggering a crash dump just
> hung (4.20-rc6 as well as 4.18). It was confirmed that the executing went as
> far
> as entering __cpu_soft_restart(),
> 
> __crash_kexec
>   machine_kexec
>     cpu_soft_restart
>       restart
>         __cpu_soft_restart
> 
> The earlycon was enabled but had no output from the 2nd kernel, so it was
> pretty
> much stuck in all those assembly code in arm64/kernel/head.S or the early part
> of start_kernel() before earlycon was initialized.
> 
> It turned out this has something to do with nr_cpus in the 1st kernel,
> although
> the 2nd kernel always has nr_cpus=1 [1]. It was tested with both
> crashkernel=512M or 768M.
> 
> nr_cpus <= 96  GOOD (2nd kernel was up in 2-3 mins.)
> nr_cpus=256    BAD  (2nd kernel was NOT up after 1 hour.)
> nr_cpus=127    BAD  (2nd kernel was NOT up after 10 mins.)
> 
> I did also test with and without CONFIG_ARM64_VHE (i.e., el2_switch) made no
> difference.
> 
> [1] KDUMP_COMMANDLINE_APPEND="irqpoll nr_cpus=1 swiotlb=noforce reset_devices"
> 
> I am still figuring out a way to debug those assembly code to where it
> actually
> hung, but the server was hooked up with a conserver that was not able to
> generate any sysrq and I have no shell access to the conserver, so seems a bit
> difficult to use kgdb or kdb in this case.
> 
> CPU information,
> 
> # lscpu
> Architecture:        aarch64
> Byte Order:          Little Endian
> CPU(s):              256
> On-line CPU(s) list: 0-255
> Thread(s) per core:  4
> Core(s) per socket:  32
> Socket(s):           2
> NUMA node(s):        2
> Vendor ID:           Cavium
> Model:               1
> Model name:          ThunderX2 99xx
> Stepping:            0x1
> BogoMIPS:            400.00
> L1d cache:           32K
> L1i cache:           32K
> L2 cache:            256K
> L3 cache:            32768K
> NUMA node0 CPU(s):   0-127
> NUMA node1 CPU(s):   128-255
> Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid
> asimdrdm

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec