From: Marc Zyngier <marc.zyngier@arm.com>
To: Qian Cai <cai@lca.pw>, Ard Biesheuvel <ard.biesheuvel@linaro.org>,
Catalin Marinas <catalin.marinas@arm.com>,
Will Deacon <will.deacon@arm.com>
Cc: "AKASHI, Takahiro" <takahiro.akashi@linaro.org>,
James Morse <James.Morse@arm.com>,
linux-arm-kernel@lists.infradead.org
Subject: Re: arm64: kdump broken on a large CPU system
Date: Tue, 11 Dec 2018 10:09:59 +0000 [thread overview]
Message-ID: <29f74c6d-dd21-dcee-6c62-914f018c4e4e@arm.com> (raw)
In-Reply-To: <113776f1-5633-e397-96eb-c533ea79671d@lca.pw>
[+ James and Takahiro]
Hi Qian,
On 10/12/2018 22:30, Qian Cai wrote:
> On this HPE Apollo 70 arm64 server with 256 CPUs, triggering a crash dump just
> hung (4.20-rc6 as well as 4.18). It was confirmed that the executing went as far
> as entering __cpu_soft_restart(),
You can forget about 4.18 altogether, it will never correctly kexec.
I've used 4.20 + kexec on a TX2 system though, and although it takes
absolutely ages, it reliably works.
>
> __crash_kexec
> machine_kexec
> cpu_soft_restart
> restart
> __cpu_soft_restart
>
> The earlycon was enabled but had no output from the 2nd kernel, so it was pretty
> much stuck in all those assembly code in arm64/kernel/head.S or the early part
> of start_kernel() before earlycon was initialized.
Could it instead be in the purgatory code provided by userspace?
>
> It turned out this has something to do with nr_cpus in the 1st kernel, although
> the 2nd kernel always has nr_cpus=1 [1]. It was tested with both
> crashkernel=512M or 768M.
James was saying something about a timeout, which may or may not be long
enough.
>
> nr_cpus <= 96 GOOD (2nd kernel was up in 2-3 mins.)
> nr_cpus=256 BAD (2nd kernel was NOT up after 1 hour.)
> nr_cpus=127 BAD (2nd kernel was NOT up after 10 mins.)
>
> I did also test with and without CONFIG_ARM64_VHE (i.e., el2_switch) made no
> difference.
>
> [1] KDUMP_COMMANDLINE_APPEND="irqpoll nr_cpus=1 swiotlb=noforce reset_devices"
>
> I am still figuring out a way to debug those assembly code to where it actually
> hung, but the server was hooked up with a conserver that was not able to
> generate any sysrq and I have no shell access to the conserver, so seems a bit
> difficult to use kgdb or kdb in this case.
>
> CPU information,
>
> # lscpu
> Architecture: aarch64
> Byte Order: Little Endian
> CPU(s): 256
> On-line CPU(s) list: 0-255
> Thread(s) per core: 4
> Core(s) per socket: 32
> Socket(s): 2
> NUMA node(s): 2
> Vendor ID: Cavium
> Model: 1
> Model name: ThunderX2 99xx
> Stepping: 0x1
> BogoMIPS: 400.00
> L1d cache: 32K
> L1i cache: 32K
> L2 cache: 256K
> L3 cache: 32768K
> NUMA node0 CPU(s): 0-127
> NUMA node1 CPU(s): 128-255
> Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid
> asimdrdm
>
Thanks,
M.
--
Jazz is not dead. It just smells funny...
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next prev parent reply other threads:[~2018-12-11 10:27 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-12-10 22:30 arm64: kdump broken on a large CPU system Qian Cai
2018-12-11 10:09 ` Marc Zyngier [this message]
2018-12-11 11:34 ` James Morse
2018-12-12 2:51 ` AKASHI, Takahiro
2018-12-12 4:39 ` Qian Cai
2018-12-12 22:37 ` Qian Cai
2018-12-13 5:22 ` [PATCH] arm64: invalidate TLB before turning MMU on Qian Cai
2018-12-13 5:40 ` Bhupesh Sharma
2018-12-13 13:39 ` Qian Cai
2018-12-13 10:44 ` James Morse
2018-12-13 13:44 ` Qian Cai
2018-12-14 4:08 ` [PATCH v2] arm64: invalidate TLB just " Qian Cai
2018-12-14 5:01 ` Bhupesh Sharma
2018-12-14 12:54 ` Qian Cai
2018-12-14 7:23 ` Ard Biesheuvel
2018-12-15 1:53 ` Qian Cai
2019-01-10 20:00 ` Bhupesh Sharma
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=29f74c6d-dd21-dcee-6c62-914f018c4e4e@arm.com \
--to=marc.zyngier@arm.com \
--cc=James.Morse@arm.com \
--cc=ard.biesheuvel@linaro.org \
--cc=cai@lca.pw \
--cc=catalin.marinas@arm.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=takahiro.akashi@linaro.org \
--cc=will.deacon@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).