linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
From: Marc Zyngier <marc.zyngier@arm.com>
To: Qian Cai <cai@lca.pw>, Ard Biesheuvel <ard.biesheuvel@linaro.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will.deacon@arm.com>
Cc: "AKASHI, Takahiro" <takahiro.akashi@linaro.org>,
	James Morse <James.Morse@arm.com>,
	linux-arm-kernel@lists.infradead.org
Subject: Re: arm64: kdump broken on a large CPU system
Date: Tue, 11 Dec 2018 10:09:59 +0000	[thread overview]
Message-ID: <29f74c6d-dd21-dcee-6c62-914f018c4e4e@arm.com> (raw)
In-Reply-To: <113776f1-5633-e397-96eb-c533ea79671d@lca.pw>

[+ James and Takahiro]

Hi Qian,

On 10/12/2018 22:30, Qian Cai wrote:
> On this HPE Apollo 70 arm64 server with 256 CPUs, triggering a crash dump just
> hung (4.20-rc6 as well as 4.18). It was confirmed that the executing went as far
> as entering __cpu_soft_restart(),

You can forget about 4.18 altogether, it will never correctly kexec.
I've used 4.20 + kexec on a TX2 system though, and although it takes
absolutely ages, it reliably works.

> 
> __crash_kexec
>   machine_kexec
>     cpu_soft_restart
>       restart
>         __cpu_soft_restart
> 
> The earlycon was enabled but had no output from the 2nd kernel, so it was pretty
> much stuck in all those assembly code in arm64/kernel/head.S or the early part
> of start_kernel() before earlycon was initialized.

Could it instead be in the purgatory code provided by userspace?

> 
> It turned out this has something to do with nr_cpus in the 1st kernel, although
> the 2nd kernel always has nr_cpus=1 [1]. It was tested with both
> crashkernel=512M or 768M.

James was saying something about a timeout, which may or may not be long
enough.

> 
> nr_cpus <= 96  GOOD (2nd kernel was up in 2-3 mins.)
> nr_cpus=256    BAD  (2nd kernel was NOT up after 1 hour.)
> nr_cpus=127    BAD  (2nd kernel was NOT up after 10 mins.)
> 
> I did also test with and without CONFIG_ARM64_VHE (i.e., el2_switch) made no
> difference.
> 
> [1] KDUMP_COMMANDLINE_APPEND="irqpoll nr_cpus=1 swiotlb=noforce reset_devices"
> 
> I am still figuring out a way to debug those assembly code to where it actually
> hung, but the server was hooked up with a conserver that was not able to
> generate any sysrq and I have no shell access to the conserver, so seems a bit
> difficult to use kgdb or kdb in this case.
> 
> CPU information,
> 
> # lscpu
> Architecture:        aarch64
> Byte Order:          Little Endian
> CPU(s):              256
> On-line CPU(s) list: 0-255
> Thread(s) per core:  4
> Core(s) per socket:  32
> Socket(s):           2
> NUMA node(s):        2
> Vendor ID:           Cavium
> Model:               1
> Model name:          ThunderX2 99xx
> Stepping:            0x1
> BogoMIPS:            400.00
> L1d cache:           32K
> L1i cache:           32K
> L2 cache:            256K
> L3 cache:            32768K
> NUMA node0 CPU(s):   0-127
> NUMA node1 CPU(s):   128-255
> Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid
> asimdrdm
> 

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2018-12-11 10:27 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-10 22:30 arm64: kdump broken on a large CPU system Qian Cai
2018-12-11 10:09 ` Marc Zyngier [this message]
2018-12-11 11:34   ` James Morse
2018-12-12  2:51     ` AKASHI, Takahiro
2018-12-12  4:39       ` Qian Cai
2018-12-12 22:37         ` Qian Cai
2018-12-13  5:22           ` [PATCH] arm64: invalidate TLB before turning MMU on Qian Cai
2018-12-13  5:40             ` Bhupesh Sharma
2018-12-13 13:39               ` Qian Cai
2018-12-13 10:44             ` James Morse
2018-12-13 13:44               ` Qian Cai
2018-12-14  4:08             ` [PATCH v2] arm64: invalidate TLB just " Qian Cai
2018-12-14  5:01               ` Bhupesh Sharma
2018-12-14 12:54                 ` Qian Cai
2018-12-14  7:23               ` Ard Biesheuvel
2018-12-15  1:53                 ` Qian Cai
2019-01-10 20:00                   ` Bhupesh Sharma

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=29f74c6d-dd21-dcee-6c62-914f018c4e4e@arm.com \
    --to=marc.zyngier@arm.com \
    --cc=James.Morse@arm.com \
    --cc=ard.biesheuvel@linaro.org \
    --cc=cai@lca.pw \
    --cc=catalin.marinas@arm.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=takahiro.akashi@linaro.org \
    --cc=will.deacon@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).