linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
From: Michael Ellerman <mpe@ellerman.id.au>
To: Aneesh Kumar K V <aneesh.kumar@linux.ibm.com>,
	Mahesh Salgaonkar <mahesh@linux.ibm.com>,
	linuxppc-dev <linuxppc-dev@ozlabs.org>
Cc: Ganesh Goudar <ganeshgr@linux.ibm.com>,
	Nicholas Piggin <npiggin@gmail.com>
Subject: Re: [PATCH v4] powerpc: Avoid nmi_enter/nmi_exit in real mode interrupt.
Date: Fri, 08 Mar 2024 19:08:50 +1100	[thread overview]
Message-ID: <874jdhno19.fsf@mail.lhotse> (raw)
In-Reply-To: <8d973907-8e86-4b9f-8995-cf3a8621f6b6@linux.ibm.com>

Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
> On 3/7/24 5:13 PM, Michael Ellerman wrote:
>> Mahesh Salgaonkar <mahesh@linux.ibm.com> writes:
>>> nmi_enter()/nmi_exit() touches per cpu variables which can lead to kernel
>>> crash when invoked during real mode interrupt handling (e.g. early HMI/MCE
>>> interrupt handler) if percpu allocation comes from vmalloc area.
>>>
>>> Early HMI/MCE handlers are called through DEFINE_INTERRUPT_HANDLER_NMI()
>>> wrapper which invokes nmi_enter/nmi_exit calls. We don't see any issue when
>>> percpu allocation is from the embedded first chunk. However with
>>> CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK enabled there are chances where percpu
>>> allocation can come from the vmalloc area.
>>>
>>> With kernel command line "percpu_alloc=page" we can force percpu allocation
>>> to come from vmalloc area and can see kernel crash in machine_check_early:
>>>
>>> [    1.215714] NIP [c000000000e49eb4] rcu_nmi_enter+0x24/0x110
>>> [    1.215717] LR [c0000000000461a0] machine_check_early+0xf0/0x2c0
>>> [    1.215719] --- interrupt: 200
>>> [    1.215720] [c000000fffd73180] [0000000000000000] 0x0 (unreliable)
>>> [    1.215722] [c000000fffd731b0] [0000000000000000] 0x0
>>> [    1.215724] [c000000fffd73210] [c000000000008364] machine_check_early_common+0x134/0x1f8
>>>
>>> Fix this by avoiding use of nmi_enter()/nmi_exit() in real mode if percpu
>>> first chunk is not embedded.
>> 
>> My system (powernv) doesn't even boot with percpu_alloc=page.
>
>
> Can you share the crash details?

Yes but it's not pretty :)

  [    1.725257][  T714] systemd-journald[714]: Collecting audit messages is disabled.
  [    1.729401][    T1] systemd[1]: Finished systemd-tmpfiles-setup-dev.service - Create Static Device Nodes in /dev.
  [^[[0;32m  OK  ^[[0m] Finished ^[[0;1;39msystemd-tmpfiles-…reate Static Device Nodes in /dev.
  [    1.773902][   C22] Disabling lock debugging due to kernel taint
  [    1.773905][   C23] Oops: Machine check, sig: 7 [#1]
  [    1.773911][   C23] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
  [    1.773916][   C23] Modules linked in:
  [    1.773920][   C23] CPU: 23 PID: 0 Comm: swapper/23 Tainted: G   M               6.8.0-rc7-02500-g23515c370cbb #1
  [    1.773924][   C23] Hardware name: 8335-GTH POWER9 0x4e1202 opal:skiboot-v6.5.3-35-g1851b2a06 PowerNV
  [    1.773926][   C23] NIP:  0000000000000000 LR: 0000000000000000 CTR: 0000000000000000
  [    1.773929][   C23] REGS: c000000fffa6ef50 TRAP: 0000   Tainted: G   M                (6.8.0-rc7-02500-g23515c370cbb)
  [    1.773932][   C23] MSR:  0000000000000000 <>  CR: 00000000  XER: 00000000
  [    1.773937][   C23] CFAR: 0000000000000000 IRQMASK: 3 
  [    1.773937][   C23] GPR00: 0000000000000000 c000000fffa6efe0 c000000fffa6efb0 0000000000000000 
  [    1.773937][   C23] GPR04: c00000000003d8c0 c000000001f5f000 0000000000000000 0000000000000103 
  [    1.773937][   C23] GPR08: 0000000000000003 653a0d962a590300 0000000000000000 0000000000000000 
  [    1.773937][   C23] GPR12: c000000fffa6f280 0000000000000000 c0000000000084a4 0000000000000000 
  [    1.773937][   C23] GPR16: 0000000053474552 0000000000000000 c00000000003d8c0 c000000fffa6f280 
  [    1.773937][   C23] GPR20: c000000001f5f000 c000000fffa6f340 c000000fffa6f2e8 0000000000000000 
  [    1.773937][   C23] GPR24: 0007fffffecf0000 c0000000065bbb80 0000000000550102 c000000002172b20 
  [    1.773937][   C23] GPR28: 0000000000000000 0000000053474552 0000000000000000 c000000ffffc6d80 
  [    1.773982][   C23] NIP [0000000000000000] 0x0
  [    1.773988][   C23] LR [0000000000000000] 0x0
  [    1.773990][   C23] Call Trace:
  [    1.773991][   C23] [c000000fffa6efe0] [c000000001f5f000] .TOC.+0x0/0xa1000 (unreliable)
  [    1.773999][   C23] Code: XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX 
  [    1.774021][   C23] ---[ end trace 0000000000000000 ]---

Something has gone badly wrong.

That was a test kernel with some other commits, but nothing that should
cause that. Removing percpu_alloc=page fix it.

It's based on fddff98e83b4b4d54470902ea0d520c4d423ca3b.

>> AFAIK the only reason we added support for it was to handle 4K kernels
>> with HPT. See commit eb553f16973a ("powerpc/64/mm: implement page
>> mapping percpu first chunk allocator").
>> 
>> So I wonder if we should change the Kconfig to only offer it as an
>> option in that case, and change the logic in setup_per_cpu_areas() to
>> only use it as a last resort.
>> 
>> I guess we probably still need this commit though, even if just for 4K HPT.
>> 
>>
> We have also observed some error when we have large gap between the start memory of
> NUMA nodes. That made the percpu offset really large causing boot failures even on 64K.

Yeah, I have vague memories of that :)

cheers

  reply	other threads:[~2024-03-08  8:09 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-14  9:51 [PATCH v4] powerpc: Avoid nmi_enter/nmi_exit in real mode interrupt Mahesh Salgaonkar
2024-02-14  9:54 ` Christophe Leroy
2024-03-06  8:25 ` Shirisha ganta
2024-03-07 11:43 ` Michael Ellerman
2024-03-08  4:41   ` Aneesh Kumar K V
2024-03-08  8:08     ` Michael Ellerman [this message]
2024-04-10  4:38       ` Mahesh J Salgaonkar
2024-03-08  5:19   ` Mahesh J Salgaonkar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=874jdhno19.fsf@mail.lhotse \
    --to=mpe@ellerman.id.au \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=ganeshgr@linux.ibm.com \
    --cc=linuxppc-dev@ozlabs.org \
    --cc=mahesh@linux.ibm.com \
    --cc=npiggin@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).